Python - Confusion regarding how strings are stored and processed in PythonBest way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?
What is the name of meteoroids which hit Moon, Mars, or pretty much anything that isn’t the Earth?
Extending Kan fibrations, without using minimal fibrations
Pre-1993 comic in which Wolverine's claws were turned to rubber?
Are there variations of the regular runtimes of the Big-O-Notation?
Exception propagation: When to catch exceptions?
Can a surprised creature fall prone voluntarily on their turn?
What can cause an unfrozen indoor copper drain pipe to crack?
Has there been evidence of any other gods?
What does this quote in Small Gods refer to?
How to evaluate sum with one million summands?
Why was wildfire not used during the Battle of Winterfell?
Succinct and gender-neutral Russian word for "writer"
How to get a ellipse shaped node in Tikz Network?
Renting a house to a graduate student in my department
Would encrypting a database protect against a compromised admin account?
Removing all characters except digits from clipboard
Company stopped paying my salary. What are my options?
How to slow yourself down (for playing nice with others)
Which other programming languages apart from Python and predecessor are out there using indentation to define code blocks?
Why is the Sun made of light elements only?
Examples where existence is harder than evaluation
Does the 500 feet falling cap apply per fall, or per turn?
Is there a need for better software for writers?
Best species to breed to intelligence
Python - Confusion regarding how strings are stored and processed in Python
Best way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I defines = 'hello'
what is the encoding in which the strings
is stored? Is it Unicode? Or does it store in plain bytes? On doingtype(s)
I got the answer as<type 'str'>
. However, when I didus = unicode(s)
,us
was of the type<type 'unicode'>
. Isus
astr
type or is there actually aunicode
type in python?Also, I know that to store space, I know that we encode strings as bytes using
encode()
function. So supposebs = s.encode('utf-8', errors='ignore')
will return a bytes object. So, now when I am writingbs
to a file, should I open the file inwb
mode? I have seen that if opened inw
mode, it stores the string in the file asb"<content in s>"
.What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
- How does
str(object)
work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
python unicode encode
|
show 4 more comments
I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I defines = 'hello'
what is the encoding in which the strings
is stored? Is it Unicode? Or does it store in plain bytes? On doingtype(s)
I got the answer as<type 'str'>
. However, when I didus = unicode(s)
,us
was of the type<type 'unicode'>
. Isus
astr
type or is there actually aunicode
type in python?Also, I know that to store space, I know that we encode strings as bytes using
encode()
function. So supposebs = s.encode('utf-8', errors='ignore')
will return a bytes object. So, now when I am writingbs
to a file, should I open the file inwb
mode? I have seen that if opened inw
mode, it stores the string in the file asb"<content in s>"
.What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
- How does
str(object)
work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
python unicode encode
3
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
1
"However, when I didus = unicode(s)
: you mean in python 2, sinceunicode
has been removed in python 3...
– Jean-François Fabre♦
Mar 23 at 10:26
2
Now it's a mix of Python 2 and 3 because in Python 3type(us)
gives<class 'str'>
and there's nounicode
type.
– ForceBru
Mar 23 at 10:35
2
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and callingencode
/decode
yourself, or you can pass anencoding
and have the library do it for you. But either way some conversion is necessary.
– Daniel Pryden
Mar 23 at 10:57
|
show 4 more comments
I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I defines = 'hello'
what is the encoding in which the strings
is stored? Is it Unicode? Or does it store in plain bytes? On doingtype(s)
I got the answer as<type 'str'>
. However, when I didus = unicode(s)
,us
was of the type<type 'unicode'>
. Isus
astr
type or is there actually aunicode
type in python?Also, I know that to store space, I know that we encode strings as bytes using
encode()
function. So supposebs = s.encode('utf-8', errors='ignore')
will return a bytes object. So, now when I am writingbs
to a file, should I open the file inwb
mode? I have seen that if opened inw
mode, it stores the string in the file asb"<content in s>"
.What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
- How does
str(object)
work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
python unicode encode
I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I defines = 'hello'
what is the encoding in which the strings
is stored? Is it Unicode? Or does it store in plain bytes? On doingtype(s)
I got the answer as<type 'str'>
. However, when I didus = unicode(s)
,us
was of the type<type 'unicode'>
. Isus
astr
type or is there actually aunicode
type in python?Also, I know that to store space, I know that we encode strings as bytes using
encode()
function. So supposebs = s.encode('utf-8', errors='ignore')
will return a bytes object. So, now when I am writingbs
to a file, should I open the file inwb
mode? I have seen that if opened inw
mode, it stores the string in the file asb"<content in s>"
.What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
- How does
str(object)
work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
python unicode encode
python unicode encode
edited Mar 23 at 10:46
rajiv_
asked Mar 23 at 9:47
rajiv_rajiv_
463512
463512
3
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
1
"However, when I didus = unicode(s)
: you mean in python 2, sinceunicode
has been removed in python 3...
– Jean-François Fabre♦
Mar 23 at 10:26
2
Now it's a mix of Python 2 and 3 because in Python 3type(us)
gives<class 'str'>
and there's nounicode
type.
– ForceBru
Mar 23 at 10:35
2
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and callingencode
/decode
yourself, or you can pass anencoding
and have the library do it for you. But either way some conversion is necessary.
– Daniel Pryden
Mar 23 at 10:57
|
show 4 more comments
3
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
1
"However, when I didus = unicode(s)
: you mean in python 2, sinceunicode
has been removed in python 3...
– Jean-François Fabre♦
Mar 23 at 10:26
2
Now it's a mix of Python 2 and 3 because in Python 3type(us)
gives<class 'str'>
and there's nounicode
type.
– ForceBru
Mar 23 at 10:35
2
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and callingencode
/decode
yourself, or you can pass anencoding
and have the library do it for you. But either way some conversion is necessary.
– Daniel Pryden
Mar 23 at 10:57
3
3
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
1
1
"However, when I did
us = unicode(s)
: you mean in python 2, since unicode
has been removed in python 3...– Jean-François Fabre♦
Mar 23 at 10:26
"However, when I did
us = unicode(s)
: you mean in python 2, since unicode
has been removed in python 3...– Jean-François Fabre♦
Mar 23 at 10:26
2
2
Now it's a mix of Python 2 and 3 because in Python 3
type(us)
gives <class 'str'>
and there's no unicode
type.– ForceBru
Mar 23 at 10:35
Now it's a mix of Python 2 and 3 because in Python 3
type(us)
gives <class 'str'>
and there's no unicode
type.– ForceBru
Mar 23 at 10:35
2
2
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling
encode
/decode
yourself, or you can pass an encoding
and have the library do it for you. But either way some conversion is necessary.– Daniel Pryden
Mar 23 at 10:57
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling
encode
/decode
yourself, or you can pass an encoding
and have the library do it for you. But either way some conversion is necessary.– Daniel Pryden
Mar 23 at 10:57
|
show 4 more comments
1 Answer
1
active
oldest
votes
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode
class, while str
is rather a synonym to bytes
.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
- Empty string object has an overhead of 49 bytes.
- String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
- String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8. - The difference of
u
andu[1:]
and at the same time the difference ofu
andu[:-1]
is90 - 88 = 2 bytes
. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead. - Memory footprint of string
j
is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1)
access by index. However, we also know that UTF-8
uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4
bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode
it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
@rajiv_sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. Forbytes
it makes sense to check the size of the data by usinglen(bytes)
since every element in a sequence is a single byte.
– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55312480%2fpython-confusion-regarding-how-strings-are-stored-and-processed-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode
class, while str
is rather a synonym to bytes
.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
- Empty string object has an overhead of 49 bytes.
- String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
- String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8. - The difference of
u
andu[1:]
and at the same time the difference ofu
andu[:-1]
is90 - 88 = 2 bytes
. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead. - Memory footprint of string
j
is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1)
access by index. However, we also know that UTF-8
uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4
bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode
it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
@rajiv_sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. Forbytes
it makes sense to check the size of the data by usinglen(bytes)
since every element in a sequence is a single byte.
– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
|
show 1 more comment
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode
class, while str
is rather a synonym to bytes
.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
- Empty string object has an overhead of 49 bytes.
- String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
- String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8. - The difference of
u
andu[1:]
and at the same time the difference ofu
andu[:-1]
is90 - 88 = 2 bytes
. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead. - Memory footprint of string
j
is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1)
access by index. However, we also know that UTF-8
uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4
bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode
it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
@rajiv_sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. Forbytes
it makes sense to check the size of the data by usinglen(bytes)
since every element in a sequence is a single byte.
– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
|
show 1 more comment
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode
class, while str
is rather a synonym to bytes
.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
- Empty string object has an overhead of 49 bytes.
- String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
- String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8. - The difference of
u
andu[1:]
and at the same time the difference ofu
andu[:-1]
is90 - 88 = 2 bytes
. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead. - Memory footprint of string
j
is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1)
access by index. However, we also know that UTF-8
uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4
bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode
it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode
class, while str
is rather a synonym to bytes
.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
- Empty string object has an overhead of 49 bytes.
- String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
- String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8. - The difference of
u
andu[1:]
and at the same time the difference ofu
andu[:-1]
is90 - 88 = 2 bytes
. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead. - Memory footprint of string
j
is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1)
access by index. However, we also know that UTF-8
uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4
bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode
it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
edited Mar 23 at 12:14
answered Mar 23 at 10:49
Ivan VelichkoIvan Velichko
3,47632465
3,47632465
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
@rajiv_sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. Forbytes
it makes sense to check the size of the data by usinglen(bytes)
since every element in a sequence is a single byte.
– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
|
show 1 more comment
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
@rajiv_sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. Forbytes
it makes sense to check the size of the data by usinglen(bytes)
since every element in a sequence is a single byte.
– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?
– rajiv_
Mar 23 at 10:54
1
1
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.
– Ivan Velichko
Mar 23 at 11:05
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?
– rajiv_
Mar 23 at 11:28
1
1
@rajiv_
sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes
it makes sense to check the size of the data by using len(bytes)
since every element in a sequence is a single byte.– Ivan Velichko
Mar 23 at 11:31
@rajiv_
sys.getsizeof
shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes
it makes sense to check the size of the data by using len(bytes)
since every element in a sequence is a single byte.– Ivan Velichko
Mar 23 at 11:31
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.
– Daniel Pryden
Mar 23 at 12:07
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55312480%2fpython-confusion-regarding-how-strings-are-stored-and-processed-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
python 2 or python 3?
– Jean-François Fabre♦
Mar 23 at 9:56
@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.
– rajiv_
Mar 23 at 10:18
1
"However, when I did
us = unicode(s)
: you mean in python 2, sinceunicode
has been removed in python 3...– Jean-François Fabre♦
Mar 23 at 10:26
2
Now it's a mix of Python 2 and 3 because in Python 3
type(us)
gives<class 'str'>
and there's nounicode
type.– ForceBru
Mar 23 at 10:35
2
There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling
encode
/decode
yourself, or you can pass anencoding
and have the library do it for you. But either way some conversion is necessary.– Daniel Pryden
Mar 23 at 10:57