Python - Confusion regarding how strings are stored and processed in PythonBest way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?

What is the name of meteoroids which hit Moon, Mars, or pretty much anything that isn’t the Earth?

Extending Kan fibrations, without using minimal fibrations

Pre-1993 comic in which Wolverine's claws were turned to rubber?

Are there variations of the regular runtimes of the Big-O-Notation?

Exception propagation: When to catch exceptions?

Can a surprised creature fall prone voluntarily on their turn?

What can cause an unfrozen indoor copper drain pipe to crack?

Has there been evidence of any other gods?

What does this quote in Small Gods refer to?

How to evaluate sum with one million summands?

Why was wildfire not used during the Battle of Winterfell?

Succinct and gender-neutral Russian word for "writer"

How to get a ellipse shaped node in Tikz Network?

Renting a house to a graduate student in my department

Would encrypting a database protect against a compromised admin account?

Removing all characters except digits from clipboard

Company stopped paying my salary. What are my options?

How to slow yourself down (for playing nice with others)

Which other programming languages apart from Python and predecessor are out there using indentation to define code blocks?

Why is the Sun made of light elements only?

Examples where existence is harder than evaluation

Does the 500 feet falling cap apply per fall, or per turn?

Is there a need for better software for writers?

Best species to breed to intelligence



Python - Confusion regarding how strings are stored and processed in Python


Best way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.



  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
    If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?


  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".


  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?


>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>


  • How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.










share|improve this question



















  • 3





    python 2 or python 3?

    – Jean-François Fabre
    Mar 23 at 9:56











  • @Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

    – rajiv_
    Mar 23 at 10:18






  • 1





    "However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

    – Jean-François Fabre
    Mar 23 at 10:26







  • 2





    Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

    – ForceBru
    Mar 23 at 10:35






  • 2





    There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

    – Daniel Pryden
    Mar 23 at 10:57


















1















I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.



  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
    If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?


  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".


  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?


>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>


  • How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.










share|improve this question



















  • 3





    python 2 or python 3?

    – Jean-François Fabre
    Mar 23 at 9:56











  • @Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

    – rajiv_
    Mar 23 at 10:18






  • 1





    "However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

    – Jean-François Fabre
    Mar 23 at 10:26







  • 2





    Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

    – ForceBru
    Mar 23 at 10:35






  • 2





    There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

    – Daniel Pryden
    Mar 23 at 10:57














1












1








1








I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.



  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
    If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?


  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".


  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?


>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>


  • How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.










share|improve this question
















I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.



  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
    If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?


  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".


  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?


>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>


  • How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.







python unicode encode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 23 at 10:46







rajiv_

















asked Mar 23 at 9:47









rajiv_rajiv_

463512




463512







  • 3





    python 2 or python 3?

    – Jean-François Fabre
    Mar 23 at 9:56











  • @Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

    – rajiv_
    Mar 23 at 10:18






  • 1





    "However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

    – Jean-François Fabre
    Mar 23 at 10:26







  • 2





    Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

    – ForceBru
    Mar 23 at 10:35






  • 2





    There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

    – Daniel Pryden
    Mar 23 at 10:57













  • 3





    python 2 or python 3?

    – Jean-François Fabre
    Mar 23 at 9:56











  • @Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

    – rajiv_
    Mar 23 at 10:18






  • 1





    "However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

    – Jean-François Fabre
    Mar 23 at 10:26







  • 2





    Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

    – ForceBru
    Mar 23 at 10:35






  • 2





    There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

    – Daniel Pryden
    Mar 23 at 10:57








3




3





python 2 or python 3?

– Jean-François Fabre
Mar 23 at 9:56





python 2 or python 3?

– Jean-François Fabre
Mar 23 at 9:56













@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18





@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18




1




1





"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre
Mar 23 at 10:26






"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre
Mar 23 at 10:26





2




2





Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35





Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35




2




2





There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57






There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57













1 Answer
1






active

oldest

votes


















2














Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.



In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.



While it's not a direct answer to your question, have a look at these examples:



import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100


Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:



  • Empty string object has an overhead of 49 bytes.

  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
    though the length is still 8.

  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:



j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17


So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.



And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.






share|improve this answer

























  • Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

    – rajiv_
    Mar 23 at 10:54






  • 1





    @rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

    – Ivan Velichko
    Mar 23 at 11:05











  • >>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

    – rajiv_
    Mar 23 at 11:28







  • 1





    @rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

    – Ivan Velichko
    Mar 23 at 11:31











  • But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

    – Daniel Pryden
    Mar 23 at 12:07











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55312480%2fpython-confusion-regarding-how-strings-are-stored-and-processed-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.



In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.



While it's not a direct answer to your question, have a look at these examples:



import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100


Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:



  • Empty string object has an overhead of 49 bytes.

  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
    though the length is still 8.

  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:



j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17


So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.



And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.






share|improve this answer

























  • Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

    – rajiv_
    Mar 23 at 10:54






  • 1





    @rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

    – Ivan Velichko
    Mar 23 at 11:05











  • >>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

    – rajiv_
    Mar 23 at 11:28







  • 1





    @rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

    – Ivan Velichko
    Mar 23 at 11:31











  • But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

    – Daniel Pryden
    Mar 23 at 12:07















2














Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.



In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.



While it's not a direct answer to your question, have a look at these examples:



import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100


Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:



  • Empty string object has an overhead of 49 bytes.

  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
    though the length is still 8.

  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:



j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17


So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.



And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.






share|improve this answer

























  • Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

    – rajiv_
    Mar 23 at 10:54






  • 1





    @rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

    – Ivan Velichko
    Mar 23 at 11:05











  • >>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

    – rajiv_
    Mar 23 at 11:28







  • 1





    @rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

    – Ivan Velichko
    Mar 23 at 11:31











  • But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

    – Daniel Pryden
    Mar 23 at 12:07













2












2








2







Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.



In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.



While it's not a direct answer to your question, have a look at these examples:



import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100


Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:



  • Empty string object has an overhead of 49 bytes.

  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
    though the length is still 8.

  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:



j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17


So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.



And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.






share|improve this answer















Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.



In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.



While it's not a direct answer to your question, have a look at these examples:



import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100


Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:



  • Empty string object has an overhead of 49 bytes.

  • String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
    though the length is still 8.

  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:



j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b'
print(len(b)) # 17


So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.



And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 23 at 12:14

























answered Mar 23 at 10:49









Ivan VelichkoIvan Velichko

3,47632465




3,47632465












  • Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

    – rajiv_
    Mar 23 at 10:54






  • 1





    @rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

    – Ivan Velichko
    Mar 23 at 11:05











  • >>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

    – rajiv_
    Mar 23 at 11:28







  • 1





    @rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

    – Ivan Velichko
    Mar 23 at 11:31











  • But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

    – Daniel Pryden
    Mar 23 at 12:07

















  • Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

    – rajiv_
    Mar 23 at 10:54






  • 1





    @rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

    – Ivan Velichko
    Mar 23 at 11:05











  • >>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

    – rajiv_
    Mar 23 at 11:28







  • 1





    @rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

    – Ivan Velichko
    Mar 23 at 11:31











  • But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

    – Daniel Pryden
    Mar 23 at 12:07
















Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54





Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54




1




1





@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05





@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05













>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28






>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28





1




1





@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31





@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31













But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07





But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55312480%2fpython-confusion-regarding-how-strings-are-stored-and-processed-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript