Python - Confusion regarding how strings are stored and processed in PythonBest way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?

What is the name of meteoroids which hit Moon, Mars, or pretty much anything that isn’t the Earth?

Extending Kan fibrations, without using minimal fibrations

Pre-1993 comic in which Wolverine's claws were turned to rubber?

Are there variations of the regular runtimes of the Big-O-Notation?

Exception propagation: When to catch exceptions?

Can a surprised creature fall prone voluntarily on their turn?

What can cause an unfrozen indoor copper drain pipe to crack?

Has there been evidence of any other gods?

What does this quote in Small Gods refer to?

How to evaluate sum with one million summands?

Why was wildfire not used during the Battle of Winterfell?

Succinct and gender-neutral Russian word for "writer"

How to get a ellipse shaped node in Tikz Network?

Renting a house to a graduate student in my department

Would encrypting a database protect against a compromised admin account?

Removing all characters except digits from clipboard

Company stopped paying my salary. What are my options?

How to slow yourself down (for playing nice with others)

Which other programming languages apart from Python and predecessor are out there using indentation to define code blocks?

Why is the Sun made of light elements only?

Examples where existence is harder than evaluation

Does the 500 feet falling cap apply per fall, or per turn?

Is there a need for better software for writers?

Best species to breed to intelligence

Python - Confusion regarding how strings are stored and processed in Python

Best way to convert text files between character sets?How can I safely create a nested directory in Python?How do I parse a string to a float or int in Python?How to get the current time in PythonHow can I make a time delay in Python?Converting integer to string in Python?How do I concatenate two lists in Python?Does Python have a string 'contains' substring method?How many bytes does one Unicode character take?How do I lowercase a string in Python?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.

Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?

Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".

What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>

How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

3

python 2 or python 3?

– Jean-François Fabre♦
Mar 23 at 9:56

@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18

1

"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre♦
Mar 23 at 10:26

2

Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35

2

There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57

|
show 4 more comments

Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?

Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".

What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>

How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

3

python 2 or python 3?

– Jean-François Fabre♦
Mar 23 at 9:56

@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18

1

"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre♦
Mar 23 at 10:26

2

Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35

2

There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57

|
show 4 more comments

Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?

Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".

What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>

How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?

Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".

What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>

How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

python unicode encode

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

edited Mar 23 at 10:46

asked Mar 23 at 9:47

rajiv_

463512

asked Mar 23 at 9:47

rajiv_

463512

asked Mar 23 at 9:47

rajiv_

463512

3

python 2 or python 3?

– Jean-François Fabre♦
Mar 23 at 9:56

@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18

1

"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre♦
Mar 23 at 10:26

2

Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35

2

There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57

|
show 4 more comments

3

python 2 or python 3?

– Jean-François Fabre♦
Mar 23 at 9:56

@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18

1

"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre♦
Mar 23 at 10:26

2

Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35

2

There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57

python 2 or python 3?

– Jean-François Fabre♦
Mar 23 at 9:56

@Jean-FrançoisFabre python 3. I know that there is some revamping in the str() function from python 2 to python 3.

– rajiv_
Mar 23 at 10:18

"However, when I did us = unicode(s): you mean in python 2, since unicode has been removed in python 3...

– Jean-François Fabre♦
Mar 23 at 10:26

Now it's a mix of Python 2 and 3 because in Python 3 type(us) gives <class 'str'> and there's no unicode type.

– ForceBru
Mar 23 at 10:35

There's a big misconception in your question. You don't encode a string into bytes "to save space": you encode a string into bytes so you can have bytes. It's best to ignore the fact that strings are represented as bytes in memory: that's a implementation detail of strings. Any time you read or write strings from any file or device, you're converting to and from bytes. You can do this explicitly by opening a file in binary mode and calling encode/decode yourself, or you can pass an encoding and have the library do it for you. But either way some conversion is necessary.

– Daniel Pryden
Mar 23 at 10:57

|
show 4 more comments

1 Answer
1

active

oldest

votes

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

Empty string object has an overhead of 49 bytes.

String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.

The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:

j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b' 
print(len(b)) # 17

So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.

And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

1

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

1

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55312480%2fpython-confusion-regarding-how-strings-are-stored-and-processed-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

Empty string object has an overhead of 49 bytes.

String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.

The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b' 
print(len(b)) # 17

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

1

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

1

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

|
show 1 more comment

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

Empty string object has an overhead of 49 bytes.

String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.

The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b' 
print(len(b)) # 17

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

1

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

1

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

|
show 1 more comment

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

Empty string object has an overhead of 49 bytes.

String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.

The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b' 
print(len(b)) # 17

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49

a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54

u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53

j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

Empty string object has an overhead of 49 bytes.

String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.

String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.

The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.

Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

j = 'hello😋😋😋'
b = j.encode('utf8') # b'helloxf0x9fx98x8bxf0x9fx98x8bxf0x9fx98x8b' 
print(len(b)) # 17

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

edited Mar 23 at 12:14

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

answered Mar 23 at 10:49

Ivan Velichko

3,47632465

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

1

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

1

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

|
show 1 more comment

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

1

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

1

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

Thanks. This is very informative answer. So, python decides how many bytes to use per symbol dynamically?

– rajiv_
Mar 23 at 10:54

@rajiv_ yes, exactly. And another important thing is that it always uses the same number of bytes per symbol along the whole string.

– Ivan Velichko
Mar 23 at 11:05

>>> j 'hello😋😋😋' >>> b = j.encode('utf8') >>> sys.getsizeof(b) 50 How come the size is 50?

– rajiv_
Mar 23 at 11:28

@rajiv_ sys.getsizeof shows the memory footprint of a python object, which almost all the time is bigger than an underlying payload. For bytes it makes sense to check the size of the data by using len(bytes) since every element in a sequence is a single byte.

– Ivan Velichko
Mar 23 at 11:31

But that's not guaranteed. That's an implementation detail at best. As my comment above said, don't think about the memory representation of strings. A Python 3 string is not a collection of bytes, it's a collection of Unicode code points. How those code points happen to be stored in memory is not a part of the public API, and could change in a different or future implementation of Python.

– Daniel Pryden
Mar 23 at 12:07

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1