Python 3.6 Messy String with Unicode characters and BytesHow do I parse a string to a float or int in Python?Python join: why is it string.join(list) instead of list.join(string)?Convert bytes to a string?Reverse a string in PythonConverting integer to string in Python?Does Python have a string 'contains' substring method?Python string formatting: % vs. .formatHow can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?Best way to convert string to bytes in Python 3?
My first c++ game (snake console game)
Why did the Apollo 13 crew extend the LM landing gear?
Dangerous workplace travelling
What was Bran's plan to kill the Night King?
How do I calculate how many of an item I'll have in this inventory system?
Can my 2 children, aged 10 and 12, who are US citizens, travel to the USA on expired American passports?
How long would it take for people to notice a mass disappearance?
What do "Sech" and "Vich" mean in this sentence?
What was the first story to feature the plot "the monsters were human all along"?
Should I mention being denied entry to UK due to a confusion in my Visa and Ticket bookings?
Is there a word that describes the unjustified use of a more complex word?
Are there terms in German for different skull shapes?
Mug and wireframe entirely disappeared
Why wasn't the Z6 version of the Infocom Z-machine ported to the IIgs?
Feasibility of lava beings?
How to view size of map in lightning component controller?
How does the reduce() method work in Java 8?
Notation: What does the tilde bellow of the Expectation mean?
Where are the "shires" in the UK?
Why do people keep telling me that I am a bad photographer?
Is Benjen dead?
Install LibreOffice-Writer Only not LibreOffice whole package
How to pass hash as password to ssh server
When an imagined world resembles or has similarities with a famous world
Python 3.6 Messy String with Unicode characters and Bytes
How do I parse a string to a float or int in Python?Python join: why is it string.join(list) instead of list.join(string)?Convert bytes to a string?Reverse a string in PythonConverting integer to string in Python?Does Python have a string 'contains' substring method?Python string formatting: % vs. .formatHow can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?Best way to convert string to bytes in Python 3?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:
x = articles[800].title
If I call x in spyder, it returns:
'Las 10 canciones m\xc3\xa1s populares de la semana'
When I useprint(x)
I get:
Las 10 canciones mxc3xa1s populares de la semana
BUT if try to correctly encode it using: (As other posts suggest)
x.encode('latin1').decode('utf8')
It returns
'Las 10 canciones m\xc3\xa1s populares de la semana'
Which is obviously not correct.
Anyone have any suggestions? I am using Python 3.6 by the way
python python-3.x python-unicode
add a comment |
So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:
x = articles[800].title
If I call x in spyder, it returns:
'Las 10 canciones m\xc3\xa1s populares de la semana'
When I useprint(x)
I get:
Las 10 canciones mxc3xa1s populares de la semana
BUT if try to correctly encode it using: (As other posts suggest)
x.encode('latin1').decode('utf8')
It returns
'Las 10 canciones m\xc3\xa1s populares de la semana'
Which is obviously not correct.
Anyone have any suggestions? I am using Python 3.6 by the way
python python-3.x python-unicode
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initializearticles?
– Aran-Fey
Mar 23 at 1:11
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22
add a comment |
So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:
x = articles[800].title
If I call x in spyder, it returns:
'Las 10 canciones m\xc3\xa1s populares de la semana'
When I useprint(x)
I get:
Las 10 canciones mxc3xa1s populares de la semana
BUT if try to correctly encode it using: (As other posts suggest)
x.encode('latin1').decode('utf8')
It returns
'Las 10 canciones m\xc3\xa1s populares de la semana'
Which is obviously not correct.
Anyone have any suggestions? I am using Python 3.6 by the way
python python-3.x python-unicode
So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:
x = articles[800].title
If I call x in spyder, it returns:
'Las 10 canciones m\xc3\xa1s populares de la semana'
When I useprint(x)
I get:
Las 10 canciones mxc3xa1s populares de la semana
BUT if try to correctly encode it using: (As other posts suggest)
x.encode('latin1').decode('utf8')
It returns
'Las 10 canciones m\xc3\xa1s populares de la semana'
Which is obviously not correct.
Anyone have any suggestions? I am using Python 3.6 by the way
python python-3.x python-unicode
python python-3.x python-unicode
edited Mar 23 at 20:01
J. Gursky
asked Mar 23 at 1:08
J. GurskyJ. Gursky
2116
2116
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initializearticles?
– Aran-Fey
Mar 23 at 1:11
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22
add a comment |
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initializearticles?
– Aran-Fey
Mar 23 at 1:11
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize
articles?– Aran-Fey
Mar 23 at 1:11
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize
articles?– Aran-Fey
Mar 23 at 1:11
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22
add a comment |
1 Answer
1
active
oldest
votes
Found a solution to this:
x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55309651%2fpython-3-6-messy-string-with-unicode-characters-and-bytes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Found a solution to this:
x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'
add a comment |
Found a solution to this:
x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'
add a comment |
Found a solution to this:
x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'
Found a solution to this:
x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'
answered Mar 23 at 22:07
J. GurskyJ. Gursky
2116
2116
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55309651%2fpython-3-6-messy-string-with-unicode-characters-and-bytes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize
articles?– Aran-Fey
Mar 23 at 1:11
I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.
– J. Gursky
Mar 23 at 4:22