Python 3.6 Messy String with Unicode characters and BytesHow do I parse a string to a float or int in Python?Python join: why is it string.join(list) instead of list.join(string)?Convert bytes to a string?Reverse a string in PythonConverting integer to string in Python?Does Python have a string 'contains' substring method?Python string formatting: % vs. .formatHow can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?Best way to convert string to bytes in Python 3?

My first c++ game (snake console game)

Why did the Apollo 13 crew extend the LM landing gear?

Dangerous workplace travelling

What was Bran's plan to kill the Night King?

How do I calculate how many of an item I'll have in this inventory system?

Can my 2 children, aged 10 and 12, who are US citizens, travel to the USA on expired American passports?

How long would it take for people to notice a mass disappearance?

What do "Sech" and "Vich" mean in this sentence?

What was the first story to feature the plot "the monsters were human all along"?

Should I mention being denied entry to UK due to a confusion in my Visa and Ticket bookings?

Is there a word that describes the unjustified use of a more complex word?

Are there terms in German for different skull shapes?

Mug and wireframe entirely disappeared

Why wasn't the Z6 version of the Infocom Z-machine ported to the IIgs?

Feasibility of lava beings?

How to view size of map in lightning component controller?

How does the reduce() method work in Java 8?

Notation: What does the tilde bellow of the Expectation mean?

Where are the "shires" in the UK?

Why do people keep telling me that I am a bad photographer?

Is Benjen dead?

Install LibreOffice-Writer Only not LibreOffice whole package

How to pass hash as password to ssh server

When an imagined world resembles or has similarities with a famous world



Python 3.6 Messy String with Unicode characters and Bytes


How do I parse a string to a float or int in Python?Python join: why is it string.join(list) instead of list.join(string)?Convert bytes to a string?Reverse a string in PythonConverting integer to string in Python?Does Python have a string 'contains' substring method?Python string formatting: % vs. .formatHow can I print literal curly-brace characters in python string and also use .format on it?How do I lowercase a string in Python?Best way to convert string to bytes in Python 3?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:



x = articles[800].title


If I call x in spyder, it returns:



'Las 10 canciones m\xc3\xa1s populares de la semana'


When I use
print(x)
I get:



Las 10 canciones mxc3xa1s populares de la semana


BUT if try to correctly encode it using: (As other posts suggest)



x.encode('latin1').decode('utf8')


It returns



'Las 10 canciones m\xc3\xa1s populares de la semana'


Which is obviously not correct.



Anyone have any suggestions? I am using Python 3.6 by the way










share|improve this question
























  • I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

    – Aran-Fey
    Mar 23 at 1:11











  • I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

    – J. Gursky
    Mar 23 at 4:22

















0















So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:



x = articles[800].title


If I call x in spyder, it returns:



'Las 10 canciones m\xc3\xa1s populares de la semana'


When I use
print(x)
I get:



Las 10 canciones mxc3xa1s populares de la semana


BUT if try to correctly encode it using: (As other posts suggest)



x.encode('latin1').decode('utf8')


It returns



'Las 10 canciones m\xc3\xa1s populares de la semana'


Which is obviously not correct.



Anyone have any suggestions? I am using Python 3.6 by the way










share|improve this question
























  • I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

    – Aran-Fey
    Mar 23 at 1:11











  • I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

    – J. Gursky
    Mar 23 at 4:22













0












0








0








So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:



x = articles[800].title


If I call x in spyder, it returns:



'Las 10 canciones m\xc3\xa1s populares de la semana'


When I use
print(x)
I get:



Las 10 canciones mxc3xa1s populares de la semana


BUT if try to correctly encode it using: (As other posts suggest)



x.encode('latin1').decode('utf8')


It returns



'Las 10 canciones m\xc3\xa1s populares de la semana'


Which is obviously not correct.



Anyone have any suggestions? I am using Python 3.6 by the way










share|improve this question
















So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. Taking one of the titles:



x = articles[800].title


If I call x in spyder, it returns:



'Las 10 canciones m\xc3\xa1s populares de la semana'


When I use
print(x)
I get:



Las 10 canciones mxc3xa1s populares de la semana


BUT if try to correctly encode it using: (As other posts suggest)



x.encode('latin1').decode('utf8')


It returns



'Las 10 canciones m\xc3\xa1s populares de la semana'


Which is obviously not correct.



Anyone have any suggestions? I am using Python 3.6 by the way







python python-3.x python-unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 23 at 20:01







J. Gursky

















asked Mar 23 at 1:08









J. GurskyJ. Gursky

2116




2116












  • I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

    – Aran-Fey
    Mar 23 at 1:11











  • I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

    – J. Gursky
    Mar 23 at 4:22

















  • I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

    – Aran-Fey
    Mar 23 at 1:11











  • I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

    – J. Gursky
    Mar 23 at 4:22
















I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

– Aran-Fey
Mar 23 at 1:11





I'm fairly sure you already messed something up when you fetched this data; I have a hard time believing that the data set originally contained a string representation of hex characters. How did you initialize articles?

– Aran-Fey
Mar 23 at 1:11













I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

– J. Gursky
Mar 23 at 4:22





I can't post the full code, but the issue is the from_warc() method of Newsplease, that's the format it returns data in when pulling from a common crawl WARC. Articles is just a list of NewsPlease article objects.

– J. Gursky
Mar 23 at 4:22












1 Answer
1






active

oldest

votes


















0














Found a solution to this:



x = 'this is a test of the Spanish word m\xc3\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55309651%2fpython-3-6-messy-string-with-unicode-characters-and-bytes%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Found a solution to this:



    x = 'this is a test of the Spanish word m\xc3\xa1s'
    x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
    print(x)
    'this is a test of the Spanish word más'





    share|improve this answer



























      0














      Found a solution to this:



      x = 'this is a test of the Spanish word m\xc3\xa1s'
      x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
      print(x)
      'this is a test of the Spanish word más'





      share|improve this answer

























        0












        0








        0







        Found a solution to this:



        x = 'this is a test of the Spanish word m\xc3\xa1s'
        x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
        print(x)
        'this is a test of the Spanish word más'





        share|improve this answer













        Found a solution to this:



        x = 'this is a test of the Spanish word m\xc3\xa1s'
        x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
        print(x)
        'this is a test of the Spanish word más'






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 23 at 22:07









        J. GurskyJ. Gursky

        2116




        2116





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55309651%2fpython-3-6-messy-string-with-unicode-characters-and-bytes%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

            용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

            155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해