Normalize Single String Column in Pandas by Fuzzy MatchingSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe

Why does a sticker slowly peel off, but if it is pulled quickly it tears?

Can’t change phone time - set automatic time grayed out

Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?

How much does Commander Data weigh?

rationalizing sieges in a modern/near-future setting

Cooking Scrambled Eggs ends up with excess liquid

What is this fighter jet at Weymouth NAS?

74S vs 74LS ICs

Talk interpreter

Why does the `ls` command sort files like this?

How do we improve collaboration with problematic tester team?

A probably wrong proof of the Riemann Hypothesis, but where is the mistake?

Counting the triangles that can be formed from segments of given lengths

What is the name of this plot that has rows with two connected dots?

How do solar inverter systems easily add AC power sources together?

Book featuring a child learning from a crowdsourced AI book

Biological refrigeration?

To what extent should we fear giving offense?

Would Epic Heroism be an acceptable rule variant for a small, first-time group playing the Lost Mine of Phandelver adventure?

Term used to describe a person who predicts future outcomes

How could a self contained organic body propel itself in space

Why is sh (not bash) complaining about functions defined in my .bashrc?

Does NASA use any type of office/groupware software and which is that?

Are strlen optimizations really needed in glibc?



Normalize Single String Column in Pandas by Fuzzy Matching


Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.



Here's an example of what the DataFrame looks like:



entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15


What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.



I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values



names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()


I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like



name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82


So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.










share|improve this question
































    0















    I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.



    Here's an example of what the DataFrame looks like:



    entity_name paper_name score magnitude
    Dept. of Commerce paper_1.pdf 0.67 0.13
    Department of Commerce paper_2.pdf 0.42 0.21
    US Department of Commerce paper_3.pdf 0.07 0.15


    What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.



    I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values



    names_1 = df.entity_name.unique()
    names_2 = df.entity_name.unique()


    I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like



    name_1 name_2 match_ratio
    Dept. of Commerce Department of Commerce 82
    Department of Commerce Dept. of Commerce 82
    US Department of Commerce Department of Commerce 100
    Department of Commerce US Department of Commerce 100
    Dept. of Commerce US Department of Commerce 82
    US Department of Commerce Dept. of Commerce 82


    So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.










    share|improve this question




























      0












      0








      0








      I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.



      Here's an example of what the DataFrame looks like:



      entity_name paper_name score magnitude
      Dept. of Commerce paper_1.pdf 0.67 0.13
      Department of Commerce paper_2.pdf 0.42 0.21
      US Department of Commerce paper_3.pdf 0.07 0.15


      What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.



      I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values



      names_1 = df.entity_name.unique()
      names_2 = df.entity_name.unique()


      I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like



      name_1 name_2 match_ratio
      Dept. of Commerce Department of Commerce 82
      Department of Commerce Dept. of Commerce 82
      US Department of Commerce Department of Commerce 100
      Department of Commerce US Department of Commerce 100
      Dept. of Commerce US Department of Commerce 82
      US Department of Commerce Dept. of Commerce 82


      So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.










      share|improve this question
















      I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.



      Here's an example of what the DataFrame looks like:



      entity_name paper_name score magnitude
      Dept. of Commerce paper_1.pdf 0.67 0.13
      Department of Commerce paper_2.pdf 0.42 0.21
      US Department of Commerce paper_3.pdf 0.07 0.15


      What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.



      I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values



      names_1 = df.entity_name.unique()
      names_2 = df.entity_name.unique()


      I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like



      name_1 name_2 match_ratio
      Dept. of Commerce Department of Commerce 82
      Department of Commerce Dept. of Commerce 82
      US Department of Commerce Department of Commerce 100
      Department of Commerce US Department of Commerce 100
      Dept. of Commerce US Department of Commerce 82
      US Department of Commerce Dept. of Commerce 82


      So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.







      pandas data-cleaning fuzzy-search






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 27 at 21:00







      nmetts

















      asked Mar 27 at 20:28









      nmettsnmetts

      511 silver badge10 bronze badges




      511 silver badge10 bronze badges

























          0






          active

          oldest

          votes










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385916%2fnormalize-single-string-column-in-pandas-by-fuzzy-matching%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes




          Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.







          Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385916%2fnormalize-single-string-column-in-pandas-by-fuzzy-matching%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

          155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해