Normalize Single String Column in Pandas by Fuzzy MatchingSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe
Why does a sticker slowly peel off, but if it is pulled quickly it tears?
Can’t change phone time - set automatic time grayed out
Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?
How much does Commander Data weigh?
rationalizing sieges in a modern/near-future setting
Cooking Scrambled Eggs ends up with excess liquid
What is this fighter jet at Weymouth NAS?
74S vs 74LS ICs
Talk interpreter
Why does the `ls` command sort files like this?
How do we improve collaboration with problematic tester team?
A probably wrong proof of the Riemann Hypothesis, but where is the mistake?
Counting the triangles that can be formed from segments of given lengths
What is the name of this plot that has rows with two connected dots?
How do solar inverter systems easily add AC power sources together?
Book featuring a child learning from a crowdsourced AI book
Biological refrigeration?
To what extent should we fear giving offense?
Would Epic Heroism be an acceptable rule variant for a small, first-time group playing the Lost Mine of Phandelver adventure?
Term used to describe a person who predicts future outcomes
How could a self contained organic body propel itself in space
Why is sh (not bash) complaining about functions defined in my .bashrc?
Does NASA use any type of office/groupware software and which is that?
Are strlen optimizations really needed in glibc?
Normalize Single String Column in Pandas by Fuzzy Matching
Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.
Here's an example of what the DataFrame looks like:
entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15
What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.
I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values
names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()
I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like
name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82
So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.
pandas data-cleaning fuzzy-search
add a comment |
I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.
Here's an example of what the DataFrame looks like:
entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15
What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.
I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values
names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()
I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like
name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82
So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.
pandas data-cleaning fuzzy-search
add a comment |
I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.
Here's an example of what the DataFrame looks like:
entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15
What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.
I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values
names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()
I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like
name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82
So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.
pandas data-cleaning fuzzy-search
I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.
Here's an example of what the DataFrame looks like:
entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15
What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.
I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values
names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()
I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like
name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82
So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.
pandas data-cleaning fuzzy-search
pandas data-cleaning fuzzy-search
edited Mar 27 at 21:00
nmetts
asked Mar 27 at 20:28
nmettsnmetts
511 silver badge10 bronze badges
511 silver badge10 bronze badges
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385916%2fnormalize-single-string-column-in-pandas-by-fuzzy-matching%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385916%2fnormalize-single-string-column-in-pandas-by-fuzzy-matching%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown