Normalize Single String Column in Pandas by Fuzzy MatchingSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe

Why does a sticker slowly peel off, but if it is pulled quickly it tears?

Can’t change phone time - set automatic time grayed out

Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?

How much does Commander Data weigh?

rationalizing sieges in a modern/near-future setting

Cooking Scrambled Eggs ends up with excess liquid

What is this fighter jet at Weymouth NAS?

74S vs 74LS ICs

Talk interpreter

Why does the `ls` command sort files like this?

How do we improve collaboration with problematic tester team?

A probably wrong proof of the Riemann Hypothesis, but where is the mistake?

Counting the triangles that can be formed from segments of given lengths

What is the name of this plot that has rows with two connected dots?

How do solar inverter systems easily add AC power sources together?

Book featuring a child learning from a crowdsourced AI book

Biological refrigeration?

To what extent should we fear giving offense?

Would Epic Heroism be an acceptable rule variant for a small, first-time group playing the Lost Mine of Phandelver adventure?

Term used to describe a person who predicts future outcomes

How could a self contained organic body propel itself in space

Why is sh (not bash) complaining about functions defined in my .bashrc?

Does NASA use any type of office/groupware software and which is that?

Are strlen optimizations really needed in glibc?

Normalize Single String Column in Pandas by Fuzzy Matching

Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersSearching one Python dataframe / dictionary for fuzzy matches in another dataframeMatch entities by fuzzy matching of multiple variablesFuzzy Match columns of Different Dataframe

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I have a Pandas DataFrame containing the results of entity sentiment using Google Cloud's Natural Language API (https://cloud.google.com/natural-language/docs/analyzing-entity-sentiment). I ran entity sentiment on a corpus of papers and want to compare how various entities are viewed by the various papers. However, the entity_name column contains duplicates due to subtle variations in how the organizations are referenced in the various papers. I would like to replace each entity name with one version so I can compare the sentiment analysis across papers.

Here's an example of what the DataFrame looks like:

entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15

What I would like is to find, for example, all references to "Department of Commerce" so I can compare the sentiment scores for this entity across the various papers. This is not a large DataFrame (less than 100k rows), so the fastest/most optimal answer isn't a concern.

I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values

names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()

I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like

name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82

So in summary, what I'm looking for is a way to replace all variations of "Department of Commerce" with one version (it doesn't matter which one) so I can compare the sentiment scores across papers. I've found solutions that involve merging two DataFrames with fuzzy matching and others that involve fuzzy matching two separate columns in the same DataFrame, but so far haven't found a way to normalize a single String column in a DataFrame.

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

add a comment |

Here's an example of what the DataFrame looks like:

entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15

I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values

names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()

I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like

name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

add a comment |

Here's an example of what the DataFrame looks like:

entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15

I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values

names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()

I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like

name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

Here's an example of what the DataFrame looks like:

entity_name paper_name score magnitude
Dept. of Commerce paper_1.pdf 0.67 0.13
Department of Commerce paper_2.pdf 0.42 0.21
US Department of Commerce paper_3.pdf 0.07 0.15

I've tried fuzzy matching the names using the fuzzywuzzy library in Python. I took the entity_name column and made two separate Numpy arrays using of the unique values

names_1 = df.entity_name.unique()
names_2 = df.entity_name.unique()

I then ran a fuzzy match for all pairs across the two columns. The issue is that I still have duplicates. This is an example of what the match DataFrame looks like

name_1 name_2 match_ratio
Dept. of Commerce Department of Commerce 82
Department of Commerce Dept. of Commerce 82
US Department of Commerce Department of Commerce 100
Department of Commerce US Department of Commerce 100
Dept. of Commerce US Department of Commerce 82
US Department of Commerce Dept. of Commerce 82

pandas data-cleaning fuzzy-search

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

edited Mar 27 at 21:00

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

asked Mar 27 at 20:28

nmetts

511 silver badge10 bronze badges

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385916%2fnormalize-single-string-column-in-pandas-by-fuzzy-matching%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴