Fuzzywuzzy match multiple columns from different dataframes in PythonHow to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python

Why are 120 V general receptacle circuits limited to 20 A?

Should I share with a new service provider a bill from its competitor?

What is the difference between x RadToDeg cos x div and COSC?

How did researchers use to find articles before the Internet and the computer era?

Generate and graph the Recamán Sequence

Mean Value Theorem: Continuous or Defined?

Did Wakanda officially get the stuff out of Bucky's head?

Who are these Discworld wizards from this picture?

How do Hassidim survive the summer heat?

Why do user defined scalar functions require the schema?

What are good ways to spray paint a QR code on a footpath?

Why was Mal so quick to drop Bester in favour of Kaylee?

Can you sign using a digital signature itself?

Do space suits measure "methane" levels or other biological gases?

Can Access Fault Exceptions of the MC68040 caused by internal access faults occur in normal situations?

Why isn’t the tax system continuous rather than bracketed?

How was film developed in the late 1920s?

Skipping over failed imports until they are needed (if ever)

What is the difference between handcrafted and learned features

Details of video memory access arbitration in Space Invaders

Java Optional working of orElse is not as if else

Prime parity peregrination

In native German words, is Q always followed by U, as in English?

Is there a category where products don't exist because uniqueness fails?



Fuzzywuzzy match multiple columns from different dataframes in Python


How to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question






















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06

















3















Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question






















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06













3












3








3








Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question














Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.







python pandas fuzzywuzzy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 12:26









ahbonahbon

7496 silver badges15 bronze badges




7496 silver badges15 bronze badges












  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06

















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06
















Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07





Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07













Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20






Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20





1




1





stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06





stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06












1 Answer
1






active

oldest

votes


















1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01















1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01













1












1








1







try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer















try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 25 at 16:41

























answered Mar 25 at 16:33









FrenchyFrenchy

2,6662 gold badges5 silver badges18 bronze badges




2,6662 gold badges5 silver badges18 bronze badges












  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01

















  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01
















Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01





Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript