Fuzzywuzzy match multiple columns from different dataframes in PythonHow to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python

Why are 120 V general receptacle circuits limited to 20 A?

Should I share with a new service provider a bill from its competitor?

What is the difference between x RadToDeg cos x div and COSC?

How did researchers use to find articles before the Internet and the computer era?

Generate and graph the Recamán Sequence

Mean Value Theorem: Continuous or Defined?

Did Wakanda officially get the stuff out of Bucky's head?

Who are these Discworld wizards from this picture?

How do Hassidim survive the summer heat?

Why do user defined scalar functions require the schema?

What are good ways to spray paint a QR code on a footpath?

Why was Mal so quick to drop Bester in favour of Kaylee?

Can you sign using a digital signature itself?

Do space suits measure "methane" levels or other biological gases?

Can Access Fault Exceptions of the MC68040 caused by internal access faults occur in normal situations?

Why isn’t the tax system continuous rather than bracketed?

How was film developed in the late 1920s?

Skipping over failed imports until they are needed (if ever)

What is the difference between handcrafted and learned features

Details of video memory access arbitration in Space Invaders

Java Optional working of orElse is not as if else

Prime parity peregrination

In native German words, is Q always followed by U, as in English?

Is there a category where products don't exist because uniqueness fails?



Fuzzywuzzy match multiple columns from different dataframes in Python


How to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question






















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06

















3















Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question






















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06













3












3








3








Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.










share|improve this question














Let's say I have the following 3 dataframes:



import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)


I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:



First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:



df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)


Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:



df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))

for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)


my_list result:



('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......


At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):



 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing

matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27


How could I get final expected result in a way more efficient in Python? Thanks.







python pandas fuzzywuzzy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 12:26









ahbonahbon

7496 silver badges15 bronze badges




7496 silver badges15 bronze badges












  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06

















  • Is there a reason you are exporting to excel and then importing it again?

    – Erfan
    Mar 25 at 13:07











  • Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

    – ahbon
    Mar 25 at 13:20







  • 1





    stackoverflow.com/a/54494781/9177877

    – Chris
    Mar 25 at 14:06
















Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07





Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07













Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20






Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20





1




1





stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06





stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06












1 Answer
1






active

oldest

votes


















1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01















1














try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer

























  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01













1












1








1







try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]





share|improve this answer















try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...



import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

:
:

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 25 at 16:41

























answered Mar 25 at 16:33









FrenchyFrenchy

2,6662 gold badges5 silver badges18 bronze badges




2,6662 gold badges5 silver badges18 bronze badges












  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01

















  • Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

    – ahbon
    Mar 26 at 3:01
















Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01





Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현