Fuzzywuzzy match multiple columns from different dataframes in PythonHow to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python

Why are 120 V general receptacle circuits limited to 20 A?

Should I share with a new service provider a bill from its competitor?

What is the difference between x RadToDeg cos x div and COSC?

How did researchers use to find articles before the Internet and the computer era?

Generate and graph the Recamán Sequence

Mean Value Theorem: Continuous or Defined?

Did Wakanda officially get the stuff out of Bucky's head?

Who are these Discworld wizards from this picture?

How do Hassidim survive the summer heat?

Why do user defined scalar functions require the schema?

What are good ways to spray paint a QR code on a footpath?

Why was Mal so quick to drop Bester in favour of Kaylee?

Can you sign using a digital signature itself?

Do space suits measure "methane" levels or other biological gases?

Can Access Fault Exceptions of the MC68040 caused by internal access faults occur in normal situations?

Why isn’t the tax system continuous rather than bracketed?

How was film developed in the late 1920s?

Skipping over failed imports until they are needed (if ever)

What is the difference between handcrafted and learned features

Details of video memory access arbitration in Space Invaders

Java Optional working of orElse is not as if else

Prime parity peregrination

In native German words, is Q always followed by U, as in English?

Is there a category where products don't exist because uniqueness fails?

Fuzzywuzzy match multiple columns from different dataframes in Python

How to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Let's say I have the following 3 dataframes:

import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
 [1010667747, 'Suzhou', 'Suzhou IFS'],
 [1010667356, 'Shenzhen', 'Kingkey 100'],
 [1010667289, 'Wuhan', 'Wuhan Center']]),
 columns=['id', 'city', 'name']
) 
df2 = pd.DataFrame(np.array([
 [190010, 'Shenzhen', 'Ping An Finance Centre'],
 [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
 [190015, 'Beijing', 'China Zun']]),
 columns=['id', 'city', 'name']
) 
df3 = pd.DataFrame(np.array([
 ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
 ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
 ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
 columns=['id', 'city', 'name']
)

I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:

First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:

df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name'] 
df5 = df2['full_name'] 
df6 = df3['full_name'] 

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)

Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:

df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
 if project:
 processedProject = fuzz._process_and_sort(project, True, True)
 processedProjects.append(processedProject)
 matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
 writer = csv.writer(f1, delimiter=',', lineterminator='n', )
 writer.writerow(('name', 'matched_name', 'similarity_ratio'))

 for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
 matcher = matchers[project1[0]]
 matcher.set_seq2(project2[1])
 ratio = int(round(100 * matcher.ratio()))
 if ratio >= threshold_ratio:
 #print(projects[project1[0]], projects[project2[0]])
 my_list = projects[project1[0]], projects[project2[0]], ratio
 print(my_list)
 writer.writerow(my_list)

my_list result:

('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......

At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):

 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen 
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan 
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen 
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou 
4 1010667747 Suzhou Suzhou IFS 190015 Beijing 

 matched_name.1 similarity_ratio 
0 Kingkey 100 44 
1 Wuhan Center 49 
2 Ping An Finance Centre 33 
3 Guangzhou CTF Finance Centre 47 
4 China Zun 27

How could I get final expected result in a way more efficient in Python? Thanks.

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07

Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20

1

stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06

add a comment |

Let's say I have the following 3 dataframes:

import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
 [1010667747, 'Suzhou', 'Suzhou IFS'],
 [1010667356, 'Shenzhen', 'Kingkey 100'],
 [1010667289, 'Wuhan', 'Wuhan Center']]),
 columns=['id', 'city', 'name']
) 
df2 = pd.DataFrame(np.array([
 [190010, 'Shenzhen', 'Ping An Finance Centre'],
 [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
 [190015, 'Beijing', 'China Zun']]),
 columns=['id', 'city', 'name']
) 
df3 = pd.DataFrame(np.array([
 ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
 ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
 ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
 columns=['id', 'city', 'name']
)

I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:

df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name'] 
df5 = df2['full_name'] 
df6 = df3['full_name'] 

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)

Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:

df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
 if project:
 processedProject = fuzz._process_and_sort(project, True, True)
 processedProjects.append(processedProject)
 matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
 writer = csv.writer(f1, delimiter=',', lineterminator='n', )
 writer.writerow(('name', 'matched_name', 'similarity_ratio'))

 for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
 matcher = matchers[project1[0]]
 matcher.set_seq2(project2[1])
 ratio = int(round(100 * matcher.ratio()))
 if ratio >= threshold_ratio:
 #print(projects[project1[0]], projects[project2[0]])
 my_list = projects[project1[0]], projects[project2[0]], ratio
 print(my_list)
 writer.writerow(my_list)

my_list result:

('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......

At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):

 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen 
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan 
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen 
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou 
4 1010667747 Suzhou Suzhou IFS 190015 Beijing 

 matched_name.1 similarity_ratio 
0 Kingkey 100 44 
1 Wuhan Center 49 
2 Ping An Finance Centre 33 
3 Guangzhou CTF Finance Centre 47 
4 China Zun 27

How could I get final expected result in a way more efficient in Python? Thanks.

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07

Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20

1

stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06

add a comment |

Let's say I have the following 3 dataframes:

import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
 [1010667747, 'Suzhou', 'Suzhou IFS'],
 [1010667356, 'Shenzhen', 'Kingkey 100'],
 [1010667289, 'Wuhan', 'Wuhan Center']]),
 columns=['id', 'city', 'name']
) 
df2 = pd.DataFrame(np.array([
 [190010, 'Shenzhen', 'Ping An Finance Centre'],
 [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
 [190015, 'Beijing', 'China Zun']]),
 columns=['id', 'city', 'name']
) 
df3 = pd.DataFrame(np.array([
 ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
 ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
 ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
 columns=['id', 'city', 'name']
)

I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:

df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name'] 
df5 = df2['full_name'] 
df6 = df3['full_name'] 

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)

Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:

df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
 if project:
 processedProject = fuzz._process_and_sort(project, True, True)
 processedProjects.append(processedProject)
 matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
 writer = csv.writer(f1, delimiter=',', lineterminator='n', )
 writer.writerow(('name', 'matched_name', 'similarity_ratio'))

 for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
 matcher = matchers[project1[0]]
 matcher.set_seq2(project2[1])
 ratio = int(round(100 * matcher.ratio()))
 if ratio >= threshold_ratio:
 #print(projects[project1[0]], projects[project2[0]])
 my_list = projects[project1[0]], projects[project2[0]], ratio
 print(my_list)
 writer.writerow(my_list)

my_list result:

('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......

At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):

 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen 
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan 
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen 
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou 
4 1010667747 Suzhou Suzhou IFS 190015 Beijing 

 matched_name.1 similarity_ratio 
0 Kingkey 100 44 
1 Wuhan Center 49 
2 Ping An Finance Centre 33 
3 Guangzhou CTF Finance Centre 47 
4 China Zun 27

How could I get final expected result in a way more efficient in Python? Thanks.

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

Let's say I have the following 3 dataframes:

import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
 [1010667747, 'Suzhou', 'Suzhou IFS'],
 [1010667356, 'Shenzhen', 'Kingkey 100'],
 [1010667289, 'Wuhan', 'Wuhan Center']]),
 columns=['id', 'city', 'name']
) 
df2 = pd.DataFrame(np.array([
 [190010, 'Shenzhen', 'Ping An Finance Centre'],
 [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
 [190015, 'Beijing', 'China Zun']]),
 columns=['id', 'city', 'name']
) 
df3 = pd.DataFrame(np.array([
 ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
 ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
 ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
 columns=['id', 'city', 'name']
)

I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:

df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name'] 
df5 = df2['full_name'] 
df6 = df3['full_name'] 

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)

Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:

df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
 if project:
 processedProject = fuzz._process_and_sort(project, True, True)
 processedProjects.append(processedProject)
 matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
 writer = csv.writer(f1, delimiter=',', lineterminator='n', )
 writer.writerow(('name', 'matched_name', 'similarity_ratio'))

 for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
 matcher = matchers[project1[0]]
 matcher.set_seq2(project2[1])
 ratio = int(round(100 * matcher.ratio()))
 if ratio >= threshold_ratio:
 #print(projects[project1[0]], projects[project2[0]])
 my_list = projects[project1[0]], projects[project2[0]], ratio
 print(my_list)
 writer.writerow(my_list)

my_list result:

('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......

At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):

 id city name matched_id matched_name 
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen 
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan 
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen 
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou 
4 1010667747 Suzhou Suzhou IFS 190015 Beijing 

 matched_name.1 similarity_ratio 
0 Kingkey 100 44 
1 Wuhan Center 49 
2 Ping An Finance Centre 33 
3 Guangzhou CTF Finance Centre 47 
4 China Zun 27

How could I get final expected result in a way more efficient in Python? Thanks.

python pandas fuzzywuzzy

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

asked Mar 25 at 12:26

ahbon

7496 silver badges15 bronze badges

Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07

Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20

1

stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06

add a comment |

Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07

Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20

1

stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06

Is there a reason you are exporting to excel and then importing it again?

– Erfan
Mar 25 at 13:07

Good question. If I don't export to excel and read it, I got a AttributeError: 'Series' object has no attribute 'full_name', quite weird.

– ahbon
Mar 25 at 13:20

stackoverflow.com/a/54494781/9177877

– Chris
Mar 25 at 14:06

add a comment |

1 Answer
1

active

oldest

votes

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

 :
 :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
 print(name)
 #print(DataFrameDict[name]

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

 :
 :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
 print(name)
 #print(DataFrameDict[name]

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

add a comment |

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

 :
 :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
 print(name)
 #print(DataFrameDict[name]

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

add a comment |

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

 :
 :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
 print(name)
 #print(DataFrameDict[name]

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

 :
 :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])

for name in DataFrameDict.keys():
 print(name)
 #print(DataFrameDict[name]

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

edited Mar 25 at 16:41

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

answered Mar 25 at 16:33

Frenchy

2,6662 gold badges5 silver badges18 bronze badges

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

add a comment |

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

Thanks. Please write to excel if you like: df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)

– ahbon
Mar 26 at 3:01

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

1 Answer
1

1 Answer
1

1 Answer
1