Fuzzywuzzy match multiple columns from different dataframes in PythonHow to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python
Why are 120 V general receptacle circuits limited to 20 A?
Should I share with a new service provider a bill from its competitor?
What is the difference between x RadToDeg cos x div and COSC?
How did researchers use to find articles before the Internet and the computer era?
Generate and graph the Recamán Sequence
Mean Value Theorem: Continuous or Defined?
Did Wakanda officially get the stuff out of Bucky's head?
Who are these Discworld wizards from this picture?
How do Hassidim survive the summer heat?
Why do user defined scalar functions require the schema?
What are good ways to spray paint a QR code on a footpath?
Why was Mal so quick to drop Bester in favour of Kaylee?
Can you sign using a digital signature itself?
Do space suits measure "methane" levels or other biological gases?
Can Access Fault Exceptions of the MC68040 caused by internal access faults occur in normal situations?
Why isn’t the tax system continuous rather than bracketed?
How was film developed in the late 1920s?
Skipping over failed imports until they are needed (if ever)
What is the difference between handcrafted and learned features
Details of video memory access arbitration in Space Invaders
Java Optional working of orElse is not as if else
Prime parity peregrination
In native German words, is Q always followed by U, as in English?
Is there a category where products don't exist because uniqueness fails?
Fuzzywuzzy match multiple columns from different dataframes in Python
How to merge on approximate strings?What is the difference between Python's list methods append and extend?Why is reading lines from stdin much slower in C++ than Python?Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersPandas: splitting dataframe into multiple dataframe based on threshold valuePython DataFrame particular columns conversiondataframes manipulation and merging python
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Let's say I have the following 3 dataframes:
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter
df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)
I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy
package, here is my solution which need to improve:
First, I concatenate all three dataframes to one column as full_name
. At this step, in fact, I shouldn't add id
to full_name
but in order to better distingue building names from different dataframes, I have added it:
df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']
df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']
frames = [df4, df5, df6]
df = pd.concat(frames)
df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)
Second, I iterate all full_names
and compare with each others to get similarity_ratio
of each pairs of building names:
df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
my_list
result:
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
At last step, I split output10.csv
manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):
id city name matched_id matched_name
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
How could I get final expected result in a way more efficient in Python? Thanks.
python pandas fuzzywuzzy
add a comment |
Let's say I have the following 3 dataframes:
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter
df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)
I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy
package, here is my solution which need to improve:
First, I concatenate all three dataframes to one column as full_name
. At this step, in fact, I shouldn't add id
to full_name
but in order to better distingue building names from different dataframes, I have added it:
df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']
df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']
frames = [df4, df5, df6]
df = pd.concat(frames)
df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)
Second, I iterate all full_names
and compare with each others to get similarity_ratio
of each pairs of building names:
df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
my_list
result:
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
At last step, I split output10.csv
manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):
id city name matched_id matched_name
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
How could I get final expected result in a way more efficient in Python? Thanks.
python pandas fuzzywuzzy
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Good question. If I don't export to excel and read it, I got aAttributeError: 'Series' object has no attribute 'full_name'
, quite weird.
– ahbon
Mar 25 at 13:20
1
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06
add a comment |
Let's say I have the following 3 dataframes:
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter
df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)
I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy
package, here is my solution which need to improve:
First, I concatenate all three dataframes to one column as full_name
. At this step, in fact, I shouldn't add id
to full_name
but in order to better distingue building names from different dataframes, I have added it:
df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']
df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']
frames = [df4, df5, df6]
df = pd.concat(frames)
df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)
Second, I iterate all full_names
and compare with each others to get similarity_ratio
of each pairs of building names:
df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
my_list
result:
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
At last step, I split output10.csv
manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):
id city name matched_id matched_name
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
How could I get final expected result in a way more efficient in Python? Thanks.
python pandas fuzzywuzzy
Let's say I have the following 3 dataframes:
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter
df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)
I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy
package, here is my solution which need to improve:
First, I concatenate all three dataframes to one column as full_name
. At this step, in fact, I shouldn't add id
to full_name
but in order to better distingue building names from different dataframes, I have added it:
df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']
df4 = df1['full_name']
df5 = df2['full_name']
df6 = df3['full_name']
frames = [df4, df5, df6]
df = pd.concat(frames)
df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)
Second, I iterate all full_names
and compare with each others to get similarity_ratio
of each pairs of building names:
df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
my_list
result:
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
At last step, I split output10.csv
manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):
id city name matched_id matched_name
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
How could I get final expected result in a way more efficient in Python? Thanks.
python pandas fuzzywuzzy
python pandas fuzzywuzzy
asked Mar 25 at 12:26
ahbonahbon
7496 silver badges15 bronze badges
7496 silver badges15 bronze badges
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Good question. If I don't export to excel and read it, I got aAttributeError: 'Series' object has no attribute 'full_name'
, quite weird.
– ahbon
Mar 25 at 13:20
1
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06
add a comment |
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Good question. If I don't export to excel and read it, I got aAttributeError: 'Series' object has no attribute 'full_name'
, quite weird.
– ahbon
Mar 25 at 13:20
1
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Good question. If I don't export to excel and read it, I got a
AttributeError: 'Series' object has no attribute 'full_name'
, quite weird.– ahbon
Mar 25 at 13:20
Good question. If I don't export to excel and read it, I got a
AttributeError: 'Series' object has no attribute 'full_name'
, quite weird.– ahbon
Mar 25 at 13:20
1
1
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06
add a comment |
1 Answer
1
active
oldest
votes
try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
Thanks. Please write to excel if you like:df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
Thanks. Please write to excel if you like:df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
add a comment |
try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
Thanks. Please write to excel if you like:df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
add a comment |
try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
:
:
frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]
#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
edited Mar 25 at 16:41
answered Mar 25 at 16:33
FrenchyFrenchy
2,6662 gold badges5 silver badges18 bronze badges
2,6662 gold badges5 silver badges18 bronze badges
Thanks. Please write to excel if you like:df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
add a comment |
Thanks. Please write to excel if you like:df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
Thanks. Please write to excel if you like:
df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
Thanks. Please write to excel if you like:
df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)
– ahbon
Mar 26 at 3:01
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337764%2ffuzzywuzzy-match-multiple-columns-from-different-dataframes-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is there a reason you are exporting to excel and then importing it again?
– Erfan
Mar 25 at 13:07
Good question. If I don't export to excel and read it, I got a
AttributeError: 'Series' object has no attribute 'full_name'
, quite weird.– ahbon
Mar 25 at 13:20
1
stackoverflow.com/a/54494781/9177877
– Chris
Mar 25 at 14:06