How would I group a set of similar product names?How would you make a comma-separated string from a list of strings?How do I get the path and name of the file that is currently executing?How to get a function name as a string in Python?How to import a module given its name?How to query as GROUP BY in django?How to manage local vs production settings in Django?How to set the current working directory?How to set environment variables in PythonHow would I specify a new line in Python?Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
How long would it take for sucrose to undergo hydrolysis in boiling water?
Background for black and white chart
Nth term of Van Eck Sequence
Should I email my professor to clear up a (possibly very irrelevant) awkward misunderstanding?
Does an African-American baby born in Youngstown, Ohio have a higher infant mortality rate than a baby born in Iran?
Manager wants to hire me; HR does not. How to proceed?
How to make a villain when your PCs are villains?
How many times to repeat an event with known probability before it has occurred a number of times
Does anyone recognize these rockets, and their location?
What is the color associated with lukewarm?
ISP is not hashing the password I log in with online. Should I take any action?
Can Dive Down protect a creature against Pacifism?
Do items with curse of vanishing disappear from shulker boxes?
How would Japanese people react to someone refusing to say “itadakimasu” for religious reasons?
SQL Server has encountered occurences of I/O requests taking longer than 15 seconds
Does PC weight have a mechanical effect?
How to address players struggling with simple controls?
What's the いて in 「忘れないでいて」 for?
Why not make one big cpu core?
What is wind "CALM"?
How to search for Android apps without ads?
What is the context for Napoleon's quote "[the Austrians] did not know the value of five minutes"?
Is there an easy way to remember if you add magnetic declination to magnetic bearings or true bearings?
Must a CPU have a GPU if the motherboard provides a display port (when there isn't any separate video card)?
How would I group a set of similar product names?
How would you make a comma-separated string from a list of strings?How do I get the path and name of the file that is currently executing?How to get a function name as a string in Python?How to import a module given its name?How to query as GROUP BY in django?How to manage local vs production settings in Django?How to set the current working directory?How to set environment variables in PythonHow would I specify a new line in Python?Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a list of product names some of which are redundant or similar:
List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I would like write a function that would group similar product names so that it would return a list:
NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.
I also thought about just the first substring in each product name:
NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])
But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.
How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.
To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.
I already provided an example of what the output would look like based on the input.
I thought beautifulsoup would help because it is a text processing library.
python
add a comment |
I have a list of product names some of which are redundant or similar:
List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I would like write a function that would group similar product names so that it would return a list:
NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.
I also thought about just the first substring in each product name:
NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])
But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.
How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.
To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.
I already provided an example of what the output would look like based on the input.
I thought beautifulsoup would help because it is a text processing library.
python
4
What determines if the products are similar?TwixandTwix Caramelis more obvious, but would you want to groupCocaColaandPepsitogether? What aboutBlack Forest HamandOscar Meyer Ham? For the samples you included, can you show the expected output? What doesbeautifulsouphave to do with it? Are you getting this list from somewhere online, or an XML document?
– BruceWayne
Mar 25 at 2:34
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
2
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59
add a comment |
I have a list of product names some of which are redundant or similar:
List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I would like write a function that would group similar product names so that it would return a list:
NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.
I also thought about just the first substring in each product name:
NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])
But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.
How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.
To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.
I already provided an example of what the output would look like based on the input.
I thought beautifulsoup would help because it is a text processing library.
python
I have a list of product names some of which are redundant or similar:
List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I would like write a function that would group similar product names so that it would return a list:
NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]
I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.
I also thought about just the first substring in each product name:
NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])
But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.
How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.
To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.
I already provided an example of what the output would look like based on the input.
I thought beautifulsoup would help because it is a text processing library.
python
python
edited Mar 25 at 2:57
Klaus D.
8,24312137
8,24312137
asked Mar 25 at 2:29
Alex KinmanAlex Kinman
77321231
77321231
4
What determines if the products are similar?TwixandTwix Caramelis more obvious, but would you want to groupCocaColaandPepsitogether? What aboutBlack Forest HamandOscar Meyer Ham? For the samples you included, can you show the expected output? What doesbeautifulsouphave to do with it? Are you getting this list from somewhere online, or an XML document?
– BruceWayne
Mar 25 at 2:34
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
2
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59
add a comment |
4
What determines if the products are similar?TwixandTwix Caramelis more obvious, but would you want to groupCocaColaandPepsitogether? What aboutBlack Forest HamandOscar Meyer Ham? For the samples you included, can you show the expected output? What doesbeautifulsouphave to do with it? Are you getting this list from somewhere online, or an XML document?
– BruceWayne
Mar 25 at 2:34
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
2
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59
4
4
What determines if the products are similar?
Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?– BruceWayne
Mar 25 at 2:34
What determines if the products are similar?
Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?– BruceWayne
Mar 25 at 2:34
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
2
2
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59
add a comment |
2 Answers
2
active
oldest
votes
You can achieve what you are trying to do by using clustering algorithms on your dataset.
a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)
OUTPUT:
array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)
This tells us that:
Cluster 0: 'CocaCola','CocaCola 3 Oz'
Cluster 1: 'Twix','Twix Caramel'
Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'
Cluster 3: 'Haagen Dazs Caramel'
Cluster 4: 'Black Forest Ham'
Cluster 5: 'Black Label Whiskey'
You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.
Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.
Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,
If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.
I hope this helps you.
add a comment |
You could use a regular expression to pick up only the name part that's before a number.
products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
import re
products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
print(products)
# ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55330551%2fhow-would-i-group-a-set-of-similar-product-names%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can achieve what you are trying to do by using clustering algorithms on your dataset.
a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)
OUTPUT:
array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)
This tells us that:
Cluster 0: 'CocaCola','CocaCola 3 Oz'
Cluster 1: 'Twix','Twix Caramel'
Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'
Cluster 3: 'Haagen Dazs Caramel'
Cluster 4: 'Black Forest Ham'
Cluster 5: 'Black Label Whiskey'
You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.
Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.
Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,
If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.
I hope this helps you.
add a comment |
You can achieve what you are trying to do by using clustering algorithms on your dataset.
a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)
OUTPUT:
array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)
This tells us that:
Cluster 0: 'CocaCola','CocaCola 3 Oz'
Cluster 1: 'Twix','Twix Caramel'
Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'
Cluster 3: 'Haagen Dazs Caramel'
Cluster 4: 'Black Forest Ham'
Cluster 5: 'Black Label Whiskey'
You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.
Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.
Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,
If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.
I hope this helps you.
add a comment |
You can achieve what you are trying to do by using clustering algorithms on your dataset.
a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)
OUTPUT:
array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)
This tells us that:
Cluster 0: 'CocaCola','CocaCola 3 Oz'
Cluster 1: 'Twix','Twix Caramel'
Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'
Cluster 3: 'Haagen Dazs Caramel'
Cluster 4: 'Black Forest Ham'
Cluster 5: 'Black Label Whiskey'
You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.
Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.
Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,
If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.
I hope this helps you.
You can achieve what you are trying to do by using clustering algorithms on your dataset.
a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)
OUTPUT:
array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)
This tells us that:
Cluster 0: 'CocaCola','CocaCola 3 Oz'
Cluster 1: 'Twix','Twix Caramel'
Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'
Cluster 3: 'Haagen Dazs Caramel'
Cluster 4: 'Black Forest Ham'
Cluster 5: 'Black Label Whiskey'
You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.
Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.
Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,
If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.
I hope this helps you.
answered Mar 25 at 5:26
Sridhar MuraliSridhar Murali
26510
26510
add a comment |
add a comment |
You could use a regular expression to pick up only the name part that's before a number.
products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
import re
products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
print(products)
# ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']
add a comment |
You could use a regular expression to pick up only the name part that's before a number.
products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
import re
products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
print(products)
# ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']
add a comment |
You could use a regular expression to pick up only the name part that's before a number.
products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
import re
products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
print(products)
# ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']
You could use a regular expression to pick up only the name part that's before a number.
products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
import re
products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
print(products)
# ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']
edited Mar 25 at 4:46
answered Mar 25 at 4:37
Alain T.Alain T.
10.5k11431
10.5k11431
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55330551%2fhow-would-i-group-a-set-of-similar-product-names%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
What determines if the products are similar?
TwixandTwix Caramelis more obvious, but would you want to groupCocaColaandPepsitogether? What aboutBlack Forest HamandOscar Meyer Ham? For the samples you included, can you show the expected output? What doesbeautifulsouphave to do with it? Are you getting this list from somewhere online, or an XML document?– BruceWayne
Mar 25 at 2:34
Regarding your edit: how would you identify the brand name in the string?
– Klaus D.
Mar 25 at 2:58
@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.
– Alex Kinman
Mar 25 at 3:03
2
So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.
– Klaus D.
Mar 25 at 3:07
(My apologies, didn't realize the second list was the expected result.)
– BruceWayne
Mar 25 at 15:59