How would I group a set of similar product names?How would you make a comma-separated string from a list of strings?How do I get the path and name of the file that is currently executing?How to get a function name as a string in Python?How to import a module given its name?How to query as GROUP BY in django?How to manage local vs production settings in Django?How to set the current working directory?How to set environment variables in PythonHow would I specify a new line in Python?Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

How long would it take for sucrose to undergo hydrolysis in boiling water?

Background for black and white chart

Nth term of Van Eck Sequence

Should I email my professor to clear up a (possibly very irrelevant) awkward misunderstanding?

Does an African-American baby born in Youngstown, Ohio have a higher infant mortality rate than a baby born in Iran?

Manager wants to hire me; HR does not. How to proceed?

How to make a villain when your PCs are villains?

How many times to repeat an event with known probability before it has occurred a number of times

Does anyone recognize these rockets, and their location?

What is the color associated with lukewarm?

ISP is not hashing the password I log in with online. Should I take any action?

Can Dive Down protect a creature against Pacifism?

Do items with curse of vanishing disappear from shulker boxes?

How would Japanese people react to someone refusing to say “itadakimasu” for religious reasons?

SQL Server has encountered occurences of I/O requests taking longer than 15 seconds

Does PC weight have a mechanical effect?

How to address players struggling with simple controls?

What's the いて in 「忘れないでいて」 for?

Why not make one big cpu core?

What is wind "CALM"?

How to search for Android apps without ads?

What is the context for Napoleon's quote "[the Austrians] did not know the value of five minutes"?

Is there an easy way to remember if you add magnetic declination to magnetic bearings or true bearings?

Must a CPU have a GPU if the motherboard provides a display port (when there isn't any separate video card)?



How would I group a set of similar product names?


How would you make a comma-separated string from a list of strings?How do I get the path and name of the file that is currently executing?How to get a function name as a string in Python?How to import a module given its name?How to query as GROUP BY in django?How to manage local vs production settings in Django?How to set the current working directory?How to set environment variables in PythonHow would I specify a new line in Python?Get statistics for each group (such as count, mean, etc) using pandas GroupBy?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have a list of product names some of which are redundant or similar:



List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...] 


I would like write a function that would group similar product names so that it would return a list:



NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]


I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.



I also thought about just the first substring in each product name:



 NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])


But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.



How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.




To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.



I already provided an example of what the output would look like based on the input.



I thought beautifulsoup would help because it is a text processing library.










share|improve this question



















  • 4





    What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

    – BruceWayne
    Mar 25 at 2:34












  • Regarding your edit: how would you identify the brand name in the string?

    – Klaus D.
    Mar 25 at 2:58











  • @KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

    – Alex Kinman
    Mar 25 at 3:03






  • 2





    So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

    – Klaus D.
    Mar 25 at 3:07











  • (My apologies, didn't realize the second list was the expected result.)

    – BruceWayne
    Mar 25 at 15:59

















0















I have a list of product names some of which are redundant or similar:



List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...] 


I would like write a function that would group similar product names so that it would return a list:



NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]


I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.



I also thought about just the first substring in each product name:



 NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])


But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.



How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.




To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.



I already provided an example of what the output would look like based on the input.



I thought beautifulsoup would help because it is a text processing library.










share|improve this question



















  • 4





    What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

    – BruceWayne
    Mar 25 at 2:34












  • Regarding your edit: how would you identify the brand name in the string?

    – Klaus D.
    Mar 25 at 2:58











  • @KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

    – Alex Kinman
    Mar 25 at 3:03






  • 2





    So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

    – Klaus D.
    Mar 25 at 3:07











  • (My apologies, didn't realize the second list was the expected result.)

    – BruceWayne
    Mar 25 at 15:59













0












0








0








I have a list of product names some of which are redundant or similar:



List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...] 


I would like write a function that would group similar product names so that it would return a list:



NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]


I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.



I also thought about just the first substring in each product name:



 NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])


But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.



How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.




To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.



I already provided an example of what the output would look like based on the input.



I thought beautifulsoup would help because it is a text processing library.










share|improve this question
















I have a list of product names some of which are redundant or similar:



List = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...] 


I would like write a function that would group similar product names so that it would return a list:



NewList = ['CocaCola','Twix','Foldgers','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey',...]


I thought about matching substrings, but that wouldn't work since 'CocaCola 3 Oz' and ''Foldgers 3 Oz' would both map to '3 Oz'.



I also thought about just the first substring in each product name:



 NewList = []
for w in List:
ws = w.split(' ')
NewList.append(ws[0])


But that would map 'Black Forest Ham' and 'Black Label Whiskey' to 'Black'.



How can I get this mapping? I know of beautifulsoup and thought it might help, but I couldn't find any posts that indicate that.




To clarify based on BruceWayne's comments:
I'm getting the list from a Pandas df (don't know why that is relevant?).
'CocaCola' and 'Pepsi' would map to different groups 'CocaCola' and 'Pepsi'. 'Black Forest Ham' and 'Oscar Meyer Ham' would also map to different groups, 'CocaCola' and 'CocaCola Light' would map to the same group 'CocaCola'. Basically I'l looking for grouping based on brand names, not product categories. That what determines similarity.



I already provided an example of what the output would look like based on the input.



I thought beautifulsoup would help because it is a text processing library.







python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 25 at 2:57









Klaus D.

8,24312137




8,24312137










asked Mar 25 at 2:29









Alex KinmanAlex Kinman

77321231




77321231







  • 4





    What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

    – BruceWayne
    Mar 25 at 2:34












  • Regarding your edit: how would you identify the brand name in the string?

    – Klaus D.
    Mar 25 at 2:58











  • @KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

    – Alex Kinman
    Mar 25 at 3:03






  • 2





    So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

    – Klaus D.
    Mar 25 at 3:07











  • (My apologies, didn't realize the second list was the expected result.)

    – BruceWayne
    Mar 25 at 15:59












  • 4





    What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

    – BruceWayne
    Mar 25 at 2:34












  • Regarding your edit: how would you identify the brand name in the string?

    – Klaus D.
    Mar 25 at 2:58











  • @KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

    – Alex Kinman
    Mar 25 at 3:03






  • 2





    So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

    – Klaus D.
    Mar 25 at 3:07











  • (My apologies, didn't realize the second list was the expected result.)

    – BruceWayne
    Mar 25 at 15:59







4




4





What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

– BruceWayne
Mar 25 at 2:34






What determines if the products are similar? Twix and Twix Caramel is more obvious, but would you want to group CocaCola and Pepsi together? What about Black Forest Ham and Oscar Meyer Ham? For the samples you included, can you show the expected output? What does beautifulsoup have to do with it? Are you getting this list from somewhere online, or an XML document?

– BruceWayne
Mar 25 at 2:34














Regarding your edit: how would you identify the brand name in the string?

– Klaus D.
Mar 25 at 2:58





Regarding your edit: how would you identify the brand name in the string?

– Klaus D.
Mar 25 at 2:58













@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

– Alex Kinman
Mar 25 at 3:03





@KlausD. that's kind of the gist of my question. I don't know all the brand names before hand, or I would use some sort of lookup table.

– Alex Kinman
Mar 25 at 3:03




2




2





So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

– Klaus D.
Mar 25 at 3:07





So, you know the answer already. What you are trying to do is part of natural language processing and a complex topic on its own.

– Klaus D.
Mar 25 at 3:07













(My apologies, didn't realize the second list was the expected result.)

– BruceWayne
Mar 25 at 15:59





(My apologies, didn't realize the second list was the expected result.)

– BruceWayne
Mar 25 at 15:59












2 Answers
2






active

oldest

votes


















1














You can achieve what you are trying to do by using clustering algorithms on your dataset.



a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
cv=CountVectorizer()
vect=cv.fit_transform(a)
km=KMeans(n_clusters=6)
km.fit_predict(vect)


OUTPUT:



array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)


This tells us that:



Cluster 0: 'CocaCola','CocaCola 3 Oz'



Cluster 1: 'Twix','Twix Caramel'



Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'



Cluster 3: 'Haagen Dazs Caramel'



Cluster 4: 'Black Forest Ham'



Cluster 5: 'Black Label Whiskey'



You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.



Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.



Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,



If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.



I hope this helps you.






share|improve this answer






























    0














    You could use a regular expression to pick up only the name part that's before a number.



    products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
    import re
    products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
    print(products)

    # ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']





    share|improve this answer

























      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55330551%2fhow-would-i-group-a-set-of-similar-product-names%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      You can achieve what you are trying to do by using clustering algorithms on your dataset.



      a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']

      from sklearn.feature_extraction.text import CountVectorizer
      from sklearn.cluster import KMeans
      cv=CountVectorizer()
      vect=cv.fit_transform(a)
      km=KMeans(n_clusters=6)
      km.fit_predict(vect)


      OUTPUT:



      array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)


      This tells us that:



      Cluster 0: 'CocaCola','CocaCola 3 Oz'



      Cluster 1: 'Twix','Twix Caramel'



      Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'



      Cluster 3: 'Haagen Dazs Caramel'



      Cluster 4: 'Black Forest Ham'



      Cluster 5: 'Black Label Whiskey'



      You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.



      Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.



      Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,



      If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.



      I hope this helps you.






      share|improve this answer



























        1














        You can achieve what you are trying to do by using clustering algorithms on your dataset.



        a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']

        from sklearn.feature_extraction.text import CountVectorizer
        from sklearn.cluster import KMeans
        cv=CountVectorizer()
        vect=cv.fit_transform(a)
        km=KMeans(n_clusters=6)
        km.fit_predict(vect)


        OUTPUT:



        array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)


        This tells us that:



        Cluster 0: 'CocaCola','CocaCola 3 Oz'



        Cluster 1: 'Twix','Twix Caramel'



        Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'



        Cluster 3: 'Haagen Dazs Caramel'



        Cluster 4: 'Black Forest Ham'



        Cluster 5: 'Black Label Whiskey'



        You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.



        Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.



        Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,



        If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.



        I hope this helps you.






        share|improve this answer

























          1












          1








          1







          You can achieve what you are trying to do by using clustering algorithms on your dataset.



          a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']

          from sklearn.feature_extraction.text import CountVectorizer
          from sklearn.cluster import KMeans
          cv=CountVectorizer()
          vect=cv.fit_transform(a)
          km=KMeans(n_clusters=6)
          km.fit_predict(vect)


          OUTPUT:



          array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)


          This tells us that:



          Cluster 0: 'CocaCola','CocaCola 3 Oz'



          Cluster 1: 'Twix','Twix Caramel'



          Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'



          Cluster 3: 'Haagen Dazs Caramel'



          Cluster 4: 'Black Forest Ham'



          Cluster 5: 'Black Label Whiskey'



          You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.



          Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.



          Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,



          If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.



          I hope this helps you.






          share|improve this answer













          You can achieve what you are trying to do by using clustering algorithms on your dataset.



          a = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']

          from sklearn.feature_extraction.text import CountVectorizer
          from sklearn.cluster import KMeans
          cv=CountVectorizer()
          vect=cv.fit_transform(a)
          km=KMeans(n_clusters=6)
          km.fit_predict(vect)


          OUTPUT:



          array([0, 0, 1, 1, 2, 2, 4, 3, 5], dtype=int32)


          This tells us that:



          Cluster 0: 'CocaCola','CocaCola 3 Oz'



          Cluster 1: 'Twix','Twix Caramel'



          Cluster 2: 'Foldgers 3 Oz','Foldgers 10 Oz'



          Cluster 3: 'Haagen Dazs Caramel'



          Cluster 4: 'Black Forest Ham'



          Cluster 5: 'Black Label Whiskey'



          You first vectorize the your data i.e., you convert each item in your list into 1D array of numbers. I am using a CountVectorizer here (easy to understand and serves the purpose here), but there are other vectorizers available too. Each digit in the 1D array would represent a word and the valueof that digit would represent the number of times it occurs in that text. This link will help you understand better about CountVectorizer aka Bag of Words algorithm.



          Once again, there are many clustering algorithms to choose from and I have chosen KMeans Clustering for the same reason as before, easy to understand and implement.This will help you understand KMeans Clustering.



          Note: You need to specify the number of clusters you require as mentioned in km=KMeans(n_clusters=6). A change in the value here might change your results. For example,



          If km=KMeans(n_clusters=5), 'Black Forest Ham' and 'Black Label Whiskey' will be categorized in the same cluster.



          I hope this helps you.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 25 at 5:26









          Sridhar MuraliSridhar Murali

          26510




          26510























              0














              You could use a regular expression to pick up only the name part that's before a number.



              products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
              import re
              products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
              print(products)

              # ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']





              share|improve this answer





























                0














                You could use a regular expression to pick up only the name part that's before a number.



                products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
                import re
                products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
                print(products)

                # ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']





                share|improve this answer



























                  0












                  0








                  0







                  You could use a regular expression to pick up only the name part that's before a number.



                  products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
                  import re
                  products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
                  print(products)

                  # ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']





                  share|improve this answer















                  You could use a regular expression to pick up only the name part that's before a number.



                  products = ['CocaCola','CocaCola 3 Oz','Twix','Twix Caramel','Foldgers 3 Oz','Foldgers 10 Oz','Haagen Dazs Caramel','Black Forest Ham','Black Label Whiskey']
                  import re
                  products = list(set(re.findall("(.*?)[0-9]",name+"0")[0].strip() for name in products))
                  print(products)

                  # ['Black Label Whiskey', 'CocaCola', 'Black Forest Ham', 'Twix Caramel', 'Twix', 'Haagen Dazs Caramel', 'Foldgers']






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Mar 25 at 4:46

























                  answered Mar 25 at 4:37









                  Alain T.Alain T.

                  10.5k11431




                  10.5k11431



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55330551%2fhow-would-i-group-a-set-of-similar-product-names%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                      용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                      155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해