tokenize string based on self-defined dictionaryHow to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?

Was Dennis Ritchie being too modest in this quote about C and Pascal?

Multiple fireplaces in an apartment building?

Don’t seats that recline flat defeat the purpose of having seatbelts?

Retract an already submitted recommendation letter (written for an undergrad student)

"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?

How much cash can I safely carry into the USA and avoid civil forfeiture?

Mistake in years of experience in resume?

Is there a better way to say "see someone's dreams"?

Island of Knights, Knaves and Spies

What is purpose of DB Browser(dbbrowser.aspx) under admin tool?

Creating a chemical industry from a medieval tech level without petroleum

A strange hotel

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

What is the term for a person whose job is to place products on shelves in stores?

Unknown code in script

What is the most expensive material in the world that could be used to create Pun-Pun's lute?

How to have a sharp product image?

A Paper Record is What I Hamper

Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?

Is there really no use for MD5 anymore?

Why did C use the -> operator instead of reusing the . operator?

Would the change in enthalpy (ΔH) for the dissolution of urea in water be positive or negative?

Why must Chinese maps be obfuscated?

My bank got bought out, am I now going to have to start filing tax returns in a different state?



tokenize string based on self-defined dictionary


How to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I would like to tokenize a list of strings according to my self-defined dictionary.



The list of string looks like this:



lst = ['vitamin c juice', 'organic supplement'] 


The self-defined dictionary:



dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'


My expected result:



vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]



My current code:



import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list])
corpus = [dct.doc2bow(text) for text in [s for s in lst]]


The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.










share|improve this question




























    1















    I would like to tokenize a list of strings according to my self-defined dictionary.



    The list of string looks like this:



    lst = ['vitamin c juice', 'organic supplement'] 


    The self-defined dictionary:



    dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'


    My expected result:



    vitamin c juice --> [(3,1), (1,1)]
    organic supplement --> [(0,1), (2,1)]



    My current code:



    import gensim
    import gensim.corpora as corpora
    from gensim.utils import tokenize
    dct = corpora.Dictionary([list(x) for x in tup_list])
    corpus = [dct.doc2bow(text) for text in [s for s in lst]]


    The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.










    share|improve this question
























      1












      1








      1








      I would like to tokenize a list of strings according to my self-defined dictionary.



      The list of string looks like this:



      lst = ['vitamin c juice', 'organic supplement'] 


      The self-defined dictionary:



      dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'


      My expected result:



      vitamin c juice --> [(3,1), (1,1)]
      organic supplement --> [(0,1), (2,1)]



      My current code:



      import gensim
      import gensim.corpora as corpora
      from gensim.utils import tokenize
      dct = corpora.Dictionary([list(x) for x in tup_list])
      corpus = [dct.doc2bow(text) for text in [s for s in lst]]


      The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.










      share|improve this question














      I would like to tokenize a list of strings according to my self-defined dictionary.



      The list of string looks like this:



      lst = ['vitamin c juice', 'organic supplement'] 


      The self-defined dictionary:



      dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'


      My expected result:



      vitamin c juice --> [(3,1), (1,1)]
      organic supplement --> [(0,1), (2,1)]



      My current code:



      import gensim
      import gensim.corpora as corpora
      from gensim.utils import tokenize
      dct = corpora.Dictionary([list(x) for x in tup_list])
      corpus = [dct.doc2bow(text) for text in [s for s in lst]]


      The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.







      python nlp nltk tokenize gensim






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 22 at 16:44









      AbbeyAbbey

      1107




      1107






















          2 Answers
          2






          active

          oldest

          votes


















          1














          I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:




          lst = ['vitamin_c juice', 'organic supplement']
          dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

          word2index = key: val for val, key in dct.items()

          tokenized = [[word2index[word] for word in text.split()] for text in lst]


          If you do not insist on your predefined mapping dct, you could also create it with:



          vocab = set([word for text in lst for word in text.split()])
          word2index = word: ind for ind, word in enumerate(sorted(vocab))





          share|improve this answer






























            1














            You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens



            for example:



            lst = ['vitamin c juice', 'organic supplement'] 
            dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

            import re
            from collections import Counter
            keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
            sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
            pattern = re.compile( "|".join(sortedKw) ) # regular expression
            lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
            tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
            result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
            print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]


            The regular expression takes the form: keyword1|keyword2|keyword3



            Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.



            After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.



            [UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.






            share|improve this answer

























            • Yes, I would like to count how many times the token appears.

              – Abbey
              Mar 25 at 21:01






            • 1





              Updated answer to make it count the number of occurrences of each token index.

              – Alain T.
              Mar 25 at 21:25











            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304262%2ftokenize-string-based-on-self-defined-dictionary%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:




            lst = ['vitamin_c juice', 'organic supplement']
            dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

            word2index = key: val for val, key in dct.items()

            tokenized = [[word2index[word] for word in text.split()] for text in lst]


            If you do not insist on your predefined mapping dct, you could also create it with:



            vocab = set([word for text in lst for word in text.split()])
            word2index = word: ind for ind, word in enumerate(sorted(vocab))





            share|improve this answer



























              1














              I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:




              lst = ['vitamin_c juice', 'organic supplement']
              dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

              word2index = key: val for val, key in dct.items()

              tokenized = [[word2index[word] for word in text.split()] for text in lst]


              If you do not insist on your predefined mapping dct, you could also create it with:



              vocab = set([word for text in lst for word in text.split()])
              word2index = word: ind for ind, word in enumerate(sorted(vocab))





              share|improve this answer

























                1












                1








                1







                I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:




                lst = ['vitamin_c juice', 'organic supplement']
                dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

                word2index = key: val for val, key in dct.items()

                tokenized = [[word2index[word] for word in text.split()] for text in lst]


                If you do not insist on your predefined mapping dct, you could also create it with:



                vocab = set([word for text in lst for word in text.split()])
                word2index = word: ind for ind, word in enumerate(sorted(vocab))





                share|improve this answer













                I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:




                lst = ['vitamin_c juice', 'organic supplement']
                dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

                word2index = key: val for val, key in dct.items()

                tokenized = [[word2index[word] for word in text.split()] for text in lst]


                If you do not insist on your predefined mapping dct, you could also create it with:



                vocab = set([word for text in lst for word in text.split()])
                word2index = word: ind for ind, word in enumerate(sorted(vocab))






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 22 at 17:42









                SimonSimon

                685




                685























                    1














                    You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens



                    for example:



                    lst = ['vitamin c juice', 'organic supplement'] 
                    dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

                    import re
                    from collections import Counter
                    keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
                    sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
                    pattern = re.compile( "|".join(sortedKw) ) # regular expression
                    lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
                    tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
                    result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
                    print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]


                    The regular expression takes the form: keyword1|keyword2|keyword3



                    Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.



                    After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.



                    [UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.






                    share|improve this answer

























                    • Yes, I would like to count how many times the token appears.

                      – Abbey
                      Mar 25 at 21:01






                    • 1





                      Updated answer to make it count the number of occurrences of each token index.

                      – Alain T.
                      Mar 25 at 21:25















                    1














                    You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens



                    for example:



                    lst = ['vitamin c juice', 'organic supplement'] 
                    dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

                    import re
                    from collections import Counter
                    keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
                    sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
                    pattern = re.compile( "|".join(sortedKw) ) # regular expression
                    lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
                    tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
                    result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
                    print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]


                    The regular expression takes the form: keyword1|keyword2|keyword3



                    Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.



                    After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.



                    [UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.






                    share|improve this answer

























                    • Yes, I would like to count how many times the token appears.

                      – Abbey
                      Mar 25 at 21:01






                    • 1





                      Updated answer to make it count the number of occurrences of each token index.

                      – Alain T.
                      Mar 25 at 21:25













                    1












                    1








                    1







                    You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens



                    for example:



                    lst = ['vitamin c juice', 'organic supplement'] 
                    dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

                    import re
                    from collections import Counter
                    keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
                    sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
                    pattern = re.compile( "|".join(sortedKw) ) # regular expression
                    lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
                    tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
                    result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
                    print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]


                    The regular expression takes the form: keyword1|keyword2|keyword3



                    Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.



                    After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.



                    [UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.






                    share|improve this answer















                    You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens



                    for example:



                    lst = ['vitamin c juice', 'organic supplement'] 
                    dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

                    import re
                    from collections import Counter
                    keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
                    sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
                    pattern = re.compile( "|".join(sortedKw) ) # regular expression
                    lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
                    tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
                    result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
                    print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]


                    The regular expression takes the form: keyword1|keyword2|keyword3



                    Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.



                    After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.



                    [UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Mar 25 at 21:32

























                    answered Mar 22 at 17:59









                    Alain T.Alain T.

                    8,67711329




                    8,67711329












                    • Yes, I would like to count how many times the token appears.

                      – Abbey
                      Mar 25 at 21:01






                    • 1





                      Updated answer to make it count the number of occurrences of each token index.

                      – Alain T.
                      Mar 25 at 21:25

















                    • Yes, I would like to count how many times the token appears.

                      – Abbey
                      Mar 25 at 21:01






                    • 1





                      Updated answer to make it count the number of occurrences of each token index.

                      – Alain T.
                      Mar 25 at 21:25
















                    Yes, I would like to count how many times the token appears.

                    – Abbey
                    Mar 25 at 21:01





                    Yes, I would like to count how many times the token appears.

                    – Abbey
                    Mar 25 at 21:01




                    1




                    1





                    Updated answer to make it count the number of occurrences of each token index.

                    – Alain T.
                    Mar 25 at 21:25





                    Updated answer to make it count the number of occurrences of each token index.

                    – Alain T.
                    Mar 25 at 21:25

















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304262%2ftokenize-string-based-on-self-defined-dictionary%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                    Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                    Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript