tokenize string based on self-defined dictionaryHow to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?
Was Dennis Ritchie being too modest in this quote about C and Pascal?
Multiple fireplaces in an apartment building?
Don’t seats that recline flat defeat the purpose of having seatbelts?
Retract an already submitted recommendation letter (written for an undergrad student)
"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?
How much cash can I safely carry into the USA and avoid civil forfeiture?
Mistake in years of experience in resume?
Is there a better way to say "see someone's dreams"?
Island of Knights, Knaves and Spies
What is purpose of DB Browser(dbbrowser.aspx) under admin tool?
Creating a chemical industry from a medieval tech level without petroleum
A strange hotel
I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?
What is the term for a person whose job is to place products on shelves in stores?
Unknown code in script
What is the most expensive material in the world that could be used to create Pun-Pun's lute?
How to have a sharp product image?
A Paper Record is What I Hamper
Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?
Is there really no use for MD5 anymore?
Why did C use the -> operator instead of reusing the . operator?
Would the change in enthalpy (ΔH) for the dissolution of urea in water be positive or negative?
Why must Chinese maps be obfuscated?
My bank got bought out, am I now going to have to start filing tax returns in a different state?
tokenize string based on self-defined dictionary
How to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I would like to tokenize a list of strings according to my self-defined dictionary.
The list of string looks like this:
lst = ['vitamin c juice', 'organic supplement']
The self-defined dictionary:
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
My expected result:
vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]
My current code:
import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list])
corpus = [dct.doc2bow(text) for text in [s for s in lst]]
The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string
However, I do not want to simply tokenize "vitamin c" as vitamin
and c
. Instead, I want to tokenize based on my existing dct
words. That is to say, it should be vitamin c
.
python nlp nltk tokenize gensim
add a comment |
I would like to tokenize a list of strings according to my self-defined dictionary.
The list of string looks like this:
lst = ['vitamin c juice', 'organic supplement']
The self-defined dictionary:
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
My expected result:
vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]
My current code:
import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list])
corpus = [dct.doc2bow(text) for text in [s for s in lst]]
The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string
However, I do not want to simply tokenize "vitamin c" as vitamin
and c
. Instead, I want to tokenize based on my existing dct
words. That is to say, it should be vitamin c
.
python nlp nltk tokenize gensim
add a comment |
I would like to tokenize a list of strings according to my self-defined dictionary.
The list of string looks like this:
lst = ['vitamin c juice', 'organic supplement']
The self-defined dictionary:
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
My expected result:
vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]
My current code:
import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list])
corpus = [dct.doc2bow(text) for text in [s for s in lst]]
The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string
However, I do not want to simply tokenize "vitamin c" as vitamin
and c
. Instead, I want to tokenize based on my existing dct
words. That is to say, it should be vitamin c
.
python nlp nltk tokenize gensim
I would like to tokenize a list of strings according to my self-defined dictionary.
The list of string looks like this:
lst = ['vitamin c juice', 'organic supplement']
The self-defined dictionary:
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
My expected result:
vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]
My current code:
import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list])
corpus = [dct.doc2bow(text) for text in [s for s in lst]]
The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string
However, I do not want to simply tokenize "vitamin c" as vitamin
and c
. Instead, I want to tokenize based on my existing dct
words. That is to say, it should be vitamin c
.
python nlp nltk tokenize gensim
python nlp nltk tokenize gensim
asked Mar 22 at 16:44
AbbeyAbbey
1107
1107
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c
to vitamin_c
:
lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'
word2index = key: val for val, key in dct.items()
tokenized = [[word2index[word] for word in text.split()] for text in lst]
If you do not insist on your predefined mapping dct
, you could also create it with:
vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))
add a comment |
You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens
for example:
lst = ['vitamin c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]
The regular expression takes the form: keyword1|keyword2|keyword3
Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.
After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.
[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304262%2ftokenize-string-based-on-self-defined-dictionary%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c
to vitamin_c
:
lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'
word2index = key: val for val, key in dct.items()
tokenized = [[word2index[word] for word in text.split()] for text in lst]
If you do not insist on your predefined mapping dct
, you could also create it with:
vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))
add a comment |
I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c
to vitamin_c
:
lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'
word2index = key: val for val, key in dct.items()
tokenized = [[word2index[word] for word in text.split()] for text in lst]
If you do not insist on your predefined mapping dct
, you could also create it with:
vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))
add a comment |
I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c
to vitamin_c
:
lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'
word2index = key: val for val, key in dct.items()
tokenized = [[word2index[word] for word in text.split()] for text in lst]
If you do not insist on your predefined mapping dct
, you could also create it with:
vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))
I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c
to vitamin_c
:
lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'
word2index = key: val for val, key in dct.items()
tokenized = [[word2index[word] for word in text.split()] for text in lst]
If you do not insist on your predefined mapping dct
, you could also create it with:
vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))
answered Mar 22 at 17:42
SimonSimon
685
685
add a comment |
add a comment |
You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens
for example:
lst = ['vitamin c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]
The regular expression takes the form: keyword1|keyword2|keyword3
Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.
After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.
[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
add a comment |
You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens
for example:
lst = ['vitamin c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]
The regular expression takes the form: keyword1|keyword2|keyword3
Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.
After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.
[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
add a comment |
You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens
for example:
lst = ['vitamin c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]
The regular expression takes the form: keyword1|keyword2|keyword3
Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.
After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.
[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.
You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens
for example:
lst = ['vitamin c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'
import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count)
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]
The regular expression takes the form: keyword1|keyword2|keyword3
Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.
After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.
[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.
edited Mar 25 at 21:32
answered Mar 22 at 17:59
Alain T.Alain T.
8,67711329
8,67711329
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
add a comment |
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
Yes, I would like to count how many times the token appears.
– Abbey
Mar 25 at 21:01
1
1
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
Updated answer to make it count the number of occurrences of each token index.
– Alain T.
Mar 25 at 21:25
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304262%2ftokenize-string-based-on-self-defined-dictionary%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown