tokenize string based on self-defined dictionaryHow to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?

Was Dennis Ritchie being too modest in this quote about C and Pascal?

Multiple fireplaces in an apartment building?

Don’t seats that recline flat defeat the purpose of having seatbelts?

Retract an already submitted recommendation letter (written for an undergrad student)

"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?

How much cash can I safely carry into the USA and avoid civil forfeiture?

Mistake in years of experience in resume?

Is there a better way to say "see someone's dreams"?

Island of Knights, Knaves and Spies

What is purpose of DB Browser(dbbrowser.aspx) under admin tool?

Creating a chemical industry from a medieval tech level without petroleum

A strange hotel

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

What is the term for a person whose job is to place products on shelves in stores?

Unknown code in script

What is the most expensive material in the world that could be used to create Pun-Pun's lute?

How to have a sharp product image?

A Paper Record is What I Hamper

Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?

Is there really no use for MD5 anymore?

Why did C use the -> operator instead of reusing the . operator?

Would the change in enthalpy (ΔH) for the dissolution of urea in water be positive or negative?

Why must Chinese maps be obfuscated?

My bank got bought out, am I now going to have to start filing tax returns in a different state?

tokenize string based on self-defined dictionary

How to merge two dictionaries in a single expression?How do I sort a list of dictionaries by a value of the dictionary?Converting string into datetimeHow do I sort a dictionary by value?How to substring a string in Python?Add new keys to a dictionary?Check if a given key already exists in a dictionaryIterating over dictionaries using 'for' loopsDoes Python have a string 'contains' substring method?How to remove a key from a Python dictionary?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I would like to tokenize a list of strings according to my self-defined dictionary.

The list of string looks like this:

lst = ['vitamin c juice', 'organic supplement']

The self-defined dictionary:

dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

My expected result:

vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.

asked Mar 22 at 16:44

Abbey

1107

add a comment |

I would like to tokenize a list of strings according to my self-defined dictionary.

The list of string looks like this:

lst = ['vitamin c juice', 'organic supplement']

The self-defined dictionary:

dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

My expected result:

vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

asked Mar 22 at 16:44

Abbey

1107

add a comment |

I would like to tokenize a list of strings according to my self-defined dictionary.

The list of string looks like this:

lst = ['vitamin c juice', 'organic supplement']

The self-defined dictionary:

dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

My expected result:

vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

asked Mar 22 at 16:44

Abbey

1107

I would like to tokenize a list of strings according to my self-defined dictionary.

The list of string looks like this:

lst = ['vitamin c juice', 'organic supplement']

The self-defined dictionary:

dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

My expected result:

vitamin c juice --> [(3,1), (1,1)]
organic supplement --> [(0,1), (2,1)]

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

python nlp nltk tokenize gensim

asked Mar 22 at 16:44

Abbey

1107

asked Mar 22 at 16:44

Abbey

1107

asked Mar 22 at 16:44

Abbey

1107

asked Mar 22 at 16:44

Abbey

1107

asked Mar 22 at 16:44

Abbey

1107

add a comment |

2 Answers
2

active

oldest

votes

I can only think of very inefficient ways on how to implement tokenizer that recognize substrings including whitespaces. However if you do not insist on the whitespace, here is an easy way, changing vitamin c to vitamin_c:


lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

word2index = key: val for val, key in dct.items()

tokenized = [[word2index[word] for word in text.split()] for text in lst]

If you do not insist on your predefined mapping dct, you could also create it with:

vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))

answered Mar 22 at 17:42

Simon

685

add a comment |

You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

1

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304262%2ftokenize-string-based-on-self-defined-dictionary%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes


lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

word2index = key: val for val, key in dct.items()

tokenized = [[word2index[word] for word in text.split()] for text in lst]

If you do not insist on your predefined mapping dct, you could also create it with:

vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))

answered Mar 22 at 17:42

Simon

685

add a comment |


lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

word2index = key: val for val, key in dct.items()

tokenized = [[word2index[word] for word in text.split()] for text in lst]

If you do not insist on your predefined mapping dct, you could also create it with:

vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))

answered Mar 22 at 17:42

Simon

685

add a comment |


lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

word2index = key: val for val, key in dct.items()

tokenized = [[word2index[word] for word in text.split()] for text in lst]

If you do not insist on your predefined mapping dct, you could also create it with:

vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))

answered Mar 22 at 17:42

Simon

685


lst = ['vitamin_c juice', 'organic supplement']
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin_c'

word2index = key: val for val, key in dct.items()

tokenized = [[word2index[word] for word in text.split()] for text in lst]

If you do not insist on your predefined mapping dct, you could also create it with:

vocab = set([word for text in lst for word in text.split()])
word2index = word: ind for ind, word in enumerate(sorted(vocab))

answered Mar 22 at 17:42

Simon

685

answered Mar 22 at 17:42

Simon

685

answered Mar 22 at 17:42

Simon

685

answered Mar 22 at 17:42

Simon

685

add a comment |

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

1

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

add a comment |

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

1

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

add a comment |

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = 0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'

import re
from collections import Counter
keywords = keyword:token for token,keyword in dct.items() # inverted dictionary
sortedKw = sorted(keywords,key=lambda x:-len(x)) # keywords in reverse order of size
pattern = re.compile( "|".join(sortedKw) ) # regular expression
lstKeywords = [ pattern.findall(item) for item in lst ] # list items --> keywords
tokenGroups = [ [keywords[word] for word in words] for words in lstKeywords ] # keyword lists to lists of indexes
result = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

edited Mar 25 at 21:32

answered Mar 22 at 17:59

Alain T.

8,67711329

answered Mar 22 at 17:59

Alain T.

8,67711329

answered Mar 22 at 17:59

Alain T.

8,67711329

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

1

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

add a comment |

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

1

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

Yes, I would like to count how many times the token appears.

– Abbey
Mar 25 at 21:01

Updated answer to make it count the number of occurrences of each token index.

– Alain T.
Mar 25 at 21:25

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers
2

2 Answers
2

2 Answers
2