How does SpaCy keeps track of character and token offset during tokenization?How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?
Should disabled buttons give feedback when clicked?
Why do people keep referring to Leia as Princess Leia, even after the destruction of Alderaan?
What is the measurable difference between dry basil and fresh?
What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?
Some interesting calculation puzzle that I made
Why is the ladder of the LM always in the dark side of the LM?
When an electron changes its spin, or any other intrinsic property, is it still the same electron?
Would dual wielding daggers be a viable choice for a covert bodyguard?
How were Martello towers supposed to work?
Do I have a right to cancel a purchase of foreign currency in the UK?
Why did Harry Potter get a bedroom?
Are there any medieval light sources without fire?
Swapping "Good" and "Bad"
Using `PlotLegends` with a `ColorFunction`
Does Multiverse exist in MCU?
Is there a strong legal guarantee that the U.S. can give to another country that it won't attack them?
How can I truly shut down ssh server?
Confirming the Identity of a (Friendly) Reviewer After the Reviews
Received a dinner invitation through my employer's email, is it ok to attend?
Word meaning to destroy books
Integer Lists of Noah
Why was hardware diversification an asset for the IBM PC ecosystem?
Is anyone advocating the promotion of homosexuality in UK schools?
How can a dictatorship government be beneficial to a dictator in a post-scarcity society?
How does SpaCy keeps track of character and token offset during tokenization?
How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices
method seems to be retrieving the token_by_start
and token_by_end
but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
python algorithm nlp cython spacy
add a comment |
How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices
method seems to be retrieving the token_by_start
and token_by_end
but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
python algorithm nlp cython spacy
add a comment |
How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices
method seems to be retrieving the token_by_start
and token_by_end
but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
python algorithm nlp cython spacy
How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices
method seems to be retrieving the token_by_start
and token_by_end
but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
python algorithm nlp cython spacy
python algorithm nlp cython spacy
asked Mar 26 at 1:43
alvasalvas
49.2k65 gold badges267 silver badges488 bronze badges
49.2k65 gold badges267 silver badges488 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for
loop on the string where uc
is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws
setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1
.
in_ws
is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace()
and operate only on False
. Instead, when it first starts, in_ws
is set to the result of string[0].isspace()
and then compared against itself. If string[0]
is a space, it will evaluate the same and therefor skip down and increase i
(discussed later) and go to the next uc
until it reaches a uc
that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i
.
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1
and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55348709%2fhow-does-spacy-keeps-track-of-character-and-token-offset-during-tokenization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for
loop on the string where uc
is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws
setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1
.
in_ws
is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace()
and operate only on False
. Instead, when it first starts, in_ws
is set to the result of string[0].isspace()
and then compared against itself. If string[0]
is a space, it will evaluate the same and therefor skip down and increase i
(discussed later) and go to the next uc
until it reaches a uc
that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i
.
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1
and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
add a comment |
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for
loop on the string where uc
is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws
setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1
.
in_ws
is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace()
and operate only on False
. Instead, when it first starts, in_ws
is set to the result of string[0].isspace()
and then compared against itself. If string[0]
is a space, it will evaluate the same and therefor skip down and increase i
(discussed later) and go to the next uc
until it reaches a uc
that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i
.
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1
and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
add a comment |
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for
loop on the string where uc
is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws
setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1
.
in_ws
is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace()
and operate only on False
. Instead, when it first starts, in_ws
is set to the result of string[0].isspace()
and then compared against itself. If string[0]
is a space, it will evaluate the same and therefor skip down and increase i
(discussed later) and go to the next uc
until it reaches a uc
that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i
.
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1
and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for
loop on the string where uc
is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws
setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1
.
in_ws
is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace()
and operate only on False
. Instead, when it first starts, in_ws
is set to the result of string[0].isspace()
and then compared against itself. If string[0]
is a space, it will evaluate the same and therefor skip down and increase i
(discussed later) and go to the next uc
until it reaches a uc
that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i
.
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1
and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
answered Apr 26 at 21:10
MyNameIsCalebMyNameIsCaleb
1,5181 gold badge2 silver badges20 bronze badges
1,5181 gold badge2 silver badges20 bronze badges
add a comment |
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55348709%2fhow-does-spacy-keeps-track-of-character-and-token-offset-during-tokenization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown