How does SpaCy keeps track of character and token offset during tokenization?How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?

Should disabled buttons give feedback when clicked?

Why do people keep referring to Leia as Princess Leia, even after the destruction of Alderaan?

What is the measurable difference between dry basil and fresh?

What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?

Some interesting calculation puzzle that I made

Why is the ladder of the LM always in the dark side of the LM?

When an electron changes its spin, or any other intrinsic property, is it still the same electron?

Would dual wielding daggers be a viable choice for a covert bodyguard?

How were Martello towers supposed to work?

Do I have a right to cancel a purchase of foreign currency in the UK?

Why did Harry Potter get a bedroom?

Are there any medieval light sources without fire?

Swapping "Good" and "Bad"

Using `PlotLegends` with a `ColorFunction`

Does Multiverse exist in MCU?

Is there a strong legal guarantee that the U.S. can give to another country that it won't attack them?

How can I truly shut down ssh server?

Confirming the Identity of a (Friendly) Reviewer After the Reviews

Received a dinner invitation through my employer's email, is it ok to attend?

Word meaning to destroy books

Integer Lists of Noah

Why was hardware diversification an asset for the IBM PC ecosystem?

Is anyone advocating the promotion of homosexuality in UK schools?

How can a dictatorship government be beneficial to a dictator in a post-scarcity society?

How does SpaCy keeps track of character and token offset during tokenization?

How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

How does SpaCy keeps track of character and token offset during tokenization?

In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init

There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.

When looking at extraneous spaces, it's doing some smart alignment of the spans.

Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

add a comment |

How does SpaCy keeps track of character and token offset during tokenization?

In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init

There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.

When looking at extraneous spaces, it's doing some smart alignment of the spans.

Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

add a comment |

How does SpaCy keeps track of character and token offset during tokenization?

In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init

There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.

When looking at extraneous spaces, it's doing some smart alignment of the spans.

Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

How does SpaCy keeps track of character and token offset during tokenization?

In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init

There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.

When looking at extraneous spaces, it's doing some smart alignment of the spans.

Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

python algorithm nlp cython spacy

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

asked Mar 26 at 1:43

alvas

49.2k65 gold badges267 silver badges488 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

+100

Summary:

During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.

in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.

 if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

 if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

 span = string[start:i]

It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.

 key = hash_string(span)
 cache_hit = self._try_cache(key, doc)
 if not cache_hit:
 self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

 if uc == ' ':
 doc.c[doc.length - 1].spacy = True
 start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

 else:
 start = i
 in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

 i += 1

TL;DR

So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55348709%2fhow-does-spacy-keeps-track-of-character-and-token-offset-during-tokenization%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

+100

Summary:

During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

 if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

 if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

 span = string[start:i]

 key = hash_string(span)
 cache_hit = self._try_cache(key, doc)
 if not cache_hit:
 self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

 if uc == ' ':
 doc.c[doc.length - 1].spacy = True
 start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

 else:
 start = i
 in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

 i += 1

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

add a comment |

+100

Summary:

During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

 if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

 if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

 span = string[start:i]

 key = hash_string(span)
 cache_hit = self._try_cache(key, doc)
 if not cache_hit:
 self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

 if uc == ' ':
 doc.c[doc.length - 1].spacy = True
 start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

 else:
 start = i
 in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

 i += 1

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

add a comment |

+100

Summary:

During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

 if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

 if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

 span = string[start:i]

 key = hash_string(span)
 cache_hit = self._try_cache(key, doc)
 if not cache_hit:
 self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

 if uc == ' ':
 doc.c[doc.length - 1].spacy = True
 start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

 else:
 start = i
 in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

 i += 1

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

Summary:

During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

 if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

 if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

 span = string[start:i]

 key = hash_string(span)
 cache_hit = self._try_cache(key, doc)
 if not cache_hit:
 self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

 if uc == ' ':
 doc.c[doc.length - 1].spacy = True
 start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

 else:
 start = i
 in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

 i += 1

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

answered Apr 26 at 21:10

MyNameIsCaleb

1,5181 gold badge2 silver badges20 bronze badges

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1