How does SpaCy keeps track of character and token offset during tokenization?How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?

Should disabled buttons give feedback when clicked?

Why do people keep referring to Leia as Princess Leia, even after the destruction of Alderaan?

What is the measurable difference between dry basil and fresh?

What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?

Some interesting calculation puzzle that I made

Why is the ladder of the LM always in the dark side of the LM?

When an electron changes its spin, or any other intrinsic property, is it still the same electron?

Would dual wielding daggers be a viable choice for a covert bodyguard?

How were Martello towers supposed to work?

Do I have a right to cancel a purchase of foreign currency in the UK?

Why did Harry Potter get a bedroom?

Are there any medieval light sources without fire?

Swapping "Good" and "Bad"

Using `PlotLegends` with a `ColorFunction`

Does Multiverse exist in MCU?

Is there a strong legal guarantee that the U.S. can give to another country that it won't attack them?

How can I truly shut down ssh server?

Confirming the Identity of a (Friendly) Reviewer After the Reviews

Received a dinner invitation through my employer's email, is it ok to attend?

Word meaning to destroy books

Integer Lists of Noah

Why was hardware diversification an asset for the IBM PC ecosystem?

Is anyone advocating the promotion of homosexuality in UK schools?

How can a dictatorship government be beneficial to a dictator in a post-scarcity society?



How does SpaCy keeps track of character and token offset during tokenization?


How to get the ASCII value of a character?How does the Google “Did you mean?” Algorithm work?How does Python's super() work with multiple inheritance?How can I print literal curly-brace characters in python string and also use .format on it?What does the 'b' character do in front of a string literal?How does the @property decorator work?Use spacy Spanish TokenizerSimilarity measure in Spacy tokensSpacy Japanese TokenizerHow to get the start/end token index of a span given the character offset of it using spaCy?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















How does SpaCy keeps track of character and token offset during tokenization?



In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init



There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.



When looking at extraneous spaces, it's doing some smart alignment of the spans.



Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?










share|improve this question




























    3















    How does SpaCy keeps track of character and token offset during tokenization?



    In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init



    There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.



    When looking at extraneous spaces, it's doing some smart alignment of the spans.



    Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?










    share|improve this question
























      3












      3








      3


      2






      How does SpaCy keeps track of character and token offset during tokenization?



      In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init



      There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.



      When looking at extraneous spaces, it's doing some smart alignment of the spans.



      Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?










      share|improve this question














      How does SpaCy keeps track of character and token offset during tokenization?



      In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init



      There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.



      When looking at extraneous spaces, it's doing some smart alignment of the spans.



      Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?







      python algorithm nlp cython spacy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 1:43









      alvasalvas

      49.2k65 gold badges267 silver badges488 bronze badges




      49.2k65 gold badges267 silver badges488 bronze badges






















          1 Answer
          1






          active

          oldest

          votes


















          4





          +100









          Summary:

          During tokenization, this is the part that keeps track of offset and character.



          Simple answer: It goes character by character in the string.



          TL;DR is at the bottom.




          Explained chunk by chunk:



          It takes in the string to be tokenized and starts iterating through it one letter/space at a time.



          It is a simple for loop on the string where uc is the current character in the string.



          for uc in string:


          It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.



          in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.



           if uc.isspace() != in_ws:


          It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.



          It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.



          It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.



           if start < i:


          span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.



           span = string[start:i]


          It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.



           key = hash_string(span)
          cache_hit = self._try_cache(key, doc)
          if not cache_hit:
          self._tokenize(doc, span, key)


          Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.



           if uc == ' ':
          doc.c[doc.length - 1].spacy = True
          start = i + 1


          If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.



           else:
          start = i
          in_ws = not in_ws


          And then it increases i += 1 and loops to the next character.



           i += 1



          TL;DR

          So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).






          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55348709%2fhow-does-spacy-keeps-track-of-character-and-token-offset-during-tokenization%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            4





            +100









            Summary:

            During tokenization, this is the part that keeps track of offset and character.



            Simple answer: It goes character by character in the string.



            TL;DR is at the bottom.




            Explained chunk by chunk:



            It takes in the string to be tokenized and starts iterating through it one letter/space at a time.



            It is a simple for loop on the string where uc is the current character in the string.



            for uc in string:


            It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.



            in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.



             if uc.isspace() != in_ws:


            It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.



            It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.



            It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.



             if start < i:


            span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.



             span = string[start:i]


            It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.



             key = hash_string(span)
            cache_hit = self._try_cache(key, doc)
            if not cache_hit:
            self._tokenize(doc, span, key)


            Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.



             if uc == ' ':
            doc.c[doc.length - 1].spacy = True
            start = i + 1


            If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.



             else:
            start = i
            in_ws = not in_ws


            And then it increases i += 1 and loops to the next character.



             i += 1



            TL;DR

            So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).






            share|improve this answer



























              4





              +100









              Summary:

              During tokenization, this is the part that keeps track of offset and character.



              Simple answer: It goes character by character in the string.



              TL;DR is at the bottom.




              Explained chunk by chunk:



              It takes in the string to be tokenized and starts iterating through it one letter/space at a time.



              It is a simple for loop on the string where uc is the current character in the string.



              for uc in string:


              It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.



              in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.



               if uc.isspace() != in_ws:


              It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.



              It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.



              It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.



               if start < i:


              span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.



               span = string[start:i]


              It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.



               key = hash_string(span)
              cache_hit = self._try_cache(key, doc)
              if not cache_hit:
              self._tokenize(doc, span, key)


              Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.



               if uc == ' ':
              doc.c[doc.length - 1].spacy = True
              start = i + 1


              If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.



               else:
              start = i
              in_ws = not in_ws


              And then it increases i += 1 and loops to the next character.



               i += 1



              TL;DR

              So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).






              share|improve this answer

























                4





                +100







                4





                +100



                4




                +100





                Summary:

                During tokenization, this is the part that keeps track of offset and character.



                Simple answer: It goes character by character in the string.



                TL;DR is at the bottom.




                Explained chunk by chunk:



                It takes in the string to be tokenized and starts iterating through it one letter/space at a time.



                It is a simple for loop on the string where uc is the current character in the string.



                for uc in string:


                It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.



                in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.



                 if uc.isspace() != in_ws:


                It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.



                It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.



                It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.



                 if start < i:


                span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.



                 span = string[start:i]


                It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.



                 key = hash_string(span)
                cache_hit = self._try_cache(key, doc)
                if not cache_hit:
                self._tokenize(doc, span, key)


                Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.



                 if uc == ' ':
                doc.c[doc.length - 1].spacy = True
                start = i + 1


                If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.



                 else:
                start = i
                in_ws = not in_ws


                And then it increases i += 1 and loops to the next character.



                 i += 1



                TL;DR

                So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).






                share|improve this answer













                Summary:

                During tokenization, this is the part that keeps track of offset and character.



                Simple answer: It goes character by character in the string.



                TL;DR is at the bottom.




                Explained chunk by chunk:



                It takes in the string to be tokenized and starts iterating through it one letter/space at a time.



                It is a simple for loop on the string where uc is the current character in the string.



                for uc in string:


                It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.



                in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.



                 if uc.isspace() != in_ws:


                It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.



                It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.



                It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.



                 if start < i:


                span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.



                 span = string[start:i]


                It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.



                 key = hash_string(span)
                cache_hit = self._try_cache(key, doc)
                if not cache_hit:
                self._tokenize(doc, span, key)


                Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.



                 if uc == ' ':
                doc.c[doc.length - 1].spacy = True
                start = i + 1


                If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.



                 else:
                start = i
                in_ws = not in_ws


                And then it increases i += 1 and loops to the next character.



                 i += 1



                TL;DR

                So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 26 at 21:10









                MyNameIsCalebMyNameIsCaleb

                1,5181 gold badge2 silver badges20 bronze badges




                1,5181 gold badge2 silver badges20 bronze badges
















                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55348709%2fhow-does-spacy-keeps-track-of-character-and-token-offset-during-tokenization%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현