How to solve Memory Error with Spacy PhraseMatcher?How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy

Averting Bathos

what should be done first, handling missing data or dealing with data types?

Examples of "unsuccessful" theories with afterlives

What are ATC-side procedures to handle a jammed frequency?

Aesthetic proofs that involve Field Theory / Galois Theory

Is there something that can completely prevent the effects of the Hold Person spell?

Is there any relation/leak between two sections of LM358 op-amp?

What benefits does the Power Word Kill spell have?

Late 1970's and 6502 chip facilities for operating systems

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

Error Message when nothing should be evaluated

Is it impolite to ask for an in-flight catalogue with no intention of buying?

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

A food item only made possible by time-freezing storage?

Why did UK NHS pay for homeopathic treatments?

Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?

List of 1000 most common words across all languages

Does the Ranger Planar Warrior bypass resistance to non magical attacks

Could Apollo astronauts see city lights from the moon?

How to see the previous "Accessed" date in Windows

Labview vs Matlab??Which one better for image processing?

My Project Manager does not accept carry-over in Scrum, Is that normal?

What are the moistened wrapped towelettes called in restaurants?

Hangman Game (YAHG)



How to solve Memory Error with Spacy PhraseMatcher?


How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2















Highlevel Background



I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.



Actual Problem



First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:



  1. samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

  2. preprocess the raw text for the nlp task,

  3. streams the text through spacy's nlp.pipe to get a list of docs and

  4. streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.



After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.



Code extracts



Functions



# Function that samples filing_documents
def random_Filings(amount):
...
return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
try:
text_contents = S3Client().get_buffer(remote_path)
...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
matcher_id, start, end = matches[id]
rule_id = nlp.vocab.strings[match_id]
token = doc[start]
sent_of_token = token.sent
match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
doc.user_data])

def match_text_stream(clean_texts):
some_pattern = [nlp(text) for text in ('foo', 'bar')]
some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

matcher = PhraseMAtcher(nlp.vocab)

matcher.add('SOME', on_match, *some_pattern)
matcher.add('OTHER', on_match, *some_other_pattern)

doc_list = []

for doc in nlp.pipe(list_of_text, barch_size=30):
doc_list.append(doc)

for doc in matcher.pipi(doc_list, batch_size=30):
pass


Problemsteps:



match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)



Error Message



MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:


...



MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError:










share|improve this question
































    2















    Highlevel Background



    I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.



    Actual Problem



    First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:



    1. samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

    2. preprocess the raw text for the nlp task,

    3. streams the text through spacy's nlp.pipe to get a list of docs and

    4. streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

    At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.



    After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.



    Code extracts



    Functions



    # Function that samples filing_documents
    def random_Filings(amount):
    ...
    return random_list

    # Function that connects to storage and saves cleaned text
    def get_clean_text(random_list):
    try:
    text_contents = S3Client().get_buffer(remote_path)
    ...
    return clean_list

    # matcher function that performs action on match of PhraseMatcher
    def on_match(matcher, doc, id, matches):
    matcher_id, start, end = matches[id]
    rule_id = nlp.vocab.strings[match_id]
    token = doc[start]
    sent_of_token = token.sent
    match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
    doc.user_data])

    def match_text_stream(clean_texts):
    some_pattern = [nlp(text) for text in ('foo', 'bar')]
    some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

    matcher = PhraseMAtcher(nlp.vocab)

    matcher.add('SOME', on_match, *some_pattern)
    matcher.add('OTHER', on_match, *some_other_pattern)

    doc_list = []

    for doc in nlp.pipe(list_of_text, barch_size=30):
    doc_list.append(doc)

    for doc in matcher.pipi(doc_list, batch_size=30):
    pass


    Problemsteps:



    match_list = []

    nlp = en_core_web_sm.load()
    sample_list = random_Filings(30)
    clean_texts = get_clean_text(sample_list)
    match_text_stream(clean_text)

    print(match_list)



    Error Message



    MemoryError
    <string> in in match_text_stream(clean_text)

    ../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

    709 origingal_strings_data = None
    710 nr_seen = 0
    711 for doc in docs:
    712 yield doc
    713 if cleanup:


    ...



    MemoryError


    ../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

    31
    32 def(bedin_update(self,X__bi, drop=0.0):
    33 X__bo = self.ops.seqcol(X__bi, self.nW)
    34 finish_update = self._get_finsih_update()
    35 return X__bo, finish_update

    ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
    ops.pyx in thinc.neural.ops.NumpyOps.allocate()

    MemoryError:










    share|improve this question




























      2












      2








      2








      Highlevel Background



      I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.



      Actual Problem



      First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:



      1. samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

      2. preprocess the raw text for the nlp task,

      3. streams the text through spacy's nlp.pipe to get a list of docs and

      4. streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

      At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.



      After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.



      Code extracts



      Functions



      # Function that samples filing_documents
      def random_Filings(amount):
      ...
      return random_list

      # Function that connects to storage and saves cleaned text
      def get_clean_text(random_list):
      try:
      text_contents = S3Client().get_buffer(remote_path)
      ...
      return clean_list

      # matcher function that performs action on match of PhraseMatcher
      def on_match(matcher, doc, id, matches):
      matcher_id, start, end = matches[id]
      rule_id = nlp.vocab.strings[match_id]
      token = doc[start]
      sent_of_token = token.sent
      match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
      doc.user_data])

      def match_text_stream(clean_texts):
      some_pattern = [nlp(text) for text in ('foo', 'bar')]
      some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

      matcher = PhraseMAtcher(nlp.vocab)

      matcher.add('SOME', on_match, *some_pattern)
      matcher.add('OTHER', on_match, *some_other_pattern)

      doc_list = []

      for doc in nlp.pipe(list_of_text, barch_size=30):
      doc_list.append(doc)

      for doc in matcher.pipi(doc_list, batch_size=30):
      pass


      Problemsteps:



      match_list = []

      nlp = en_core_web_sm.load()
      sample_list = random_Filings(30)
      clean_texts = get_clean_text(sample_list)
      match_text_stream(clean_text)

      print(match_list)



      Error Message



      MemoryError
      <string> in in match_text_stream(clean_text)

      ../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

      709 origingal_strings_data = None
      710 nr_seen = 0
      711 for doc in docs:
      712 yield doc
      713 if cleanup:


      ...



      MemoryError


      ../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

      31
      32 def(bedin_update(self,X__bi, drop=0.0):
      33 X__bo = self.ops.seqcol(X__bi, self.nW)
      34 finish_update = self._get_finsih_update()
      35 return X__bo, finish_update

      ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
      ops.pyx in thinc.neural.ops.NumpyOps.allocate()

      MemoryError:










      share|improve this question
















      Highlevel Background



      I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.



      Actual Problem



      First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:



      1. samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

      2. preprocess the raw text for the nlp task,

      3. streams the text through spacy's nlp.pipe to get a list of docs and

      4. streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

      At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.



      After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.



      Code extracts



      Functions



      # Function that samples filing_documents
      def random_Filings(amount):
      ...
      return random_list

      # Function that connects to storage and saves cleaned text
      def get_clean_text(random_list):
      try:
      text_contents = S3Client().get_buffer(remote_path)
      ...
      return clean_list

      # matcher function that performs action on match of PhraseMatcher
      def on_match(matcher, doc, id, matches):
      matcher_id, start, end = matches[id]
      rule_id = nlp.vocab.strings[match_id]
      token = doc[start]
      sent_of_token = token.sent
      match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
      doc.user_data])

      def match_text_stream(clean_texts):
      some_pattern = [nlp(text) for text in ('foo', 'bar')]
      some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

      matcher = PhraseMAtcher(nlp.vocab)

      matcher.add('SOME', on_match, *some_pattern)
      matcher.add('OTHER', on_match, *some_other_pattern)

      doc_list = []

      for doc in nlp.pipe(list_of_text, barch_size=30):
      doc_list.append(doc)

      for doc in matcher.pipi(doc_list, batch_size=30):
      pass


      Problemsteps:



      match_list = []

      nlp = en_core_web_sm.load()
      sample_list = random_Filings(30)
      clean_texts = get_clean_text(sample_list)
      match_text_stream(clean_text)

      print(match_list)



      Error Message



      MemoryError
      <string> in in match_text_stream(clean_text)

      ../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

      709 origingal_strings_data = None
      710 nr_seen = 0
      711 for doc in docs:
      712 yield doc
      713 if cleanup:


      ...



      MemoryError


      ../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

      31
      32 def(bedin_update(self,X__bi, drop=0.0):
      33 X__bo = self.ops.seqcol(X__bi, self.nW)
      34 finish_update = self._get_finsih_update()
      35 return X__bo, finish_update

      ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
      ops.pyx in thinc.neural.ops.NumpyOps.allocate()

      MemoryError:







      python-3.x azure-storage-blobs spacy match-phrase






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 29 at 11:38







      wiesbadener

















      asked Mar 28 at 17:33









      wiesbadenerwiesbadener

      114 bronze badges




      114 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          0
















          The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.






          share|improve this answer




















          • 2





            At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

            – wiesbadener
            Mar 29 at 11:09






          • 2





            Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

            – wiesbadener
            Mar 29 at 15:58













          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55403724%2fhow-to-solve-memory-error-with-spacy-phrasematcher%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0
















          The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.






          share|improve this answer




















          • 2





            At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

            – wiesbadener
            Mar 29 at 11:09






          • 2





            Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

            – wiesbadener
            Mar 29 at 15:58















          0
















          The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.






          share|improve this answer




















          • 2





            At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

            – wiesbadener
            Mar 29 at 11:09






          • 2





            Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

            – wiesbadener
            Mar 29 at 15:58













          0














          0










          0









          The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.






          share|improve this answer













          The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 29 at 7:05









          Chandan GuptaChandan Gupta

          3631 silver badge8 bronze badges




          3631 silver badge8 bronze badges










          • 2





            At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

            – wiesbadener
            Mar 29 at 11:09






          • 2





            Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

            – wiesbadener
            Mar 29 at 15:58












          • 2





            At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

            – wiesbadener
            Mar 29 at 11:09






          • 2





            Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

            – wiesbadener
            Mar 29 at 15:58







          2




          2





          At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

          – wiesbadener
          Mar 29 at 11:09





          At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

          – wiesbadener
          Mar 29 at 11:09




          2




          2





          Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

          – wiesbadener
          Mar 29 at 15:58





          Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

          – wiesbadener
          Mar 29 at 15:58




















          draft saved

          draft discarded















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55403724%2fhow-to-solve-memory-error-with-spacy-phrasematcher%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript