How to solve Memory Error with Spacy PhraseMatcher?How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy

Averting Bathos

what should be done first, handling missing data or dealing with data types?

Examples of "unsuccessful" theories with afterlives

What are ATC-side procedures to handle a jammed frequency?

Aesthetic proofs that involve Field Theory / Galois Theory

Is there something that can completely prevent the effects of the Hold Person spell?

Is there any relation/leak between two sections of LM358 op-amp?

What benefits does the Power Word Kill spell have?

Late 1970's and 6502 chip facilities for operating systems

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

Error Message when nothing should be evaluated

Is it impolite to ask for an in-flight catalogue with no intention of buying?

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

A food item only made possible by time-freezing storage?

Why did UK NHS pay for homeopathic treatments?

Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?

List of 1000 most common words across all languages

Does the Ranger Planar Warrior bypass resistance to non magical attacks

Could Apollo astronauts see city lights from the moon?

How to see the previous "Accessed" date in Windows

Labview vs Matlab??Which one better for image processing?

My Project Manager does not accept carry-over in Scrum, Is that normal?

What are the moistened wrapped towelettes called in restaurants?

Hangman Game (YAHG)

How to solve Memory Error with Spacy PhraseMatcher?

How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Highlevel Background

I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.

Actual Problem

First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:

samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

preprocess the raw text for the nlp task,

streams the text through spacy's nlp.pipe to get a list of docs and

streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.

After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.

Code extracts

Functions

# Function that samples filing_documents
def random_Filings(amount):
 ...
 return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
 try:
 text_contents = S3Client().get_buffer(remote_path)
 ...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
 matcher_id, start, end = matches[id]
 rule_id = nlp.vocab.strings[match_id]
 token = doc[start]
 sent_of_token = token.sent
 match_list.append([str(rule_id), sent_of_token.start, sent_of_token, 
 doc.user_data])

def match_text_stream(clean_texts):
 some_pattern = [nlp(text) for text in ('foo', 'bar')]
 some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

 matcher = PhraseMAtcher(nlp.vocab)

 matcher.add('SOME', on_match, *some_pattern)
 matcher.add('OTHER', on_match, *some_other_pattern)

 doc_list = []

 for doc in nlp.pipe(list_of_text, barch_size=30):
 doc_list.append(doc)

 for doc in matcher.pipi(doc_list, batch_size=30):
 pass

Problemsteps:

match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)

Error Message

MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:

...

MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError:

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

add a comment
|

Highlevel Background

Actual Problem

First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:

samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

preprocess the raw text for the nlp task,

streams the text through spacy's nlp.pipe to get a list of docs and

streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.

After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.

Code extracts

Functions

# Function that samples filing_documents
def random_Filings(amount):
 ...
 return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
 try:
 text_contents = S3Client().get_buffer(remote_path)
 ...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
 matcher_id, start, end = matches[id]
 rule_id = nlp.vocab.strings[match_id]
 token = doc[start]
 sent_of_token = token.sent
 match_list.append([str(rule_id), sent_of_token.start, sent_of_token, 
 doc.user_data])

def match_text_stream(clean_texts):
 some_pattern = [nlp(text) for text in ('foo', 'bar')]
 some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

 matcher = PhraseMAtcher(nlp.vocab)

 matcher.add('SOME', on_match, *some_pattern)
 matcher.add('OTHER', on_match, *some_other_pattern)

 doc_list = []

 for doc in nlp.pipe(list_of_text, barch_size=30):
 doc_list.append(doc)

 for doc in matcher.pipi(doc_list, batch_size=30):
 pass

Problemsteps:

match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)

Error Message

MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:

...

MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError:

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

add a comment
|

Highlevel Background

Actual Problem

First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:

samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

preprocess the raw text for the nlp task,

streams the text through spacy's nlp.pipe to get a list of docs and

streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.

After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.

Code extracts

Functions

# Function that samples filing_documents
def random_Filings(amount):
 ...
 return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
 try:
 text_contents = S3Client().get_buffer(remote_path)
 ...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
 matcher_id, start, end = matches[id]
 rule_id = nlp.vocab.strings[match_id]
 token = doc[start]
 sent_of_token = token.sent
 match_list.append([str(rule_id), sent_of_token.start, sent_of_token, 
 doc.user_data])

def match_text_stream(clean_texts):
 some_pattern = [nlp(text) for text in ('foo', 'bar')]
 some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

 matcher = PhraseMAtcher(nlp.vocab)

 matcher.add('SOME', on_match, *some_pattern)
 matcher.add('OTHER', on_match, *some_other_pattern)

 doc_list = []

 for doc in nlp.pipe(list_of_text, barch_size=30):
 doc_list.append(doc)

 for doc in matcher.pipi(doc_list, batch_size=30):
 pass

Problemsteps:

match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)

Error Message

MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:

...

MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError:

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

Highlevel Background

Actual Problem

First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:

samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,

preprocess the raw text for the nlp task,

streams the text through spacy's nlp.pipe to get a list of docs and

streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).

At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.

After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.

Code extracts

Functions

# Function that samples filing_documents
def random_Filings(amount):
 ...
 return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
 try:
 text_contents = S3Client().get_buffer(remote_path)
 ...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
 matcher_id, start, end = matches[id]
 rule_id = nlp.vocab.strings[match_id]
 token = doc[start]
 sent_of_token = token.sent
 match_list.append([str(rule_id), sent_of_token.start, sent_of_token, 
 doc.user_data])

def match_text_stream(clean_texts):
 some_pattern = [nlp(text) for text in ('foo', 'bar')]
 some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

 matcher = PhraseMAtcher(nlp.vocab)

 matcher.add('SOME', on_match, *some_pattern)
 matcher.add('OTHER', on_match, *some_other_pattern)

 doc_list = []

 for doc in nlp.pipe(list_of_text, barch_size=30):
 doc_list.append(doc)

 for doc in matcher.pipi(doc_list, batch_size=30):
 pass

Problemsteps:

match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)

Error Message

MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:

...

MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError:

python-3.x azure-storage-blobs spacy match-phrase

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

edited Mar 29 at 11:38

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

asked Mar 28 at 17:33

wiesbadener

114 bronze badges

add a comment
|

1 Answer
1

active

oldest

votes

The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

2

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

2

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55403724%2fhow-to-solve-memory-error-with-spacy-phrasematcher%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

2

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

2

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

add a comment
|

The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

2

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

2

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

add a comment
|

The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

answered Mar 29 at 7:05

Chandan Gupta

3631 silver badge8 bronze badges

2

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

2

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

add a comment
|

2

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

2

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.

– wiesbadener
Mar 29 at 11:09

Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...

– wiesbadener
Mar 29 at 15:58

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

Highlevel Background

Actual Problem

Code extracts

Error Message

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1