How to solve Memory Error with Spacy PhraseMatcher?How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy
Averting Bathos
what should be done first, handling missing data or dealing with data types?
Examples of "unsuccessful" theories with afterlives
What are ATC-side procedures to handle a jammed frequency?
Aesthetic proofs that involve Field Theory / Galois Theory
Is there something that can completely prevent the effects of the Hold Person spell?
Is there any relation/leak between two sections of LM358 op-amp?
What benefits does the Power Word Kill spell have?
Late 1970's and 6502 chip facilities for operating systems
A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?
Error Message when nothing should be evaluated
Is it impolite to ask for an in-flight catalogue with no intention of buying?
Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?
A food item only made possible by time-freezing storage?
Why did UK NHS pay for homeopathic treatments?
Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?
List of 1000 most common words across all languages
Does the Ranger Planar Warrior bypass resistance to non magical attacks
Could Apollo astronauts see city lights from the moon?
How to see the previous "Accessed" date in Windows
Labview vs Matlab??Which one better for image processing?
My Project Manager does not accept carry-over in Scrum, Is that normal?
What are the moistened wrapped towelettes called in restaurants?
Hangman Game (YAHG)
How to solve Memory Error with Spacy PhraseMatcher?
How can I represent an 'Enum' in Python?How to flush output of print function?Using PhraseMatcher in SpaCy to find multiple match typesError while installing spacySpacy Entity from PhraseMatcher onlySpacy | is it possible to remove the hardcoded limit on lengths in the phrasematcher?Error installing spacy using pip3How to solve Memory Error while loading English Module using spacy?Spacy Matcher/PhraseMatcher Span, how to expand span to the current sentence?POS pattern mining with spacy
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Highlevel Background
I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.
Actual Problem
First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:
- samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,
- preprocess the raw text for the nlp task,
- streams the text through spacy's nlp.pipe to get a list of docs and
- streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).
At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.
After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.
Code extracts
Functions
# Function that samples filing_documents
def random_Filings(amount):
...
return random_list
# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
try:
text_contents = S3Client().get_buffer(remote_path)
...
return clean_list
# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
matcher_id, start, end = matches[id]
rule_id = nlp.vocab.strings[match_id]
token = doc[start]
sent_of_token = token.sent
match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
doc.user_data])
def match_text_stream(clean_texts):
some_pattern = [nlp(text) for text in ('foo', 'bar')]
some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]
matcher = PhraseMAtcher(nlp.vocab)
matcher.add('SOME', on_match, *some_pattern)
matcher.add('OTHER', on_match, *some_other_pattern)
doc_list = []
for doc in nlp.pipe(list_of_text, barch_size=30):
doc_list.append(doc)
for doc in matcher.pipi(doc_list, batch_size=30):
pass
Problemsteps:
match_list = []
nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)
print(match_list)
Error Message
MemoryError
<string> in in match_text_stream(clean_text)
../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)
709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:
...
MemoryError
../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)
31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update
ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()
MemoryError:
python-3.x azure-storage-blobs spacy match-phrase
add a comment
|
Highlevel Background
I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.
Actual Problem
First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:
- samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,
- preprocess the raw text for the nlp task,
- streams the text through spacy's nlp.pipe to get a list of docs and
- streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).
At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.
After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.
Code extracts
Functions
# Function that samples filing_documents
def random_Filings(amount):
...
return random_list
# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
try:
text_contents = S3Client().get_buffer(remote_path)
...
return clean_list
# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
matcher_id, start, end = matches[id]
rule_id = nlp.vocab.strings[match_id]
token = doc[start]
sent_of_token = token.sent
match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
doc.user_data])
def match_text_stream(clean_texts):
some_pattern = [nlp(text) for text in ('foo', 'bar')]
some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]
matcher = PhraseMAtcher(nlp.vocab)
matcher.add('SOME', on_match, *some_pattern)
matcher.add('OTHER', on_match, *some_other_pattern)
doc_list = []
for doc in nlp.pipe(list_of_text, barch_size=30):
doc_list.append(doc)
for doc in matcher.pipi(doc_list, batch_size=30):
pass
Problemsteps:
match_list = []
nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)
print(match_list)
Error Message
MemoryError
<string> in in match_text_stream(clean_text)
../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)
709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:
...
MemoryError
../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)
31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update
ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()
MemoryError:
python-3.x azure-storage-blobs spacy match-phrase
add a comment
|
Highlevel Background
I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.
Actual Problem
First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:
- samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,
- preprocess the raw text for the nlp task,
- streams the text through spacy's nlp.pipe to get a list of docs and
- streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).
At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.
After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.
Code extracts
Functions
# Function that samples filing_documents
def random_Filings(amount):
...
return random_list
# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
try:
text_contents = S3Client().get_buffer(remote_path)
...
return clean_list
# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
matcher_id, start, end = matches[id]
rule_id = nlp.vocab.strings[match_id]
token = doc[start]
sent_of_token = token.sent
match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
doc.user_data])
def match_text_stream(clean_texts):
some_pattern = [nlp(text) for text in ('foo', 'bar')]
some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]
matcher = PhraseMAtcher(nlp.vocab)
matcher.add('SOME', on_match, *some_pattern)
matcher.add('OTHER', on_match, *some_other_pattern)
doc_list = []
for doc in nlp.pipe(list_of_text, barch_size=30):
doc_list.append(doc)
for doc in matcher.pipi(doc_list, batch_size=30):
pass
Problemsteps:
match_list = []
nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)
print(match_list)
Error Message
MemoryError
<string> in in match_text_stream(clean_text)
../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)
709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:
...
MemoryError
../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)
31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update
ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()
MemoryError:
python-3.x azure-storage-blobs spacy match-phrase
Highlevel Background
I'm working on a project where in the first step I'm searching for keywords and phrases inside a large text corpus. I want to identify passages/sentences where these keywords occur. Later I want to make these passages accessible through my local postgres db for the user to query information. The data is stored on Azure Blob Storage and I'm using Minio Server to connect my Django application.
Actual Problem
First my shell was killed and after some try-and-error refactoring/debugging a memory error, when running my script that:
- samples 30 (I want to sample 10000 but it breaks already at low numbers) random text documents from the blob storage,
- preprocess the raw text for the nlp task,
- streams the text through spacy's nlp.pipe to get a list of docs and
- streams the list of docs to PhraseMatcher (which passes on_match the rule_id, start token of a sentence (with match), the sentence, hash_id to a match_list).
At first the shell was killed. I looked into the log files and saw that it was a memory error, but to be honest I'm quite new to this topic.
After rearranging the code I got an MemoryError directly inside the shell. Within the language.pipe() step of streaming the text to spaCy.
Code extracts
Functions
# Function that samples filing_documents
def random_Filings(amount):
...
return random_list
# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
try:
text_contents = S3Client().get_buffer(remote_path)
...
return clean_list
# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
matcher_id, start, end = matches[id]
rule_id = nlp.vocab.strings[match_id]
token = doc[start]
sent_of_token = token.sent
match_list.append([str(rule_id), sent_of_token.start, sent_of_token,
doc.user_data])
def match_text_stream(clean_texts):
some_pattern = [nlp(text) for text in ('foo', 'bar')]
some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]
matcher = PhraseMAtcher(nlp.vocab)
matcher.add('SOME', on_match, *some_pattern)
matcher.add('OTHER', on_match, *some_other_pattern)
doc_list = []
for doc in nlp.pipe(list_of_text, barch_size=30):
doc_list.append(doc)
for doc in matcher.pipi(doc_list, batch_size=30):
pass
Problemsteps:
match_list = []
nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)
print(match_list)
Error Message
MemoryError
<string> in in match_text_stream(clean_text)
../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)
709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712 yield doc
713 if cleanup:
...
MemoryError
../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)
31
32 def(bedin_update(self,X__bi, drop=0.0):
33 X__bo = self.ops.seqcol(X__bi, self.nW)
34 finish_update = self._get_finsih_update()
35 return X__bo, finish_update
ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()
MemoryError:
python-3.x azure-storage-blobs spacy match-phrase
python-3.x azure-storage-blobs spacy match-phrase
edited Mar 29 at 11:38
wiesbadener
asked Mar 28 at 17:33
wiesbadenerwiesbadener
114 bronze badges
114 bronze badges
add a comment
|
add a comment
|
1 Answer
1
active
oldest
votes
The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55403724%2fhow-to-solve-memory-error-with-spacy-phrasematcher%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
add a comment
|
The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
add a comment
|
The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.
The solution is to cut your documents up into smaller pieces before training. Paragraph units work quite well, or possibly sections.
answered Mar 29 at 7:05
Chandan GuptaChandan Gupta
3631 silver badge8 bronze badges
3631 silver badge8 bronze badges
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
add a comment
|
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
2
2
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
At the moment I’m not training but using the PhraseMatcher to identify these paragraphs.
– wiesbadener
Mar 29 at 11:09
2
2
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
Thanks for the hint. I notices that indeed that a few documents were very large and my VM had too little RAM to handle it...
– wiesbadener
Mar 29 at 15:58
add a comment
|
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55403724%2fhow-to-solve-memory-error-with-spacy-phrasematcher%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown