extracting text from paragraphs using pythonCalling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?
Drums and punctuation
Testing using real data of the customer
Why isn't 'chemically-strengthened glass' made with potassium carbonate to begin with?
Manager questioning my time estimates for a project
What did the 'turbo' button actually do?
What is the use case for non-breathable waterproof pants?
How to melt snow without fire or body heat?
Is there a context where the expression `a.b::c` makes sense?
Can a character with the War Caster feat call a bolt with Call Lightning instead of making an opportunity attack?
Are black holes spherical during merger?
When playing Edgar Markov, what is the definition of a "Vampire spell"?
Function argument returning void or non-void type
Count all vowels in string
Which European Languages are not Indo-European?
Mercedes C180 (W204) dash symbol
USPS Back Room - Trespassing?
Natural Armour and Weapons
Writing style before Elements of Style
Beginner looking to learn/master musical theory and instrumental ability. Where should I begin?
Why are GND pads often only connected by four traces?
Why did Drogon spare this character?
便利な工具 what does な means
On San Andreas Speedruns, why do players blow up the Picador in the mission Ryder?
How was Daenerys able to legitimise this character?
extracting text from paragraphs using python
Calling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:
company name, city, state, amount $123,456,653
We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.
Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
Usually, the paragraph will look like this (70-80% of the time):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).
This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)
python regex python-3.x
|
show 11 more comments
I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:
company name, city, state, amount $123,456,653
We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.
Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
Usually, the paragraph will look like this (70-80% of the time):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).
This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)
python regex python-3.x
1
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Secondly, in your 70-80% example, isL-3typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words
– FailSafe
Mar 24 at 0:48
1
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
1
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
1
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09
|
show 11 more comments
I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:
company name, city, state, amount $123,456,653
We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.
Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
Usually, the paragraph will look like this (70-80% of the time):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).
This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)
python regex python-3.x
I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:
company name, city, state, amount $123,456,653
We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.
Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
Usually, the paragraph will look like this (70-80% of the time):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).
This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)
python regex python-3.x
python regex python-3.x
asked Mar 24 at 0:25
dataviewsdataviews
19814
19814
1
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Secondly, in your 70-80% example, isL-3typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words
– FailSafe
Mar 24 at 0:48
1
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
1
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
1
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09
|
show 11 more comments
1
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Secondly, in your 70-80% example, isL-3typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words
– FailSafe
Mar 24 at 0:48
1
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
1
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
1
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09
1
1
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Secondly, in your 70-80% example, is
L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words– FailSafe
Mar 24 at 0:48
Secondly, in your 70-80% example, is
L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words– FailSafe
Mar 24 at 0:48
1
1
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
1
1
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
1
1
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09
|
show 11 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319641%2fextracting-text-from-paragraphs-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319641%2fextracting-text-from-paragraphs-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...
– FailSafe
Mar 24 at 0:37
Secondly, in your 70-80% example, is
L-3typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words– FailSafe
Mar 24 at 0:48
1
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.
– GKE
Mar 24 at 2:29
1
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.
– dataviews
Mar 24 at 2:38
1
i see u updated the readme, thanks ;)
– dataviews
Mar 24 at 4:09