extracting text from paragraphs using pythonCalling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?

Drums and punctuation

Testing using real data of the customer

Why isn't 'chemically-strengthened glass' made with potassium carbonate to begin with?

Manager questioning my time estimates for a project

What did the 'turbo' button actually do?

What is the use case for non-breathable waterproof pants?

How to melt snow without fire or body heat?

Is there a context where the expression `a.b::c` makes sense?

Can a character with the War Caster feat call a bolt with Call Lightning instead of making an opportunity attack?

Are black holes spherical during merger?

When playing Edgar Markov, what is the definition of a "Vampire spell"?

Function argument returning void or non-void type

Count all vowels in string

Which European Languages are not Indo-European?

Mercedes C180 (W204) dash symbol

USPS Back Room - Trespassing?

Natural Armour and Weapons

Writing style before Elements of Style

Beginner looking to learn/master musical theory and instrumental ability. Where should I begin?

Why are GND pads often only connected by four traces?

Why did Drogon spare this character?

便利な工具 what does な means

On San Andreas Speedruns, why do players blow up the Picador in the mission Ryder?

How was Daenerys able to legitimise this character?

extracting text from paragraphs using python

Calling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:

company name, city, state, amount $123,456,653

We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.

Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx

Usually, the paragraph will look like this (70-80% of the time):

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).

This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)

asked Mar 24 at 0:25

dataviews

19814

1

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37

Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48

1

@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29

1

lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38

1

i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09

|
show 11 more comments

company name, city, state, amount $123,456,653

We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.

Usually, the paragraph will look like this (70-80% of the time):

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)

asked Mar 24 at 0:25

dataviews

19814

1

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37

Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48

1

@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29

1

lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38

1

i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09

|
show 11 more comments

company name, city, state, amount $123,456,653

We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.

Usually, the paragraph will look like this (70-80% of the time):

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)

asked Mar 24 at 0:25

dataviews

19814

company name, city, state, amount $123,456,653

We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.

Usually, the paragraph will look like this (70-80% of the time):

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)

python regex python-3.x

asked Mar 24 at 0:25

dataviews

19814

asked Mar 24 at 0:25

dataviews

19814

asked Mar 24 at 0:25

dataviews

19814

asked Mar 24 at 0:25

dataviews

19814

asked Mar 24 at 0:25

dataviews

19814

1

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37

Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48

1

@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29

1

lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38

1

i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09

|
show 11 more comments

1

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37

Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48

1

@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29

1

lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38

1

i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37

Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48

@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29

lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38

i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09

|
show 11 more comments

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319641%2fextracting-text-from-paragraphs-using-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴