extracting text from paragraphs using pythonCalling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?

Drums and punctuation

Testing using real data of the customer

Why isn't 'chemically-strengthened glass' made with potassium carbonate to begin with?

Manager questioning my time estimates for a project

What did the 'turbo' button actually do?

What is the use case for non-breathable waterproof pants?

How to melt snow without fire or body heat?

Is there a context where the expression `a.b::c` makes sense?

Can a character with the War Caster feat call a bolt with Call Lightning instead of making an opportunity attack?

Are black holes spherical during merger?

When playing Edgar Markov, what is the definition of a "Vampire spell"?

Function argument returning void or non-void type

Count all vowels in string

Which European Languages are not Indo-European?

Mercedes C180 (W204) dash symbol

USPS Back Room - Trespassing?

Natural Armour and Weapons

Writing style before Elements of Style

Beginner looking to learn/master musical theory and instrumental ability. Where should I begin?

Why are GND pads often only connected by four traces?

Why did Drogon spare this character?

便利な工具 what does な means

On San Andreas Speedruns, why do players blow up the Picador in the mission Ryder?

How was Daenerys able to legitimise this character?



extracting text from paragraphs using python


Calling an external command in PythonWhat are metaclasses in Python?Finding the index of an item given a list containing it in PythonWhat is the difference between Python's list methods append and extend?How can I safely create a nested directory?Does Python have a ternary conditional operator?How to get the current time in PythonHow can I make a time delay in Python?Extracting extension from filename in PythonDoes Python have a string 'contains' substring method?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








2















I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:



company name, city, state, amount $123,456,653


We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.



Example: company name 1, city, state, company name 2, city, state, amount $123,456,653


There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.



Example: company name 1, company name 1 longer, city, state, amount $123,456,653


And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.



Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx



Usually, the paragraph will look like this (70-80% of the time):



L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx


Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).



This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)










share|improve this question

















  • 1





    Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

    – FailSafe
    Mar 24 at 0:37











  • Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

    – FailSafe
    Mar 24 at 0:48






  • 1





    @FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

    – GKE
    Mar 24 at 2:29






  • 1





    lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

    – dataviews
    Mar 24 at 2:38






  • 1





    i see u updated the readme, thanks ;)

    – dataviews
    Mar 24 at 4:09

















2















I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:



company name, city, state, amount $123,456,653


We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.



Example: company name 1, city, state, company name 2, city, state, amount $123,456,653


There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.



Example: company name 1, company name 1 longer, city, state, amount $123,456,653


And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.



Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx



Usually, the paragraph will look like this (70-80% of the time):



L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx


Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).



This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)










share|improve this question

















  • 1





    Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

    – FailSafe
    Mar 24 at 0:37











  • Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

    – FailSafe
    Mar 24 at 0:48






  • 1





    @FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

    – GKE
    Mar 24 at 2:29






  • 1





    lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

    – dataviews
    Mar 24 at 2:38






  • 1





    i see u updated the readme, thanks ;)

    – dataviews
    Mar 24 at 4:09













2












2








2








I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:



company name, city, state, amount $123,456,653


We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.



Example: company name 1, city, state, company name 2, city, state, amount $123,456,653


There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.



Example: company name 1, company name 1 longer, city, state, amount $123,456,653


And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.



Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx



Usually, the paragraph will look like this (70-80% of the time):



L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx


Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).



This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)










share|improve this question














I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:



company name, city, state, amount $123,456,653


We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.



Example: company name 1, city, state, company name 2, city, state, amount $123,456,653


There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.



Example: company name 1, company name 1 longer, city, state, amount $123,456,653


And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.



Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx



Usually, the paragraph will look like this (70-80% of the time):



L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx


Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).



This is the current regex I am using: r'([^$]*),.*?$([0-9,]+)







python regex python-3.x






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 24 at 0:25









dataviewsdataviews

19814




19814







  • 1





    Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

    – FailSafe
    Mar 24 at 0:37











  • Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

    – FailSafe
    Mar 24 at 0:48






  • 1





    @FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

    – GKE
    Mar 24 at 2:29






  • 1





    lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

    – dataviews
    Mar 24 at 2:38






  • 1





    i see u updated the readme, thanks ;)

    – dataviews
    Mar 24 at 4:09












  • 1





    Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

    – FailSafe
    Mar 24 at 0:37











  • Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

    – FailSafe
    Mar 24 at 0:48






  • 1





    @FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

    – GKE
    Mar 24 at 2:29






  • 1





    lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

    – dataviews
    Mar 24 at 2:38






  • 1





    i see u updated the readme, thanks ;)

    – dataviews
    Mar 24 at 4:09







1




1





Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37





Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be...

– FailSafe
Mar 24 at 0:37













Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48





Secondly, in your 70-80% example, is L-3 typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words

– FailSafe
Mar 24 at 0:48




1




1





@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29





@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex.

– GKE
Mar 24 at 2:29




1




1





lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38





lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user.

– dataviews
Mar 24 at 2:38




1




1





i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09





i see u updated the readme, thanks ;)

– dataviews
Mar 24 at 4:09












0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319641%2fextracting-text-from-paragraphs-using-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55319641%2fextracting-text-from-paragraphs-using-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해