Whats the best (fastest) way to scape webpages?What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list

Whats the best (fastest) way to scape webpages?What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if name == “main”: do?What is init.py for?How to substring a string in Python?Fastest way to check if a value exist in a list

My parents are Afghan

Why is the episode called "The Last of the Starks"?

Can a player choose to add detail and flavor to their character's spells and abilities?

Why doesn't a particle exert force on itself?

Is it safe to keep the GPU on 100% utilization for a very long time?

Convert a huge txt-file into a dataset

HTML folder located within IOS Image file?

Justification of physical currency in an interstellar civilization?

Latex editor/compiler for Windows and Powerpoint

Crime rates in a post-scarcity economy

why it is 2>&1 and not 2>>&1 to append to a log file

How to get the decimal part of a number in apex

Why doesn't increasing the temperature of something like wood or paper set them on fire?

How to make a kid's bike easier to pedal

How does "politician" work as a job/career?

What’s the interaction between darkvision and the Eagle Aspect of the beast, if you have Darkvision past 100 feet?

Good introductory book to type theory?

Make me a minimum magic sum

How can I test a shell script in a "safe environment" to avoid harm to my computer?

Did any early RISC OS precursor run on the BBC Micro?

Why did Dr. Strange keep looking into the future after the snap?

My C Drive is full without reason

When does WordPress.org notify sites of new version?

Extracting the parent, leaf, and extension from a valid path

Whats the best (fastest) way to scape webpages?

What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...

Here's an example of a patent.

I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))

Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?

#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

 #url = new url with new patent number similar to above

 try: 
 driver.get(url) 
 driver.set_page_load_timeout(20) 
 except: 
 #--write to csv
 continue

 if "404" in driver.title: #patent number not found
 #--write to csv
 continue

 try: 
 WebDriverWait(driver, 10).until(
 EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
 )
 except: 
 #--write to csv
 continue


 ## rest of code to get data from tables and write to csv

Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?

I'm new to webscraping, so any help would be greatly appreciated :)

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13

add a comment |

I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...

Here's an example of a patent.

Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?

#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

 #url = new url with new patent number similar to above

 try: 
 driver.get(url) 
 driver.set_page_load_timeout(20) 
 except: 
 #--write to csv
 continue

 if "404" in driver.title: #patent number not found
 #--write to csv
 continue

 try: 
 WebDriverWait(driver, 10).until(
 EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
 )
 except: 
 #--write to csv
 continue


 ## rest of code to get data from tables and write to csv

Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?

I'm new to webscraping, so any help would be greatly appreciated :)

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13

add a comment |

I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...

Here's an example of a patent.

Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?

#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

 #url = new url with new patent number similar to above

 try: 
 driver.get(url) 
 driver.set_page_load_timeout(20) 
 except: 
 #--write to csv
 continue

 if "404" in driver.title: #patent number not found
 #--write to csv
 continue

 try: 
 WebDriverWait(driver, 10).until(
 EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
 )
 except: 
 #--write to csv
 continue


 ## rest of code to get data from tables and write to csv

Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?

I'm new to webscraping, so any help would be greatly appreciated :)

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...

Here's an example of a patent.

Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?

#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

 #url = new url with new patent number similar to above

 try: 
 driver.get(url) 
 driver.set_page_load_timeout(20) 
 except: 
 #--write to csv
 continue

 if "404" in driver.title: #patent number not found
 #--write to csv
 continue

 try: 
 WebDriverWait(driver, 10).until(
 EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
 )
 except: 
 #--write to csv
 continue


 ## rest of code to get data from tables and write to csv

Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?

I'm new to webscraping, so any help would be greatly appreciated :)

python selenium web-scraping beautifulsoup

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

edited Mar 23 at 11:32

Fozoro

2,0022927

edited Mar 23 at 11:32

Fozoro

2,0022927

edited Mar 23 at 11:32

Fozoro

2,0022927

asked Mar 23 at 6:12

carmen__

asked Mar 23 at 6:12

carmen__

asked Mar 23 at 6:12

carmen__

Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13

add a comment |

Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13

Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13

add a comment |

1 Answer
1

active

oldest

votes

Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
 for code in codes:
 url = 'https://patents.google.com/patent//en?oq='.format(code, code)
 r = s.get(url)
 tables = pd.read_html(str(r.content))
 print(tables) #example only. Remove later
 #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
 # rather than print... whatever steps to store info you want until write out

answered Mar 23 at 7:18

QHarr

41.1k82446

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55311132%2fwhats-the-best-fastest-way-to-scape-webpages%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
 for code in codes:
 url = 'https://patents.google.com/patent//en?oq='.format(code, code)
 r = s.get(url)
 tables = pd.read_html(str(r.content))
 print(tables) #example only. Remove later
 #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
 # rather than print... whatever steps to store info you want until write out

answered Mar 23 at 7:18

QHarr

41.1k82446

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

add a comment |

Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
 for code in codes:
 url = 'https://patents.google.com/patent//en?oq='.format(code, code)
 r = s.get(url)
 tables = pd.read_html(str(r.content))
 print(tables) #example only. Remove later
 #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
 # rather than print... whatever steps to store info you want until write out

answered Mar 23 at 7:18

QHarr

41.1k82446

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

add a comment |

Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
 for code in codes:
 url = 'https://patents.google.com/patent//en?oq='.format(code, code)
 r = s.get(url)
 tables = pd.read_html(str(r.content))
 print(tables) #example only. Remove later
 #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
 # rather than print... whatever steps to store info you want until write out

answered Mar 23 at 7:18

QHarr

41.1k82446

Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
 for code in codes:
 url = 'https://patents.google.com/patent//en?oq='.format(code, code)
 r = s.get(url)
 tables = pd.read_html(str(r.content))
 print(tables) #example only. Remove later
 #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
 # rather than print... whatever steps to store info you want until write out

answered Mar 23 at 7:18

QHarr

41.1k82446

answered Mar 23 at 7:18

QHarr

41.1k82446

answered Mar 23 at 7:18

QHarr

41.1k82446

answered Mar 23 at 7:18

QHarr

41.1k82446

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

add a comment |

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07

selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11

the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12

Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1