Whats the best (fastest) way to scape webpages?What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list

My parents are Afghan

Why is the episode called "The Last of the Starks"?

Can a player choose to add detail and flavor to their character's spells and abilities?

Why doesn't a particle exert force on itself?

Is it safe to keep the GPU on 100% utilization for a very long time?

Convert a huge txt-file into a dataset

HTML folder located within IOS Image file?

Justification of physical currency in an interstellar civilization?

Latex editor/compiler for Windows and Powerpoint

Crime rates in a post-scarcity economy

why it is 2>&1 and not 2>>&1 to append to a log file

How to get the decimal part of a number in apex

Why doesn't increasing the temperature of something like wood or paper set them on fire?

How to make a kid's bike easier to pedal

How does "politician" work as a job/career?

What’s the interaction between darkvision and the Eagle Aspect of the beast, if you have Darkvision past 100 feet?

Good introductory book to type theory?

Make me a minimum magic sum

How can I test a shell script in a "safe environment" to avoid harm to my computer?

Did any early RISC OS precursor run on the BBC Micro?

Why did Dr. Strange keep looking into the future after the snap?

My C Drive is full without reason

When does WordPress.org notify sites of new version?

Extracting the parent, leaf, and extension from a valid path



Whats the best (fastest) way to scape webpages?


What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...



Here's an example of a patent.



I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))



Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?



#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

#url = new url with new patent number similar to above

try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue

if "404" in driver.title: #patent number not found
#--write to csv
continue

try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue


## rest of code to get data from tables and write to csv


Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?



I'm new to webscraping, so any help would be greatly appreciated :)










share|improve this question
























  • Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

    – QHarr
    Mar 23 at 7:13


















0















I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...



Here's an example of a patent.



I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))



Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?



#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

#url = new url with new patent number similar to above

try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue

if "404" in driver.title: #patent number not found
#--write to csv
continue

try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue


## rest of code to get data from tables and write to csv


Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?



I'm new to webscraping, so any help would be greatly appreciated :)










share|improve this question
























  • Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

    – QHarr
    Mar 23 at 7:13














0












0








0








I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...



Here's an example of a patent.



I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))



Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?



#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

#url = new url with new patent number similar to above

try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue

if "404" in driver.title: #patent number not found
#--write to csv
continue

try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue


## rest of code to get data from tables and write to csv


Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?



I'm new to webscraping, so any help would be greatly appreciated :)










share|improve this question
















I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...



Here's an example of a patent.



I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))



Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?



#...
from selenium.webdriver.support import expected_conditions as EC
#...

## read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

#url = new url with new patent number similar to above

try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue

if "404" in driver.title: #patent number not found
#--write to csv
continue

try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue


## rest of code to get data from tables and write to csv


Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?



I'm new to webscraping, so any help would be greatly appreciated :)







python selenium web-scraping beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 23 at 11:32









Fozoro

2,0022927




2,0022927










asked Mar 23 at 6:12









carmen__carmen__

31




31












  • Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

    – QHarr
    Mar 23 at 7:13


















  • Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

    – QHarr
    Mar 23 at 7:13

















Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13






Are you after two tables? Patent Citations and Non Patent citations? All tables on page?

– QHarr
Mar 23 at 7:13













1 Answer
1






active

oldest

votes


















2














Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.



import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out





share|improve this answer























  • I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

    – carmen__
    Mar 23 at 22:07












  • selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

    – QHarr
    Mar 23 at 22:11











  • the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

    – QHarr
    Mar 23 at 22:12











  • Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

    – carmen__
    Mar 29 at 7:27












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55311132%2fwhats-the-best-fastest-way-to-scape-webpages%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.



import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out





share|improve this answer























  • I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

    – carmen__
    Mar 23 at 22:07












  • selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

    – QHarr
    Mar 23 at 22:11











  • the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

    – QHarr
    Mar 23 at 22:12











  • Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

    – carmen__
    Mar 29 at 7:27
















2














Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.



import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out





share|improve this answer























  • I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

    – carmen__
    Mar 23 at 22:07












  • selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

    – QHarr
    Mar 23 at 22:11











  • the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

    – QHarr
    Mar 23 at 22:12











  • Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

    – carmen__
    Mar 29 at 7:27














2












2








2







Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.



import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out





share|improve this answer













Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.



import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out






share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 23 at 7:18









QHarrQHarr

41.1k82446




41.1k82446












  • I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

    – carmen__
    Mar 23 at 22:07












  • selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

    – QHarr
    Mar 23 at 22:11











  • the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

    – QHarr
    Mar 23 at 22:12











  • Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

    – carmen__
    Mar 29 at 7:27


















  • I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

    – carmen__
    Mar 23 at 22:07












  • selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

    – QHarr
    Mar 23 at 22:11











  • the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

    – QHarr
    Mar 23 at 22:12











  • Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

    – carmen__
    Mar 29 at 7:27

















I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07






I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?

– carmen__
Mar 23 at 22:07














selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11





selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.

– QHarr
Mar 23 at 22:11













the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12





the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.

– QHarr
Mar 23 at 22:12













Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27






Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.

– carmen__
Mar 29 at 7:27




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55311132%2fwhats-the-best-fastest-way-to-scape-webpages%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript