Whats the best (fastest) way to scape webpages?What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list
My parents are Afghan
Why is the episode called "The Last of the Starks"?
Can a player choose to add detail and flavor to their character's spells and abilities?
Why doesn't a particle exert force on itself?
Is it safe to keep the GPU on 100% utilization for a very long time?
Convert a huge txt-file into a dataset
HTML folder located within IOS Image file?
Justification of physical currency in an interstellar civilization?
Latex editor/compiler for Windows and Powerpoint
Crime rates in a post-scarcity economy
why it is 2>&1 and not 2>>&1 to append to a log file
How to get the decimal part of a number in apex
Why doesn't increasing the temperature of something like wood or paper set them on fire?
How to make a kid's bike easier to pedal
How does "politician" work as a job/career?
What’s the interaction between darkvision and the Eagle Aspect of the beast, if you have Darkvision past 100 feet?
Good introductory book to type theory?
Make me a minimum magic sum
How can I test a shell script in a "safe environment" to avoid harm to my computer?
Did any early RISC OS precursor run on the BBC Micro?
Why did Dr. Strange keep looking into the future after the snap?
My C Drive is full without reason
When does WordPress.org notify sites of new version?
Extracting the parent, leaf, and extension from a valid path
Whats the best (fastest) way to scape webpages?
What does ** (double star/asterisk) and * (star/asterisk) do for parameters?What are metaclasses in Python?Is there a way to run Python on Android?What is the difference between @staticmethod and @classmethod?What does the “yield” keyword do?Nicest way to pad zeroes to a stringWhat does if __name__ == “__main__”: do?What is __init__.py for?How to substring a string in Python?Fastest way to check if a value exist in a list
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...
Here's an example of a patent.
I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))
Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?
#...
from selenium.webdriver.support import expected_conditions as EC
#...
## read file of patent numbers and initiate chrome
url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"
for x in patent_number:
#url = new url with new patent number similar to above
try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue
if "404" in driver.title: #patent number not found
#--write to csv
continue
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue
## rest of code to get data from tables and write to csv
Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?
I'm new to webscraping, so any help would be greatly appreciated :)
python selenium web-scraping beautifulsoup
add a comment |
I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...
Here's an example of a patent.
I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))
Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?
#...
from selenium.webdriver.support import expected_conditions as EC
#...
## read file of patent numbers and initiate chrome
url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"
for x in patent_number:
#url = new url with new patent number similar to above
try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue
if "404" in driver.title: #patent number not found
#--write to csv
continue
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue
## rest of code to get data from tables and write to csv
Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?
I'm new to webscraping, so any help would be greatly appreciated :)
python selenium web-scraping beautifulsoup
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13
add a comment |
I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...
Here's an example of a patent.
I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))
Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?
#...
from selenium.webdriver.support import expected_conditions as EC
#...
## read file of patent numbers and initiate chrome
url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"
for x in patent_number:
#url = new url with new patent number similar to above
try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue
if "404" in driver.title: #patent number not found
#--write to csv
continue
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue
## rest of code to get data from tables and write to csv
Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?
I'm new to webscraping, so any help would be greatly appreciated :)
python selenium web-scraping beautifulsoup
I'm trying to scrape data from Google Patents and finding the execution time takes too long. How can I increase the speed? Running through 8000 patents took 7 hours already...
Here's an example of a patent.
I need to get data from the tables below and write them to a csv file. I think the bottle neck is at WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))
Is this necessary or can I use find_elements_by_css_selector and check if that returns anything?
#...
from selenium.webdriver.support import expected_conditions as EC
#...
## read file of patent numbers and initiate chrome
url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"
for x in patent_number:
#url = new url with new patent number similar to above
try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue
if "404" in driver.title: #patent number not found
#--write to csv
continue
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue
## rest of code to get data from tables and write to csv
Is there a more efficient way of finding if these tables exist on a patent page? Or would there be a difference if I used BeautifulSoup?
I'm new to webscraping, so any help would be greatly appreciated :)
python selenium web-scraping beautifulsoup
python selenium web-scraping beautifulsoup
edited Mar 23 at 11:32
Fozoro
2,0022927
2,0022927
asked Mar 23 at 6:12
carmen__carmen__
31
31
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13
add a comment |
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13
add a comment |
1 Answer
1
active
oldest
votes
Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55311132%2fwhats-the-best-fastest-way-to-scape-webpages%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
add a comment |
Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
add a comment |
Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out
Not sure which tables you are after but consider you may be able to use requests and pandas to grab the tables, as well as Session to re-use connection.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent//en?oq='.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out
answered Mar 23 at 7:18
QHarrQHarr
41.1k82446
41.1k82446
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
add a comment |
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
I'd like to get data from all tables. I'm trying to understand the difference between BeautifulSoup and Selenium. What advantage does BeautifulSoup have over selenium in this use case? When would I use one over the other?
– carmen__
Mar 23 at 22:07
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
selenium allows you to automate a browser... javascript can render on the page in cases where content is dynamically loaded, you can interact with site controls e.g. buttons etc. Requests is much faster but you lose the ability to interact with a webpage as there is no browser. Also, javascript loaded content won't be present. In your case, requests is faster. The response object content is then parsed with BeautifulSoup and lxml.
– QHarr
Mar 23 at 22:11
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
the above answer will print all tables. You can decide what you do with them. Each table us returned as a dataframe inside the returned list.
– QHarr
Mar 23 at 22:12
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
Thanks! Is there a way to grab the titles of the tables as well? For example, I need "Patent Citations (1)" that comes before the table. I want to use these titles to determine where to store certain table data.
– carmen__
Mar 29 at 7:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55311132%2fwhats-the-best-fastest-way-to-scape-webpages%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are you after two tables? Patent Citations and Non Patent citations? All tables on page?
– QHarr
Mar 23 at 7:13