Looking for a better solution to scrape multiple webpages with beautifulsoupScraping with BeautifulSoup and multiple paragraphsWeb scraping daily tables into CSV with BeautifulSoup in PythonScraping data from multiple previous dates in WSJ stocks websiteSimulate clicking a link when scraping with Python and BeautifulSoupBeautifulSoup Scraping: loading div instead of the contentScraping URLs using BeautifulSoupHow to scrape multiple pages with an unchanging URL - Python & BeautifulSoupHow can I scrape a site with multiple pages using beautifulsoup and python?Scrape table from HTML with beautifulsoupScrape results of multiple search pages with Beautiful Soup and Python
How should I mix small caps with digits or symbols?
tikz: 5 squares on a row, roman numbered 1 -> 5
Separate the element after every 2nd ',' and push into next row in bash
How to play vs. 1.e4 e5 2.Nf3 Nc6 3.Bc4 d6?
Parse a C++14 integer literal
How would a physicist explain this starship engine?
Is being an extrovert a necessary condition to be a manager?
Are there any nuances between "dismiss" and "ignore"?
Keeping the dodos out of the field
pwaS eht tirsf dna tasl setterl fo hace dorw
Difference in 1 user doing 1000 iterations and 1000 users doing 1 iteration in Load testing
What are the domains of the multiplication and unit morphisms of a monoid object?
Best practice for printing and evaluating formulas with the minimal coding
US F1 Visa grace period attending a conference
How do we explain the use of a software on a math paper?
List of lists elementwise greater/smaller than
How is dynamic resistance of a diode modeled for large voltage variations?
Department head said that group project may be rejected. How to mitigate?
400–430 degrees Celsius heated bath
What is metrics.roc_curve and metrics.auc measuring when I'm comparing binary data with probability estimates?
Gambler's Fallacy Dice
Salesforce bug enabled "Modify All"
Way of refund if scammed?
How can sister protect herself from impulse purchases with a credit card?
Looking for a better solution to scrape multiple webpages with beautifulsoup
Scraping with BeautifulSoup and multiple paragraphsWeb scraping daily tables into CSV with BeautifulSoup in PythonScraping data from multiple previous dates in WSJ stocks websiteSimulate clicking a link when scraping with Python and BeautifulSoupBeautifulSoup Scraping: loading div instead of the contentScraping URLs using BeautifulSoupHow to scrape multiple pages with an unchanging URL - Python & BeautifulSoupHow can I scrape a site with multiple pages using beautifulsoup and python?Scrape table from HTML with beautifulsoupScrape results of multiple search pages with Beautiful Soup and Python
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am trying to scrape the results of sports game off a website. The website has all the results of all the games which is perfect, but they are on many pages. Each page represents one day and I am gathering data over many months of games so it will be quite a lot of urls to enter.
The way I set it up now is that i have a base url, and a list of the dates that I can append using a for loop. This way works fine, but I was curious if there was a better way before I enter the many many dates I will be scraping over.
url = 'http://www.url.com?'
#this list would hold hundreds of dates
dates = ['month=11&day=1&year=2016', 'month=11&day=2&year=2016', ...]
for i in dates:
page = requests.get(url+i)
soup = BeautifulSoup(page.text, 'html.parser')
#and so on, this part works as intended
python beautifulsoup
|
show 1 more comment
I am trying to scrape the results of sports game off a website. The website has all the results of all the games which is perfect, but they are on many pages. Each page represents one day and I am gathering data over many months of games so it will be quite a lot of urls to enter.
The way I set it up now is that i have a base url, and a list of the dates that I can append using a for loop. This way works fine, but I was curious if there was a better way before I enter the many many dates I will be scraping over.
url = 'http://www.url.com?'
#this list would hold hundreds of dates
dates = ['month=11&day=1&year=2016', 'month=11&day=2&year=2016', ...]
for i in dates:
page = requests.get(url+i)
soup = BeautifulSoup(page.text, 'html.parser')
#and so on, this part works as intended
python beautifulsoup
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55
|
show 1 more comment
I am trying to scrape the results of sports game off a website. The website has all the results of all the games which is perfect, but they are on many pages. Each page represents one day and I am gathering data over many months of games so it will be quite a lot of urls to enter.
The way I set it up now is that i have a base url, and a list of the dates that I can append using a for loop. This way works fine, but I was curious if there was a better way before I enter the many many dates I will be scraping over.
url = 'http://www.url.com?'
#this list would hold hundreds of dates
dates = ['month=11&day=1&year=2016', 'month=11&day=2&year=2016', ...]
for i in dates:
page = requests.get(url+i)
soup = BeautifulSoup(page.text, 'html.parser')
#and so on, this part works as intended
python beautifulsoup
I am trying to scrape the results of sports game off a website. The website has all the results of all the games which is perfect, but they are on many pages. Each page represents one day and I am gathering data over many months of games so it will be quite a lot of urls to enter.
The way I set it up now is that i have a base url, and a list of the dates that I can append using a for loop. This way works fine, but I was curious if there was a better way before I enter the many many dates I will be scraping over.
url = 'http://www.url.com?'
#this list would hold hundreds of dates
dates = ['month=11&day=1&year=2016', 'month=11&day=2&year=2016', ...]
for i in dates:
page = requests.get(url+i)
soup = BeautifulSoup(page.text, 'html.parser')
#and so on, this part works as intended
python beautifulsoup
python beautifulsoup
edited Mar 23 at 19:50
Gmuersch
asked Mar 23 at 19:40
GmuerschGmuersch
11
11
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55
|
show 1 more comment
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55
|
show 1 more comment
1 Answer
1
active
oldest
votes
If you really wish to search on every single day, then datetime and timedelta can be used to iterate over all the possible days. Give it a starting date, this can then be advanced one day at a time until an end date (which could be datetime.now() for today):
from datetime import datetime, timedelta
base_url = "http://www.url.com?month=&day=&year="
search_date = datetime(2016, 11, 1)
end_date = datetime(2017, 1, 1)
one_day = timedelta(days=1)
while search_date < end_date:
url = base_url.format(search_date.month, search_date.day, search_date.year)
print(url)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
search_date += one_day
This would give you something like:
http://www.url.com?month=11&day=1&year=2016
http://www.url.com?month=11&day=2&year=2016
http://www.url.com?month=11&day=3&year=2016
http://www.url.com?month=11&day=4&year=2016
.
.
.
http://www.url.com?month=12&day=29&year=2016
http://www.url.com?month=12&day=30&year=2016
http://www.url.com?month=12&day=31&year=2016
A better approach though would be to make use of the next link on your page. For this the URL of the actual page would be needed. BeautifulSoup could then be used to easily extract the link.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55317655%2flooking-for-a-better-solution-to-scrape-multiple-webpages-with-beautifulsoup%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you really wish to search on every single day, then datetime and timedelta can be used to iterate over all the possible days. Give it a starting date, this can then be advanced one day at a time until an end date (which could be datetime.now() for today):
from datetime import datetime, timedelta
base_url = "http://www.url.com?month=&day=&year="
search_date = datetime(2016, 11, 1)
end_date = datetime(2017, 1, 1)
one_day = timedelta(days=1)
while search_date < end_date:
url = base_url.format(search_date.month, search_date.day, search_date.year)
print(url)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
search_date += one_day
This would give you something like:
http://www.url.com?month=11&day=1&year=2016
http://www.url.com?month=11&day=2&year=2016
http://www.url.com?month=11&day=3&year=2016
http://www.url.com?month=11&day=4&year=2016
.
.
.
http://www.url.com?month=12&day=29&year=2016
http://www.url.com?month=12&day=30&year=2016
http://www.url.com?month=12&day=31&year=2016
A better approach though would be to make use of the next link on your page. For this the URL of the actual page would be needed. BeautifulSoup could then be used to easily extract the link.
add a comment |
If you really wish to search on every single day, then datetime and timedelta can be used to iterate over all the possible days. Give it a starting date, this can then be advanced one day at a time until an end date (which could be datetime.now() for today):
from datetime import datetime, timedelta
base_url = "http://www.url.com?month=&day=&year="
search_date = datetime(2016, 11, 1)
end_date = datetime(2017, 1, 1)
one_day = timedelta(days=1)
while search_date < end_date:
url = base_url.format(search_date.month, search_date.day, search_date.year)
print(url)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
search_date += one_day
This would give you something like:
http://www.url.com?month=11&day=1&year=2016
http://www.url.com?month=11&day=2&year=2016
http://www.url.com?month=11&day=3&year=2016
http://www.url.com?month=11&day=4&year=2016
.
.
.
http://www.url.com?month=12&day=29&year=2016
http://www.url.com?month=12&day=30&year=2016
http://www.url.com?month=12&day=31&year=2016
A better approach though would be to make use of the next link on your page. For this the URL of the actual page would be needed. BeautifulSoup could then be used to easily extract the link.
add a comment |
If you really wish to search on every single day, then datetime and timedelta can be used to iterate over all the possible days. Give it a starting date, this can then be advanced one day at a time until an end date (which could be datetime.now() for today):
from datetime import datetime, timedelta
base_url = "http://www.url.com?month=&day=&year="
search_date = datetime(2016, 11, 1)
end_date = datetime(2017, 1, 1)
one_day = timedelta(days=1)
while search_date < end_date:
url = base_url.format(search_date.month, search_date.day, search_date.year)
print(url)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
search_date += one_day
This would give you something like:
http://www.url.com?month=11&day=1&year=2016
http://www.url.com?month=11&day=2&year=2016
http://www.url.com?month=11&day=3&year=2016
http://www.url.com?month=11&day=4&year=2016
.
.
.
http://www.url.com?month=12&day=29&year=2016
http://www.url.com?month=12&day=30&year=2016
http://www.url.com?month=12&day=31&year=2016
A better approach though would be to make use of the next link on your page. For this the URL of the actual page would be needed. BeautifulSoup could then be used to easily extract the link.
If you really wish to search on every single day, then datetime and timedelta can be used to iterate over all the possible days. Give it a starting date, this can then be advanced one day at a time until an end date (which could be datetime.now() for today):
from datetime import datetime, timedelta
base_url = "http://www.url.com?month=&day=&year="
search_date = datetime(2016, 11, 1)
end_date = datetime(2017, 1, 1)
one_day = timedelta(days=1)
while search_date < end_date:
url = base_url.format(search_date.month, search_date.day, search_date.year)
print(url)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
search_date += one_day
This would give you something like:
http://www.url.com?month=11&day=1&year=2016
http://www.url.com?month=11&day=2&year=2016
http://www.url.com?month=11&day=3&year=2016
http://www.url.com?month=11&day=4&year=2016
.
.
.
http://www.url.com?month=12&day=29&year=2016
http://www.url.com?month=12&day=30&year=2016
http://www.url.com?month=12&day=31&year=2016
A better approach though would be to make use of the next link on your page. For this the URL of the actual page would be needed. BeautifulSoup could then be used to easily extract the link.
answered Apr 3 at 10:11
Martin EvansMartin Evans
29.1k133357
29.1k133357
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55317655%2flooking-for-a-better-solution-to-scrape-multiple-webpages-with-beautifulsoup%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I mean, what you have looks already ok. What is the thing that bothers you with your current implementation? There are few improvements that can be done to improve the readability of the code, but it's not that essential. What's bothering you with your current implementation?
– Rafael
Mar 23 at 19:47
Whats bothering me is that I would have to manually enter hundreds of dates into the date list. I was just wondering if their was a more efficient way around that, If not then I would do it. any other tips for improvement would be appreciated!
– Gmuersch
Mar 23 at 19:49
Well, you are only interested in specific dates right? Do you have some constrains on the dates? Because you could automate the creation of the dates but I am not sure if you want to do this.
– Rafael
Mar 23 at 19:50
The dates have to be in the format of the dates list. As long as there isnt an obvious solution to automate this, I dont mind entering them. I just didnt want to do it this way if it would be considered bad practice.
– Gmuersch
Mar 23 at 19:55
Does the pages have some references to where the next page is, like a "next issue" link or something?
– yorodm
Mar 23 at 19:55