How to scrape a website that has a difficult table to read (pandas & beautiful soup)?How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup
Why didn't Voldemort recognize that Dumbledore was affected by his curse?
Writing an augmented sixth chord on the flattened supertonic
Heap allocation on microcontroller
Entire circuit dead after GFCI outlet
Is it possible to have a wealthy country without a middle class?
Getting UPS Power from One Room to Another
Who won a Game of Bar Dice?
Does the new finding on "reversing a quantum jump mid-flight" rule out any interpretations of QM?
Bb13b9 confusion
sed + add word before string only if not exists
Why are trash cans referred to as "zafacón" in Puerto Rico?
HR woman suggesting me I should not hang out with the coworker
Is it legal for a bar bouncer to confiscate a fake ID
Are polynomials with the same roots identical?
Electricity free spaceship
Why does logistic function use e rather than 2?
How did old MS-DOS games utilize various graphic cards?
Does the 2019 UA Artificer's Many-Handed Pouch infusion enable unlimited infinite-range cross-planar communication?
How to communicate to my GM that not being allowed to use stealth isn't fun for me?
Non-aqueous eyes?
Projective subvarieties of a quasiprojective variety
Why can my keyboard only digest 6 keypresses at a time?
Is it possible to have 2 different but equal size real number sets that have the same mean and standard deviation?
Can a catering trolley removal result in a measurable reduction in emissions?
How to scrape a website that has a difficult table to read (pandas & beautiful soup)?
How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.
I have tried many things.
import pandas as pd
import ssl
import csv
ssl._create_default_https_context = ssl._create_unverified_context
calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)
calls_df.to_csv("calls.csv", index=False)
I would like to parse this into a csv file and I am index matching this with another dataset.
python web-scraping html-table beautifulsoup
add a comment |
I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.
I have tried many things.
import pandas as pd
import ssl
import csv
ssl._create_default_https_context = ssl._create_unverified_context
calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)
calls_df.to_csv("calls.csv", index=False)
I would like to parse this into a csv file and I am index matching this with another dataset.
python web-scraping html-table beautifulsoup
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Was your question answered?
– QHarr
Apr 4 at 17:35
add a comment |
I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.
I have tried many things.
import pandas as pd
import ssl
import csv
ssl._create_default_https_context = ssl._create_unverified_context
calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)
calls_df.to_csv("calls.csv", index=False)
I would like to parse this into a csv file and I am index matching this with another dataset.
python web-scraping html-table beautifulsoup
I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.
I have tried many things.
import pandas as pd
import ssl
import csv
ssl._create_default_https_context = ssl._create_unverified_context
calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)
calls_df.to_csv("calls.csv", index=False)
I would like to parse this into a csv file and I am index matching this with another dataset.
python web-scraping html-table beautifulsoup
python web-scraping html-table beautifulsoup
edited Mar 24 at 19:16
system123456
4721111
4721111
asked Mar 24 at 19:14
Caitlin BallingallCaitlin Ballingall
82
82
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Was your question answered?
– QHarr
Apr 4 at 17:35
add a comment |
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Was your question answered?
– QHarr
Apr 4 at 17:35
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Was your question answered?
– QHarr
Apr 4 at 17:35
Was your question answered?
– QHarr
Apr 4 at 17:35
add a comment |
1 Answer
1
active
oldest
votes
There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end
Single page version where you change the current_page
value to the appropriate page number.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')
results = []
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )
All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random
ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)
try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
@Sim's version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)
soup = BeautifulSoup(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
if __name__ == '__main__':
ua = UserAgent()
get_content(i)
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
|
show 8 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55327525%2fhow-to-scrape-a-website-that-has-a-difficult-table-to-read-pandas-beautiful-s%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end
Single page version where you change the current_page
value to the appropriate page number.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')
results = []
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )
All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random
ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)
try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
@Sim's version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)
soup = BeautifulSoup(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
if __name__ == '__main__':
ua = UserAgent()
get_content(i)
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
|
show 8 more comments
There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end
Single page version where you change the current_page
value to the appropriate page number.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')
results = []
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )
All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random
ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)
try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
@Sim's version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)
soup = BeautifulSoup(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
if __name__ == '__main__':
ua = UserAgent()
get_content(i)
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
|
show 8 more comments
There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end
Single page version where you change the current_page
value to the appropriate page number.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')
results = []
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )
All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random
ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)
try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
@Sim's version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)
soup = BeautifulSoup(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
if __name__ == '__main__':
ua = UserAgent()
get_content(i)
There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end
Single page version where you change the current_page
value to the appropriate page number.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')
results = []
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )
All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random
ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)
try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
@Sim's version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'
data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'
results = []
i = 0
def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)
soup = BeautifulSoup(r['html'], 'lxml')
for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1
if __name__ == '__main__':
ua = UserAgent()
get_content(i)
edited Mar 24 at 21:41
answered Mar 24 at 20:23
QHarrQHarr
44.9k92749
44.9k92749
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
|
show 8 more comments
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
– Caitlin Ballingall
Mar 24 at 20:35
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve
– QHarr
Mar 24 at 20:36
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)
– QHarr
Mar 24 at 20:37
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.
– QHarr
Mar 24 at 20:38
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
I also need to extract rather than hard core the end point for results count but that’s a quick fix
– QHarr
Mar 24 at 20:41
|
show 8 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55327525%2fhow-to-scrape-a-website-that-has-a-difficult-table-to-read-pandas-beautiful-s%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)
– David Silveiro
Mar 24 at 19:21
Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.
– Desiigner
Mar 24 at 19:22
Was your question answered?
– QHarr
Apr 4 at 17:35