How to scrape a website that has a difficult table to read (pandas & beautiful soup)?How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup

Why didn't Voldemort recognize that Dumbledore was affected by his curse?

Writing an augmented sixth chord on the flattened supertonic

Heap allocation on microcontroller

Entire circuit dead after GFCI outlet

Is it possible to have a wealthy country without a middle class?

Getting UPS Power from One Room to Another

Who won a Game of Bar Dice?

Does the new finding on "reversing a quantum jump mid-flight" rule out any interpretations of QM?

Bb13b9 confusion

sed + add word before string only if not exists

Why are trash cans referred to as "zafacón" in Puerto Rico?

HR woman suggesting me I should not hang out with the coworker

Is it legal for a bar bouncer to confiscate a fake ID

Are polynomials with the same roots identical?

Electricity free spaceship

Why does logistic function use e rather than 2?

How did old MS-DOS games utilize various graphic cards?

Does the 2019 UA Artificer's Many-Handed Pouch infusion enable unlimited infinite-range cross-planar communication?

How to communicate to my GM that not being allowed to use stealth isn't fun for me?

Non-aqueous eyes?

Projective subvarieties of a quasiprojective variety

Why can my keyboard only digest 6 keypresses at a time?

Is it possible to have 2 different but equal size real number sets that have the same mean and standard deviation?

Can a catering trolley removal result in a measurable reduction in emissions?



How to scrape a website that has a difficult table to read (pandas & beautiful soup)?


How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.



I have tried many things.



import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)


I would like to parse this into a csv file and I am index matching this with another dataset.










share|improve this question
























  • I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

    – David Silveiro
    Mar 24 at 19:21











  • Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

    – Desiigner
    Mar 24 at 19:22











  • Was your question answered?

    – QHarr
    Apr 4 at 17:35

















0















I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.



I have tried many things.



import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)


I would like to parse this into a csv file and I am index matching this with another dataset.










share|improve this question
























  • I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

    – David Silveiro
    Mar 24 at 19:21











  • Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

    – Desiigner
    Mar 24 at 19:22











  • Was your question answered?

    – QHarr
    Apr 4 at 17:35













0












0








0








I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.



I have tried many things.



import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)


I would like to parse this into a csv file and I am index matching this with another dataset.










share|improve this question
















I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.



I have tried many things.



import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)


I would like to parse this into a csv file and I am index matching this with another dataset.







python web-scraping html-table beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 24 at 19:16









system123456

4721111




4721111










asked Mar 24 at 19:14









Caitlin BallingallCaitlin Ballingall

82




82












  • I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

    – David Silveiro
    Mar 24 at 19:21











  • Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

    – Desiigner
    Mar 24 at 19:22











  • Was your question answered?

    – QHarr
    Apr 4 at 17:35

















  • I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

    – David Silveiro
    Mar 24 at 19:21











  • Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

    – Desiigner
    Mar 24 at 19:22











  • Was your question answered?

    – QHarr
    Apr 4 at 17:35
















I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21





I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21













Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22





Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22













Was your question answered?

– QHarr
Apr 4 at 17:35





Was your question answered?

– QHarr
Apr 4 at 17:35












1 Answer
1






active

oldest

votes


















2














There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end



Single page version where you change the current_page value to the appropriate page number.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )



All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])

s.mount('http://', HTTPAdapter(max_retries=retries))

while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)

try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1



@Sim's version:



import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'


results = []

i = 0

def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)

soup = BeautifulSoup(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1

if __name__ == '__main__':
ua = UserAgent()
get_content(i)





share|improve this answer

























  • I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

    – Caitlin Ballingall
    Mar 24 at 20:35












  • you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

    – QHarr
    Mar 24 at 20:36











  • Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

    – QHarr
    Mar 24 at 20:37











  • My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

    – QHarr
    Mar 24 at 20:38











  • I also need to extract rather than hard core the end point for results count but that’s a quick fix

    – QHarr
    Mar 24 at 20:41












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55327525%2fhow-to-scrape-a-website-that-has-a-difficult-table-to-read-pandas-beautiful-s%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end



Single page version where you change the current_page value to the appropriate page number.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )



All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])

s.mount('http://', HTTPAdapter(max_retries=retries))

while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)

try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1



@Sim's version:



import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'


results = []

i = 0

def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)

soup = BeautifulSoup(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1

if __name__ == '__main__':
ua = UserAgent()
get_content(i)





share|improve this answer

























  • I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

    – Caitlin Ballingall
    Mar 24 at 20:35












  • you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

    – QHarr
    Mar 24 at 20:36











  • Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

    – QHarr
    Mar 24 at 20:37











  • My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

    – QHarr
    Mar 24 at 20:38











  • I also need to extract rather than hard core the end point for results count but that’s a quick fix

    – QHarr
    Mar 24 at 20:41
















2














There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end



Single page version where you change the current_page value to the appropriate page number.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )



All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])

s.mount('http://', HTTPAdapter(max_retries=retries))

while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)

try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1



@Sim's version:



import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'


results = []

i = 0

def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)

soup = BeautifulSoup(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1

if __name__ == '__main__':
ua = UserAgent()
get_content(i)





share|improve this answer

























  • I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

    – Caitlin Ballingall
    Mar 24 at 20:35












  • you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

    – QHarr
    Mar 24 at 20:36











  • Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

    – QHarr
    Mar 24 at 20:37











  • My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

    – QHarr
    Mar 24 at 20:38











  • I also need to extract rather than hard core the end point for results count but that’s a quick fix

    – QHarr
    Mar 24 at 20:41














2












2








2







There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end



Single page version where you change the current_page value to the appropriate page number.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )



All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])

s.mount('http://', HTTPAdapter(max_retries=retries))

while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)

try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1



@Sim's version:



import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'


results = []

i = 0

def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)

soup = BeautifulSoup(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1

if __name__ == '__main__':
ua = UserAgent()
get_content(i)





share|improve this answer















There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end



Single page version where you change the current_page value to the appropriate page number.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )



All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.



import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers =

'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =

'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'



results = []
i = 0
with requests.Session() as s:
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])

s.mount('http://', HTTPAdapter(max_retries=retries))

while len(results) < 1000: #total:
data['current_page'] = i
data['result_id'] = str(int(data['result_id']) + i)

try:
r = s.post(url, headers = headers, data = data).json()
except Exception as e:
print(e)
time.sleep(2)
headers['User-Agent'] = random.choice(ua)
r = s.post(url, headers = headers, data = data).json()
continue
soup = bs(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1



@Sim's version:



import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers =
'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data =
'PayYear[]' : '2018',
'BranchName[]' : 'Villages',
'SortBy' : 'YTDPay DESC',
'current_page' : '0',
'result_id' : '110687408',
'url' : '/tools/required/reports/payroll?action=get',
'nav_request' : '0'


results = []

i = 0

def get_content(i):
while len(results) < 15908:
print(len(results))
data['current_page'] = i
headers['User-Agent'] = ua.random
try:
r = requests.post(url, headers = headers, data = data).json()
except Exception:
time.sleep(1)
get_content(i)

soup = BeautifulSoup(r['html'], 'lxml')

for item in soup.select('tr:nth-child(odd)'):
row = [subItem.text for subItem in item.select('td')][1:]
results.append(row)
i+=1

if __name__ == '__main__':
ua = UserAgent()
get_content(i)






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 24 at 21:41

























answered Mar 24 at 20:23









QHarrQHarr

44.9k92749




44.9k92749












  • I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

    – Caitlin Ballingall
    Mar 24 at 20:35












  • you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

    – QHarr
    Mar 24 at 20:36











  • Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

    – QHarr
    Mar 24 at 20:37











  • My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

    – QHarr
    Mar 24 at 20:38











  • I also need to extract rather than hard core the end point for results count but that’s a quick fix

    – QHarr
    Mar 24 at 20:41


















  • I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

    – Caitlin Ballingall
    Mar 24 at 20:35












  • you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

    – QHarr
    Mar 24 at 20:36











  • Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

    – QHarr
    Mar 24 at 20:37











  • My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

    – QHarr
    Mar 24 at 20:38











  • I also need to extract rather than hard core the end point for results count but that’s a quick fix

    – QHarr
    Mar 24 at 20:41

















I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35






I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35














you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36





you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36













Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37





Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37













My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38





My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38













I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41






I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55327525%2fhow-to-scrape-a-website-that-has-a-difficult-table-to-read-pandas-beautiful-s%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript