How to scrape a website that has a difficult table to read (pandas & beautiful soup)?How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup

Why didn't Voldemort recognize that Dumbledore was affected by his curse?

Writing an augmented sixth chord on the flattened supertonic

Heap allocation on microcontroller

Entire circuit dead after GFCI outlet

Is it possible to have a wealthy country without a middle class?

Getting UPS Power from One Room to Another

Who won a Game of Bar Dice?

Does the new finding on "reversing a quantum jump mid-flight" rule out any interpretations of QM?

Bb13b9 confusion

sed + add word before string only if not exists

Why are trash cans referred to as "zafacón" in Puerto Rico?

HR woman suggesting me I should not hang out with the coworker

Is it legal for a bar bouncer to confiscate a fake ID

Are polynomials with the same roots identical?

Electricity free spaceship

Why does logistic function use e rather than 2?

How did old MS-DOS games utilize various graphic cards?

Does the 2019 UA Artificer's Many-Handed Pouch infusion enable unlimited infinite-range cross-planar communication?

How to communicate to my GM that not being allowed to use stealth isn't fun for me?

Non-aqueous eyes?

Projective subvarieties of a quasiprojective variety

Why can my keyboard only digest 6 keypresses at a time?

Is it possible to have 2 different but equal size real number sets that have the same mean and standard deviation?

Can a catering trolley removal result in a measurable reduction in emissions?

How to scrape a website that has a difficult table to read (pandas & beautiful soup)?

How do you read from stdin?How to read a file line-by-line into a list?“Large data” work flows using pandasHow to iterate over rows in a DataFrame in Pandas?Scraping and parsing data table using beautiful soup and pythonbeautiful soup - python - table scrapingScraping an html table with beautiful soup into pandasScrape Table Data into .csvBeautiful Soup Wikipidia nested tablesScraping and Looping Meta Tags With Beautiful Soup

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.

I have tried many things.

import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)

I would like to parse this into a csv file and I am index matching this with another dataset.

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21

Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22

Was your question answered?

– QHarr
Apr 4 at 17:35

add a comment |

I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.

I have tried many things.

import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)

I would like to parse this into a csv file and I am index matching this with another dataset.

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21

Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22

Was your question answered?

– QHarr
Apr 4 at 17:35

add a comment |

I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.

I have tried many things.

import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)

I would like to parse this into a csv file and I am index matching this with another dataset.

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

I am trying to scrape data from https://www.seethroughny.net/payrolls/110681345 but the table is difficult to deal with.

I have tried many things.

import pandas as pd
import ssl
import csv

ssl._create_default_https_context = ssl._create_unverified_context


calls_df = pd.read_html("https://www.seethroughny.net/payrolls/110681345", header=0)
print(calls_df)

calls_df.to_csv("calls.csv", index=False)

I would like to parse this into a csv file and I am index matching this with another dataset.

python web-scraping html-table beautifulsoup

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

edited Mar 24 at 19:16

system123456

4721111

edited Mar 24 at 19:16

system123456

4721111

edited Mar 24 at 19:16

system123456

4721111

asked Mar 24 at 19:14

Caitlin Ballingall

asked Mar 24 at 19:14

Caitlin Ballingall

asked Mar 24 at 19:14

Caitlin Ballingall

I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21

Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22

Was your question answered?

– QHarr
Apr 4 at 17:35

add a comment |

I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21

Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22

Was your question answered?

– QHarr
Apr 4 at 17:35

I'd recommend opening up the dev console in the browser and seeing if first the payrolls are being retrieved from an endpoint in json before attempting to scrape it directly from the site. Failing that, if the sites dynamic, get the source via selenium, parse the data via beautifulsoup and then into a CSV format :)

– David Silveiro
Mar 24 at 19:21

Why is this table difficult to deal with? Simply use beautifulsoup: first lookup for the table, then lookup for thead inside this table to grab data inside <th>. Then get the rest from <tbody>.

– Desiigner
Mar 24 at 19:22

Was your question answered?

– QHarr
Apr 4 at 17:35

add a comment |

1 Answer
1

active

oldest

votes

There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end

Single page version where you change the current_page value to the appropriate page number.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )

All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 ]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



results = []
i = 0
with requests.Session() as s:
 retries = Retry(total=5,
 backoff_factor=0.1,
 status_forcelist=[ 500, 502, 503, 504 ])

 s.mount('http://', HTTPAdapter(max_retries=retries))

 while len(results) < 1000: #total:
 data['current_page'] = i
 data['result_id'] = str(int(data['result_id']) + i)

 try:
 r = s.post(url, headers = headers, data = data).json()
 except Exception as e:
 print(e)
 time.sleep(2)
 headers['User-Agent'] = random.choice(ua)
 r = s.post(url, headers = headers, data = data).json()
 continue
 soup = bs(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

@Sim's version:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers = 
 'User-Agent' : 'Mozilla/5.0',
 'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 
 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 


results = []

i = 0

def get_content(i):
 while len(results) < 15908:
 print(len(results))
 data['current_page'] = i
 headers['User-Agent'] = ua.random
 try:
 r = requests.post(url, headers = headers, data = data).json()
 except Exception:
 time.sleep(1)
 get_content(i)

 soup = BeautifulSoup(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

if __name__ == '__main__':
 ua = UserAgent()
 get_content(i)

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

|
show 8 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55327525%2fhow-to-scrape-a-website-that-has-a-difficult-table-to-read-pandas-beautiful-s%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end

Single page version where you change the current_page value to the appropriate page number.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )

All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 ]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



results = []
i = 0
with requests.Session() as s:
 retries = Retry(total=5,
 backoff_factor=0.1,
 status_forcelist=[ 500, 502, 503, 504 ])

 s.mount('http://', HTTPAdapter(max_retries=retries))

 while len(results) < 1000: #total:
 data['current_page'] = i
 data['result_id'] = str(int(data['result_id']) + i)

 try:
 r = s.post(url, headers = headers, data = data).json()
 except Exception as e:
 print(e)
 time.sleep(2)
 headers['User-Agent'] = random.choice(ua)
 r = s.post(url, headers = headers, data = data).json()
 continue
 soup = bs(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

@Sim's version:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers = 
 'User-Agent' : 'Mozilla/5.0',
 'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 
 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 


results = []

i = 0

def get_content(i):
 while len(results) < 15908:
 print(len(results))
 data['current_page'] = i
 headers['User-Agent'] = ua.random
 try:
 r = requests.post(url, headers = headers, data = data).json()
 except Exception:
 time.sleep(1)
 get_content(i)

 soup = BeautifulSoup(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

if __name__ == '__main__':
 ua = UserAgent()
 get_content(i)

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

|
show 8 more comments

There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end

Single page version where you change the current_page value to the appropriate page number.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )

All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 ]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



results = []
i = 0
with requests.Session() as s:
 retries = Retry(total=5,
 backoff_factor=0.1,
 status_forcelist=[ 500, 502, 503, 504 ])

 s.mount('http://', HTTPAdapter(max_retries=retries))

 while len(results) < 1000: #total:
 data['current_page'] = i
 data['result_id'] = str(int(data['result_id']) + i)

 try:
 r = s.post(url, headers = headers, data = data).json()
 except Exception as e:
 print(e)
 time.sleep(2)
 headers['User-Agent'] = random.choice(ua)
 r = s.post(url, headers = headers, data = data).json()
 continue
 soup = bs(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

@Sim's version:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers = 
 'User-Agent' : 'Mozilla/5.0',
 'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 
 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 


results = []

i = 0

def get_content(i):
 while len(results) < 15908:
 print(len(results))
 data['current_page'] = i
 headers['User-Agent'] = ua.random
 try:
 r = requests.post(url, headers = headers, data = data).json()
 except Exception:
 time.sleep(1)
 get_content(i)

 soup = BeautifulSoup(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

if __name__ == '__main__':
 ua = UserAgent()
 get_content(i)

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

|
show 8 more comments

There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end

Single page version where you change the current_page value to the appropriate page number.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )

All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 ]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



results = []
i = 0
with requests.Session() as s:
 retries = Retry(total=5,
 backoff_factor=0.1,
 status_forcelist=[ 500, 502, 503, 504 ])

 s.mount('http://', HTTPAdapter(max_retries=retries))

 while len(results) < 1000: #total:
 data['current_page'] = i
 data['result_id'] = str(int(data['result_id']) + i)

 try:
 r = s.post(url, headers = headers, data = data).json()
 except Exception as e:
 print(e)
 time.sleep(2)
 headers['User-Agent'] = random.choice(ua)
 r = s.post(url, headers = headers, data = data).json()
 continue
 soup = bs(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

@Sim's version:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers = 
 'User-Agent' : 'Mozilla/5.0',
 'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 
 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 


results = []

i = 0

def get_content(i):
 while len(results) < 15908:
 print(len(results))
 data['current_page'] = i
 headers['User-Agent'] = ua.random
 try:
 r = requests.post(url, headers = headers, data = data).json()
 except Exception:
 time.sleep(1)
 get_content(i)

 soup = BeautifulSoup(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

if __name__ == '__main__':
 ua = UserAgent()
 get_content(i)

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

There is a json response containing the html. It seems that something blocks requests at random points in entire all results loop version at end

Single page version where you change the current_page value to the appropriate page number.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



r = requests.post(url, headers = headers, data = data).json()
soup = bs(r['html'], 'lxml')

results = []

for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)

df = pd.DataFrame(results)
df.to_csv(r'C:UsersUserDesktopData.csv', sep=',', encoding='utf-8-sig',index = False )

All pages version (work in progress as currently request can fail to return json at varying points in loop despite delay). Seems improved with @sim's suggestion of swapping out user-agents.

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import random


ua = ['Mozilla/5.0',
 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 ]


url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'
headers = 

 'Accept' : 'application/json, text/javascript, */*; q=0.01' ,
 'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
 'User-Agent' : 'Mozilla/5.0',
'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 

 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 



results = []
i = 0
with requests.Session() as s:
 retries = Retry(total=5,
 backoff_factor=0.1,
 status_forcelist=[ 500, 502, 503, 504 ])

 s.mount('http://', HTTPAdapter(max_retries=retries))

 while len(results) < 1000: #total:
 data['current_page'] = i
 data['result_id'] = str(int(data['result_id']) + i)

 try:
 r = s.post(url, headers = headers, data = data).json()
 except Exception as e:
 print(e)
 time.sleep(2)
 headers['User-Agent'] = random.choice(ua)
 r = s.post(url, headers = headers, data = data).json()
 continue
 soup = bs(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

@Sim's version:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

url = 'https://www.seethroughny.net/tools/required/reports/payroll?action=get'

headers = 
 'User-Agent' : 'Mozilla/5.0',
 'Referer' : 'https://www.seethroughny.net/payrolls/110681'


data = 
 'PayYear[]' : '2018',
 'BranchName[]' : 'Villages',
 'SortBy' : 'YTDPay DESC',
 'current_page' : '0',
 'result_id' : '110687408',
 'url' : '/tools/required/reports/payroll?action=get',
 'nav_request' : '0' 


results = []

i = 0

def get_content(i):
 while len(results) < 15908:
 print(len(results))
 data['current_page'] = i
 headers['User-Agent'] = ua.random
 try:
 r = requests.post(url, headers = headers, data = data).json()
 except Exception:
 time.sleep(1)
 get_content(i)

 soup = BeautifulSoup(r['html'], 'lxml')

 for item in soup.select('tr:nth-child(odd)'):
 row = [subItem.text for subItem in item.select('td')][1:]
 results.append(row)
 i+=1

if __name__ == '__main__':
 ua = UserAgent()
 get_content(i)

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

edited Mar 24 at 21:41

answered Mar 24 at 20:23

QHarr

44.9k92749

answered Mar 24 at 20:23

QHarr

44.9k92749

answered Mar 24 at 20:23

QHarr

44.9k92749

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

|
show 8 more comments

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

I get the following error: File "payroll.py", line 32, in <module> for item in soup.select('tr:nth-child(odd)'): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/element.py", line 1528, in select 'Only the following pseudo-classes are implemented: nth-of-type.') NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

– Caitlin Ballingall
Mar 24 at 20:35

you need the latest bs4 i.e. 4.7.1 I'm afraid as that has soup sieve

– QHarr
Mar 24 at 20:36

Lovely to c u @SIM. Yes... I am not sure that id makes a difference but I have been playing with a script that simply adds i to the id by adding in data['result_id'] = str(int(data['result_id']) + i)

– QHarr
Mar 24 at 20:37

My real sticking point at present is why r = s.post line is failing at different points.... which is usually indicative of blocking etc but I have seen a 404 pop up despite requests being valid if run as a single request with the same page number.

– QHarr
Mar 24 at 20:38

I also need to extract rather than hard core the end point for results count but that’s a quick fix

– QHarr
Mar 24 at 20:41

|
show 8 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1