Getting full html back from a website request using PythonWeb-scraping JavaScript page with PythonHow can I parse a website using Selenium and Beautifulsoup in python?Getting a map() to return a list in Python 3.xPost JSON using Python RequestsPython Requests throwing SSLErrorHow do I disable log messages from the Requests library?Correct way to try/except using Python requests module?Download large file in python with requestsPython requests - print entire http request (raw)?Search on website with python and requestsCan't get full table from requests pythonHow to use request python 3 to login to website , then scrape data from that website
Array or vector? Two dimensional array or matrix?
Possibility to correct pitch from digital versions of records with the hole not centered
White's last move?
Why would "dead languages" be the only languages that spells could be written in?
Taking advantage when HR forgets to communicate the rules
Wouldn't putting an electronic key inside a small Faraday cage render it completely useless?
Does the Milky Way orbit around anything?
How can I use my cell phone's light as a reading light?
Implicit conversion between decimals with different precisions
Why does mean tend be more stable in different samples than median?
Attach a visible light telescope to the outside of the ISS
Passwordless authentication - how invalidate login code
Initializing variables in an "if" statement
Is it acceptable that I plot a time-series figure with years increasing from right to left?
Chilling juice in copper vessel
Is there a minimum amount of electricity that can be fed back into the grid?
What is the shape of the upper boundary of water hitting a screen?
Gory anime with pink haired girl escaping an asylum
Examples of fluid (including air) being used to transmit digital data?
Will Jimmy fall off his platform?
What is the fundamental difference between catching whales and hunting other animals?
What happens if the limit of 4 billion files was exceeded in an ext4 partition?
How predictable is $RANDOM really?
Why do most airliners have underwing engines, while business jets have rear-mounted engines?
Getting full html back from a website request using Python
Web-scraping JavaScript page with PythonHow can I parse a website using Selenium and Beautifulsoup in python?Getting a map() to return a list in Python 3.xPost JSON using Python RequestsPython Requests throwing SSLErrorHow do I disable log messages from the Requests library?Correct way to try/except using Python requests module?Download large file in python with requestsPython requests - print entire http request (raw)?Search on website with python and requestsCan't get full table from requests pythonHow to use request python 3 to login to website , then scrape data from that website
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())
Output:
<!DOCTYPE html>
<html>
<head>
<script>
var i10cdone =(function() function pingBeacon(msg) var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function() (document.head ; i10cimg.onerror = function() (document.head ; ( document.head ; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) document.cookie = 'i10c.bdddb=;path=/';; var error=''; function errorHandler(e) if (e && e.error && e.error.stack ) error=e.error.stack; else if( e && e.message ) error = e.message; else error = 'unknown'; if(window.addEventListener) window.addEventListener('error',errorHandler, false); else if ( window.attachEvent ) window.attachEvent('onerror',errorHandler); return function() if (window.removeEventListener) window.removeEventListener('error',errorHandler); else if (window.detachEvent) window.detachEvent('onerror',errorHandler); if(error) pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; ; )();
</script>
<script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&i10c.nv.host=www.digikey.com&i10c.opts=botox&bcb=1" type="text/javascript">
</script>
<script type="text/javascript">
INSTART.Init("apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":""disableQuerySelectorInterception" :true, 'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\.digikey\.com$","^authtest\.digikey\.com$","^blocked\.digikey\.com$","^dynatrace\.digikey\.com$","^search\.digikey\.com$","^www\.digikey\.ca$","^www\.digikey\.com$","^www\.digikey\.com\.mx$"]
);
</script>
<script>
typeof i10cdone === 'function' && i10cdone();
</script>
</head>
<body>
<script>
setTimeout(function()document.cookie="i10c.eac23=1";window.location.reload(true);,30);
</script>
</body>
</html>
The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.
Any help would be appreciated!
Thanks!
EDIT:
Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python
python-3.x beautifulsoup python-requests
add a comment |
I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())
Output:
<!DOCTYPE html>
<html>
<head>
<script>
var i10cdone =(function() function pingBeacon(msg) var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function() (document.head ; i10cimg.onerror = function() (document.head ; ( document.head ; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) document.cookie = 'i10c.bdddb=;path=/';; var error=''; function errorHandler(e) if (e && e.error && e.error.stack ) error=e.error.stack; else if( e && e.message ) error = e.message; else error = 'unknown'; if(window.addEventListener) window.addEventListener('error',errorHandler, false); else if ( window.attachEvent ) window.attachEvent('onerror',errorHandler); return function() if (window.removeEventListener) window.removeEventListener('error',errorHandler); else if (window.detachEvent) window.detachEvent('onerror',errorHandler); if(error) pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; ; )();
</script>
<script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&i10c.nv.host=www.digikey.com&i10c.opts=botox&bcb=1" type="text/javascript">
</script>
<script type="text/javascript">
INSTART.Init("apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":""disableQuerySelectorInterception" :true, 'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\.digikey\.com$","^authtest\.digikey\.com$","^blocked\.digikey\.com$","^dynatrace\.digikey\.com$","^search\.digikey\.com$","^www\.digikey\.ca$","^www\.digikey\.com$","^www\.digikey\.com\.mx$"]
);
</script>
<script>
typeof i10cdone === 'function' && i10cdone();
</script>
</head>
<body>
<script>
setTimeout(function()document.cookie="i10c.eac23=1";window.location.reload(true);,30);
</script>
</body>
</html>
The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.
Any help would be appreciated!
Thanks!
EDIT:
Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python
python-3.x beautifulsoup python-requests
2
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check outselenium
– C.Nivs
Mar 25 at 20:47
add a comment |
I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())
Output:
<!DOCTYPE html>
<html>
<head>
<script>
var i10cdone =(function() function pingBeacon(msg) var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function() (document.head ; i10cimg.onerror = function() (document.head ; ( document.head ; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) document.cookie = 'i10c.bdddb=;path=/';; var error=''; function errorHandler(e) if (e && e.error && e.error.stack ) error=e.error.stack; else if( e && e.message ) error = e.message; else error = 'unknown'; if(window.addEventListener) window.addEventListener('error',errorHandler, false); else if ( window.attachEvent ) window.attachEvent('onerror',errorHandler); return function() if (window.removeEventListener) window.removeEventListener('error',errorHandler); else if (window.detachEvent) window.detachEvent('onerror',errorHandler); if(error) pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; ; )();
</script>
<script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&i10c.nv.host=www.digikey.com&i10c.opts=botox&bcb=1" type="text/javascript">
</script>
<script type="text/javascript">
INSTART.Init("apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":""disableQuerySelectorInterception" :true, 'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\.digikey\.com$","^authtest\.digikey\.com$","^blocked\.digikey\.com$","^dynatrace\.digikey\.com$","^search\.digikey\.com$","^www\.digikey\.ca$","^www\.digikey\.com$","^www\.digikey\.com\.mx$"]
);
</script>
<script>
typeof i10cdone === 'function' && i10cdone();
</script>
</head>
<body>
<script>
setTimeout(function()document.cookie="i10c.eac23=1";window.location.reload(true);,30);
</script>
</body>
</html>
The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.
Any help would be appreciated!
Thanks!
EDIT:
Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python
python-3.x beautifulsoup python-requests
I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())
Output:
<!DOCTYPE html>
<html>
<head>
<script>
var i10cdone =(function() function pingBeacon(msg) var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function() (document.head ; i10cimg.onerror = function() (document.head ; ( document.head ; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) document.cookie = 'i10c.bdddb=;path=/';; var error=''; function errorHandler(e) if (e && e.error && e.error.stack ) error=e.error.stack; else if( e && e.message ) error = e.message; else error = 'unknown'; if(window.addEventListener) window.addEventListener('error',errorHandler, false); else if ( window.attachEvent ) window.attachEvent('onerror',errorHandler); return function() if (window.removeEventListener) window.removeEventListener('error',errorHandler); else if (window.detachEvent) window.detachEvent('onerror',errorHandler); if(error) pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; ; )();
</script>
<script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&i10c.nv.host=www.digikey.com&i10c.opts=botox&bcb=1" type="text/javascript">
</script>
<script type="text/javascript">
INSTART.Init("apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":""disableQuerySelectorInterception" :true, 'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\.digikey\.com$","^authtest\.digikey\.com$","^blocked\.digikey\.com$","^dynatrace\.digikey\.com$","^search\.digikey\.com$","^www\.digikey\.ca$","^www\.digikey\.com$","^www\.digikey\.com\.mx$"]
);
</script>
<script>
typeof i10cdone === 'function' && i10cdone();
</script>
</head>
<body>
<script>
setTimeout(function()document.cookie="i10c.eac23=1";window.location.reload(true);,30);
</script>
</body>
</html>
The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.
Any help would be appreciated!
Thanks!
EDIT:
Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python
python-3.x beautifulsoup python-requests
python-3.x beautifulsoup python-requests
edited Mar 26 at 5:29
Shanteshwar Inde
1,0432 gold badges11 silver badges19 bronze badges
1,0432 gold badges11 silver badges19 bronze badges
asked Mar 25 at 20:45
ItMItM
472 silver badges10 bronze badges
472 silver badges10 bronze badges
2
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check outselenium
– C.Nivs
Mar 25 at 20:47
add a comment |
2
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check outselenium
– C.Nivs
Mar 25 at 20:47
2
2
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check out
selenium
– C.Nivs
Mar 25 at 20:47
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check out
selenium
– C.Nivs
Mar 25 at 20:47
add a comment |
2 Answers
2
active
oldest
votes
Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.
Visit view-source:https://www.digikey.com/products/en?keywords=part_number
on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.
If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.
To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.
Here an example of how to achieve that using Selenium:
How can I parse a website using Selenium and Beautifulsoup in python?
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
add a comment |
The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be
Here are some of the potential APIs you can use to get the data directly
https://api-portal.digikey.com/product
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55346142%2fgetting-full-html-back-from-a-website-request-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.
Visit view-source:https://www.digikey.com/products/en?keywords=part_number
on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.
If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.
To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.
Here an example of how to achieve that using Selenium:
How can I parse a website using Selenium and Beautifulsoup in python?
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
add a comment |
Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.
Visit view-source:https://www.digikey.com/products/en?keywords=part_number
on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.
If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.
To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.
Here an example of how to achieve that using Selenium:
How can I parse a website using Selenium and Beautifulsoup in python?
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
add a comment |
Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.
Visit view-source:https://www.digikey.com/products/en?keywords=part_number
on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.
If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.
To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.
Here an example of how to achieve that using Selenium:
How can I parse a website using Selenium and Beautifulsoup in python?
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.
Visit view-source:https://www.digikey.com/products/en?keywords=part_number
on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.
If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.
To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.
Here an example of how to achieve that using Selenium:
How can I parse a website using Selenium and Beautifulsoup in python?
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
edited Mar 25 at 20:55
answered Mar 25 at 20:48
gtalaricogtalarico
1,3547 silver badges20 bronze badges
1,3547 silver badges20 bronze badges
add a comment |
add a comment |
The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be
Here are some of the potential APIs you can use to get the data directly
https://api-portal.digikey.com/product
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
add a comment |
The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be
Here are some of the potential APIs you can use to get the data directly
https://api-portal.digikey.com/product
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
add a comment |
The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be
Here are some of the potential APIs you can use to get the data directly
https://api-portal.digikey.com/product
The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source
Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be
Here are some of the potential APIs you can use to get the data directly
https://api-portal.digikey.com/product
edited Mar 25 at 21:05
answered Mar 25 at 20:56
NightShadeNightShade
12110 bronze badges
12110 bronze badges
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
add a comment |
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though!
– ItM
Mar 26 at 0:18
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render.
– chitown88
Mar 26 at 7:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55346142%2fgetting-full-html-back-from-a-website-request-using-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check out
selenium
– C.Nivs
Mar 25 at 20:47