Can't Identify Proper CSS Selector to Scrape with MechanizeUsing regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it
Using "subway" as name for London Underground?
Passing multiple files through stdin (over ssh)
What can plausibly explain many of my very long and low-tech bridges?
Scrum Master role: Reporting?
Inconsistent behavior of compiler optimization of unused string
Is it a problem if <h4>, <h5> and <h6> are smaller than regular text?
Is an early checkout possible at a hotel before its reception opens?
Why was the Sega Genesis marketed as a 16-bit console?
Can a user sell my software (MIT license) without modification?
What makes Ada the language of choice for the ISS's safety-critical systems?
What makes an item an artifact?
Is open-sourcing the code of a webapp not recommended?
Does an ice chest packed full of frozen food need ice?
Should I compare a std::string to "string" or "string"s?
Find the Factorial From the Given Prime Relationship
When conversion from Integer to Single may lose precision
How to retract an idea already pitched to an employer?
What should the arbiter and what should have I done in this case?
How did they achieve the Gunslinger's shining eye effect in Westworld?
How can drunken, homicidal elves successfully conduct a wild hunt?
How to tell your grandparent to not come to fetch you with their car?
Words that signal future content
Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization
Was there a priest on the Titanic who stayed on the ship giving confession to as many as he could?
Can't Identify Proper CSS Selector to Scrape with Mechanize
Using regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama
ruby-on-rails ruby nokogiri mechanize
add a comment |
I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama
ruby-on-rails ruby nokogiri mechanize
Css does have substring matching, so you could useimg[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings
– max pleaner
Mar 19 at 18:51
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59
add a comment |
I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama
ruby-on-rails ruby nokogiri mechanize
I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama
ruby-on-rails ruby nokogiri mechanize
ruby-on-rails ruby nokogiri mechanize
asked Mar 19 at 6:58
Andrew HymanAndrew Hyman
93
93
Css does have substring matching, so you could useimg[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings
– max pleaner
Mar 19 at 18:51
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59
add a comment |
Css does have substring matching, so you could useimg[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings
– max pleaner
Mar 19 at 18:51
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59
Css does have substring matching, so you could use
img[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings– max pleaner
Mar 19 at 18:51
Css does have substring matching, so you could use
img[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings– max pleaner
Mar 19 at 18:51
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59
add a comment |
1 Answer
1
active
oldest
votes
You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55235229%2fcant-identify-proper-css-selector-to-scrape-with-mechanize%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.
add a comment |
You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.
add a comment |
You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.
You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.
answered Mar 24 at 16:28
NemyaNationNemyaNation
8010
8010
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55235229%2fcant-identify-proper-css-selector-to-scrape-with-mechanize%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Css does have substring matching, so you could use
img[src^='//cdn.shopify.com/s/files/']
(not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings– max pleaner
Mar 19 at 18:51
Let me know if my answer to your question is sufficient. If so please mark as correct.
– NemyaNation
Mar 28 at 23:00
Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.
– the Tin Man
May 23 at 23:59