Can't Identify Proper CSS Selector to Scrape with MechanizeUsing regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it

Using "subway" as name for London Underground?

Passing multiple files through stdin (over ssh)

What can plausibly explain many of my very long and low-tech bridges?

Scrum Master role: Reporting?

Inconsistent behavior of compiler optimization of unused string

Is it a problem if <h4>, <h5> and <h6> are smaller than regular text?

Is an early checkout possible at a hotel before its reception opens?

Why was the Sega Genesis marketed as a 16-bit console?

Can a user sell my software (MIT license) without modification?

What makes Ada the language of choice for the ISS's safety-critical systems?

What makes an item an artifact?

Is open-sourcing the code of a webapp not recommended?

Does an ice chest packed full of frozen food need ice?

Should I compare a std::string to "string" or "string"s?

Find the Factorial From the Given Prime Relationship

When conversion from Integer to Single may lose precision

How to retract an idea already pitched to an employer?

What should the arbiter and what should have I done in this case?

How did they achieve the Gunslinger's shining eye effect in Westworld?

How can drunken, homicidal elves successfully conduct a wild hunt?

How to tell your grandparent to not come to fetch you with their car?

Words that signal future content

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

Was there a priest on the Titanic who stayed on the ship giving confession to as many as he could?

Can't Identify Proper CSS Selector to Scrape with Mechanize

Using regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.

The rake task I have defined to complete the scraping is as follows:

mechanize = Mechanize.new
mechanize.get(url) do |page|
 page.links_with(:href => /products/).each do |link|
 coffee_page = link.click

 bean = Bean.new

 bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
 bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
 bean.roaster_id = "2"
 bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
 bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
 bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
 bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
 bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
 bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
 bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
 bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

 if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
 bean.destroy
 else
 ap bean
 end
 end
end

Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.

<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">

The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

asked Mar 19 at 6:58

Andrew Hyman

Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51

Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00

Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59

add a comment |

The rake task I have defined to complete the scraping is as follows:

mechanize = Mechanize.new
mechanize.get(url) do |page|
 page.links_with(:href => /products/).each do |link|
 coffee_page = link.click

 bean = Bean.new

 bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
 bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
 bean.roaster_id = "2"
 bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
 bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
 bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
 bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
 bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
 bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
 bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
 bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

 if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
 bean.destroy
 else
 ap bean
 end
 end
end

<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">

The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

asked Mar 19 at 6:58

Andrew Hyman

Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51

Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00

Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59

add a comment |

The rake task I have defined to complete the scraping is as follows:

mechanize = Mechanize.new
mechanize.get(url) do |page|
 page.links_with(:href => /products/).each do |link|
 coffee_page = link.click

 bean = Bean.new

 bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
 bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
 bean.roaster_id = "2"
 bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
 bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
 bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
 bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
 bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
 bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
 bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
 bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

 if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
 bean.destroy
 else
 ap bean
 end
 end
end

<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">

The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

asked Mar 19 at 6:58

Andrew Hyman

The rake task I have defined to complete the scraping is as follows:

mechanize = Mechanize.new
mechanize.get(url) do |page|
 page.links_with(:href => /products/).each do |link|
 coffee_page = link.click

 bean = Bean.new

 bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
 bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
 bean.roaster_id = "2"
 bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
 bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
 bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
 bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
 bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
 bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
 bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
 bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

 if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
 bean.destroy
 else
 ap bean
 end
 end
end

<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">

The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

ruby-on-rails ruby nokogiri mechanize

asked Mar 19 at 6:58

Andrew Hyman

asked Mar 19 at 6:58

Andrew Hyman

asked Mar 19 at 6:58

Andrew Hyman

asked Mar 19 at 6:58

Andrew Hyman

asked Mar 19 at 6:58

Andrew Hyman

Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51

Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00

Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59

add a comment |

Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51

Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00

Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59

Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51

Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00

Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59

add a comment |

1 Answer
1

active

oldest

votes

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

answered Mar 24 at 16:28

NemyaNation

8010

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55235229%2fcant-identify-proper-css-selector-to-scrape-with-mechanize%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

answered Mar 24 at 16:28

NemyaNation

8010

add a comment |

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

answered Mar 24 at 16:28

NemyaNation

8010

add a comment |

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

answered Mar 24 at 16:28

NemyaNation

8010

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

answered Mar 24 at 16:28

NemyaNation

8010

answered Mar 24 at 16:28

NemyaNation

8010

answered Mar 24 at 16:28

NemyaNation

8010

answered Mar 24 at 16:28

NemyaNation

8010

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1