Can't Identify Proper CSS Selector to Scrape with MechanizeUsing regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it

Using "subway" as name for London Underground?

Passing multiple files through stdin (over ssh)

What can plausibly explain many of my very long and low-tech bridges?

Scrum Master role: Reporting?

Inconsistent behavior of compiler optimization of unused string

Is it a problem if <h4>, <h5> and <h6> are smaller than regular text?

Is an early checkout possible at a hotel before its reception opens?

Why was the Sega Genesis marketed as a 16-bit console?

Can a user sell my software (MIT license) without modification?

What makes Ada the language of choice for the ISS's safety-critical systems?

What makes an item an artifact?

Is open-sourcing the code of a webapp not recommended?

Does an ice chest packed full of frozen food need ice?

Should I compare a std::string to "string" or "string"s?

Find the Factorial From the Given Prime Relationship

When conversion from Integer to Single may lose precision

How to retract an idea already pitched to an employer?

What should the arbiter and what should have I done in this case?

How did they achieve the Gunslinger's shining eye effect in Westworld?

How can drunken, homicidal elves successfully conduct a wild hunt?

How to tell your grandparent to not come to fetch you with their car?

Words that signal future content

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

Was there a priest on the Titanic who stayed on the ship giving confession to as many as he could?



Can't Identify Proper CSS Selector to Scrape with Mechanize


Using regular expression in css?Mechanize not recognizing anchor tags via CSS selector methodsNokogiri and Mechanize help (navigating to pages via div class and scraping)How do I convert a Nokogiri statement into Mechanize for screen scraping?nokogiri + mechanize css selector by textIn scraping, can't login with MechanizeWeb Scraping with Nokogiri and MechanizePick the correct form from Mechanize results via CSS selectorMechanize search unable to find CSS selector (it's definitely present)Mechanize suddenly can't login anymoreRails mechanize data scraping correct data/cleaning it






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.



The rake task I have defined to complete the scraping is as follows:



mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click

bean = Bean.new

bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end


Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.



<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">


The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama










share|improve this question






















  • Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

    – max pleaner
    Mar 19 at 18:51











  • Let me know if my answer to your question is sufficient. If so please mark as correct.

    – NemyaNation
    Mar 28 at 23:00











  • Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

    – the Tin Man
    May 23 at 23:59


















0















I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.



The rake task I have defined to complete the scraping is as follows:



mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click

bean = Bean.new

bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end


Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.



<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">


The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama










share|improve this question






















  • Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

    – max pleaner
    Mar 19 at 18:51











  • Let me know if my answer to your question is sufficient. If so please mark as correct.

    – NemyaNation
    Mar 28 at 23:00











  • Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

    – the Tin Man
    May 23 at 23:59














0












0








0








I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.



The rake task I have defined to complete the scraping is as follows:



mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click

bean = Bean.new

bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end


Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.



<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">


The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama










share|improve this question














I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.



The rake task I have defined to complete the scraping is as follows:



mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click

bean = Bean.new

bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end


Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.



<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">


The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama







ruby-on-rails ruby nokogiri mechanize






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 19 at 6:58









Andrew HymanAndrew Hyman

93




93












  • Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

    – max pleaner
    Mar 19 at 18:51











  • Let me know if my answer to your question is sufficient. If so please mark as correct.

    – NemyaNation
    Mar 28 at 23:00











  • Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

    – the Tin Man
    May 23 at 23:59


















  • Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

    – max pleaner
    Mar 19 at 18:51











  • Let me know if my answer to your question is sufficient. If so please mark as correct.

    – NemyaNation
    Mar 28 at 23:00











  • Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

    – the Tin Man
    May 23 at 23:59

















Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51





Css does have substring matching, so you could use img[src^='//cdn.shopify.com/s/files/'] (not sure if that is specific enough for your needs, you can scope to a parent if required). See stackoverflow.com/questions/8903313/… and w3.org/TR/selectors/#attribute-substrings

– max pleaner
Mar 19 at 18:51













Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00





Let me know if my answer to your question is sufficient. If so please mark as correct.

– NemyaNation
Mar 28 at 23:00













Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59






Please read "How to Ask". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense.

– the Tin Man
May 23 at 23:59













1 Answer
1






active

oldest

votes


















0














You need to change



bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')


to



bean.image_url = coffee_page.css('#mobile-only>img').attr('src')


If you can, always use nearby identifiers to locate the element you want to access.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55235229%2fcant-identify-proper-css-selector-to-scrape-with-mechanize%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    You need to change



    bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')


    to



    bean.image_url = coffee_page.css('#mobile-only>img').attr('src')


    If you can, always use nearby identifiers to locate the element you want to access.






    share|improve this answer



























      0














      You need to change



      bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')


      to



      bean.image_url = coffee_page.css('#mobile-only>img').attr('src')


      If you can, always use nearby identifiers to locate the element you want to access.






      share|improve this answer

























        0












        0








        0







        You need to change



        bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')


        to



        bean.image_url = coffee_page.css('#mobile-only>img').attr('src')


        If you can, always use nearby identifiers to locate the element you want to access.






        share|improve this answer













        You need to change



        bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')


        to



        bean.image_url = coffee_page.css('#mobile-only>img').attr('src')


        If you can, always use nearby identifiers to locate the element you want to access.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 24 at 16:28









        NemyaNationNemyaNation

        8010




        8010





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55235229%2fcant-identify-proper-css-selector-to-scrape-with-mechanize%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript