How to avoid storing duplicate resultsHow to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv

Was it ever illegal to name a pig "Napoleon" in France?

What factors could lead to bishops establishing monastic armies?

Why is the Cauchy Distribution is so useful?

VHDL: is there a way to create an entity into which constants can be passed?

Why different specifications for telescopes and binoculars?

Adjust the Table

Estimates on number of topologies on a finite set

How does one acquire an undead eyeball encased in a gem?

Performance issue in code for reading line and testing for palindrome

Why does Trump want a citizenship question on the census?

User Vs. Connected App

Is this Cambridge Dictionary example of "felicitate" valid?

When I press the space bar it deletes the letters in front of it

Are there red cards that offer protection against mass token destruction?

What is the problem here?(all integers are irrational proof...i think so)

Is it stylistically sound to use onomatopoeic words?

Compressed gas thruster for an orbital launch vehicle?

Would a Nikon FG 20 film SLR camera take pictures without batteries?

Can a landlord force all residents to use the landlord's in-house debit card accounts?

Need a non-volatile memory IC with near unlimited read/write operations capability

Is it okay to roll multiple attacks that all have advantage in one cluster?

Don't the events of "Forest of the Dead" contradict the fixed point in "The Wedding of River Song"?

What could cause the sea level to massively decrease?

Object's height not a multiple of layer height



How to avoid storing duplicate results


How to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I store scraped content within a csv file.
Each row contains a unique ID and description of an item.



My ID is coming from the website where I scrap the content and not generated on the scraper side.



I use Scrapy's feedExporter to generate the csv file



When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.



As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation



Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?










share|improve this question






















  • Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

    – Gallaecio
    Mar 26 at 7:11






  • 1





    doc.scrapy.org/en/latest/topics/…

    – Gallaecio
    Mar 26 at 7:13











  • Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

    – Tomáš Linhart
    Mar 27 at 13:52

















0















I store scraped content within a csv file.
Each row contains a unique ID and description of an item.



My ID is coming from the website where I scrap the content and not generated on the scraper side.



I use Scrapy's feedExporter to generate the csv file



When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.



As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation



Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?










share|improve this question






















  • Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

    – Gallaecio
    Mar 26 at 7:11






  • 1





    doc.scrapy.org/en/latest/topics/…

    – Gallaecio
    Mar 26 at 7:13











  • Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

    – Tomáš Linhart
    Mar 27 at 13:52













0












0








0








I store scraped content within a csv file.
Each row contains a unique ID and description of an item.



My ID is coming from the website where I scrap the content and not generated on the scraper side.



I use Scrapy's feedExporter to generate the csv file



When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.



As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation



Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?










share|improve this question














I store scraped content within a csv file.
Each row contains a unique ID and description of an item.



My ID is coming from the website where I scrap the content and not generated on the scraper side.



I use Scrapy's feedExporter to generate the csv file



When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.



As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation



Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?







scrapy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 23:18









jeanjean

699 bronze badges




699 bronze badges












  • Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

    – Gallaecio
    Mar 26 at 7:11






  • 1





    doc.scrapy.org/en/latest/topics/…

    – Gallaecio
    Mar 26 at 7:13











  • Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

    – Tomáš Linhart
    Mar 27 at 13:52

















  • Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

    – Gallaecio
    Mar 26 at 7:11






  • 1





    doc.scrapy.org/en/latest/topics/…

    – Gallaecio
    Mar 26 at 7:13











  • Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

    – Tomáš Linhart
    Mar 27 at 13:52
















Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11





Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11




1




1





doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13





doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13













Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52





Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52












0






active

oldest

votes










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55347751%2fhow-to-avoid-storing-duplicate-results%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes




Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.







Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55347751%2fhow-to-avoid-storing-duplicate-results%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript