How to avoid storing duplicate resultsHow to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv
Multi tool use
Was it ever illegal to name a pig "Napoleon" in France?
What factors could lead to bishops establishing monastic armies?
Why is the Cauchy Distribution is so useful?
VHDL: is there a way to create an entity into which constants can be passed?
Why different specifications for telescopes and binoculars?
Adjust the Table
Estimates on number of topologies on a finite set
How does one acquire an undead eyeball encased in a gem?
Performance issue in code for reading line and testing for palindrome
Why does Trump want a citizenship question on the census?
User Vs. Connected App
Is this Cambridge Dictionary example of "felicitate" valid?
When I press the space bar it deletes the letters in front of it
Are there red cards that offer protection against mass token destruction?
What is the problem here?(all integers are irrational proof...i think so)
Is it stylistically sound to use onomatopoeic words?
Compressed gas thruster for an orbital launch vehicle?
Would a Nikon FG 20 film SLR camera take pictures without batteries?
Can a landlord force all residents to use the landlord's in-house debit card accounts?
Need a non-volatile memory IC with near unlimited read/write operations capability
Is it okay to roll multiple attacks that all have advantage in one cluster?
Don't the events of "Forest of the Dead" contradict the fixed point in "The Wedding of River Song"?
What could cause the sea level to massively decrease?
Object's height not a multiple of layer height
How to avoid storing duplicate results
How to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I store scraped content within a csv file.
Each row contains a unique ID and description of an item.
My ID is coming from the website where I scrap the content and not generated on the scraper side.
I use Scrapy's feedExporter to generate the csv file
When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.
As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation
Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?
scrapy
add a comment |
I store scraped content within a csv file.
Each row contains a unique ID and description of an item.
My ID is coming from the website where I scrap the content and not generated on the scraper side.
I use Scrapy's feedExporter to generate the csv file
When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.
As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation
Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?
scrapy
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
1
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52
add a comment |
I store scraped content within a csv file.
Each row contains a unique ID and description of an item.
My ID is coming from the website where I scrap the content and not generated on the scraper side.
I use Scrapy's feedExporter to generate the csv file
When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.
As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation
Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?
scrapy
I store scraped content within a csv file.
Each row contains a unique ID and description of an item.
My ID is coming from the website where I scrap the content and not generated on the scraper side.
I use Scrapy's feedExporter to generate the csv file
When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.
As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation
Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?
scrapy
scrapy
asked Mar 25 at 23:18
jeanjean
699 bronze badges
699 bronze badges
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
1
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52
add a comment |
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
1
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
1
1
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55347751%2fhow-to-avoid-storing-duplicate-results%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55347751%2fhow-to-avoid-storing-duplicate-results%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
l3y0g0eT4zkPrUp4gKQvh9cNunljRRY df9BBx u8c3g3CYjYPRkBVpAuLU Yf jXWswr qFFJ7YC00D5iqeX
Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list
– Gallaecio
Mar 26 at 7:11
1
doc.scrapy.org/en/latest/topics/…
– Gallaecio
Mar 26 at 7:13
Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.
– Tomáš Linhart
Mar 27 at 13:52