How to avoid storing duplicate resultsHow to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv

Was it ever illegal to name a pig "Napoleon" in France?

What factors could lead to bishops establishing monastic armies?

Why is the Cauchy Distribution is so useful?

VHDL: is there a way to create an entity into which constants can be passed?

Why different specifications for telescopes and binoculars?

Adjust the Table

Estimates on number of topologies on a finite set

How does one acquire an undead eyeball encased in a gem?

Performance issue in code for reading line and testing for palindrome

Why does Trump want a citizenship question on the census?

User Vs. Connected App

Is this Cambridge Dictionary example of "felicitate" valid?

When I press the space bar it deletes the letters in front of it

Are there red cards that offer protection against mass token destruction?

What is the problem here?(all integers are irrational proof...i think so)

Is it stylistically sound to use onomatopoeic words?

Compressed gas thruster for an orbital launch vehicle?

Would a Nikon FG 20 film SLR camera take pictures without batteries?

Can a landlord force all residents to use the landlord's in-house debit card accounts?

Need a non-volatile memory IC with near unlimited read/write operations capability

Is it okay to roll multiple attacks that all have advantage in one cluster?

Don't the events of "Forest of the Dead" contradict the fixed point in "The Wedding of River Song"?

What could cause the sea level to massively decrease?

Object's height not a multiple of layer height

How to avoid storing duplicate results

How to prevent duplicates on Scrapy fetching depending on an existing JSON listReplay a Scrapy spider on stored dataHeadless Browser and scraping - solutionsCrawl and monitor +1000 websitesRunning Scrapy daily and tracking changes in the dataAdding Scrapy request URL into Parsed ArrayBuffered pipeline using scrapyPython/Scrapy: Extracting xpath from Ebay's dynamic description fieldScraping Multiple Websites with Single Spider using ScrapyScrapy Crawler excel outputStoring Scrapped data in csv

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I store scraped content within a csv file.
Each row contains a unique ID and description of an item.

My ID is coming from the website where I scrap the content and not generated on the scraper side.

I use Scrapy's feedExporter to generate the csv file

When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.

As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation

Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?

asked Mar 25 at 23:18

jean

699 bronze badges

Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11

1

doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13

Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52

add a comment |

I store scraped content within a csv file.
Each row contains a unique ID and description of an item.

My ID is coming from the website where I scrap the content and not generated on the scraper side.

I use Scrapy's feedExporter to generate the csv file

When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.

Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?

asked Mar 25 at 23:18

jean

699 bronze badges

Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11

1

doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13

Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52

add a comment |

I store scraped content within a csv file.
Each row contains a unique ID and description of an item.

My ID is coming from the website where I scrap the content and not generated on the scraper side.

I use Scrapy's feedExporter to generate the csv file

When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.

Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?

asked Mar 25 at 23:18

jean

699 bronze badges

I store scraped content within a csv file.
Each row contains a unique ID and description of an item.

My ID is coming from the website where I scrap the content and not generated on the scraper side.

I use Scrapy's feedExporter to generate the csv file

When I scrap again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.

Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?

scrapy

asked Mar 25 at 23:18

jean

699 bronze badges

asked Mar 25 at 23:18

jean

699 bronze badges

asked Mar 25 at 23:18

jean

699 bronze badges

asked Mar 25 at 23:18

jean

699 bronze badges

asked Mar 25 at 23:18

jean

699 bronze badges

Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11

1

doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13

Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52

add a comment |

Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11

1

doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13

Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52

Possible duplicate of How to prevent duplicates on Scrapy fetching depending on an existing JSON list

– Gallaecio
Mar 26 at 7:11

doc.scrapy.org/en/latest/topics/…

– Gallaecio
Mar 26 at 7:13

Apart from the example that @Gallaecio linked, if you use distributed system like Scrapy Cluster, you might want to use something like Redis to hold the set of seen IDs.

– Tomáš Linhart
Mar 27 at 13:52

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55347751%2fhow-to-avoid-storing-duplicate-results%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현