Scrapy rules, callback for allowed_domains, and a different callback for denied domainsWhat is the difference between @staticmethod and @classmethod?Difference between __str__ and __repr__?Scrapy - Parse_item not being calledhow to filter duplicate requests based on url in scrapyWhy don't my Scrapy CrawlSpider rules work?Scrapy rule denyselenium with scrapy for dynamic pageScrapy ignore allowed_domains?Scrapy: Linkextractor Rule not workingWho parent if we to use rules in Scarpy?

Scrapy rules, callback for allowed_domains, and a different callback for denied domainsWhat is the difference between @staticmethod and @classmethod?Difference between str and repr?Scrapy - Parse_item not being calledhow to filter duplicate requests based on url in scrapyWhy don't my Scrapy CrawlSpider rules work?Scrapy rule denyselenium with scrapy for dynamic pageScrapy ignore allowed_domains?Scrapy: Linkextractor Rule not workingWho parent if we to use rules in Scarpy?

How do I query for system views in a SQL Server database?

Is this Android phone Android 9.0 or Android 6.0?

What powers the air required for pneumatic brakes in aircraft?

Alphanumeric Line and Curve Counting

If I stood next to a piece of metal heated to a million degrees, but in a perfect vacuum, would I feel hot?

Do dragons smell of lilacs?

Why did Spider-Man take a detour to Dorset?

Is straight-up writing someone's opinions telling?

How could an animal "smell" carbon monoxide?

Adjusting vertical spacing in fractions?

Why did Steve Rogers choose this character in Endgame?

Why do candidates not quit if they no longer have a realistic chance to win in the 2020 US presidents election

How to remove the first colon ':' from a timestamp?

Intel 8080-based home computers

Getting one over on the boss

Why does FFmpeg choose 10+20+20 ms instead of an even 16 ms for 60 fps GIF images?

Is it OK to use personal email ID for faculty job applications or should we use (current) institute's ID

Is the Gritty Realism variant incompatible with dungeon-based adventures?

What problems was on a lunar module of Apollo 11?

Is there a source that says only 1/5th of the Jews will make it past the messiah?

Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?

How can I find what program is preventing my Mac from going to sleep?

Is the Malay "garam" (salt) related to the Latin "garum" (fish sauce)?

What could be reasoning of male prison in VR world to only allow undershirt and sarong as nightwear to male prisoners

Scrapy rules, callback for allowed_domains, and a different callback for denied domains

What is the difference between @staticmethod and @classmethod?Difference between __str__ and __repr__?Scrapy - Parse_item not being calledhow to filter duplicate requests based on url in scrapyWhy don't my Scrapy CrawlSpider rules work?Scrapy rule denyselenium with scrapy for dynamic pageScrapy ignore allowed_domains?Scrapy: Linkextractor Rule not workingWho parent if we to use rules in Scarpy?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

In Scrapy how can I use different callback functions for allowed domains, and denied domains.

I'm using the following rules:

rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]

Basically I want parse_item to be called whenever there is a request from an allowed_domain (or sub-domain of one of those domains). Then I want parse_denied_item to be called for all requests that are not whitelisted by allowed_domains.

How can I do this?

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

add a comment |

In Scrapy how can I use different callback functions for allowed domains, and denied domains.

I'm using the following rules:

rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]

How can I do this?

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

add a comment |

In Scrapy how can I use different callback functions for allowed domains, and denied domains.

I'm using the following rules:

rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]

How can I do this?

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

In Scrapy how can I use different callback functions for allowed domains, and denied domains.

I'm using the following rules:

rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]

How can I do this?

python scrapy

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

asked Mar 26 at 7:41

toast

91912 silver badges35 bronze badges

add a comment |

2 Answers
2

active

oldest

votes

I believe the best approach is not to use allowed_domains on LinkExtractor, and instead parse the domain out of response.url in your parse_* method and perform a different logic depending on the domain.

You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
 # [Get domain out of response.url]
 if domain in allowed_domains:
 yield from self.parse_item(response)
 else:
 yield from self.parse_denied_item(response)

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

1

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

add a comment |

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.

From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains (assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.

process_request can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url before proceeding, and logging any of those that aren't.

Here is an example:

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
 found = False
 for url in self.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False: #otherwise log it
 self.logDeniedDomain(urlparse(request.url).netloc)

 # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
 # setting request to None should prevent this call from being executed (which is not the case for all)
 # middleware is used to catch these few requests
 request = None

 return request

[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.

My Downloader middleware:

def process_request(self, request, spider):
 #catch any requests that should be filtered, and ignore them
 found = False
 for url in spider.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False:
 print("[ignored] "+request.url)
 raise IgnoreRequest('Offsite link, ignore')

 return None

Make sure you import IgnoreRequest as well:

from scrapy.exceptions import IgnoreRequest

and enable the Downloader middleware in settings.py.

To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351994%2fscrapy-rules-callback-for-allowed-domains-and-a-different-callback-for-denied%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
 # [Get domain out of response.url]
 if domain in allowed_domains:
 yield from self.parse_item(response)
 else:
 yield from self.parse_denied_item(response)

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

1

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

add a comment |

You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
 # [Get domain out of response.url]
 if domain in allowed_domains:
 yield from self.parse_item(response)
 else:
 yield from self.parse_denied_item(response)

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

1

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

add a comment |

You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
 # [Get domain out of response.url]
 if domain in allowed_domains:
 yield from self.parse_item(response)
 else:
 yield from self.parse_denied_item(response)

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
 # [Get domain out of response.url]
 if domain in allowed_domains:
 yield from self.parse_item(response)
 else:
 yield from self.parse_denied_item(response)

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

edited Mar 26 at 8:12

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

answered Mar 26 at 7:56

Gallaecio

1,6342 gold badges13 silver badges29 bronze badges

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

1

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

add a comment |

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

1

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

Could the process_request attribute of a Rule be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…

– toast
Mar 26 at 8:03

Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.

– Gallaecio
Mar 26 at 8:11

add a comment |

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.

Here is an example:

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
 found = False
 for url in self.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False: #otherwise log it
 self.logDeniedDomain(urlparse(request.url).netloc)

 # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
 # setting request to None should prevent this call from being executed (which is not the case for all)
 # middleware is used to catch these few requests
 request = None

 return request

[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.

My Downloader middleware:

def process_request(self, request, spider):
 #catch any requests that should be filtered, and ignore them
 found = False
 for url in spider.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False:
 print("[ignored] "+request.url)
 raise IgnoreRequest('Offsite link, ignore')

 return None

Make sure you import IgnoreRequest as well:

from scrapy.exceptions import IgnoreRequest

and enable the Downloader middleware in settings.py.

To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

add a comment |

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.

Here is an example:

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
 found = False
 for url in self.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False: #otherwise log it
 self.logDeniedDomain(urlparse(request.url).netloc)

 # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
 # setting request to None should prevent this call from being executed (which is not the case for all)
 # middleware is used to catch these few requests
 request = None

 return request

[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.

My Downloader middleware:

def process_request(self, request, spider):
 #catch any requests that should be filtered, and ignore them
 found = False
 for url in spider.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False:
 print("[ignored] "+request.url)
 raise IgnoreRequest('Offsite link, ignore')

 return None

Make sure you import IgnoreRequest as well:

from scrapy.exceptions import IgnoreRequest

and enable the Downloader middleware in settings.py.

To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

add a comment |

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.

Here is an example:

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
 found = False
 for url in self.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False: #otherwise log it
 self.logDeniedDomain(urlparse(request.url).netloc)

 # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
 # setting request to None should prevent this call from being executed (which is not the case for all)
 # middleware is used to catch these few requests
 request = None

 return request

[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.

My Downloader middleware:

def process_request(self, request, spider):
 #catch any requests that should be filtered, and ignore them
 found = False
 for url in spider.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False:
 print("[ignored] "+request.url)
 raise IgnoreRequest('Offsite link, ignore')

 return None

Make sure you import IgnoreRequest as well:

from scrapy.exceptions import IgnoreRequest

and enable the Downloader middleware in settings.py.

To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.

Here is an example:

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
 found = False
 for url in self.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False: #otherwise log it
 self.logDeniedDomain(urlparse(request.url).netloc)

 # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
 # setting request to None should prevent this call from being executed (which is not the case for all)
 # middleware is used to catch these few requests
 request = None

 return request

[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.

My Downloader middleware:

def process_request(self, request, spider):
 #catch any requests that should be filtered, and ignore them
 found = False
 for url in spider.allowed_domains:
 if url in request.url:
 #an allowed domain is in the request.url, proceed
 found = True

 if found == False:
 print("[ignored] "+request.url)
 raise IgnoreRequest('Offsite link, ignore')

 return None

Make sure you import IgnoreRequest as well:

from scrapy.exceptions import IgnoreRequest

and enable the Downloader middleware in settings.py.

To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

edited Mar 27 at 5:03

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

answered Mar 26 at 8:48

toast

91912 silver badges35 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers
2

2 Answers
2

2 Answers
2