Scrapy rules, callback for allowed_domains, and a different callback for denied domainsWhat is the difference between @staticmethod and @classmethod?Difference between __str__ and __repr__?Scrapy - Parse_item not being calledhow to filter duplicate requests based on url in scrapyWhy don't my Scrapy CrawlSpider rules work?Scrapy rule denyselenium with scrapy for dynamic pageScrapy ignore allowed_domains?Scrapy: Linkextractor Rule not workingWho parent if we to use rules in Scarpy?
How do I query for system views in a SQL Server database?
Is this Android phone Android 9.0 or Android 6.0?
What powers the air required for pneumatic brakes in aircraft?
Alphanumeric Line and Curve Counting
If I stood next to a piece of metal heated to a million degrees, but in a perfect vacuum, would I feel hot?
Do dragons smell of lilacs?
Why did Spider-Man take a detour to Dorset?
Is straight-up writing someone's opinions telling?
How could an animal "smell" carbon monoxide?
Adjusting vertical spacing in fractions?
Why did Steve Rogers choose this character in Endgame?
Why do candidates not quit if they no longer have a realistic chance to win in the 2020 US presidents election
How to remove the first colon ':' from a timestamp?
Intel 8080-based home computers
Getting one over on the boss
Why does FFmpeg choose 10+20+20 ms instead of an even 16 ms for 60 fps GIF images?
Is it OK to use personal email ID for faculty job applications or should we use (current) institute's ID
Is the Gritty Realism variant incompatible with dungeon-based adventures?
What problems was on a lunar module of Apollo 11?
Is there a source that says only 1/5th of the Jews will make it past the messiah?
Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?
How can I find what program is preventing my Mac from going to sleep?
Is the Malay "garam" (salt) related to the Latin "garum" (fish sauce)?
What could be reasoning of male prison in VR world to only allow undershirt and sarong as nightwear to male prisoners
Scrapy rules, callback for allowed_domains, and a different callback for denied domains
What is the difference between @staticmethod and @classmethod?Difference between __str__ and __repr__?Scrapy - Parse_item not being calledhow to filter duplicate requests based on url in scrapyWhy don't my Scrapy CrawlSpider rules work?Scrapy rule denyselenium with scrapy for dynamic pageScrapy ignore allowed_domains?Scrapy: Linkextractor Rule not workingWho parent if we to use rules in Scarpy?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item
to be called whenever there is a request from an allowed_domain
(or sub-domain of one of those domains). Then I want parse_denied_item
to be called for all requests that are not whitelisted by allowed_domains
.
How can I do this?
python scrapy
add a comment |
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item
to be called whenever there is a request from an allowed_domain
(or sub-domain of one of those domains). Then I want parse_denied_item
to be called for all requests that are not whitelisted by allowed_domains
.
How can I do this?
python scrapy
add a comment |
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item
to be called whenever there is a request from an allowed_domain
(or sub-domain of one of those domains). Then I want parse_denied_item
to be called for all requests that are not whitelisted by allowed_domains
.
How can I do this?
python scrapy
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item
to be called whenever there is a request from an allowed_domain
(or sub-domain of one of those domains). Then I want parse_denied_item
to be called for all requests that are not whitelisted by allowed_domains
.
How can I do this?
python scrapy
python scrapy
asked Mar 26 at 7:41
toasttoast
91912 silver badges35 bronze badges
91912 silver badges35 bronze badges
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
I believe the best approach is not to use allowed_domains
on LinkExtractor
, and instead parse the domain out of response.url
in your parse_*
method and perform a different logic depending on the domain.
You can keep separate parse_*
methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response)
(Python 3) with the corresponding parse_*
method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
Could theprocess_request
attribute of aRule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…
– toast
Mar 26 at 8:03
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
add a comment |
Based on Gallaecio's answer. An alternate option is to use process_request
of Rule
. process_request
will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains
(assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request
can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url
before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request
in Downloader middleware appears to solve it though.
My Downloader
middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest
as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py
.
To verify this, you can add some verification code in process_item
of your crawler to ensure no requests to out of scope sites have been made.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351994%2fscrapy-rules-callback-for-allowed-domains-and-a-different-callback-for-denied%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I believe the best approach is not to use allowed_domains
on LinkExtractor
, and instead parse the domain out of response.url
in your parse_*
method and perform a different logic depending on the domain.
You can keep separate parse_*
methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response)
(Python 3) with the corresponding parse_*
method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
Could theprocess_request
attribute of aRule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…
– toast
Mar 26 at 8:03
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
add a comment |
I believe the best approach is not to use allowed_domains
on LinkExtractor
, and instead parse the domain out of response.url
in your parse_*
method and perform a different logic depending on the domain.
You can keep separate parse_*
methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response)
(Python 3) with the corresponding parse_*
method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
Could theprocess_request
attribute of aRule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…
– toast
Mar 26 at 8:03
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
add a comment |
I believe the best approach is not to use allowed_domains
on LinkExtractor
, and instead parse the domain out of response.url
in your parse_*
method and perform a different logic depending on the domain.
You can keep separate parse_*
methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response)
(Python 3) with the corresponding parse_*
method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
I believe the best approach is not to use allowed_domains
on LinkExtractor
, and instead parse the domain out of response.url
in your parse_*
method and perform a different logic depending on the domain.
You can keep separate parse_*
methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response)
(Python 3) with the corresponding parse_*
method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
edited Mar 26 at 8:12
answered Mar 26 at 7:56
GallaecioGallaecio
1,6342 gold badges13 silver badges29 bronze badges
1,6342 gold badges13 silver badges29 bronze badges
Could theprocess_request
attribute of aRule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…
– toast
Mar 26 at 8:03
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
add a comment |
Could theprocess_request
attribute of aRule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…
– toast
Mar 26 at 8:03
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
Could the
process_request
attribute of a Rule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…– toast
Mar 26 at 8:03
Could the
process_request
attribute of a Rule
be used to capture the request before it is executed? Then filter from this point? docs.scrapy.org/en/latest/topics/…– toast
Mar 26 at 8:03
1
1
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
Indeed, I had not thought of that. Please, add it as a response to your own question, I for one will vote that up.
– Gallaecio
Mar 26 at 8:11
add a comment |
Based on Gallaecio's answer. An alternate option is to use process_request
of Rule
. process_request
will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains
(assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request
can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url
before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request
in Downloader middleware appears to solve it though.
My Downloader
middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest
as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py
.
To verify this, you can add some verification code in process_item
of your crawler to ensure no requests to out of scope sites have been made.
add a comment |
Based on Gallaecio's answer. An alternate option is to use process_request
of Rule
. process_request
will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains
(assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request
can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url
before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request
in Downloader middleware appears to solve it though.
My Downloader
middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest
as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py
.
To verify this, you can add some verification code in process_item
of your crawler to ensure no requests to out of scope sites have been made.
add a comment |
Based on Gallaecio's answer. An alternate option is to use process_request
of Rule
. process_request
will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains
(assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request
can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url
before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request
in Downloader middleware appears to solve it though.
My Downloader
middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest
as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py
.
To verify this, you can add some verification code in process_item
of your crawler to ensure no requests to out of scope sites have been made.
Based on Gallaecio's answer. An alternate option is to use process_request
of Rule
. process_request
will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains
(assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request
can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url
before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request
in Downloader middleware appears to solve it though.
My Downloader
middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest
as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py
.
To verify this, you can add some verification code in process_item
of your crawler to ensure no requests to out of scope sites have been made.
edited Mar 27 at 5:03
answered Mar 26 at 8:48
toasttoast
91912 silver badges35 bronze badges
91912 silver badges35 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351994%2fscrapy-rules-callback-for-allowed-domains-and-a-different-callback-for-denied%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown