Scrapy CrawlSpider parse_item for a 302 redirect responseHow to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor

Is the first derivative operation on a signal a causal system?

How strong are Wi-Fi signals?

Dictionary size reduces upon increasing one element

Is there a general effective method to solve Smullyan style Knights and Knaves problems? Is the truth table method the most appropriate one?

Command to Search for Filenames Exceeding 143 Characters?

What does the view outside my ship traveling at light speed look like?

Can't remember the name of this game

Identify this in soil?

Different circular sectors as new logo of the International System

Can R-3-methyl-4-heptanone be enantioselectively synthesised from 4-heptanone?

Seed ship, unsexed person, cover has golden person attached to ship by umbilical cord

Binary Search in C++17

Why does the 6502 have the BIT instruction?

Which is the common name of Mind Flayers?

Logarithm of dependent variable is uniformly distributed. How to calculate a confidence interval for the mean?

Is there a down side to setting the sampling time of a SAR ADC as long as possible?

Windows 10 Programms start without visual Interface

Does this degree 12 genus 1 curve have only one point over infinitely many finite fields?

How many chess players are over 2500 Elo?

Why do airplanes use an axial flow jet engine instead of a more compact centrifugal jet engine?

Riley Rebuses that Share a Common Theme

Under what law can the U.S. arrest International Criminal Court (ICC) judges over war crimes probe?

Looking for a soft substance that doesn't dissolve underwater

General purpose replacement for enum with FlagsAttribute



Scrapy CrawlSpider parse_item for a 302 redirect response


How to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








2















I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.



A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.



I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.



Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.



Scrapy version: 0.24.6










share|improve this question






















  • First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

    – Granitosaurus
    Feb 10 '16 at 9:21











  • @Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

    – bawejakunal
    Feb 10 '16 at 10:23












  • oh no, I've posted a detailed answer how to enable this :)

    – Granitosaurus
    Feb 10 '16 at 10:43

















2















I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.



A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.



I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.



Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.



Scrapy version: 0.24.6










share|improve this question






















  • First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

    – Granitosaurus
    Feb 10 '16 at 9:21











  • @Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

    – bawejakunal
    Feb 10 '16 at 10:23












  • oh no, I've posted a detailed answer how to enable this :)

    – Granitosaurus
    Feb 10 '16 at 10:43













2












2








2








I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.



A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.



I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.



Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.



Scrapy version: 0.24.6










share|improve this question














I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.



A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.



I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.



Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.



Scrapy version: 0.24.6







redirect web-scraping scrapy web-crawler






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 10 '16 at 5:38









bawejakunalbawejakunal

701937




701937












  • First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

    – Granitosaurus
    Feb 10 '16 at 9:21











  • @Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

    – bawejakunal
    Feb 10 '16 at 10:23












  • oh no, I've posted a detailed answer how to enable this :)

    – Granitosaurus
    Feb 10 '16 at 10:43

















  • First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

    – Granitosaurus
    Feb 10 '16 at 9:21











  • @Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

    – bawejakunal
    Feb 10 '16 at 10:23












  • oh no, I've posted a detailed answer how to enable this :)

    – Granitosaurus
    Feb 10 '16 at 10:43
















First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21





First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21













@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23






@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23














oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43





oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43












2 Answers
2






active

oldest

votes


















2














To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:



import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)


What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.



if you want to disable redirects for every request(which usually is not the best idea) you can just add



REDIRECTS_ENABLED = False


to your settings.py in scrapy project.



There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html






share|improve this answer


















  • 1





    I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

    – bawejakunal
    Feb 11 '16 at 3:07











  • Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

    – Granitosaurus
    Feb 11 '16 at 7:42












  • I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

    – bawejakunal
    Feb 11 '16 at 7:51


















0














class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'



def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)

return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)





share|improve this answer

























  • Some explanation along with this script would be helpful

    – TT.
    Mar 20 at 8:44











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35307644%2fscrapy-crawlspider-parse-item-for-a-302-redirect-response%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:



import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)


What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.



if you want to disable redirects for every request(which usually is not the best idea) you can just add



REDIRECTS_ENABLED = False


to your settings.py in scrapy project.



There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html






share|improve this answer


















  • 1





    I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

    – bawejakunal
    Feb 11 '16 at 3:07











  • Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

    – Granitosaurus
    Feb 11 '16 at 7:42












  • I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

    – bawejakunal
    Feb 11 '16 at 7:51















2














To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:



import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)


What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.



if you want to disable redirects for every request(which usually is not the best idea) you can just add



REDIRECTS_ENABLED = False


to your settings.py in scrapy project.



There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html






share|improve this answer


















  • 1





    I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

    – bawejakunal
    Feb 11 '16 at 3:07











  • Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

    – Granitosaurus
    Feb 11 '16 at 7:42












  • I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

    – bawejakunal
    Feb 11 '16 at 7:51













2












2








2







To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:



import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)


What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.



if you want to disable redirects for every request(which usually is not the best idea) you can just add



REDIRECTS_ENABLED = False


to your settings.py in scrapy project.



There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html






share|improve this answer













To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:



import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)


What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.



if you want to disable redirects for every request(which usually is not the best idea) you can just add



REDIRECTS_ENABLED = False


to your settings.py in scrapy project.



There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html







share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 10 '16 at 10:42









GranitosaurusGranitosaurus

12.3k22545




12.3k22545







  • 1





    I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

    – bawejakunal
    Feb 11 '16 at 3:07











  • Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

    – Granitosaurus
    Feb 11 '16 at 7:42












  • I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

    – bawejakunal
    Feb 11 '16 at 7:51












  • 1





    I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

    – bawejakunal
    Feb 11 '16 at 3:07











  • Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

    – Granitosaurus
    Feb 11 '16 at 7:42












  • I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

    – bawejakunal
    Feb 11 '16 at 7:51







1




1





I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07





I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07













Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42






Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42














I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51





I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51













0














class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'



def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)

return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)





share|improve this answer

























  • Some explanation along with this script would be helpful

    – TT.
    Mar 20 at 8:44















0














class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'



def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)

return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)





share|improve this answer

























  • Some explanation along with this script would be helpful

    – TT.
    Mar 20 at 8:44













0












0








0







class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'



def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)

return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)





share|improve this answer















class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'



def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)

return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 24 at 6:47

























answered Mar 20 at 8:17









qihuan wuqihuan wu

84




84












  • Some explanation along with this script would be helpful

    – TT.
    Mar 20 at 8:44

















  • Some explanation along with this script would be helpful

    – TT.
    Mar 20 at 8:44
















Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44





Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35307644%2fscrapy-crawlspider-parse-item-for-a-302-redirect-response%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해