Scrapy CrawlSpider parse_item for a 302 redirect responseHow to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor

Is the first derivative operation on a signal a causal system?

How strong are Wi-Fi signals?

Dictionary size reduces upon increasing one element

Is there a general effective method to solve Smullyan style Knights and Knaves problems? Is the truth table method the most appropriate one?

Command to Search for Filenames Exceeding 143 Characters?

What does the view outside my ship traveling at light speed look like?

Can't remember the name of this game

Identify this in soil?

Different circular sectors as new logo of the International System

Can R-3-methyl-4-heptanone be enantioselectively synthesised from 4-heptanone?

Seed ship, unsexed person, cover has golden person attached to ship by umbilical cord

Binary Search in C++17

Why does the 6502 have the BIT instruction?

Which is the common name of Mind Flayers?

Logarithm of dependent variable is uniformly distributed. How to calculate a confidence interval for the mean?

Is there a down side to setting the sampling time of a SAR ADC as long as possible?

Windows 10 Programms start without visual Interface

Does this degree 12 genus 1 curve have only one point over infinitely many finite fields?

How many chess players are over 2500 Elo?

Why do airplanes use an axial flow jet engine instead of a more compact centrifugal jet engine?

Riley Rebuses that Share a Common Theme

Under what law can the U.S. arrest International Criminal Court (ICC) judges over war crimes probe?

Looking for a soft substance that doesn't dissolve underwater

General purpose replacement for enum with FlagsAttribute

Scrapy CrawlSpider parse_item for a 302 redirect response

How to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.

A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.

I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.

Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.

Scrapy version: 0.24.6

asked Feb 10 '16 at 5:38

bawejakunal

701937

First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21

@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23

oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43

add a comment |

I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.

I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.

Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.

Scrapy version: 0.24.6

asked Feb 10 '16 at 5:38

bawejakunal

701937

First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21

@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23

oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43

add a comment |

I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.

I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.

Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.

Scrapy version: 0.24.6

asked Feb 10 '16 at 5:38

bawejakunal

701937

I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.

I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.

Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.

Scrapy version: 0.24.6

redirect web-scraping scrapy web-crawler

asked Feb 10 '16 at 5:38

bawejakunal

701937

asked Feb 10 '16 at 5:38

bawejakunal

701937

asked Feb 10 '16 at 5:38

bawejakunal

701937

asked Feb 10 '16 at 5:38

bawejakunal

701937

asked Feb 10 '16 at 5:38

bawejakunal

701937

First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21

@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23

oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43

add a comment |

First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21

@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23

oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43

First you should use scrapy v1+ and having meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.

– Granitosaurus
Feb 10 '16 at 9:21

@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.

– bawejakunal
Feb 10 '16 at 10:23

oh no, I've posted a detailed answer how to enable this :)

– Granitosaurus
Feb 10 '16 at 10:43

add a comment |

2 Answers
2

active

oldest

votes

To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:

import scrapy
class MySpider(scrapy.Spider):
 name = 'myspider'
 start_urls = ['http://example.com',]
 def start_requests(self):
 for url in self.start_urls:
 yield scrapy.Request(url, meta='dont_redirect':True)

What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.

if you want to disable redirects for every request(which usually is not the best idea) you can just add

REDIRECTS_ENABLED = False

to your settings.py in scrapy project.

There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

1

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

add a comment |

class LagouSpider(CrawlSpider):
 handle_httpstatus_list = [302]
 meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com']
 login_url = "https://passport.lagou.com/login/login.html"
 custom_settings = 'REDIRECT_ENABLED': False
 rules = (
 Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
 Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
 Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
 )
 headers = 
 'Accept': 'application/json, text/javascript, */*; q=0.01',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Connection': 'keep-alive',
 'Host': 'www.lagou.com',
 'Referer': 'https://www.lagou.com/',
 'X-Anit-Forge-Code': '0',
 'X-Anit-Forge-Token': 'None',
 'Accept-Encoding': 'gzip, deflate, br',
 'X-Requested-With': 'XMLHttpRequest'
 


 def start_requests(self):
 global rc, im
 browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
 browser.get(self.login_url)
 # ··········(some code)

 return [scrapy.Request(self.start_urls[0], cookies=cookie_dict, 
 meta=self.meta)]
 def parse_job(self, response):
 if response.status == 302:
 print("302")
 time.sleep(100)

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35307644%2fscrapy-crawlspider-parse-item-for-a-302-redirect-response%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:

import scrapy
class MySpider(scrapy.Spider):
 name = 'myspider'
 start_urls = ['http://example.com',]
 def start_requests(self):
 for url in self.start_urls:
 yield scrapy.Request(url, meta='dont_redirect':True)

if you want to disable redirects for every request(which usually is not the best idea) you can just add

REDIRECTS_ENABLED = False

to your settings.py in scrapy project.

There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

1

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

add a comment |

To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:

import scrapy
class MySpider(scrapy.Spider):
 name = 'myspider'
 start_urls = ['http://example.com',]
 def start_requests(self):
 for url in self.start_urls:
 yield scrapy.Request(url, meta='dont_redirect':True)

if you want to disable redirects for every request(which usually is not the best idea) you can just add

REDIRECTS_ENABLED = False

to your settings.py in scrapy project.

There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

1

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

add a comment |

To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:

import scrapy
class MySpider(scrapy.Spider):
 name = 'myspider'
 start_urls = ['http://example.com',]
 def start_requests(self):
 for url in self.start_urls:
 yield scrapy.Request(url, meta='dont_redirect':True)

if you want to disable redirects for every request(which usually is not the best idea) you can just add

REDIRECTS_ENABLED = False

to your settings.py in scrapy project.

There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.

so your spider should look something like this:

import scrapy
class MySpider(scrapy.Spider):
 name = 'myspider'
 start_urls = ['http://example.com',]
 def start_requests(self):
 for url in self.start_urls:
 yield scrapy.Request(url, meta='dont_redirect':True)

if you want to disable redirects for every request(which usually is not the best idea) you can just add

REDIRECTS_ENABLED = False

to your settings.py in scrapy project.

There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

answered Feb 10 '16 at 10:42

Granitosaurus

12.3k22545

1

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

add a comment |

1

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in parse_item.

– bawejakunal
Feb 11 '16 at 3:07

Scrapy should indeed log everything. 302 links would come out as scraping <website2> redirected from <website1>[302] or something like that.

– Granitosaurus
Feb 11 '16 at 7:42

I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)

– bawejakunal
Feb 11 '16 at 7:51

add a comment |

class LagouSpider(CrawlSpider):
 handle_httpstatus_list = [302]
 meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com']
 login_url = "https://passport.lagou.com/login/login.html"
 custom_settings = 'REDIRECT_ENABLED': False
 rules = (
 Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
 Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
 Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
 )
 headers = 
 'Accept': 'application/json, text/javascript, */*; q=0.01',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Connection': 'keep-alive',
 'Host': 'www.lagou.com',
 'Referer': 'https://www.lagou.com/',
 'X-Anit-Forge-Code': '0',
 'X-Anit-Forge-Token': 'None',
 'Accept-Encoding': 'gzip, deflate, br',
 'X-Requested-With': 'XMLHttpRequest'
 


 def start_requests(self):
 global rc, im
 browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
 browser.get(self.login_url)
 # ··········(some code)

 return [scrapy.Request(self.start_urls[0], cookies=cookie_dict, 
 meta=self.meta)]
 def parse_job(self, response):
 if response.status == 302:
 print("302")
 time.sleep(100)

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

add a comment |

class LagouSpider(CrawlSpider):
 handle_httpstatus_list = [302]
 meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com']
 login_url = "https://passport.lagou.com/login/login.html"
 custom_settings = 'REDIRECT_ENABLED': False
 rules = (
 Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
 Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
 Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
 )
 headers = 
 'Accept': 'application/json, text/javascript, */*; q=0.01',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Connection': 'keep-alive',
 'Host': 'www.lagou.com',
 'Referer': 'https://www.lagou.com/',
 'X-Anit-Forge-Code': '0',
 'X-Anit-Forge-Token': 'None',
 'Accept-Encoding': 'gzip, deflate, br',
 'X-Requested-With': 'XMLHttpRequest'
 


 def start_requests(self):
 global rc, im
 browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
 browser.get(self.login_url)
 # ··········(some code)

 return [scrapy.Request(self.start_urls[0], cookies=cookie_dict, 
 meta=self.meta)]
 def parse_job(self, response):
 if response.status == 302:
 print("302")
 time.sleep(100)

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

add a comment |

class LagouSpider(CrawlSpider):
 handle_httpstatus_list = [302]
 meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com']
 login_url = "https://passport.lagou.com/login/login.html"
 custom_settings = 'REDIRECT_ENABLED': False
 rules = (
 Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
 Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
 Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
 )
 headers = 
 'Accept': 'application/json, text/javascript, */*; q=0.01',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Connection': 'keep-alive',
 'Host': 'www.lagou.com',
 'Referer': 'https://www.lagou.com/',
 'X-Anit-Forge-Code': '0',
 'X-Anit-Forge-Token': 'None',
 'Accept-Encoding': 'gzip, deflate, br',
 'X-Requested-With': 'XMLHttpRequest'
 


 def start_requests(self):
 global rc, im
 browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
 browser.get(self.login_url)
 # ··········(some code)

 return [scrapy.Request(self.start_urls[0], cookies=cookie_dict, 
 meta=self.meta)]
 def parse_job(self, response):
 if response.status == 302:
 print("302")
 time.sleep(100)

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

class LagouSpider(CrawlSpider):
 handle_httpstatus_list = [302]
 meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com']
 login_url = "https://passport.lagou.com/login/login.html"
 custom_settings = 'REDIRECT_ENABLED': False
 rules = (
 Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
 Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
 Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
 )
 headers = 
 'Accept': 'application/json, text/javascript, */*; q=0.01',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Connection': 'keep-alive',
 'Host': 'www.lagou.com',
 'Referer': 'https://www.lagou.com/',
 'X-Anit-Forge-Code': '0',
 'X-Anit-Forge-Token': 'None',
 'Accept-Encoding': 'gzip, deflate, br',
 'X-Requested-With': 'XMLHttpRequest'
 


 def start_requests(self):
 global rc, im
 browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
 browser.get(self.login_url)
 # ··········(some code)

 return [scrapy.Request(self.start_urls[0], cookies=cookie_dict, 
 meta=self.meta)]
 def parse_job(self, response):
 if response.status == 302:
 print("302")
 time.sleep(100)

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

edited Mar 24 at 6:47

answered Mar 20 at 8:17

qihuan wu

answered Mar 20 at 8:17

qihuan wu

answered Mar 20 at 8:17

qihuan wu

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

add a comment |

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

Some explanation along with this script would be helpful

– TT.
Mar 20 at 8:44

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers
2

2 Answers
2

2 Answers
2