Scrapy CrawlSpider parse_item for a 302 redirect responseHow to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor
Is the first derivative operation on a signal a causal system?
How strong are Wi-Fi signals?
Dictionary size reduces upon increasing one element
Is there a general effective method to solve Smullyan style Knights and Knaves problems? Is the truth table method the most appropriate one?
Command to Search for Filenames Exceeding 143 Characters?
What does the view outside my ship traveling at light speed look like?
Can't remember the name of this game
Identify this in soil?
Different circular sectors as new logo of the International System
Can R-3-methyl-4-heptanone be enantioselectively synthesised from 4-heptanone?
Seed ship, unsexed person, cover has golden person attached to ship by umbilical cord
Binary Search in C++17
Why does the 6502 have the BIT instruction?
Which is the common name of Mind Flayers?
Logarithm of dependent variable is uniformly distributed. How to calculate a confidence interval for the mean?
Is there a down side to setting the sampling time of a SAR ADC as long as possible?
Windows 10 Programms start without visual Interface
Does this degree 12 genus 1 curve have only one point over infinitely many finite fields?
How many chess players are over 2500 Elo?
Why do airplanes use an axial flow jet engine instead of a more compact centrifugal jet engine?
Riley Rebuses that Share a Common Theme
Under what law can the U.S. arrest International Criminal Court (ICC) judges over war crimes probe?
Looking for a soft substance that doesn't dissolve underwater
General purpose replacement for enum with FlagsAttribute
Scrapy CrawlSpider parse_item for a 302 redirect response
How to manage a redirect request after a jQuery Ajax callHow do I redirect to another webpage?How do I make a redirect in PHP?How can I redirect and append both stdout and stderr to a file with Bash?How do I redirect with JavaScript?Scrapy Redirect in pythonhow to filter duplicate requests based on url in scrapyHow to fix CrawlSpider redirection?Scrapy handle 302 response codeScrapy - Understanding CrawlSpider and LinkExtractor
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.
A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.
I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.
Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.
Scrapy version: 0.24.6
redirect web-scraping scrapy web-crawler
add a comment |
I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.
A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.
I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.
Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.
Scrapy version: 0.24.6
redirect web-scraping scrapy web-crawler
First you should use scrapy v1+ and havingmeta='dont_redirect': Trueshould stop theRedirectMiddlewarewhich is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.
– Granitosaurus
Feb 10 '16 at 9:21
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should thismeta='dont_redirect': Truebe put ? Just defining it in the class definition does not helps it.
– bawejakunal
Feb 10 '16 at 10:23
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43
add a comment |
I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.
A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.
I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.
Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.
Scrapy version: 0.24.6
redirect web-scraping scrapy web-crawler
I am using a Scrapy CrawlSpider to crawl websites and process on their page content. For this I am using the Scrapy Docs Crawlspider Example.
A particular page on the links takes in a parameter target via GET request (say http://www.example.com?target=x) and redirects (302) if the value is erroneous. On receiving this 302 HTTP response, scrapy follows the redirect, but doesn't processes the response in parse_item method, as intended by me.
I came across a few solutions suggesting meta/dont_redirect/http_status_list but none seem to be taking effect.
Please suggest how can I parse the response of 302 redirection, without/before following on the 302 redirected location.
Scrapy version: 0.24.6
redirect web-scraping scrapy web-crawler
redirect web-scraping scrapy web-crawler
asked Feb 10 '16 at 5:38
bawejakunalbawejakunal
701937
701937
First you should use scrapy v1+ and havingmeta='dont_redirect': Trueshould stop theRedirectMiddlewarewhich is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.
– Granitosaurus
Feb 10 '16 at 9:21
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should thismeta='dont_redirect': Truebe put ? Just defining it in the class definition does not helps it.
– bawejakunal
Feb 10 '16 at 10:23
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43
add a comment |
First you should use scrapy v1+ and havingmeta='dont_redirect': Trueshould stop theRedirectMiddlewarewhich is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.
– Granitosaurus
Feb 10 '16 at 9:21
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should thismeta='dont_redirect': Truebe put ? Just defining it in the class definition does not helps it.
– bawejakunal
Feb 10 '16 at 10:23
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43
First you should use scrapy v1+ and having
meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.– Granitosaurus
Feb 10 '16 at 9:21
First you should use scrapy v1+ and having
meta='dont_redirect': True should stop the RedirectMiddleware which is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.– Granitosaurus
Feb 10 '16 at 9:21
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this
meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.– bawejakunal
Feb 10 '16 at 10:23
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this
meta='dont_redirect': True be put ? Just defining it in the class definition does not helps it.– bawejakunal
Feb 10 '16 at 10:23
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43
add a comment |
2 Answers
2
active
oldest
votes
To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.
so your spider should look something like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)
What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.
if you want to disable redirects for every request(which usually is not the best idea) you can just add
REDIRECTS_ENABLED = False
to your settings.py in scrapy project.
There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them inparse_item.
– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out asscraping <website2> redirected from <website1>[302]or something like that.
– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
add a comment |
class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'
def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)
return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35307644%2fscrapy-crawlspider-parse-item-for-a-302-redirect-response%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.
so your spider should look something like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)
What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.
if you want to disable redirects for every request(which usually is not the best idea) you can just add
REDIRECTS_ENABLED = False
to your settings.py in scrapy project.
There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them inparse_item.
– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out asscraping <website2> redirected from <website1>[302]or something like that.
– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
add a comment |
To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.
so your spider should look something like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)
What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.
if you want to disable redirects for every request(which usually is not the best idea) you can just add
REDIRECTS_ENABLED = False
to your settings.py in scrapy project.
There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them inparse_item.
– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out asscraping <website2> redirected from <website1>[302]or something like that.
– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
add a comment |
To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.
so your spider should look something like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)
What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.
if you want to disable redirects for every request(which usually is not the best idea) you can just add
REDIRECTS_ENABLED = False
to your settings.py in scrapy project.
There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html
To disable redirects you should add meta={'dont_redirect': True) to your yielded scrapy.Requests.
so your spider should look something like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com',]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta='dont_redirect':True)
What happens here is that scrapy has a default downloader middleware called RedirectMiddleware which is enabled by default and handles all redirections, by supplying this meta argument you are telling this middleware to not do it's job for this particular request.
if you want to disable redirects for every request(which usually is not the best idea) you can just add
REDIRECTS_ENABLED = False
to your settings.py in scrapy project.
There is a brilliant illustration on scrapy docs on how all of the scrapy pieces, like middlewares and spiders, work together:
http://doc.scrapy.org/en/latest/topics/architecture.html
answered Feb 10 '16 at 10:42
GranitosaurusGranitosaurus
12.3k22545
12.3k22545
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them inparse_item.
– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out asscraping <website2> redirected from <website1>[302]or something like that.
– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
add a comment |
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them inparse_item.
– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out asscraping <website2> redirected from <website1>[302]or something like that.
– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
1
1
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in
parse_item.– bawejakunal
Feb 11 '16 at 3:07
I guess I could not convey my target here correctly. I basically want to log all the links that I come across while scraping a website, so even if its a 302 response I would want the scrapy crawler to log the original link (which gives 302 response) as well as the target location specified in the response header and process both of them in
parse_item.– bawejakunal
Feb 11 '16 at 3:07
Scrapy should indeed log everything. 302 links would come out as
scraping <website2> redirected from <website1>[302] or something like that.– Granitosaurus
Feb 11 '16 at 7:42
Scrapy should indeed log everything. 302 links would come out as
scraping <website2> redirected from <website1>[302] or something like that.– Granitosaurus
Feb 11 '16 at 7:42
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
I decided to upgrade Scrapy to v1.0.5, that seems to be working better. Thanks for your help and advice :)
– bawejakunal
Feb 11 '16 at 7:51
add a comment |
class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'
def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)
return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
add a comment |
class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'
def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)
return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
add a comment |
class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'
def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)
return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)
class LagouSpider(CrawlSpider):
handle_httpstatus_list = [302]
meta = 'dont_redirect': True, "handle_httpstatus_list": [302]
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com']
login_url = "https://passport.lagou.com/login/login.html"
custom_settings = 'REDIRECT_ENABLED': False
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/jd+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
headers =
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest'
def start_requests(self):
global rc, im
browser = webdriver.Chrome(executable_path="/home/wqh/下载/chromedriver")
browser.get(self.login_url)
# ··········(some code)
return [scrapy.Request(self.start_urls[0], cookies=cookie_dict,
meta=self.meta)]
def parse_job(self, response):
if response.status == 302:
print("302")
time.sleep(100)
edited Mar 24 at 6:47
answered Mar 20 at 8:17
qihuan wuqihuan wu
84
84
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
add a comment |
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
Some explanation along with this script would be helpful
– TT.
Mar 20 at 8:44
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35307644%2fscrapy-crawlspider-parse-item-for-a-302-redirect-response%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
First you should use scrapy v1+ and having
meta='dont_redirect': Trueshould stop theRedirectMiddlewarewhich is enabled by default redirecting the redirect on redirect status codes like 302. If that doesn't help we need more info.– Granitosaurus
Feb 10 '16 at 9:21
@Granitosaurus I know version 0.24 is quite old but I am working on an old base which cant be immediately migrated to v1.0 so will have to do with that only, moreover where exactly should this
meta='dont_redirect': Truebe put ? Just defining it in the class definition does not helps it.– bawejakunal
Feb 10 '16 at 10:23
oh no, I've posted a detailed answer how to enable this :)
– Granitosaurus
Feb 10 '16 at 10:43