Scrapy: how to store url_id along with the crawled dataHow to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements

Plotting octahedron inside the sphere and sphere inside the cube

Do I have to cite common CS algorithms?

Can sampling rate be a floating point number?

Submitting a new paper just after another was accepted by the same journal

If "more guns less crime", how do gun advocates explain that the EU has less crime than the US?

A continuous water "planet" ring around a star

Can the ground attached to neutral fool a receptacle tester?

Is it okay for a ticket seller in the USA to refuse to give you your change, keep it for themselves and claim it's a tip?

Heat equation: Squiggly lines

Why did I get only 5 points even though I won?

What gave Harry Potter the idea of writing in Tom Riddle's diary?

Do beef farmed pastures net remove carbon emissions?

What is this "Table of astronomy" about?

What does the phrase "pull off sick wheelies and flips" mean here?

How would timezones work on a planet 100 times the size of our Earth

Is this curved text blend possible in Illustrator?

Boss wants me to ignore a software API license prohibiting mass download

How are you supposed to know the strumming pattern for a song from the "chord sheet music"?

Collinear Galois conjugates

Safest way to store environment variable value in a file

How can Radagast come across Gandalf and Thorin's company?

Super Duper Vdd stiffening required on 555 timer, what is the best way?

PhD advisor lost funding, need advice

Why are Tucker and Malcolm not dead?

Scrapy: how to store url_id along with the crawled data

How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):
 start_urls = []
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 start_urls.append(url)

 for url in start_urls:
 yield Request(url=url, callback=self.parse)

 def parse(self, response):
 yield 
 'target_url': response.url,
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.

You may note that, for each of my url, there is an unique url_id associated with it. How can I match the each crawled result with the url_id. Ideally, I want to store the url_id in the yield output result in comments.json.

Thanks a lot!

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32

add a comment |

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):
 start_urls = []
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 start_urls.append(url)

 for url in start_urls:
 yield Request(url=url, callback=self.parse)

 def parse(self, response):
 yield 
 'target_url': response.url,
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.

Thanks a lot!

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32

add a comment |

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):
 start_urls = []
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 start_urls.append(url)

 for url in start_urls:
 yield Request(url=url, callback=self.parse)

 def parse(self, response):
 yield 
 'target_url': response.url,
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.

Thanks a lot!

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):
 start_urls = []
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 start_urls.append(url)

 for url in start_urls:
 yield Request(url=url, callback=self.parse)

 def parse(self, response):
 yield 
 'target_url': response.url,
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.

Thanks a lot!

python python-3.x scrapy scrapy-pipeline

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

asked Mar 27 at 9:29

Steve Yang

1627 bronze badges

Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32

add a comment |

Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32

Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32

add a comment |

2 Answers
2

active

oldest

votes

Try to pass in meta parameter, for example. I've done some updates to your code:

def start_requests(self):
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
 yield 
 'target_url': response.meta['original_url'],
 'url_id': response.meta['url_id'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

add a comment |

Answering to the question and to the comment, try something like this:

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):

 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')

 yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


 def parse(self, response):
 yield 
 'target_url': response.meta['url'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
 'url_id':response.meta['url_id']

As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55373779%2fscrapy-how-to-store-url-id-along-with-the-crawled-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Try to pass in meta parameter, for example. I've done some updates to your code:

def start_requests(self):
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
 yield 
 'target_url': response.meta['original_url'],
 'url_id': response.meta['url_id'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

add a comment |

Try to pass in meta parameter, for example. I've done some updates to your code:

def start_requests(self):
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
 yield 
 'target_url': response.meta['original_url'],
 'url_id': response.meta['url_id'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

add a comment |

Try to pass in meta parameter, for example. I've done some updates to your code:

def start_requests(self):
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
 yield 
 'target_url': response.meta['original_url'],
 'url_id': response.meta['url_id'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

Try to pass in meta parameter, for example. I've done some updates to your code:

def start_requests(self):
 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')
 yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
 yield 
 'target_url': response.meta['original_url'],
 'url_id': response.meta['url_id'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract()

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

answered Mar 27 at 9:37

vezunchik

3,2553 gold badges12 silver badges25 bronze badges

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

add a comment |

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48

add a comment |

Answering to the question and to the comment, try something like this:

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):

 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')

 yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


 def parse(self, response):
 yield 
 'target_url': response.meta['url'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
 'url_id':response.meta['url_id']

As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

add a comment |

Answering to the question and to the comment, try something like this:

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):

 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')

 yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


 def parse(self, response):
 yield 
 'target_url': response.meta['url'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
 'url_id':response.meta['url_id']

As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

add a comment |

Answering to the question and to the comment, try something like this:

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):

 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')

 yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


 def parse(self, response):
 yield 
 'target_url': response.meta['url'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
 'url_id':response.meta['url_id']

As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

Answering to the question and to the comment, try something like this:

from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
 name = "my_spider"

 def __init__(self):
 self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
 self.browser.set_page_load_timeout(100)


 def closed(self,spider):
 print("spider closed")
 self.browser.close()

 def start_requests(self):

 with open("target_urls.txt", 'r', encoding='utf-8') as f:
 for line in f:
 url_id, url = line.split('tt')

 yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


 def parse(self, response):
 yield 
 'target_url': response.meta['url'],
 'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
 'url_id':response.meta['url_id']

As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

answered Mar 27 at 9:45

Anakin87

3015 bronze badges

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

add a comment |

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers
2

2 Answers
2

2 Answers
2