Scrapy: how to store url_id along with the crawled dataHow to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements
Plotting octahedron inside the sphere and sphere inside the cube
Do I have to cite common CS algorithms?
Can sampling rate be a floating point number?
Submitting a new paper just after another was accepted by the same journal
If "more guns less crime", how do gun advocates explain that the EU has less crime than the US?
A continuous water "planet" ring around a star
Can the ground attached to neutral fool a receptacle tester?
Is it okay for a ticket seller in the USA to refuse to give you your change, keep it for themselves and claim it's a tip?
Heat equation: Squiggly lines
Why did I get only 5 points even though I won?
What gave Harry Potter the idea of writing in Tom Riddle's diary?
Do beef farmed pastures net remove carbon emissions?
What is this "Table of astronomy" about?
What does the phrase "pull off sick wheelies and flips" mean here?
How would timezones work on a planet 100 times the size of our Earth
Is this curved text blend possible in Illustrator?
Boss wants me to ignore a software API license prohibiting mass download
How are you supposed to know the strumming pattern for a song from the "chord sheet music"?
Collinear Galois conjugates
Safest way to store environment variable value in a file
How can Radagast come across Gandalf and Thorin's company?
Super Duper Vdd stiffening required on 555 timer, what is the best way?
PhD advisor lost funding, need advice
Why are Tucker and Malcolm not dead?
Scrapy: how to store url_id along with the crawled data
How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
Above is my scrapy code. I use scrapy crawl my_spider -o comments.json
to run the crawler.
You may note that, for each of my url
, there is an unique url_id
associated with it. How can I match the each crawled result with the url_id
. Ideally, I want to store the url_id
in the yield output result in comments.json
.
Thanks a lot!
python python-3.x scrapy scrapy-pipeline
add a comment |
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
Above is my scrapy code. I use scrapy crawl my_spider -o comments.json
to run the crawler.
You may note that, for each of my url
, there is an unique url_id
associated with it. How can I match the each crawled result with the url_id
. Ideally, I want to store the url_id
in the yield output result in comments.json
.
Thanks a lot!
python python-3.x scrapy scrapy-pipeline
Another question is, in my output filecomments.json
, thetarget_url
is not the same as the input url instart_urls
, is there a way to lettarget_url
store original url?
– Steve Yang
Mar 27 at 9:32
add a comment |
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
Above is my scrapy code. I use scrapy crawl my_spider -o comments.json
to run the crawler.
You may note that, for each of my url
, there is an unique url_id
associated with it. How can I match the each crawled result with the url_id
. Ideally, I want to store the url_id
in the yield output result in comments.json
.
Thanks a lot!
python python-3.x scrapy scrapy-pipeline
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
Above is my scrapy code. I use scrapy crawl my_spider -o comments.json
to run the crawler.
You may note that, for each of my url
, there is an unique url_id
associated with it. How can I match the each crawled result with the url_id
. Ideally, I want to store the url_id
in the yield output result in comments.json
.
Thanks a lot!
python python-3.x scrapy scrapy-pipeline
python python-3.x scrapy scrapy-pipeline
asked Mar 27 at 9:29
Steve YangSteve Yang
1627 bronze badges
1627 bronze badges
Another question is, in my output filecomments.json
, thetarget_url
is not the same as the input url instart_urls
, is there a way to lettarget_url
store original url?
– Steve Yang
Mar 27 at 9:32
add a comment |
Another question is, in my output filecomments.json
, thetarget_url
is not the same as the input url instart_urls
, is there a way to lettarget_url
store original url?
– Steve Yang
Mar 27 at 9:32
Another question is, in my output file
comments.json
, the target_url
is not the same as the input url in start_urls
, is there a way to let target_url
store original url?– Steve Yang
Mar 27 at 9:32
Another question is, in my output file
comments.json
, the target_url
is not the same as the input url in start_urls
, is there a way to let target_url
store original url?– Steve Yang
Mar 27 at 9:32
add a comment |
2 Answers
2
active
oldest
votes
Try to pass in meta
parameter, for example. I've done some updates to your code:
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)
def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
add a comment |
Answering to the question and to the comment, try something like this:
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)
def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']
As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55373779%2fscrapy-how-to-store-url-id-along-with-the-crawled-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try to pass in meta
parameter, for example. I've done some updates to your code:
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)
def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
add a comment |
Try to pass in meta
parameter, for example. I've done some updates to your code:
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)
def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
add a comment |
Try to pass in meta
parameter, for example. I've done some updates to your code:
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)
def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
Try to pass in meta
parameter, for example. I've done some updates to your code:
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)
def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()
answered Mar 27 at 9:37
vezunchikvezunchik
3,2553 gold badges12 silver badges25 bronze badges
3,2553 gold badges12 silver badges25 bronze badges
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
add a comment |
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
This is exactly what I need, thank you very much!
– Steve Yang
Mar 27 at 9:48
add a comment |
Answering to the question and to the comment, try something like this:
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)
def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']
As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
add a comment |
Answering to the question and to the comment, try something like this:
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)
def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']
As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
add a comment |
Answering to the question and to the comment, try something like this:
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)
def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']
As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).
Answering to the question and to the comment, try something like this:
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)
def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']
As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).
answered Mar 27 at 9:45
Anakin87Anakin87
3015 bronze badges
3015 bronze badges
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
add a comment |
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.
– Steve Yang
Mar 27 at 9:50
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55373779%2fscrapy-how-to-store-url-id-along-with-the-crawled-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Another question is, in my output file
comments.json
, thetarget_url
is not the same as the input url instart_urls
, is there a way to lettarget_url
store original url?– Steve Yang
Mar 27 at 9:32