Scrapy: how to store url_id along with the crawled dataHow to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements

Plotting octahedron inside the sphere and sphere inside the cube

Do I have to cite common CS algorithms?

Can sampling rate be a floating point number?

Submitting a new paper just after another was accepted by the same journal

If "more guns less crime", how do gun advocates explain that the EU has less crime than the US?

A continuous water "planet" ring around a star

Can the ground attached to neutral fool a receptacle tester?

Is it okay for a ticket seller in the USA to refuse to give you your change, keep it for themselves and claim it's a tip?

Heat equation: Squiggly lines

Why did I get only 5 points even though I won?

What gave Harry Potter the idea of writing in Tom Riddle's diary?

Do beef farmed pastures net remove carbon emissions?

What is this "Table of astronomy" about?

What does the phrase "pull off sick wheelies and flips" mean here?

How would timezones work on a planet 100 times the size of our Earth

Is this curved text blend possible in Illustrator?

Boss wants me to ignore a software API license prohibiting mass download

How are you supposed to know the strumming pattern for a song from the "chord sheet music"?

Collinear Galois conjugates

Safest way to store environment variable value in a file

How can Radagast come across Gandalf and Thorin's company?

Super Duper Vdd stiffening required on 555 timer, what is the best way?

PhD advisor lost funding, need advice

Why are Tucker and Malcolm not dead?



Scrapy: how to store url_id along with the crawled data


How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do I list all files of a directory?Closing database connection from pipeline and middleware in ScrapyScrapy Crawl multiple spiders subsequentlypyquery response.body retrieve div elements






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)

for url in start_urls:
yield Request(url=url, callback=self.parse)

def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()



Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.



You may note that, for each of my url, there is an unique url_id associated with it. How can I match the each crawled result with the url_id. Ideally, I want to store the url_id in the yield output result in comments.json.



Thanks a lot!










share|improve this question
























  • Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

    – Steve Yang
    Mar 27 at 9:32

















1















from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)

for url in start_urls:
yield Request(url=url, callback=self.parse)

def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()



Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.



You may note that, for each of my url, there is an unique url_id associated with it. How can I match the each crawled result with the url_id. Ideally, I want to store the url_id in the yield output result in comments.json.



Thanks a lot!










share|improve this question
























  • Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

    – Steve Yang
    Mar 27 at 9:32













1












1








1


1






from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)

for url in start_urls:
yield Request(url=url, callback=self.parse)

def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()



Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.



You may note that, for each of my url, there is an unique url_id associated with it. How can I match the each crawled result with the url_id. Ideally, I want to store the url_id in the yield output result in comments.json.



Thanks a lot!










share|improve this question














from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):
start_urls = []
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
start_urls.append(url)

for url in start_urls:
yield Request(url=url, callback=self.parse)

def parse(self, response):
yield
'target_url': response.url,
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()



Above is my scrapy code. I use scrapy crawl my_spider -o comments.json to run the crawler.



You may note that, for each of my url, there is an unique url_id associated with it. How can I match the each crawled result with the url_id. Ideally, I want to store the url_id in the yield output result in comments.json.



Thanks a lot!







python python-3.x scrapy scrapy-pipeline






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 27 at 9:29









Steve YangSteve Yang

1627 bronze badges




1627 bronze badges















  • Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

    – Steve Yang
    Mar 27 at 9:32

















  • Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

    – Steve Yang
    Mar 27 at 9:32
















Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32





Another question is, in my output file comments.json, the target_url is not the same as the input url in start_urls, is there a way to let target_url store original url?

– Steve Yang
Mar 27 at 9:32












2 Answers
2






active

oldest

votes


















2














Try to pass in meta parameter, for example. I've done some updates to your code:



def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()






share|improve this answer

























  • This is exactly what I need, thank you very much!

    – Steve Yang
    Mar 27 at 9:48


















1














Answering to the question and to the comment, try something like this:



from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):

with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')

yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']



As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).






share|improve this answer

























  • Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

    – Steve Yang
    Mar 27 at 9:50













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55373779%2fscrapy-how-to-store-url-id-along-with-the-crawled-data%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Try to pass in meta parameter, for example. I've done some updates to your code:



def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()






share|improve this answer

























  • This is exactly what I need, thank you very much!

    – Steve Yang
    Mar 27 at 9:48















2














Try to pass in meta parameter, for example. I've done some updates to your code:



def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()






share|improve this answer

























  • This is exactly what I need, thank you very much!

    – Steve Yang
    Mar 27 at 9:48













2












2








2







Try to pass in meta parameter, for example. I've done some updates to your code:



def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()






share|improve this answer













Try to pass in meta parameter, for example. I've done some updates to your code:



def start_requests(self):
with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')
yield Request(url, self.parse, meta='url_id': url_id, 'original_url': url)

def parse(self, response):
yield
'target_url': response.meta['original_url'],
'url_id': response.meta['url_id'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract()







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 27 at 9:37









vezunchikvezunchik

3,2553 gold badges12 silver badges25 bronze badges




3,2553 gold badges12 silver badges25 bronze badges















  • This is exactly what I need, thank you very much!

    – Steve Yang
    Mar 27 at 9:48

















  • This is exactly what I need, thank you very much!

    – Steve Yang
    Mar 27 at 9:48
















This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48





This is exactly what I need, thank you very much!

– Steve Yang
Mar 27 at 9:48













1














Answering to the question and to the comment, try something like this:



from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):

with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')

yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']



As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).






share|improve this answer

























  • Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

    – Steve Yang
    Mar 27 at 9:50















1














Answering to the question and to the comment, try something like this:



from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):

with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')

yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']



As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).






share|improve this answer

























  • Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

    – Steve Yang
    Mar 27 at 9:50













1












1








1







Answering to the question and to the comment, try something like this:



from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):

with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')

yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']



As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).






share|improve this answer













Answering to the question and to the comment, try something like this:



from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
name = "my_spider"

def __init__(self):
self.browser = webdriver.Chrome(executable_path='E:/chromedriver')
self.browser.set_page_load_timeout(100)


def closed(self,spider):
print("spider closed")
self.browser.close()

def start_requests(self):

with open("target_urls.txt", 'r', encoding='utf-8') as f:
for line in f:
url_id, url = line.split('tt')

yield Request(url=url, callback=self.parse, meta='url_id':url_id,'url':url)


def parse(self, response):
yield
'target_url': response.meta['url'],
'comments': response.xpath('//div[@class="comments"]//em//text()').extract(),
'url_id':response.meta['url_id']



As said in the previous answer, you can pass parameters between various methods, using META (http://scrapingauthority.com/scrapy-meta).







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 27 at 9:45









Anakin87Anakin87

3015 bronze badges




3015 bronze badges















  • Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

    – Steve Yang
    Mar 27 at 9:50

















  • Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

    – Steve Yang
    Mar 27 at 9:50
















Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50





Thanks a lot, exactly what I need. However, I would accept vezunchik's answer for earlier response.

– Steve Yang
Mar 27 at 9:50

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55373779%2fscrapy-how-to-store-url-id-along-with-the-crawled-data%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript