Scrapy:at 0 items/minFinding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages
How are mathematicians paid to do research?
During copyediting, journal disagrees about spelling of paper's main topic
How would vampires avoid contracting diseases?
C program to parse source code of another language
Do I have a right to cancel a purchase of foreign currency in the UK?
As the Dungeon Master, how do I handle a player that insists on a specific class when I already know that choice will cause issues?
Terry Pratchett book with a lawyer dragon and sheep
Does throwing a penny at a train stop the train?
How to convert a file with several spaces into a tab-delimited file?
Do you know your 'KVZ's?
Are randomly-generated passwords starting with "a" less secure?
Using Newton's shell theorem to accelerate a spaceship
Why weren't bootable game disks ever common on the IBM PC?
Is the genetic term "polycistronic" still used in modern biology?
Find The One Element In An Array That is Different From The Others
How did the hit man miss?
How can I effectively communicate to recruiters that a phone call is not possible?
Why are they 'nude photos'?
Why didn't Nick Fury expose the villain's identity and plans?
Can the Mage Hand cantrip be used to trip an enemy who is running away?
Is a request to book a business flight ticket for a graduate student an unreasonable one?
How to tell someone I'd like to become friends without letting them think I'm romantically interested in them?
How would my creatures handle groups without a strong concept of numbers?
What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?
Scrapy:at 0 items/min
Finding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.
the info:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
novspider.py
#-*-coding:utf8-*-
from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re
class novSpider(RedisSpider):
name = "novspider"
redis_key = 'nvospider:start_urls'
start_urls = ['http://www.daomubiji.com/']
def parse(self,response):
selector = Selector(response)
table = selector.xpath('//table')
for each in table:
bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
content = each.xpath('tr/td/a/text()').extract()
url = each.xpath('tr/td/a/@href').extract()
for i in range(len(url)):
item = NovelspiderItem()
item['bookName'] = bookName
item['chapterURL'] = url[i]
try:
item['bookTitle'] = content[i].split(' ')[0]
item['chapterNum'] = content[i].split(' ')[1]
except Exception,e:
continue
try:
item['chapterName'] = content[i].split(' ')[2]
except Exception,e:
item['chapterName'] = content[i].split(' ')[1][-3:]
yield Request(url[i], callback='parseContent', meta='item':item)
def parseContent(self, response):
selector = Selector(response)
item = response.meta['item']
html = selector.xpath('//div[@class="content"]').extract()[0]
textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
text = re.findall('<p>(.*?)</p>',textField,re.S)
fulltext = ''
for each in text:
fulltext += each
item['text'] = fulltext
yield item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'novelspider'
SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'
ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem
from scrapy.conf import settings
import pymongo
class NovelspiderPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbName = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
tdb = client[dbName]
self.post = tdb[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
bookInfo = dict(item)
self.post.insert(bookInfo)
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Field, Item
class NovelspiderItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
bookName = Field()
bookTitle = Field()
chapterNum = Field()
chapterName = Field()
chapterURL = Field()
text = Field()
python scrapy
add a comment |
I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.
the info:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
novspider.py
#-*-coding:utf8-*-
from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re
class novSpider(RedisSpider):
name = "novspider"
redis_key = 'nvospider:start_urls'
start_urls = ['http://www.daomubiji.com/']
def parse(self,response):
selector = Selector(response)
table = selector.xpath('//table')
for each in table:
bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
content = each.xpath('tr/td/a/text()').extract()
url = each.xpath('tr/td/a/@href').extract()
for i in range(len(url)):
item = NovelspiderItem()
item['bookName'] = bookName
item['chapterURL'] = url[i]
try:
item['bookTitle'] = content[i].split(' ')[0]
item['chapterNum'] = content[i].split(' ')[1]
except Exception,e:
continue
try:
item['chapterName'] = content[i].split(' ')[2]
except Exception,e:
item['chapterName'] = content[i].split(' ')[1][-3:]
yield Request(url[i], callback='parseContent', meta='item':item)
def parseContent(self, response):
selector = Selector(response)
item = response.meta['item']
html = selector.xpath('//div[@class="content"]').extract()[0]
textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
text = re.findall('<p>(.*?)</p>',textField,re.S)
fulltext = ''
for each in text:
fulltext += each
item['text'] = fulltext
yield item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'novelspider'
SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'
ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem
from scrapy.conf import settings
import pymongo
class NovelspiderPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbName = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
tdb = client[dbName]
self.post = tdb[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
bookInfo = dict(item)
self.post.insert(bookInfo)
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Field, Item
class NovelspiderItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
bookName = Field()
bookTitle = Field()
chapterNum = Field()
chapterName = Field()
chapterURL = Field()
text = Field()
python scrapy
add a comment |
I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.
the info:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
novspider.py
#-*-coding:utf8-*-
from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re
class novSpider(RedisSpider):
name = "novspider"
redis_key = 'nvospider:start_urls'
start_urls = ['http://www.daomubiji.com/']
def parse(self,response):
selector = Selector(response)
table = selector.xpath('//table')
for each in table:
bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
content = each.xpath('tr/td/a/text()').extract()
url = each.xpath('tr/td/a/@href').extract()
for i in range(len(url)):
item = NovelspiderItem()
item['bookName'] = bookName
item['chapterURL'] = url[i]
try:
item['bookTitle'] = content[i].split(' ')[0]
item['chapterNum'] = content[i].split(' ')[1]
except Exception,e:
continue
try:
item['chapterName'] = content[i].split(' ')[2]
except Exception,e:
item['chapterName'] = content[i].split(' ')[1][-3:]
yield Request(url[i], callback='parseContent', meta='item':item)
def parseContent(self, response):
selector = Selector(response)
item = response.meta['item']
html = selector.xpath('//div[@class="content"]').extract()[0]
textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
text = re.findall('<p>(.*?)</p>',textField,re.S)
fulltext = ''
for each in text:
fulltext += each
item['text'] = fulltext
yield item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'novelspider'
SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'
ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem
from scrapy.conf import settings
import pymongo
class NovelspiderPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbName = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
tdb = client[dbName]
self.post = tdb[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
bookInfo = dict(item)
self.post.insert(bookInfo)
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Field, Item
class NovelspiderItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
bookName = Field()
bookTitle = Field()
chapterNum = Field()
chapterName = Field()
chapterURL = Field()
text = Field()
python scrapy
I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.
the info:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
novspider.py
#-*-coding:utf8-*-
from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re
class novSpider(RedisSpider):
name = "novspider"
redis_key = 'nvospider:start_urls'
start_urls = ['http://www.daomubiji.com/']
def parse(self,response):
selector = Selector(response)
table = selector.xpath('//table')
for each in table:
bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
content = each.xpath('tr/td/a/text()').extract()
url = each.xpath('tr/td/a/@href').extract()
for i in range(len(url)):
item = NovelspiderItem()
item['bookName'] = bookName
item['chapterURL'] = url[i]
try:
item['bookTitle'] = content[i].split(' ')[0]
item['chapterNum'] = content[i].split(' ')[1]
except Exception,e:
continue
try:
item['chapterName'] = content[i].split(' ')[2]
except Exception,e:
item['chapterName'] = content[i].split(' ')[1][-3:]
yield Request(url[i], callback='parseContent', meta='item':item)
def parseContent(self, response):
selector = Selector(response)
item = response.meta['item']
html = selector.xpath('//div[@class="content"]').extract()[0]
textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
text = re.findall('<p>(.*?)</p>',textField,re.S)
fulltext = ''
for each in text:
fulltext += each
item['text'] = fulltext
yield item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'novelspider'
SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'
ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem
from scrapy.conf import settings
import pymongo
class NovelspiderPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbName = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
tdb = client[dbName]
self.post = tdb[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
bookInfo = dict(item)
self.post.insert(bookInfo)
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Field, Item
class NovelspiderItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
bookName = Field()
bookTitle = Field()
chapterNum = Field()
chapterName = Field()
chapterURL = Field()
text = Field()
python scrapy
python scrapy
edited Apr 3 at 22:06
Thiago Curvelo
2,9311 gold badge16 silver badges31 bronze badges
2,9311 gold badge16 silver badges31 bronze badges
asked Oct 8 '15 at 18:11
zwl1619zwl1619
1,1844 gold badges24 silver badges51 bronze badges
1,1844 gold badges24 silver badges51 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You never reach the parse method that way. Use this instead:
yield Request(
url[i],
callback=self.parseContent, # <--
meta='item':item)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33023060%2fscrapyat-0-items-min%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You never reach the parse method that way. Use this instead:
yield Request(
url[i],
callback=self.parseContent, # <--
meta='item':item)
add a comment |
You never reach the parse method that way. Use this instead:
yield Request(
url[i],
callback=self.parseContent, # <--
meta='item':item)
add a comment |
You never reach the parse method that way. Use this instead:
yield Request(
url[i],
callback=self.parseContent, # <--
meta='item':item)
You never reach the parse method that way. Use this instead:
yield Request(
url[i],
callback=self.parseContent, # <--
meta='item':item)
answered Mar 26 at 2:15
Thiago CurveloThiago Curvelo
2,9311 gold badge16 silver badges31 bronze badges
2,9311 gold badge16 silver badges31 bronze badges
add a comment |
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33023060%2fscrapyat-0-items-min%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown