Scrapy:at 0 items/minFinding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages

How are mathematicians paid to do research?

During copyediting, journal disagrees about spelling of paper's main topic

How would vampires avoid contracting diseases?

C program to parse source code of another language

Do I have a right to cancel a purchase of foreign currency in the UK?

As the Dungeon Master, how do I handle a player that insists on a specific class when I already know that choice will cause issues?

Terry Pratchett book with a lawyer dragon and sheep

Does throwing a penny at a train stop the train?

How to convert a file with several spaces into a tab-delimited file?

Do you know your 'KVZ's?

Are randomly-generated passwords starting with "a" less secure?

Using Newton's shell theorem to accelerate a spaceship

Why weren't bootable game disks ever common on the IBM PC?

Is the genetic term "polycistronic" still used in modern biology?

Find The One Element In An Array That is Different From The Others

How did the hit man miss?

How can I effectively communicate to recruiters that a phone call is not possible?

Why are they 'nude photos'?

Why didn't Nick Fury expose the villain's identity and plans?

Can the Mage Hand cantrip be used to trip an enemy who is running away?

Is a request to book a business flight ticket for a graduate student an unreasonable one?

How to tell someone I'd like to become friends without letting them think I'm romantically interested in them?

How would my creatures handle groups without a strong concept of numbers?

What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?

Scrapy:at 0 items/min

Finding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.

the info:

2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)

novspider.py

#-*-coding:utf8-*-

from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re

class novSpider(RedisSpider):
 name = "novspider"
 redis_key = 'nvospider:start_urls'
 start_urls = ['http://www.daomubiji.com/'] 

 def parse(self,response):
 selector = Selector(response)
 table = selector.xpath('//table')
 for each in table:
 bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
 content = each.xpath('tr/td/a/text()').extract()
 url = each.xpath('tr/td/a/@href').extract()
 for i in range(len(url)):
 item = NovelspiderItem()
 item['bookName'] = bookName
 item['chapterURL'] = url[i]

 try:
 item['bookTitle'] = content[i].split(' ')[0]
 item['chapterNum'] = content[i].split(' ')[1]
 except Exception,e:
 continue

 try:
 item['chapterName'] = content[i].split(' ')[2]
 except Exception,e:
 item['chapterName'] = content[i].split(' ')[1][-3:]
 yield Request(url[i], callback='parseContent', meta='item':item)

 def parseContent(self, response):
 selector = Selector(response)
 item = response.meta['item']
 html = selector.xpath('//div[@class="content"]').extract()[0]
 textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
 text = re.findall('<p>(.*?)</p>',textField,re.S)
 fulltext = ''
 for each in text:
 fulltext += each
 item['text'] = fulltext
 yield item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem 
from scrapy.conf import settings
import pymongo

class NovelspiderPipeline(object):
 def __init__(self):
 host = settings['MONGODB_HOST']
 port = settings['MONGODB_PORT']
 dbName = settings['MONGODB_DBNAME']
 client = pymongo.MongoClient(host=host, port=port)
 tdb = client[dbName]
 self.post = tdb[settings['MONGODB_DOCNAME']]

 def process_item(self, item, spider):
 bookInfo = dict(item)
 self.post.insert(bookInfo)
 return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class NovelspiderItem(Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 bookName = Field()
 bookTitle = Field()
 chapterNum = Field()
 chapterName = Field()
 chapterURL = Field()
 text = Field()

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

add a comment |

I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.

the info:

2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)

novspider.py

#-*-coding:utf8-*-

from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re

class novSpider(RedisSpider):
 name = "novspider"
 redis_key = 'nvospider:start_urls'
 start_urls = ['http://www.daomubiji.com/'] 

 def parse(self,response):
 selector = Selector(response)
 table = selector.xpath('//table')
 for each in table:
 bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
 content = each.xpath('tr/td/a/text()').extract()
 url = each.xpath('tr/td/a/@href').extract()
 for i in range(len(url)):
 item = NovelspiderItem()
 item['bookName'] = bookName
 item['chapterURL'] = url[i]

 try:
 item['bookTitle'] = content[i].split(' ')[0]
 item['chapterNum'] = content[i].split(' ')[1]
 except Exception,e:
 continue

 try:
 item['chapterName'] = content[i].split(' ')[2]
 except Exception,e:
 item['chapterName'] = content[i].split(' ')[1][-3:]
 yield Request(url[i], callback='parseContent', meta='item':item)

 def parseContent(self, response):
 selector = Selector(response)
 item = response.meta['item']
 html = selector.xpath('//div[@class="content"]').extract()[0]
 textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
 text = re.findall('<p>(.*?)</p>',textField,re.S)
 fulltext = ''
 for each in text:
 fulltext += each
 item['text'] = fulltext
 yield item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem 
from scrapy.conf import settings
import pymongo

class NovelspiderPipeline(object):
 def __init__(self):
 host = settings['MONGODB_HOST']
 port = settings['MONGODB_PORT']
 dbName = settings['MONGODB_DBNAME']
 client = pymongo.MongoClient(host=host, port=port)
 tdb = client[dbName]
 self.post = tdb[settings['MONGODB_DOCNAME']]

 def process_item(self, item, spider):
 bookInfo = dict(item)
 self.post.insert(bookInfo)
 return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class NovelspiderItem(Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 bookName = Field()
 bookTitle = Field()
 chapterNum = Field()
 chapterName = Field()
 chapterURL = Field()
 text = Field()

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

add a comment |

I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.

the info:

2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)

novspider.py

#-*-coding:utf8-*-

from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re

class novSpider(RedisSpider):
 name = "novspider"
 redis_key = 'nvospider:start_urls'
 start_urls = ['http://www.daomubiji.com/'] 

 def parse(self,response):
 selector = Selector(response)
 table = selector.xpath('//table')
 for each in table:
 bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
 content = each.xpath('tr/td/a/text()').extract()
 url = each.xpath('tr/td/a/@href').extract()
 for i in range(len(url)):
 item = NovelspiderItem()
 item['bookName'] = bookName
 item['chapterURL'] = url[i]

 try:
 item['bookTitle'] = content[i].split(' ')[0]
 item['chapterNum'] = content[i].split(' ')[1]
 except Exception,e:
 continue

 try:
 item['chapterName'] = content[i].split(' ')[2]
 except Exception,e:
 item['chapterName'] = content[i].split(' ')[1][-3:]
 yield Request(url[i], callback='parseContent', meta='item':item)

 def parseContent(self, response):
 selector = Selector(response)
 item = response.meta['item']
 html = selector.xpath('//div[@class="content"]').extract()[0]
 textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
 text = re.findall('<p>(.*?)</p>',textField,re.S)
 fulltext = ''
 for each in text:
 fulltext += each
 item['text'] = fulltext
 yield item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem 
from scrapy.conf import settings
import pymongo

class NovelspiderPipeline(object):
 def __init__(self):
 host = settings['MONGODB_HOST']
 port = settings['MONGODB_PORT']
 dbName = settings['MONGODB_DBNAME']
 client = pymongo.MongoClient(host=host, port=port)
 tdb = client[dbName]
 self.post = tdb[settings['MONGODB_DOCNAME']]

 def process_item(self, item, spider):
 bookInfo = dict(item)
 self.post.insert(bookInfo)
 return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class NovelspiderItem(Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 bookName = Field()
 bookTitle = Field()
 chapterNum = Field()
 chapterName = Field()
 chapterURL = Field()
 text = Field()

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.

the info:

2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)

novspider.py

#-*-coding:utf8-*-

from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re

class novSpider(RedisSpider):
 name = "novspider"
 redis_key = 'nvospider:start_urls'
 start_urls = ['http://www.daomubiji.com/'] 

 def parse(self,response):
 selector = Selector(response)
 table = selector.xpath('//table')
 for each in table:
 bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
 content = each.xpath('tr/td/a/text()').extract()
 url = each.xpath('tr/td/a/@href').extract()
 for i in range(len(url)):
 item = NovelspiderItem()
 item['bookName'] = bookName
 item['chapterURL'] = url[i]

 try:
 item['bookTitle'] = content[i].split(' ')[0]
 item['chapterNum'] = content[i].split(' ')[1]
 except Exception,e:
 continue

 try:
 item['chapterName'] = content[i].split(' ')[2]
 except Exception,e:
 item['chapterName'] = content[i].split(' ')[1][-3:]
 yield Request(url[i], callback='parseContent', meta='item':item)

 def parseContent(self, response):
 selector = Selector(response)
 item = response.meta['item']
 html = selector.xpath('//div[@class="content"]').extract()[0]
 textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
 text = re.findall('<p>(.*?)</p>',textField,re.S)
 fulltext = ''
 for each in text:
 fulltext += each
 item['text'] = fulltext
 yield item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem 
from scrapy.conf import settings
import pymongo

class NovelspiderPipeline(object):
 def __init__(self):
 host = settings['MONGODB_HOST']
 port = settings['MONGODB_PORT']
 dbName = settings['MONGODB_DBNAME']
 client = pymongo.MongoClient(host=host, port=port)
 tdb = client[dbName]
 self.post = tdb[settings['MONGODB_DOCNAME']]

 def process_item(self, item, spider):
 bookInfo = dict(item)
 self.post.insert(bookInfo)
 return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class NovelspiderItem(Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 bookName = Field()
 bookTitle = Field()
 chapterNum = Field()
 chapterName = Field()
 chapterURL = Field()
 text = Field()

python scrapy

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

edited Apr 3 at 22:06

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

asked Oct 8 '15 at 18:11

zwl1619

1,1844 gold badges24 silver badges51 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

You never reach the parse method that way. Use this instead:

yield Request(
 url[i], 
 callback=self.parseContent, # <--
 meta='item':item)

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33023060%2fscrapyat-0-items-min%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You never reach the parse method that way. Use this instead:

yield Request(
 url[i], 
 callback=self.parseContent, # <--
 meta='item':item)

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

add a comment |

You never reach the parse method that way. Use this instead:

yield Request(
 url[i], 
 callback=self.parseContent, # <--
 meta='item':item)

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

add a comment |

You never reach the parse method that way. Use this instead:

yield Request(
 url[i], 
 callback=self.parseContent, # <--
 meta='item':item)

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

You never reach the parse method that way. Use this instead:

yield Request(
 url[i], 
 callback=self.parseContent, # <--
 meta='item':item)

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

answered Mar 26 at 2:15

Thiago Curvelo

2,9311 gold badge16 silver badges31 bronze badges

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1