Scrapy:at 0 items/minFinding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages

How are mathematicians paid to do research?

During copyediting, journal disagrees about spelling of paper's main topic

How would vampires avoid contracting diseases?

C program to parse source code of another language

Do I have a right to cancel a purchase of foreign currency in the UK?

As the Dungeon Master, how do I handle a player that insists on a specific class when I already know that choice will cause issues?

Terry Pratchett book with a lawyer dragon and sheep

Does throwing a penny at a train stop the train?

How to convert a file with several spaces into a tab-delimited file?

Do you know your 'KVZ's?

Are randomly-generated passwords starting with "a" less secure?

Using Newton's shell theorem to accelerate a spaceship

Why weren't bootable game disks ever common on the IBM PC?

Is the genetic term "polycistronic" still used in modern biology?

Find The One Element In An Array That is Different From The Others

How did the hit man miss?

How can I effectively communicate to recruiters that a phone call is not possible?

Why are they 'nude photos'?

Why didn't Nick Fury expose the villain's identity and plans?

Can the Mage Hand cantrip be used to trip an enemy who is running away?

Is a request to book a business flight ticket for a graduate student an unreasonable one?

How to tell someone I'd like to become friends without letting them think I'm romantically interested in them?

How would my creatures handle groups without a strong concept of numbers?

What's the point of having a RAID 1 configuration over incremental backups to a secondary drive?



Scrapy:at 0 items/min


Finding the index of an item given a list containing it in PythonHow to randomly select an item from a list?How can I count the occurrences of a list item?Scrapy spider difference between Crawled pages and Scraped itemsIssue with Scrapy: Crawled 0 pages (at 0 pages/min)Scrapy how to use a proxy poolScrapy spider fails to terminate after finishing web scrapeWhy am I getting empty “Messages: ” logging output when running Scrapy?Scrapy: crawled and scraped 0 itemsScrapy on Linkedin Crawled 0 pages






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5















I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
The example uses Scrapy+Redis+MongoDB.



the info:



2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)


novspider.py



#-*-coding:utf8-*-

from scrapy_redis.spiders import RedisSpider
from scrapy.selector import Selector
from scrapy.http import Request
from novelspider.items import NovelspiderItem
import re

class novSpider(RedisSpider):
name = "novspider"
redis_key = 'nvospider:start_urls'
start_urls = ['http://www.daomubiji.com/']

def parse(self,response):
selector = Selector(response)
table = selector.xpath('//table')
for each in table:
bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
content = each.xpath('tr/td/a/text()').extract()
url = each.xpath('tr/td/a/@href').extract()
for i in range(len(url)):
item = NovelspiderItem()
item['bookName'] = bookName
item['chapterURL'] = url[i]

try:
item['bookTitle'] = content[i].split(' ')[0]
item['chapterNum'] = content[i].split(' ')[1]
except Exception,e:
continue

try:
item['chapterName'] = content[i].split(' ')[2]
except Exception,e:
item['chapterName'] = content[i].split(' ')[1][-3:]
yield Request(url[i], callback='parseContent', meta='item':item)

def parseContent(self, response):
selector = Selector(response)
item = response.meta['item']
html = selector.xpath('//div[@class="content"]').extract()[0]
textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
text = re.findall('<p>(.*?)</p>',textField,re.S)
fulltext = ''
for each in text:
fulltext += each
item['text'] = fulltext
yield item


settings.py



# -*- coding: utf-8 -*-

# Scrapy settings for novelspider project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'novdata'
MONGODB_DOCNAME = 'nov1'


pipelines.py



# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from items import NovelspiderItem
from scrapy.conf import settings
import pymongo

class NovelspiderPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbName = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
tdb = client[dbName]
self.post = tdb[settings['MONGODB_DOCNAME']]

def process_item(self, item, spider):
bookInfo = dict(item)
self.post.insert(bookInfo)
return item


items.py



# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class NovelspiderItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
bookName = Field()
bookTitle = Field()
chapterNum = Field()
chapterName = Field()
chapterURL = Field()
text = Field()









share|improve this question






























    5















    I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
    The example uses Scrapy+Redis+MongoDB.



    the info:



    2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
    2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
    2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)


    novspider.py



    #-*-coding:utf8-*-

    from scrapy_redis.spiders import RedisSpider
    from scrapy.selector import Selector
    from scrapy.http import Request
    from novelspider.items import NovelspiderItem
    import re

    class novSpider(RedisSpider):
    name = "novspider"
    redis_key = 'nvospider:start_urls'
    start_urls = ['http://www.daomubiji.com/']

    def parse(self,response):
    selector = Selector(response)
    table = selector.xpath('//table')
    for each in table:
    bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
    content = each.xpath('tr/td/a/text()').extract()
    url = each.xpath('tr/td/a/@href').extract()
    for i in range(len(url)):
    item = NovelspiderItem()
    item['bookName'] = bookName
    item['chapterURL'] = url[i]

    try:
    item['bookTitle'] = content[i].split(' ')[0]
    item['chapterNum'] = content[i].split(' ')[1]
    except Exception,e:
    continue

    try:
    item['chapterName'] = content[i].split(' ')[2]
    except Exception,e:
    item['chapterName'] = content[i].split(' ')[1][-3:]
    yield Request(url[i], callback='parseContent', meta='item':item)

    def parseContent(self, response):
    selector = Selector(response)
    item = response.meta['item']
    html = selector.xpath('//div[@class="content"]').extract()[0]
    textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
    text = re.findall('<p>(.*?)</p>',textField,re.S)
    fulltext = ''
    for each in text:
    fulltext += each
    item['text'] = fulltext
    yield item


    settings.py



    # -*- coding: utf-8 -*-

    # Scrapy settings for novelspider project
    #
    # For simplicity, this file contains only the most important settings by
    # default. All the other settings are documented here:
    #
    # http://doc.scrapy.org/en/latest/topics/settings.html
    #

    BOT_NAME = 'novelspider'

    SPIDER_MODULES = ['novelspider.spiders']
    NEWSPIDER_MODULE = 'novelspider.spiders'

    ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
    COOKIES_ENABLED = True

    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    SCHEDULER_PERSIST = True
    SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
    REDIS_URL = None
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379

    MONGODB_HOST = '127.0.0.1'
    MONGODB_PORT = 27017
    MONGODB_DBNAME = 'novdata'
    MONGODB_DOCNAME = 'nov1'


    pipelines.py



    # -*- coding: utf-8 -*-

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from items import NovelspiderItem
    from scrapy.conf import settings
    import pymongo

    class NovelspiderPipeline(object):
    def __init__(self):
    host = settings['MONGODB_HOST']
    port = settings['MONGODB_PORT']
    dbName = settings['MONGODB_DBNAME']
    client = pymongo.MongoClient(host=host, port=port)
    tdb = client[dbName]
    self.post = tdb[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
    bookInfo = dict(item)
    self.post.insert(bookInfo)
    return item


    items.py



    # -*- coding: utf-8 -*-

    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html

    from scrapy import Field, Item


    class NovelspiderItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    bookName = Field()
    bookTitle = Field()
    chapterNum = Field()
    chapterName = Field()
    chapterURL = Field()
    text = Field()









    share|improve this question


























      5












      5








      5








      I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
      The example uses Scrapy+Redis+MongoDB.



      the info:



      2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
      2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)


      novspider.py



      #-*-coding:utf8-*-

      from scrapy_redis.spiders import RedisSpider
      from scrapy.selector import Selector
      from scrapy.http import Request
      from novelspider.items import NovelspiderItem
      import re

      class novSpider(RedisSpider):
      name = "novspider"
      redis_key = 'nvospider:start_urls'
      start_urls = ['http://www.daomubiji.com/']

      def parse(self,response):
      selector = Selector(response)
      table = selector.xpath('//table')
      for each in table:
      bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
      content = each.xpath('tr/td/a/text()').extract()
      url = each.xpath('tr/td/a/@href').extract()
      for i in range(len(url)):
      item = NovelspiderItem()
      item['bookName'] = bookName
      item['chapterURL'] = url[i]

      try:
      item['bookTitle'] = content[i].split(' ')[0]
      item['chapterNum'] = content[i].split(' ')[1]
      except Exception,e:
      continue

      try:
      item['chapterName'] = content[i].split(' ')[2]
      except Exception,e:
      item['chapterName'] = content[i].split(' ')[1][-3:]
      yield Request(url[i], callback='parseContent', meta='item':item)

      def parseContent(self, response):
      selector = Selector(response)
      item = response.meta['item']
      html = selector.xpath('//div[@class="content"]').extract()[0]
      textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
      text = re.findall('<p>(.*?)</p>',textField,re.S)
      fulltext = ''
      for each in text:
      fulltext += each
      item['text'] = fulltext
      yield item


      settings.py



      # -*- coding: utf-8 -*-

      # Scrapy settings for novelspider project
      #
      # For simplicity, this file contains only the most important settings by
      # default. All the other settings are documented here:
      #
      # http://doc.scrapy.org/en/latest/topics/settings.html
      #

      BOT_NAME = 'novelspider'

      SPIDER_MODULES = ['novelspider.spiders']
      NEWSPIDER_MODULE = 'novelspider.spiders'

      ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

      USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
      COOKIES_ENABLED = True

      SCHEDULER = "scrapy_redis.scheduler.Scheduler"
      SCHEDULER_PERSIST = True
      SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
      REDIS_URL = None
      REDIS_HOST = '127.0.0.1'
      REDIS_PORT = 6379

      MONGODB_HOST = '127.0.0.1'
      MONGODB_PORT = 27017
      MONGODB_DBNAME = 'novdata'
      MONGODB_DOCNAME = 'nov1'


      pipelines.py



      # -*- coding: utf-8 -*-

      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
      from items import NovelspiderItem
      from scrapy.conf import settings
      import pymongo

      class NovelspiderPipeline(object):
      def __init__(self):
      host = settings['MONGODB_HOST']
      port = settings['MONGODB_PORT']
      dbName = settings['MONGODB_DBNAME']
      client = pymongo.MongoClient(host=host, port=port)
      tdb = client[dbName]
      self.post = tdb[settings['MONGODB_DOCNAME']]

      def process_item(self, item, spider):
      bookInfo = dict(item)
      self.post.insert(bookInfo)
      return item


      items.py



      # -*- coding: utf-8 -*-

      # Define here the models for your scraped items
      #
      # See documentation in:
      # http://doc.scrapy.org/en/latest/topics/items.html

      from scrapy import Field, Item


      class NovelspiderItem(Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      bookName = Field()
      bookTitle = Field()
      chapterNum = Field()
      chapterName = Field()
      chapterURL = Field()
      text = Field()









      share|improve this question
















      I get a Scrapy example from a website,it works but seems something wrong:it can not get all the content,and I don't know what happened.
      The example uses Scrapy+Redis+MongoDB.



      the info:



      2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
      2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
      2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)


      novspider.py



      #-*-coding:utf8-*-

      from scrapy_redis.spiders import RedisSpider
      from scrapy.selector import Selector
      from scrapy.http import Request
      from novelspider.items import NovelspiderItem
      import re

      class novSpider(RedisSpider):
      name = "novspider"
      redis_key = 'nvospider:start_urls'
      start_urls = ['http://www.daomubiji.com/']

      def parse(self,response):
      selector = Selector(response)
      table = selector.xpath('//table')
      for each in table:
      bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
      content = each.xpath('tr/td/a/text()').extract()
      url = each.xpath('tr/td/a/@href').extract()
      for i in range(len(url)):
      item = NovelspiderItem()
      item['bookName'] = bookName
      item['chapterURL'] = url[i]

      try:
      item['bookTitle'] = content[i].split(' ')[0]
      item['chapterNum'] = content[i].split(' ')[1]
      except Exception,e:
      continue

      try:
      item['chapterName'] = content[i].split(' ')[2]
      except Exception,e:
      item['chapterName'] = content[i].split(' ')[1][-3:]
      yield Request(url[i], callback='parseContent', meta='item':item)

      def parseContent(self, response):
      selector = Selector(response)
      item = response.meta['item']
      html = selector.xpath('//div[@class="content"]').extract()[0]
      textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1)
      text = re.findall('<p>(.*?)</p>',textField,re.S)
      fulltext = ''
      for each in text:
      fulltext += each
      item['text'] = fulltext
      yield item


      settings.py



      # -*- coding: utf-8 -*-

      # Scrapy settings for novelspider project
      #
      # For simplicity, this file contains only the most important settings by
      # default. All the other settings are documented here:
      #
      # http://doc.scrapy.org/en/latest/topics/settings.html
      #

      BOT_NAME = 'novelspider'

      SPIDER_MODULES = ['novelspider.spiders']
      NEWSPIDER_MODULE = 'novelspider.spiders'

      ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']

      USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
      COOKIES_ENABLED = True

      SCHEDULER = "scrapy_redis.scheduler.Scheduler"
      SCHEDULER_PERSIST = True
      SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
      REDIS_URL = None
      REDIS_HOST = '127.0.0.1'
      REDIS_PORT = 6379

      MONGODB_HOST = '127.0.0.1'
      MONGODB_PORT = 27017
      MONGODB_DBNAME = 'novdata'
      MONGODB_DOCNAME = 'nov1'


      pipelines.py



      # -*- coding: utf-8 -*-

      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
      from items import NovelspiderItem
      from scrapy.conf import settings
      import pymongo

      class NovelspiderPipeline(object):
      def __init__(self):
      host = settings['MONGODB_HOST']
      port = settings['MONGODB_PORT']
      dbName = settings['MONGODB_DBNAME']
      client = pymongo.MongoClient(host=host, port=port)
      tdb = client[dbName]
      self.post = tdb[settings['MONGODB_DOCNAME']]

      def process_item(self, item, spider):
      bookInfo = dict(item)
      self.post.insert(bookInfo)
      return item


      items.py



      # -*- coding: utf-8 -*-

      # Define here the models for your scraped items
      #
      # See documentation in:
      # http://doc.scrapy.org/en/latest/topics/items.html

      from scrapy import Field, Item


      class NovelspiderItem(Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      bookName = Field()
      bookTitle = Field()
      chapterNum = Field()
      chapterName = Field()
      chapterURL = Field()
      text = Field()






      python scrapy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 3 at 22:06









      Thiago Curvelo

      2,9311 gold badge16 silver badges31 bronze badges




      2,9311 gold badge16 silver badges31 bronze badges










      asked Oct 8 '15 at 18:11









      zwl1619zwl1619

      1,1844 gold badges24 silver badges51 bronze badges




      1,1844 gold badges24 silver badges51 bronze badges






















          1 Answer
          1






          active

          oldest

          votes


















          0














          You never reach the parse method that way. Use this instead:



          yield Request(
          url[i],
          callback=self.parseContent, # <--
          meta='item':item)





          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33023060%2fscrapyat-0-items-min%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            You never reach the parse method that way. Use this instead:



            yield Request(
            url[i],
            callback=self.parseContent, # <--
            meta='item':item)





            share|improve this answer



























              0














              You never reach the parse method that way. Use this instead:



              yield Request(
              url[i],
              callback=self.parseContent, # <--
              meta='item':item)





              share|improve this answer

























                0












                0








                0







                You never reach the parse method that way. Use this instead:



                yield Request(
                url[i],
                callback=self.parseContent, # <--
                meta='item':item)





                share|improve this answer













                You never reach the parse method that way. Use this instead:



                yield Request(
                url[i],
                callback=self.parseContent, # <--
                meta='item':item)






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 26 at 2:15









                Thiago CurveloThiago Curvelo

                2,9311 gold badge16 silver badges31 bronze badges




                2,9311 gold badge16 silver badges31 bronze badges


















                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33023060%2fscrapyat-0-items-min%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                    155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해