V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
15874103329
V2EX  ›  Python

[新人求助] 关于 scrapy 项目中 scrapy.Request 没有回调的问题

  •  
  •   15874103329 · 2019-01-09 17:38:52 +08:00 · 2462 次点击
    这是一个创建于 2178 天前的主题,其中的信息可能已经有所发展或是发生改变。
    import scrapy

    from Demo.items import DemoItem


    class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quores.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
    item = DemoItem()
    text = quote.css('.text::text').extract_first()
    author = quote.css('.author::text').extract_first()
    tags = quote.css('.tags .tag::text').extract()
    item['text'] = text
    item['author'] = author
    item['tags'] = tags
    yield item


    next = response.css('.pager .next a::attr("href")').extract_first()
    url = response.urljoin(next)
    if next:
    yield scrapy.Request(url=url,callback=self.parse)
    10 条回复    2019-01-10 13:00:39 +08:00
    15874103329
        1
    15874103329  
    OP
       2019-01-09 17:39:36 +08:00
    按照教程里写的,但是我这代码只爬取了一页就结束了,求大佬帮忙看看
    15874103329
        2
    15874103329  
    OP
       2019-01-09 21:38:06 +08:00
    求助啊
    Leigg
        3
    Leigg  
       2019-01-10 00:00:46 +08:00 via iPhone
    把 next 打印出来
    carry110
        4
    carry110  
       2019-01-10 04:48:30 +08:00 via iPhone
    next 哪行,不要 extract_first ()试试。
    carry110
        5
    carry110  
       2019-01-10 10:55:22 +08:00
    把 if next:去掉就能行了,亲测!
    15874103329
        6
    15874103329  
    OP
       2019-01-10 10:58:01 +08:00
    @Leigg next 打印出来是 '/page/2/'
    url 是'http://quotes.toscrape.com/page/2/'
    15874103329
        7
    15874103329  
    OP
       2019-01-10 11:00:16 +08:00
    @carry110
    我这还是只打印了一页,不知啥情况
    Leigg
        8
    Leigg  
       2019-01-10 12:09:33 +08:00 via iPhone
    贴出 scrapy 结束的日志
    15874103329
        9
    15874103329  
    OP
       2019-01-10 12:49:54 +08:00
    @Leigg
    2019-01-10 11:35:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'http': <GET http://http//quotes.toscrape.com/page/2>
    2019-01-10 11:35:18 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-01-10 11:35:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 446,
    'downloader/request_count': 2,
    'downloader/request_method_count/GET': 2,
    'downloader/response_bytes': 2701,
    'downloader/response_count': 2,
    'downloader/response_status_count/200': 1,
    'downloader/response_status_count/404': 1,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2019, 1, 10, 3, 35, 18, 314550),
    'item_scraped_count': 10,
    'log_count/DEBUG': 14,
    'log_count/INFO': 7,
    'offsite/domains': 1,
    'offsite/filtered': 9,
    'request_depth_max': 1,
    'response_received_count': 2,
    'scheduler/dequeued': 1,
    'scheduler/dequeued/memory': 1,
    'scheduler/enqueued': 1,
    'scheduler/enqueued/memory': 1,
    'start_time': datetime.datetime(2019, 1, 10, 3, 35, 14, 371325)}
    2019-01-10 11:35:18 [scrapy.core.engine] INFO: Spider closed (finished)
    15874103329
        10
    15874103329  
    OP
       2019-01-10 13:00:39 +08:00
    已解决,修改代码为 yield scrapy.http.Request(url, callback=self.parse, dont_filter=True)
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   990 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 23ms · UTC 21:47 · PVG 05:47 · LAX 13:47 · JFK 16:47
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.