我也搞了个抓淘宝 MM 的 py 程序

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2824 天前的主题，其中的信息可能已经有所发展或是发生改变。

我平时写后台的，看你们爬得高兴也来凑个热闹。做得很粗糙，没有考虑出错恢复什么的，有时间再加吧。地址是 https://github.com/carlonelong/TaobaoMMCrawler

出错

粗糙

后台

热闹

33 条回复 • 2017-04-02 18:17:15 +08:00

aksoft

2017-03-31 13:30:15 +08:00

这是抓啥的？？？

carlonelong

2017-03-31 14:00:17 +08:00

@aksoft mm 相册

2017-03-31 14:20:17 +08:00

原来是抓淘女郎……
话说抓过某特定关键词的买家秀，惊喜多多… 楼主可以试试… 记住分类排除内衣的（不让上图

caicaicaiTrain

2017-03-31 14:24:08 +08:00

@RE 这个刺激了

mansur

2017-03-31 14:25:19 +08:00

能抓东京的大姐姐吗

springmarker

2017-03-31 14:35:54 +08:00 via Android

抓 cosplay 店的

carlonelong

2017-03-31 15:16:20 +08:00

@RE
来提供一个~~

carlonelong

2017-03-31 15:16:51 +08:00

@springmarker 有道理

tyhunter

2017-03-31 15:33:24 +08:00

报错了

start downloading 田媛媛
current page 1
start downloading album 10000702574 45ÕÅ 张
Traceback (most recent call last):
File "/Users/hunter/Downloads/TaobaoMMCrawler-master/crawler.py", line 83, in <module>
c.getAlbums()
File "/Users/hunter/Downloads/TaobaoMMCrawler-master/crawler.py", line 58, in getAlbums
self.getImages(model_id, album_id, album_img_count.strip(u'张'))
File "/Users/hunter/Downloads/TaobaoMMCrawler-master/crawler.py", line 65, in getImages
for page in xrange(1, (int(image_count)-1)/16+2):
ValueError: invalid literal for int() with base 10: '45\xd5\xc5'

carlonelong

2017-03-31 16:54:57 +08:00

@tyhunter 编码出问题了。。你是啥环境啊

roist

2017-03-31 17:01:25 +08:00

美图秀秀修过度的图，不如看看那些国内的擦边套图

zwh8800

2017-03-31 17:12:37 +08:00

好像有 BUG 啊

```
$ python crawler.py
start downloading 田媛媛
current page 1
start downloading album 10000702574 45ÕÅ 张
Traceback (most recent call last):
File "crawler.py", line 83, in <module>
c.getAlbums()
File "crawler.py", line 58, in getAlbums
self.getImages(model_id, album_id, album_img_count.strip(u'张'))
File "crawler.py", line 65, in getImages
for page in xrange(1, (int(image_count)-1)/16+2):
ValueError: invalid literal for int() with base 10: '45\xd5\xc5'
```

123s

2017-03-31 17:18:34 +08:00

抓淘宝 MM
好 h

xiejc

2017-03-31 17:24:44 +08:00

41 行 soup = bs(self.readHtml(model_url).decode('gbk'), 'html.parser') 修改成功不报错了

carlonelong

2017-03-31 17:26:35 +08:00

@xiejc 好 thx 我改一下

imherer

2017-03-31 17:29:31 +08:00

Python 版本要多少啊？
我 2.7 在 Mac 和 Windows 下都报同样的错呢
````
Traceback (most recent call last):
File "TaobaoMMCrawler.py", line 5, in <module>
from bs4 import BeautifulSoup as bs
ImportError: No module named bs4
````

zwh8800

2017-03-31 17:30:39 +08:00

@xiejc 👍

carlonelong

2017-03-31 17:32:50 +08:00

@imherer 这个是因为你没装 beautifulsoup pip install bs4 应该就可以了

7654

2017-03-31 17:52:22 +08:00

可以添加浏览器 UA
爬的时候限制一下，不然会 GG

neutrino

2017-03-31 17:58:07 +08:00

提了个 pr ，有些文件是 png 格式的（

imherer

2017-03-31 18:01:27 +08:00

@carlonelong 多谢

carlonelong

2017-03-31 18:17:43 +08:00

@neutrino thx 另外吐个槽，很不喜欢 python3 的 print = =

carlonelong

2017-03-31 18:18:20 +08:00

@7654 嗯，回头改一下

7654

2017-03-31 18:28:16 +08:00

r#22 @carlonelong import urllib.request

neutrino

2017-03-31 18:51:34 +08:00

@carlonelong haha 我是懒得装两份 bs4 requests ……就不说刚开始用 print 的时候是按照 printf 的格式用的了……捂脸

7654

2017-03-31 18:59:48 +08:00

去掉_620x10000.jpg 是大图

carlonelong

2017-03-31 19:41:30 +08:00

@7654 我去我居然没有发现

neutrino

2017-03-31 22:32:07 +08:00

@7654
@carlonelong

去掉以后， imghdr 有时候无法识别格式了，下载了看是 jpg ……

aksoft

2017-04-01 04:28:33 +08:00 via iPhone

不能抓回家不好

carlonelong

2017-04-01 10:31:53 +08:00

@aksoft 3D 打印你值得拥有

carlonelong

2017-04-01 10:33:36 +08:00

@neutrino 我今天晚上把俩文件合一块吧

aksoft

2017-04-01 11:15:33 +08:00

@carlonelong 不能用有啥用？

carlonelong

2017-04-02 18:17:15 +08:00

把 py2/3 放在一起了