Introduction¶
Ruia是一个基于asyncio和aiohttp的异步爬虫框架,它具有编写快速,非阻塞,扩展性强等特点,让你写更少的代码,收获更快的运行速度
特性如下:
- 自定义中间件
- 支持js加载类型网页
- 友好地数据响应类
- 异步无阻塞
Installation¶
安装Ruia之前请先确保你使用的是Python3.6+
# For Linux & Mac
pip install -U ruia[uvloop]
# For Windows
pip install -U ruia
# New features
pip install git+https://github.com/howie6879/ruia
Code Snippets¶
下面我将举个例子简单介绍下Ruia的使用方式以及框架运行流程,创建文件hacker_news_spider.py,然后拷贝下面代码到文件中:
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, TextField, Item, Spider
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
async def clean_title(self, value):
"""
如果字段不需要清洗 这个函数可以不写
"""
return value
class HackerNewsSpider(Spider):
start_urls = ['https://news.ycombinator.com/news?p=1', 'https://news.ycombinator.com/news?p=2']
concurrency = 10
async def parse(self, res):
items = await HackerNewsItem.get_items(html=res.html)
for item in items:
async with aiofiles.open('./hacker_news.txt', 'a') as f:
await f.write(item.title + '\n')
if __name__ == '__main__':
HackerNewsSpider.start(middleware=None)
在终端执行python hacker_news_spider.py,如果顺利的话将会得到如下输出,并且目标数据会存储在hacker_news.txt文件中:
[2018-09-24 11:02:05,088]-ruia-INFO spider : Spider started!
[2018-09-24 11:02:05,089]-Request-INFO request: <GET: https://news.ycombinator.com/news?p=2>
[2018-09-24 11:02:05,113]-Request-INFO request: <GET: https://news.ycombinator.com/news?p=1>
[2018-09-24 11:02:09,820]-ruia-INFO spider : Stopping spider: ruia
[2018-09-24 11:02:09,820]-ruia-INFO spider : Total requests: 2
[2018-09-24 11:02:09,820]-ruia-INFO spider : Time usage: 0:00:01.731780
[2018-09-24 11:02:09,821]-ruia-INFO spider : Spider finished!