爬虫-本地部署Python爬虫管理平台Crawlab

一、拉取crawlab代码

1
2
3
4
5
6
7
本教程使用crawlab版本:v0.4.1
https://github.com/crawlab-team/crawlab/releases/tag/v0.4.1

最新HTTPS:https://github.com/crawlab-team/crawlab.git
最新SSH:git@github.com:crawlab-team/crawlab.git

推荐使用SSH拉取代码,仓库较大,可能出现缓冲区溢出异常。

二、部署

cd crawlab 目录下 执行 docker-compose up

如果提示Docker File Sharing找不到路径异常,打开docker-compose.yml配置文件,删除mongoredisvolumes配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
 version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_API_ADDRESS: "http://localhost:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
depends_on:
- mongo
- redis
worker:
image: tikazyq/crawlab:latest
container_name: worker
environment:
CRAWLAB_SERVER_MASTER: "N"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
restart: always
- volumes:
- - "/opt/crawlab/mongo/data/db:/data/db"
ports:
- "27017:27017"
redis:
image: redis:latest
restart: always
- volumes:
- - "/opt/crawlab/redis/data:/data"
ports:
- "6379:6379"

docker-compose up 部署成功提示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
=> docker-compose up

Starting master ... done
Starting worker ... done
Attaching to worker, master
master | * Starting nginx nginx
worker | * Starting nginx nginx
worker | ...done.
worker | [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
worker |
worker | [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
worker | - using env: export GIN_MODE=release
worker | - using code: gin.SetMode(gin.ReleaseMode)
worker |
worker | 2019/12/17 09:45:51 info 初始化配置成功
worker | 2019/12/17 09:45:51 info 初始化日志设置成功
worker | 2019/12/17 09:45:51 info 默认未开启定期清理日志配置
worker | 2019/12/17 09:45:51 info 初始化Mongodb数据库成功
worker | 2019/12/17 09:45:51 info 初始化Redis数据库成功
worker | 2019/12/17 09:45:51 info 初始化任务执行器成功
worker | 2019/12/17 09:45:51 info register type is :*register.MacRegister
worker | {subscribe nodes:5df871e26001790017ce4ebb 1}
worker | 2019/12/17 09:45:51 info 初始化节点配置成功
worker | 2019/12/17 09:45:51 info 初始化爬虫服务成功
worker | 2019/12/17 09:45:51 info 初始化用户服务成功
worker | [GIN-debug] GET /ping --> crawlab/routes.Ping (3 handlers)
worker | {subscribe nodes:public 1}

打开http://localhost:8080/#/login用默认 admin帐号登录

登录

三、运行spider

修改配置文件中mongoredis为你自己的数据库地址

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_API_ADDRESS: "http://localhost:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "192.168.0.0"
CRAWLAB_MONGO_PORT: "6666"
CRAWLAB_MONGO_DB: "crawlab_test"
CRAWLAB_REDIS_ADDRESS: "192.168.0.0"
CRAWLAB_REDIS_PASSWORD: "000000"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend

worker:
image: tikazyq/crawlab:latest
container_name: worker
environment:
CRAWLAB_SERVER_MASTER: "N"
CRAWLAB_MONGO_HOST: "192.168.0.0"
CRAWLAB_MONGO_PORT: "6666"
CRAWLAB_MONGO_DB: "crawlab_test"
CRAWLAB_REDIS_ADDRESS: "192.168.0.0"
CRAWLAB_REDIS_PASSWORD: "000000"

新增一个mongo pipeline并启用 http://docs.crawlab.cn/Examples/ScrapyIntegration.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
from pymongo import MongoClient

MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'

# scrapy example in the pipeline
class JuejinPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
# 结果集 -> 表名
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]

def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item

从根目录打包你的爬虫代码上传

上传爬虫

设置执行命令和结果集,结果集名称将是mongo存储的表名

  • 如果爬虫详情页提示找不到文件,稍稍等一会就OK
    配置爬虫

保存 -> 运行 -> 选择节点执行

运行完成

运行完成

四、示例虫子代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# -*- coding: utf-8 -*-
from SimpleScrapy.items import NewTopItem, NewTopListDataItem
import scrapy


# 热榜数据
class NewtopSpider(scrapy.Spider):

name = 'newtop'
allowed_domains = ['tophub.today']
start_urls = ['https://tophub.today']

def parse(self, response):
# 列表数据
# 所有类型
listTitles = response.css('.bc-tc-tb::text').getall()
# 所有内容
listContentHtml = response.css('.bc').css('.bc-tc').xpath('//div[@id="Sortable"]')

for index in range(len(listTitles)):
# 内容列表
for content in listContentHtml[index].css('.cc-cd'):
# 数据
item = NewTopItem()
# 类型
item['newType'] = listTitles[index]
# 网站
item['webType'] = content.css('.cc-cd-lb').css('span::text').get()
# 网站榜单类型
item['listType'] = content.css('.cc-cd-sb-st::text').get()
# 最后更新时间
item['lastTime'] = content.css('.i-h::text').get()
links = content.css('.cc-cd-cb-l').css('a')
# 排行数据
item['listData'] = []
for link in links:
data_item = NewTopListDataItem()
# url
data_item['link'] = link.css('::attr(href)').get()
# 排行
data_item['ranking'] = link.css('.s::text').get()
# 标题
data_item['title'] = link.css('.t::text').get()
# 数量
data_item['count'] = link.css('.e::text').get()
item['listData'].append(data_item)
yield item
--- 青山不改 绿水长流,日后江湖相见,自当杯酒言欢,咱们就此别过。---

本文标题:爬虫-本地部署Python爬虫管理平台Crawlab

文章作者:mecono

发布时间:2019年12月17日 - 17:12

最后更新:2019年12月19日 - 09:12

原始链接:https://mecono.cn/1125896980.html

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-ND 4.0许可协议。转载请注明出处!