开源分布式爬虫管理平台Crawlab容器化部署
一、Crawlab 简介
Crawlab 目前已经支持定时任务、数据分析、可配置爬虫、SDK、消息通知、Scrapy 支持、Git 同步 等功能。Crawlab主要解决的是大量爬虫管理困难的问题,例如需要监控上百个网站的参杂scrapy
和selenium
的项目不容易做到同时管理,而且命令行管理的成本非常高,还容易出错。Crawlab支持任何语言和任何框架,配合任务调度、任务监控,很容易做到对成规模的爬虫项目进行有效监控管理。
- 官网功能介绍:https://www.crawlab.cn/features/index.html
- GitHub地址:https://github.com/crawlab-team/crawlab
- Gitee地址:https://gitee.com/crawlab-team/crawlab
- GitBook地址:https://docs.crawlab.cn/zh/
二、Crawlab 容器化部署
2.1 环境准备
文章推荐:Docker安装部署教程、Docker mirror speed
[root@10-27-0-224 ~]# yum install docker-compose -y
2.2 自动部署
[root@10-27-0-224 ~]# vim docker-compose.yml version: '3.3' services: master: image: tikazyq/crawlab:latest # 使用的镜像 container_name: crawlab-master # 启动的容器名称 environment: CRAWLAB_SERVER_MASTER: "Y" # 是否作为主节点Y/N CRAWLAB_MONGO_HOST: "mongo" # mongodb数据库 CRAWLAB_REDIS_ADDRESS: "redis" # redis数据库 ports: - "8080:8080" # 将容器前端8080端口映射至宿主机 depends_on: - mongo # 依赖的容器 - redis # 依赖的容器 volumes: - "/var/crawlab/log:/var/logs/crawlab" # log persistent 日志持久化 worker: image: tikazyq/crawlab:latest container_name: crawlab-worker environment: CRAWLAB_SERVER_MASTER: "N" # 不作为主节点,即为工作节点,注意不能有一个以上的主节点 CRAWLAB_MONGO_HOST: "mongo" CRAWLAB_REDIS_ADDRESS: "redis" depends_on: - mongo - redis volumes: - "/var/crawlab/log:/var/logs/crawlab" # log persistent 日志持久化 mongo: # mongodb数据库 image: mongo:latest restart: always volumes: - "/opt/crawlab/mongo/data/db:/data/db" # make data persistent 持久化 ports: - "27017:27017" # expose port to host machine 暴露接口到宿主机 redis: # redis数据库 image: redis:latest restart: always volumes: - "/opt/crawlab/redis/data:/data" # make data persistent 持久化 ports: - "6379:6379" # expose port to host machine 暴露接口到宿主机 splash: # use Splash to run spiders on dynamic pages image: scrapinghub/splash container_name: splash ports: - "8050:8050" [root@10-27-0-224 ~]# docker-compose up -d # 启动容器 [root@10-27-0-224 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 74fe33836fa9 tikazyq/crawlab:latest "/bin/bash /app/dock…" 5 minutes ago Up 5 minutes 8000/tcp, 0.0.0.0:8080->8080/tcp crawlab-master c519f15bccf1 tikazyq/crawlab:latest "/bin/bash /app/dock…" 5 minutes ago Up 5 minutes 8000/tcp, 8080/tcp crawlab-worker 3457cf598c6e mongo:latest "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 0.0.0.0:27017->27017/tcp root_mongo_1 b41f0774f671 redis:latest "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 0.0.0.0:6379->6379/tcp root_redis_1 cbb7a364721f scrapinghub/splash "python3 /app/bin/sp…" 5 minutes ago Up 5 minutes 0.0.0.0:8050->8050/tcp splash root_mongo_1 [root@10-27-0-224 ~]# docker-compose down | true # 删除容器
2.3 Crawlab 访问测试
作者:UStarGao
链接:https://www.starcto.com/open-sourcing/206.html
来源:STARCTO
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
UCloud云平台推荐
随便看看
- 2021-09-27MongoDB数据导入导出工具详解
- 2022-11-10UCloud云数据库公网访问解决方案
- 2021-02-21Docker创建与查看容器常用参数解读
- 2021-04-10Linux服务器性能分析命令sar详解
- 2024-09-12UCloud Centos7.x高内核降级到低内核及内核crash参数调整