Most popular crawling repositories and open source projects

double-agent

A test suite of common scraper detection techniques. See how detectabl...

9   127   127  

scraply

Scraply a simple dom scraper to fetch information from any html based...

11   123   123  

estela

estela, an elastic web scraping cluster 🕸

5   110   110  

pdf-crawler

SimFin's open source PDF crawler

38   103   103  

dig-etl-engine

Download DIG to run on your laptop or server.

40   97   97  

proxifier

A fast, modern and intelligent proxy rotator perfect for crawling and...

15   95   95  

jkcrawler

使用 Scrapy 写成的 JK 爬虫,图片源自哔哩哔哩、Tumblr、Instagram,以及...

27   95   95  

LinkedIn-Skills-Crawler

A simple Python script to crawl complete list of LinkedIn skills

111   94   94  

warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

0   87   87  

Infect

Create you virus in termux!

10   84   84  

bathyscaphe

Fast, highly configurable, cloud native dark web crawler.

24   83   83  

bots-zoo

22   80   80  

ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the...

24   79   79  

arachnid

Powerful web scraping framework for Crystal

11   78   78  

robots.txt

Simple robots.txt template. Keep unwanted robots out (disallow). White...

38   78   78  

Harvester

Web crawling and document processing through a usable interface.

15   69   69  

wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC...

10   61   61  

pomp

Screen scraping and web crawling framework

10   60   60  

Python-Crawling-Tutorial

Python crawling tutorial

26   60   60  

tech-seo-crawler

Build a small, 3 domain internet using Github pages and Wikipedia and...

8   60   60  

proxycrawl-python

ProxyCrawl Python library for scraping and crawling

21   57   57  

crawling-projects

Web scraping and automation using python

15   56   56  

talospider

talospider - A simple,lightweight scraping micro-framework

4   55   55  

custom-crawler

🌌 High productivity semi-automatic crawler generator 🛠️🧰

2   55   55  

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The offici...

20   53   53  

learn.scrapinghub.com

Scrapinghub Learning Center. Report issues in Jira: Report issues in J...

24   53   53  

flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-comm...

17   51   51  

scrapy-distributed

A series of distributed components for Scrapy. Including RabbitMQ-base...

12   44   44  

warcworker

A dockerized, queued high fidelity web archiver based on Squidwarc

7   43   43  

jason-the-miner

⛏ A versatile Web scraper for Node.js

11   43   43  

bluebird

Unofficial Python client for Twitter

11   43   43  

scrape-github-trending

Tutorial for web scraping / crawling with Node.js.

7   42   42  

feedsearch-crawler

Crawl sites for RSS, Atom, and JSON feeds.

7   42   42  

EngineeringTeam

와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.

10   41   41  

socials

👨‍👩‍👦 Social account detection and extraction in Python, e.g. for craw...

8   41   41  

burp-dom-scanner

Burp Suite's extension to scan and crawl Single Page Applications

4   40   40  

podcastcrawler

PHP library to find podcasts

10   39   39  

auctus

Dataset search engine, discovering data from a variety of sources, pro...

11   37   37  

bilib

整合多个B站原生API,并结合爬取技术的Python爬取用lib

1   36   36  

Deepminer

Deep web crawler and search engine

7   36   36  

serverless-instagram-crawler

serverless, instagram hashtag crawler with lambda, dynamoDB

8   34   34  

mal-analysis

github repo for MyAnimeList analysis. Also links to the MAL dataset.

7   32   32  

spidyquotes

Example site for web scraping tutorials

15   31   31  

video-crawler

Crawl websites for videos from Youtube, Vimeo, Soundcloud, etc

4   30   30  

pdf_downloader

A Scrapy Spider for downloading PDF files from a webpage.

14   30   30  

scaling-to-distributed-crawling

Repository for the Mastering Web Scraping in Python: Scaling to Distri...

7   30   30  

BaiduSpider

项目已经移动至:https://github.com/BaiduSpider/BaiduSpider !! 一个...

13   29   29  

serritor

Serritor is an open source web crawler framework built upon Selenium a...

15   28   28  

ferret-server

Advanced declarative web scraping

6   27   27  

AyugeSpiderTools

scrapy 扩展库:其主要功能使 scrapy 开发不用在意 item,pipeline,middle...

3   27   27