Most popular crawling repositories and open source projects

scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python...

10965   57678   57678  

colly

Elegant Scraper and Crawler Framework for Golang

1815   24465   24465  

crawlee

Crawleeโ€”A web scraping and browser automation library for Node.js to b...

889   18535   18535  

newspaper

newspaper3k is a news, full-text, and article metadata extraction in P...

2130   14668   14668  

awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing...

814   7082   7082  

Scrapling

๐Ÿ•ท๏ธ An undetectable, powerful, flexible, high-performance Python librar...

348   6369   6369  

rod

A Chrome DevTools Protocol driver for web automation and scraping.

397   6101   6101  

crawlee-python

Crawleeโ€”A web scraping and browser automation library for Python to bu...

403   5919   5919  

ferret

Declarative web scraping

307   5836   5836  

headless-chrome-crawler

Distributed crawler powered by Headless Chrome

409   5582   5582  

hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoin...

524   4793   4793  

puppeteer-sharp

Headless Chrome .NET API

468   3710   3710  

nutch

Apache Nutch is an extensible and scalable web crawler

1257   3045   3045  

ai.robots.txt

A list of AI agents and robots to block.

112   2900   2900  

awesome-puppeteer

A curated list of awesome puppeteer resources.

160   2502   2502  

grab

Web Scraping Framework

275   2405   2405  

skycaiji

่“ๅคฉ้‡‡้›†ๅ™จๆ˜ฏไธ€ๆฌพๅผ€ๆบๅ…่ดน็š„็ˆฌ่™ซ็ณป็ปŸ๏ผŒไป…้œ€็‚น้€‰็ผ–่พ‘่ง„ๅˆ™ๅณๅฏ้‡‡้›†ๆ•ฐๆฎ๏ผŒๅฏ่ฟ...

596   2016   2016  

cariddi

Take a list of domains, crawl urls and scan for endpoints, secrets, ap...

184   1793   1793  

holiday-cn

๐Ÿ“…๐Ÿ‡จ๐Ÿ‡ณไธญๅ›ฝๆณ•ๅฎš่Š‚ๅ‡ๆ—ฅๆ•ฐๆฎ ่‡ชๅŠจๆฏๆ—ฅๆŠ“ๅ–ๅ›ฝๅŠก้™ขๅ…ฌๅ‘Š

169   1554   1554  

core

The complete web scraping toolkit for PHP.

76   1418   1418  

mlscraper

๐Ÿค– Scrape data from HTML websites automatically by just providing exam...

91   1359   1359  

bhban_rpa

<6๊ฐœ์›” ์น˜ ์—…๋ฌด๋ฅผ ํ•˜๋ฃจ ๋งŒ์— ๋๋‚ด๋Š” ์—…๋ฌด ์ž๋™ํ™”(์ƒ๋Šฅ์ถœํŒ์‚ฌ, 2020)>์˜ ์˜ˆ...

1081   1119   1119  

crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

120   1037   1037  

scrapy-selenium

Scrapy middleware to handle javascript pages using selenium

362   945   945  

rebrowser-patches

Collection of patches for puppeteer and playwright to avoid automation...

46   901   901  

scrapyrt

HTTP API for Scrapy spiders

162   864   864  

browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Do...

109   831   831  

Lulu

[Unmaintained] A simple and clean video/music/image downloader ๐Ÿ‘พ

142   811   811  

easy-scraping-tutorial

Simple but useful Python web scraping tutorial code.

546   802   802  

AdminHack

today we will hack the admin panel of the site.

140   799   799  

dataflowkit

Extract structured data from web sites. Web sites scraping.

80   688   688  

linkedin-profile-scraper-api

๐Ÿ•ต๏ธโ€โ™‚๏ธ LinkedIn profile scraper returning structured profile data in J...

169   667   667  

Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pret...

57   633   633  

isp-data-pollution

ISP Data Pollution to Protect Private Browsing History with Obfuscatio...

52   608   608  

scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

136   569   569  

spidermon

Scrapy Extension for monitoring spiders execution.

101   546   546  

webster

a reliable high-level web crawling & scraping framework for Node.js.

53   540   540  

crawljax

Crawljax

223   526   526  

siteone-crawler

SiteOne Crawler is a cross-platform website crawler and analyzer for S...

39   520   520  

LinkedInDumper

Python 3 script to dump/scrape/extract company employees from LinkedIn...

52   477   477  

seonaut

Open source SEO audit tool.

73   407   407  

second-order

Second-order subdomain takeover scanner

67   405   405  

WarcDB

WarcDB: Web crawl data as SQLite databases.

11   404   404  

webpalm

๐Ÿ•ธ๏ธ Crawl in the web network

39   372   372  

crawler

Library for Rapid (Web) Crawler and Scraper Development

13   366   366  

spidy

The simple, easy to use command line web crawler.

69   349   349  

telegram-crawler

๐Ÿ•ท Automatically detect changes made to the official Telegram sites, cl...

37   318   318  

stopstalk-deployment

Stop stalking and start StopStalking :wink:

99   318   318  

memorious

Lightweight web scraping toolkit for documents and structured data.

62   313   313  

crawler

๐Ÿ•ท๏ธ An easy-to-use spider written in Golang. (previous named GOPA.)

82   309   309