Most popular crawling repositories and open source projects

Scrapling D4Vinci Python

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

69.8k 6.9k 256

scrapy scrapy Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

63.2k 11.8k 1.8k

colly gocolly Go

Elegant Scraper and Crawler Framework for Golang

25.4k 1.9k 315

crawlee apify TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs,...

24.8k 1.6k 131

maxun getmaxun TypeScript

🔥 The open-source no-code platform for web scraping, crawling, search and AI data extraction • Turn websites into structured APIs in minutes 🔥

16.6k 1.4k 83

newspaper codelucas Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

15.1k 2.1k 373

crawlee-python apify Python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, P...

9.3k 777 46

awesome-web-scraping lorien Makefile

List of libraries, tools and APIs for web scraping and data processing.

8k 918 230

rod go-rod Go

A Chrome DevTools Protocol driver for web automation and scraping.

7k 473 51

ferret MontFerret Go

Declarative data automation language and Go runtime for structured extraction workflows.

6k 326 92

headless-chrome-crawler yujiosaka JavaScript

Distributed crawler powered by Headless Chrome

5.6k 404 112

hakrawler hakluke Go

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

5.1k 537 60

exa-mcp-server exa-labs TypeScript

Exa MCP for web search and web crawling!

4.7k 359 20

ai.robots.txt ai-robots-txt Python

A list of AI agents and robots to block.

4k 170 60

puppeteer-sharp hardkoded C#

Headless Chrome .NET API

3.9k 487 56

cariddi edoardottt Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

3.5k 296 16

nutch apache Java

Apache Nutch is an extensible and scalable web crawler

3.3k 1.3k 223

awesome-puppeteer transitive-bullshit

A curated list of awesome puppeteer resources.

2.6k 169 49

grab lorien Python

Web Scraping Framework

2.5k 276 83

skycaiji zorlan PHP

蓝天采集器是一款开源免费的爬虫系统，仅需点选编辑规则即可采集数据，可运行在本地、虚拟主机或云服务器中，几乎能采集所有类型的网页，无缝对接各类CMS建站程...

2.1k 606 77

holiday-cn NateScarlet Python

📅🇨🇳中国法定节假日数据自动每日抓取国务院公告

2.1k 208 21

core roach-php PHP

The complete web scraping toolkit for PHP.

1.5k 87 17

rebrowser-patches rebrowser JavaScript

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy...

1.4k 78 28

mlscraper lorey Python

🤖 Scrape data from HTML websites automatically by just providing examples

1.4k 93 15

bhban_rpa needleworm Python

<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디...

1.2k 1.1k 6

crawly elixir-crawly Elixir

Crawly, a high-level web crawling & scraping framework for Elixir.

1.1k 123 17

browsertrix-crawler webrecorder TypeScript

Run a high-fidelity browser-based web archiving crawler in a single Docker container

1.1k 147 22

scrapfly-scrapers scrapfly Python

Scalable Python web scraping scripts for +40 popular domains

1k 200 19

scrapy-selenium clemfromspace Python

Scrapy middleware to handle javascript pages using selenium

952 352 19

scrapyrt scrapinghub Python

HTTP API for Scrapy spiders

882 162 41

AdminHack mishakorzik Shell

today we will hack the admin panel of the site.

881 156 24

easy-scraping-tutorial MorvanZhou Jupyter Notebook

Simple but useful Python web scraping tutorial code.

818 542 40

siteone-crawler janreges Rust

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers,...

804 73 9

Lulu iawia002 Python

[Unmaintained] A simple and clean video/music/image downloader 👾

802 139 1

linkedin-profile-scraper-api josephlimtech TypeScript

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON.

767 180 14

seonaut StJudeWasHere Go

Open source SEO audit tool.

737 123 9

dataflowkit slotix Go

Extract structured data from web sites. Web sites scraping.

715 83 20

Craw4LLM cxcscmu Python

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

658 60 4

isp-data-pollution essandess Python

ISP Data Pollution to Protect Private Browsing History with Obfuscation

611 53 36

LinkedInDumper l4rm4nd Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API

602 59 9

deepcrawl lumpinif TypeScript

100% free and full open-source edge Firecrawl alternative with better links extraction for agents - that you can deploy to cloudflare or vercel by you...

588 74 5

spidermon scrapinghub Python

Scrapy Extension for monitoring spiders execution.

561 102 68