Most popular crawling repositories and open source projects

scrapy scrapy Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

61.2k 11.4k 61.2k

Scrapling D4Vinci Python

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

35k 2.9k 35k

colly gocolly Go

Elegant Scraper and Crawler Framework for Golang

24.5k 1.8k 24.5k

crawlee apify TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs,...

22.7k 1.3k 22.7k

newspaper codelucas Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

15k 2.1k 15k

crawlee-python apify Python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, P...

8.7k 703 8.7k

awesome-web-scraping lorien Makefile

List of libraries, tools and APIs for web scraping and data processing.

7.1k 814 7.1k

rod go-rod Go

A Chrome DevTools Protocol driver for web automation and scraping.

6.2k 407 6.2k

ferret MontFerret Go

Declarative web scraping

5.8k 307 5.8k

headless-chrome-crawler yujiosaka JavaScript

Distributed crawler powered by Headless Chrome

5.6k 409 5.6k

hakrawler hakluke Go

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

4.8k 524 4.8k

puppeteer-sharp hardkoded C#

Headless Chrome .NET API

3.7k 468 3.7k

nutch apache Java

Apache Nutch is an extensible and scalable web crawler

3k 1.3k 3k

ai.robots.txt ai-robots-txt Python

A list of AI agents and robots to block.

2.9k 112 2.9k

cariddi edoardottt Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

2.7k 243 2.7k

awesome-puppeteer transitive-bullshit

A curated list of awesome puppeteer resources.

2.5k 160 2.5k

grab lorien Python

Web Scraping Framework

2.5k 280 2.5k

skycaiji zorlan PHP

蓝天采集器是一款开源免费的爬虫系统，仅需点选编辑规则即可采集数据，可运行在本地、虚拟主机或云服务器中，几乎能采集所有类型的网页，无缝对接各类CMS建站程...

2k 596 2k

holiday-cn NateScarlet Python

📅🇨🇳中国法定节假日数据自动每日抓取国务院公告

1.6k 173 1.6k

core roach-php PHP

The complete web scraping toolkit for PHP.

1.4k 76 1.4k

mlscraper lorey Python

🤖 Scrape data from HTML websites automatically by just providing examples

1.4k 91 1.4k

bhban_rpa needleworm Python

<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디...

1.1k 1.1k 1.1k

crawly elixir-crawly Elixir

Crawly, a high-level web crawling & scraping framework for Elixir.

1k 120 1k

scrapy-selenium clemfromspace Python

Scrapy middleware to handle javascript pages using selenium

945 362 945

rebrowser-patches rebrowser JavaScript

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy...

901 46 901

scrapyrt scrapinghub Python

HTTP API for Scrapy spiders

864 162 864

browsertrix-crawler webrecorder TypeScript

Run a high-fidelity browser-based web archiving crawler in a single Docker container

831 109 831

Lulu iawia002 Python

[Unmaintained] A simple and clean video/music/image downloader 👾

807 140 807

easy-scraping-tutorial MorvanZhou Jupyter Notebook

Simple but useful Python web scraping tutorial code.

802 546 802

AdminHack mishakorzik Shell

today we will hack the admin panel of the site.

799 140 799

siteone-crawler janreges Rust

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers,...

709 56 709