Topic

crawling

Repositories (1350)

scrapy
scrapy scrapy Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

61.5k
Scrapling
Scrapling D4Vinci Python

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

38.7k
colly
colly gocolly Go

Elegant Scraper and Crawler Framework for Golang

25.3k
crawlee
crawlee apify TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs,...

23k
maxun
maxun getmaxun TypeScript

🔥 The open-source no-code platform for web scraping, crawling, search and AI data extraction • Turn websites into structured APIs in minutes 🔥

15.5k
newspaper
newspaper codelucas Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

15k
crawlee-python
crawlee-python apify Python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, P...

8.8k
awesome-web-scraping
awesome-web-scraping lorien Makefile

List of libraries, tools and APIs for web scraping and data processing.

7.9k
rod
rod go-rod Go

A Chrome DevTools Protocol driver for web automation and scraping.

6.9k
ferret
ferret MontFerret Go

Declarative web scraping

6k
headless-chrome-crawler
headless-chrome-crawler yujiosaka JavaScript

Distributed crawler powered by Headless Chrome

5.7k
hakrawler
hakrawler hakluke Go

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

5k
exa-mcp-server
exa-mcp-server exa-labs TypeScript

Exa MCP for web search and web crawling!

4.3k
puppeteer-sharp
puppeteer-sharp hardkoded C#

Headless Chrome .NET API

3.9k
ai.robots.txt
ai.robots.txt ai-robots-txt Python

A list of AI agents and robots to block.

3.8k
cariddi
cariddi edoardottt Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

3.4k
nutch
nutch apache Java

Apache Nutch is an extensible and scalable web crawler

3.1k
awesome-puppeteer
awesome-puppeteer transitive-bullshit

A curated list of awesome puppeteer resources.

2.5k
grab
grab lorien Python

Web Scraping Framework

2.5k
skycaiji
skycaiji zorlan PHP

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程...

2.1k
holiday-cn
holiday-cn NateScarlet Python

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

1.9k
core
core roach-php PHP

The complete web scraping toolkit for PHP.

1.5k
mlscraper
mlscraper lorey Python

🤖 Scrape data from HTML websites automatically by just providing examples

1.4k
rebrowser-patches
rebrowser-patches rebrowser JavaScript

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy...

1.3k
bhban_rpa
bhban_rpa needleworm Python

<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디...

1.1k
crawly
crawly elixir-crawly Elixir

Crawly, a high-level web crawling & scraping framework for Elixir.

1.1k
browsertrix-crawler
browsertrix-crawler webrecorder TypeScript

Run a high-fidelity browser-based web archiving crawler in a single Docker container

1k
scrapy-selenium
scrapy-selenium clemfromspace Python

Scrapy middleware to handle javascript pages using selenium

954
scrapfly-scrapers
scrapfly-scrapers scrapfly Python

Scalable Python web scraping scripts for +40 popular domains

947
scrapyrt
scrapyrt scrapinghub Python

HTTP API for Scrapy spiders

880
AdminHack
AdminHack mishakorzik Shell

today we will hack the admin panel of the site.

872
easy-scraping-tutorial
easy-scraping-tutorial MorvanZhou Jupyter Notebook

Simple but useful Python web scraping tutorial code.

817
Lulu
Lulu iawia002 Python

[Unmaintained] A simple and clean video/music/image downloader 👾

806
linkedin-profile-scraper-api
linkedin-profile-scraper-api josephlimtech TypeScript

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON.

753
siteone-crawler
siteone-crawler janreges Rust

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers,...

728
dataflowkit
dataflowkit slotix Go

Extract structured data from web sites. Web sites scraping.

712
seonaut
seonaut StJudeWasHere Go

Open source SEO audit tool.

690
Craw4LLM
Craw4LLM cxcscmu Python

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

654
isp-data-pollution
isp-data-pollution essandess Python

ISP Data Pollution to Protect Private Browsing History with Obfuscation

605
LinkedInDumper
LinkedInDumper l4rm4nd Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API

584
deepcrawl
deepcrawl lumpinif TypeScript

100% free and full open-source edge Firecrawl alternative with better links extraction for agents - that you can deploy to cloudflare or vercel by you...

573
webster
webster zhuyingda JavaScript

a reliable high-level web crawling & scraping framework for Node.js.

561
spidermon
spidermon scrapinghub Python

Scrapy Extension for monitoring spiders execution.

554
crawljax
crawljax crawljax Java

Crawljax

540
WarcDB
WarcDB Florents-Tselai Python

WarcDB: Web crawl data as SQLite databases.

404
second-order
second-order mhmdiaa Go

Second-order subdomain takeover scanner

402
webpalm
webpalm XORbit01 Go

🕸️ Crawl in the web network

381
crawler
crawler crwlrsoft PHP

Library for Rapid (Web) Crawler and Scraper Development

369
spidy
spidy rivermont Python

The simple, easy to use command line web crawler.

354
telegram-crawler
telegram-crawler MarshalX Python

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

349