Topic

crawling

Repositories (1230)

proxycrawl-python
proxycrawl-python crawlbase Python

ProxyCrawl Python library for scraping and crawling

59
talospider
talospider howie6879 Python

talospider - A simple,lightweight scraping micro-framework

55
learn.scrapinghub.com
learn.scrapinghub.com scrapinghub CSS

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB

55
rag-web-browser
rag-web-browser apify TypeScript

RAG Web Browser is an Apify Actor to feed your LLM applications and RAG pipelines with up-to-date text content scraped from the web.

55
diffbot-php-client
diffbot-php-client Swader PHP

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

53
flink-crawler
flink-crawler ScaleUnlimited Java

Continuous scalable web crawler built on top of Flink and crawler-commons

52
fuckvkeypad
fuckvkeypad soulee-dev Python

가상키보드(vKeypad) 우회도구

52
Deepminer
Deepminer Conso1eCowb0y Python

Deep web crawler and search engine

52
bilib
bilib OlafZhang Python

整合多个B站原生API,并结合爬取技术的Python爬取用lib

50
billboard-json
billboard-json KoreanThinker TypeScript

🎧 Get json type billboard hot 100 chart

50
thecrowler
thecrowler pzaino Go

A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze,...

49
socials
socials lorey Python

👨‍👩‍👦 Social account detection and extraction in Python, e.g. for crawling/scraping.

47
web-languages
web-languages commoncrawl

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

46
scaling-to-distributed-crawling
scaling-to-distributed-crawling ZenRows HTML

Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.

46
covid-social-analysis
covid-social-analysis lunarwhite HTML

Apply ML on weibo sentiment. 疫情背景下微博文本情感分析与可视化

46
jason-the-miner
jason-the-miner mawrkus JavaScript

⛏ A versatile Web scraper for Node.js

45
auctus
auctus VIDA-NYU Python

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

45
bluebird
bluebird labteral Python

Unofficial Python client for Twitter

44
warcworker
warcworker peterk Python

A dockerized, queued high fidelity web archiver based on Squidwarc

43
scrape-github-trending
scrape-github-trending transitive-bullshit JavaScript

Tutorial for web scraping / crawling with Node.js.

43
EngineeringTeam
EngineeringTeam YBIGTA

와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.

42
Raven
Raven Symbolexe Go

Raven is a powerful and customizable web crawler written in Go.

42
webtranspose
webtranspose mike-gee Python

Web scraping API for building AI applications.

41
Coupang-Review-Crawling
Coupang-Review-Crawling JaehyoJJAng Python

쿠팡 리뷰 크롤링

41
crawl-data-api
crawl-data-api justoneapi

justoneapi Data API Services. We provide APIs for: Xiaohongshu, Red, Redbook, Rednote, Taobao, JD.com, Douyin (E-commerce), Douyin (Videos), Kuaishou,...

39
podcastcrawler
podcastcrawler podcastcrawler PHP

PHP library to find podcasts

39
DarkWeb-Crawling-Indexing
DarkWeb-Crawling-Indexing AshwinAmbal HTML

A DarkWeb Crawler based off the open-source TorSpider. Indexing with search engine created using Apache Solr.

38
XingDumper
XingDumper l4rm4nd Python

Python 3 script to dump/scrape/extract company employees from XING API

38
sneakpeek
sneakpeek flulemon Python

Sneakpeek is a framework that helps to quickly and conviniently develop scrapers. It’s the best choice for scrapers that have some specific complex sc...

37
mal-analysis
mal-analysis racinmat Jupyter Notebook

github repo for MyAnimeList analysis. Also links to the MAL dataset.

34
pdf_downloader
pdf_downloader alaminopu Python

A Scrapy Spider for downloading PDF files from a webpage.

34
serverless-instagram-crawler
serverless-instagram-crawler kimcoder TypeScript

serverless, instagram hashtag crawler with lambda, dynamoDB

33
video-crawler
video-crawler garysieling Scala

Crawl websites for videos from Youtube, Vimeo, Soundcloud, etc

33
BaiduSpider
BaiduSpider samzhangjy Python

项目已经移动至:https://github.com/BaiduSpider/BaiduSpider !! 一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视...

33
serritor
serritor peterbencze Java

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaS...

32
spidyquotes
spidyquotes zytedata Julia

Example site for web scraping tutorials

31
squirm
squirm squirm-framework Crystal

This was the night of the crawling terror!

31
ferret-server
ferret-server MontFerret Go

Advanced declarative web scraping

30
NetExtract
NetExtract sabber-slt TypeScript

NetExtract: Efficiently extract core content from any webpage and convert it to clean, LLM-optimized Markdown with a simple API.

30
CrowLeer
CrowLeer erap320 C

Powerful C++ web crawler based on libcurl

29
ProductHunt-scraper
ProductHunt-scraper fernandod1 Python

Producthunt.com famous website scraper script. Scrap all offers and save in spreadsheet excel file.

28
puppet-master
puppet-master saasify-sh TypeScript

Puppeteer as a service hosted on Saasify.

26
amharic_spell_corrector
amharic_spell_corrector yididiyan Python

Amharic Spelling Corrector based on SymSpell - Spelling corrector which is 1 million times faster through Symmetric Delete spelling correction algori...

26
billboard-player
billboard-player krtk-dev TypeScript

🎹 Free billboard hot 100 M/V streaming service

26
translators
translators krtk-dev TypeScript

🌐 Comparison of Google, Papago, and Kakao Translator

26
botasaurus-starter
botasaurus-starter omkarcloud TypeScript

🚀 OFFICIAL STARTER TEMPLATE FOR BOTASAURUS SCRAPING FRAMEWORK 🤖

25
popular_restaurants_from_officials
popular_restaurants_from_officials jy617lee Jupyter Notebook

서울시 공무원의 업무추진비를 분석하여 진짜 맛집 찾기 프로젝트

25
scrapy-requests
scrapy-requests rafyzg Python

Scrapy middleware to handle javascript pages using requests-html

25
crawlkit
crawlkit crawlkit JavaScript

A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.

24
SentimentPoliticalCompass
SentimentPoliticalCompass JulianMar11 Jupyter Notebook

framework to analyze newspapers with respect to their political conviction using entity sentiments of party representatives.

24