Detalhes do pacote

icecrawl

wangdangel79MIT0.4.0

Web scraping application with HTTP API (incl. Dashboard), CLI, and MCP Server interfaces.

webscraper, cli, api, typescript

readme (leia-me)

Icecrawl

Buy Me a Coffee

A powerful web scraping application offering multiple interfaces: HTTP API (with Dashboard), CLI, and MCP Server.


Features

  • Multiple Interfaces:
    • HTTP API Server: RESTful API for integration, includes a web dashboard.
    • CLI Tool: icecrawl command for terminal-based scraping.
    • MCP Server: icecrawl-mcp command for programmatic use with MCP clients.
  • Web Dashboard: User-friendly UI for managing scrapes and viewing results.
  • Authentication: User management with role-based access control.
  • Database Storage: Persistent storage using Prisma ORM (SQLite default).
  • Crawling: Asynchronous website crawling with depth and scope control.
  • Flexible Output: JSON, Markdown, raw HTML, or screenshots.
  • Performance Optimization: Caching, request pooling.
  • Proxy Support: Use proxies for requests.
  • JS Rendering: Optional headless browser usage via Puppeteer.
  • And more: Content Transformation, Exporting, Scheduled Jobs...

Sitemap Generation Mode

  • Crawl an entire website to build a hierarchical sitemap of all internal links.
  • Does not save page content or extract text.
  • Useful for visualizing site structure, auditing SEO, or link analysis.
  • Enable by setting crawl option "mode": "sitemap" via API or CLI.
  • The sitemap is saved as JSON in the crawl job record and can be retrieved via API.

Installation

From npm (Recommended)

npm install -g icecrawl
  • Creates a default data directory:
    • Windows: C:\Users\<username>\Documents\Icecrawl
    • macOS/Linux: ~/Icecrawl
  • Generates .env file, initializes database, seeds default admin user.
  • After install:
icecrawl --help
icecrawl-mcp

From Source (Development)

git clone https://github.com/wangdangel/icecrawl.git
cd icecrawl
npm install
cp .env.example .env
# Edit .env with your config
npx prisma migrate dev
npm run prisma:generate
npm run build
npm run build:dashboard
# Optionally: npm link

Usage

Start Dashboard + MCP Server (default)

icecrawl

Start only the Dashboard server

icecrawl dashboard

Start only the MCP server

icecrawl mcp-server

Scraping via CLI

icecrawl scrape url https://example.com
echo "https://example.com" | icecrawl scrape

See docs/cli-usage.md for full CLI documentation and examples.


Troubleshooting

Permission Denied Error when running icecrawl

If you successfully install globally (npm install -g icecrawl) but get a Permission denied error when trying to run icecrawl, you may need to manually add execute permissions:

  1. Find your global npm bin directory:

     npm bin -g
    
  2. Run the following command, replacing the path with the one found above:

     chmod +x /path/to/your/global/bin/icecrawl
    

This should resolve the permission issue.


MCP Server Configuration Example

Add this to your MCP client configuration (e.g., Cline):

{
  "command": "node",
  "args": ["k:/Documents/smart_crawler/dist/mcp-server.js"],
  "cwd": "k:/Documents/smart_crawler",
  "disabled": false,
  "autoApprove": [],
  "timeout": 60,
  "transportType": "stdio"
}

Default Login Credentials

For initial access after seeding:

Username Password Email Role
admin password admin@example.com admin

Development Commands

npm test
npm run test:coverage
npm run lint
npm run format
npm run prisma:studio

Project Structure

To be documented.

CI/CD Workflow

To be documented.

Contribution Guidelines

To be documented.

Releasing

To be documented.


License

MIT

changelog (log de mudanças)

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

0.3.23 (2025-04-07)

0.3.22 (2025-04-07)

0.3.21 (2025-04-07)

0.3.20 (2025-04-07)

0.3.19 (2025-04-07)

0.3.18 (2025-04-07)

0.3.17 (2025-04-07)

0.3.16 (2025-04-07)

0.3.15 (2025-04-07)

0.3.14 (2025-04-07)

0.3.12 (2025-04-07)

0.3.6 (2025-04-07)

0.3.2 (2025-04-07)

0.3.1 (2025-04-07)

0.3.0 (2025-04-07)

Features

  • Implement MCP Server Interface\n\nAdds a Model Context Protocol (MCP) server interface to Icecrawl, allowing programmatic interaction via MCP clients.\n\n- Installs @modelcontextprotocol/sdk dependency.\n- Creates src/mcp-server.ts as the server entry point, loading .env.\n- Implements ListTools handler defining scrape_url, start_crawl, and get_crawl_job_result tools with their input schemas.\n- Implements CallTool handler with basic logic for each tool, leveraging existing services (Scraper, MarkdownService, Prisma, BrowserService).\n- Adds icecrawl-mcp command to package.json bin field.\n- Updates README.md and PLANNING.md to document the new interface and configuration.\n- Updates TASK.md to track MCP integration progress. (e0ecc19)

Bug Fixes

  • align package.json with npm publish corrections (6d7b79d)

0.2.0 (2025-04-07)

Features

  • Add CI/CD, linting, API keys, dashboard, and user features (0c087a3)
  • Fix login issues and automate DB setup with seeding (1b6dc49)
  • Implement background job processing and fix dashboard errors (3c1b184)
  • Implement website crawl-to-markdown feature (225676c)
  • prepare for npm publish, rename to icecrawl, fix build errors (c989eea)
  • Refactor services and controllers for user and dashboard (6fd84a5)

Bug Fixes

  • Resolve dashboard loading issues and related errors\n\n- Corrected dashboard statistics calculation in DashboardService to include pending/failed job counts.\n- Reset and re-seeded the database (dev.db) due to missing tables identified during debugging.\n- Updated package.json to configure Prisma seeding.\n- Fixed incorrect API path for login requests in login.html (/api/users/login -> /api/auth/login).\n- Corrected the API response structure for crawl jobs in DashboardController to include proper pagination details.\n- Added debug logging to DashboardService.getStatistics.\n- Updated TASK.md to mark dashboard investigation as complete. (dc35bef)

[0.1.0] - YYYY-MM-DD

Added

  • Initial release