← Back to all projects

Nowhere Mirror WebScrapper

Python AI / LLM BeautifulSoup Multimodal Vision OCR SQLite3 Multi-Threading Distributed Architecture
Focus: AI-Powered Web Scraping & Distributed Architecture

Overview

Mirror WebScrapper is a distributed, three-tier architecture. Instead of manually writing scrapers for every new website, this system uses a multimodal AI (Vision + Text) to analyze a page and automatically generate the extraction logic.

Virtual Machine (VM-A) that serves as a target environment. The main app (webscrapper) running on Server (VM-B). The Server-AI, running a custom Python intermediary script that facilitates communication between the App Server (VM-B) and LM-Studio (Server-AI). The code generation and text extraction happens here.

Project Demo

Video — Mirror WebScrapper full walkthrough demo

Step 1 — Creating a High-Fidelity Web Server Mirror

To build a "Scraper Sandbox" on a dedicated Virtual Machine (VM-A) that serves as a target environment. This ensures that the web-scraping and AI-analysis logic can be developed and tested without violating terms of service, hitting rate limits, or being affected by network instability.

Mirroring a modern website is not as simple as saving HTML. Modern sites rely on complex relative paths, asset dependencies, and font-icons. The main challenges addressed here were:

Tech Stack Used:

By the end of this step, a mirror from the original site (books.toscrape.com) was successfully created in a browser. This includes:

Mirror Result
Fig 1 — High-fidelity web mirror of books.toscrape.com

Step 2 — The AI Bridge Script

A custom Python intermediary script facilitates communication between the App Server (VM-B) and LM-Studio (Server-AI). This bridge handles API requests and injects specialized system prompts and multi-shot instructions to maximize the accuracy of the AI-generated scraping functions.

AI Bridge Script Architecture
Fig 2 — AI Bridge Script connecting VM-B and LM-Studio

Step 3 — Main Application Workflow

DOM Sanitization & UI Selection

The system ingests a target URL and generates a "Naked HTML" version. By stripping scripts, styles, and fonts, the app presents a clean structural view. The user interacts with this simplified layout to select specific data points (e.g., Price, Title) they wish to extract.

DOM Sanitization
Fig 3 — DOM sanitization
Naked HTML View
Naked HTML View

Selecting Specific Data Points

Data Point Selection
Fig 4 — User selecting target data points on the cleaned layout
Selection Images
Fig 5 — Selection images for scraping targets

Multimodal Payload Transmission

The sanitized HTML and the page screenshot (to be used later) are bundled and transmitted to the Server-AI. The Bridge Script formats this data for the LLM. The app then provides a live preview of the generated Python tool, allowing the user to perform "Human-in-the-Loop" (HITL) refinements through natural language chat or manual code edits.

Multimodal Payload
Fig 6 — Multimodal payload transmission to Server-AI
HITL Refinement
Fig 7 — Human-in-the-Loop refinement interface

Validation & Sandbox Testing

Users can execute the generated function within a built-in sandbox to verify the output. Once the data extraction meets the required accuracy, the function is committed to the local library.

Sandbox Testing
Fig 8 — Built-in sandbox for validating generated scraping code
Sandbox Output
Fig 9 — Sandbox output verification before committing to library

Function Library

A centralized repository where users manage, version-control, and retrieve previously generated scraping scripts.

Function Library
Fig 10 — Centralized function library

Distributed Web-Scraper

The final execution engine. Users can:

Output: The system auto-generates a domain-specific directory containing a SQLite3 database and organized image folders.

Distributed Scraper
Fig 11 — Distributed web-scraper execution engine
Scraper Results
Fig 12 — Scraper results and generated output structure

System Architecture

Server-AI (The Logic Tier)

The 1080 Ti-powered server hosts three distinct API endpoints via the Bridge Script:

App-Server (The Orchestration Tier — VM-B)

VM-B manages the global state and workflow orchestration. To prevent hardware bottlenecks, the architecture uses Asynchronous Task Offloading:

Non-Blocking Logic: The main application does not wait for OCR completion; instead, it streams images to a temporary buffer on Server-AI. As the AI completes each image, the data is pushed back to the database on VM-B, and the temporary file is purged to maintain a lean footprint.

Database Creation

Database Creation
Fig 13 — Auto-generated SQLite3 database structure