Mirror WebScrapper is a distributed, three-tier architecture. Instead of manually writing scrapers for every new website, this system uses a multimodal AI (Vision + Text) to analyze a page and automatically generate the extraction logic.
Virtual Machine (VM-A) that serves as a target environment. The main app (webscrapper) running on Server (VM-B). The Server-AI, running a custom Python intermediary script that facilitates communication between the App Server (VM-B) and LM-Studio (Server-AI). The code generation and text extraction happens here.
To build a "Scraper Sandbox" on a dedicated Virtual Machine (VM-A) that serves as a target environment. This ensures that the web-scraping and AI-analysis logic can be developed and tested without violating terms of service, hitting rate limits, or being affected by network instability.
Mirroring a modern website is not as simple as saving HTML. Modern sites rely on complex relative paths, asset dependencies, and font-icons. The main challenges addressed here were:
?v=3.2.1) from CSS files to match local file system naming conventions.index.html dashboard that scans the directory structure to provide a clickable interface for all cloned resources.Tech Stack Used:
By the end of this step, a mirror from the original site (books.toscrape.com) was successfully created in a browser. This includes:
/catalogue/product/ structure.
A custom Python intermediary script facilitates communication between the App Server (VM-B) and LM-Studio (Server-AI). This bridge handles API requests and injects specialized system prompts and multi-shot instructions to maximize the accuracy of the AI-generated scraping functions.
The system ingests a target URL and generates a "Naked HTML" version. By stripping scripts, styles, and fonts, the app presents a clean structural view. The user interacts with this simplified layout to select specific data points (e.g., Price, Title) they wish to extract.
The sanitized HTML and the page screenshot (to be used later) are bundled and transmitted to the Server-AI. The Bridge Script formats this data for the LLM. The app then provides a live preview of the generated Python tool, allowing the user to perform "Human-in-the-Loop" (HITL) refinements through natural language chat or manual code edits.
Users can execute the generated function within a built-in sandbox to verify the output. Once the data extraction meets the required accuracy, the function is committed to the local library.
A centralized repository where users manage, version-control, and retrieve previously generated scraping scripts.
The final execution engine. Users can:
Output: The system auto-generates a domain-specific directory containing a SQLite3 database and organized image folders.
The 1080 Ti-powered server hosts three distinct API endpoints via the Bridge Script:
VM-B manages the global state and workflow orchestration. To prevent hardware bottlenecks, the architecture uses Asynchronous Task Offloading:
Non-Blocking Logic: The main application does not wait for OCR completion; instead, it streams images to a temporary buffer on Server-AI. As the AI completes each image, the data is pushed back to the database on VM-B, and the temporary file is purged to maintain a lean footprint.