Description
THE PROGRAM CONSISTS OF:
Gargoyle Site Mapper (SEO XML Generator) contained in GargoyleSiteMapper0.1.5.py
SOFTWARE PURPOSE:
Generate a sitemap.xml for your website with webpage and image end-points in order to enable Search Engine crawlers to find and map all your webpages.
INTERFACE:
Graphical User Interface
REQUIREMENTS – ADDITIONAL INSTALLATIONS (OPEN-SOURCE):
python 3.6+ (This program was extensively tested on Python 3.12.3)
lxml==6.0.2 (pip install lxml)
REQUIREMENTS – ALREADY INCLUDED IN YOUR PYTHON 3.12.x INSTALLATION:
SYSTEM & OS: sys, os, time, datetime, random
NETWORKING: ssl, urllib, request, urllib.parse, urllib.robotparser, queue, threading
GUI: tkinter
DATA HANDLING: email.utils
– Memory (RAM): Minimum 2GB. The program is very light, but if you crawl a site with 10,000+ pages, the URL queue will sit in your RAM.
– Internet Connection: A stable connection is required. If your internet drops, the “Timeout” safety feature will trigger, and the crawler will mark those pages as an “Error.”
– The software needs Write Permissions in the folder where the script is located. It creates a new directory for every domain it crawls. It saves .xml and .txt files inside those directories. Ensure you are not running the script from a “Read-Only” location like a protected System32 folder or a locked USB drive.
SOFTWARE DESCRIPTION:
The Gargoyle Sitemap Generator v0.1.5 is a professional-grade web crawler and xml sitemap generator designed to map the architecture of a website while prioritizing stealth, ethics, and diagnostic accuracy. It acts as an automated explorer that “reads” a website like a human would, but at the speed of multiple concurrent threads.
Here is a breakdown of how the program operates and the specific safety layers I have engineered into it’s architecture.
1. Functional Overview
The program starts at a “Seed URL” and parses the HTML to find every internal link. It follows these links recursively, building a tree-like map of the entire domain.
Engine: A multi-threaded orchestrator that manages a “Queue” (to-do list) and a “Checked” dictionary (history).
Output: It produces a standard sitemap.xml for SEO and a broken_links.txt for site maintenance.
2. Multi-Layered Safety Features
To ensure the program is a “Good Actor” on the web and doesn’t get your IP address banned or crash your server, it uses several safety protocols:
A. The “Good Citizen” Protocol (Robots.txt)
Before the first link is even crawled, the program fetches the site’s robots.txt file. It uses the RobotFileParser to ensure it never enters “Disallowed” directories. If a site owner has marked a folder as private Gargoyle will skip it automatically.
B. Human-Mimicry (Rate Limiting)
Standard bots hit a server thousands of times a second, which looks like a DDoS attack. Gargoyle uses Randomized Pauses (min_p and max_p). After every page visit, each thread sleeps for a random interval. This makes the traffic look like a natural human browsing patterns rather than a machine.
C. Domain Locking (The “Fence”)
To prevent the crawler from accidentally trying to “map the entire internet,” it uses strict Netloc Validation. If it finds a link to Facebook, Twitter, or an external blog, it recognizes that the “Network Location” doesn’t match your target domain and refuses to follow it.
D. Keyword & Backend Filtering
The program includes a “Blocklist” of keywords (e.g., /admin, wp-login, ?). This prevents the crawler from getting stuck in “Spider Traps” (infinite loops caused by calendar filters) or attempting to access sensitive login portals.
E. Thread-Safe “Locking”
When multiple threads try to write to the same list at once, “Race Conditions” can occur, leading to data corruption. Gargoyle uses a Global Interpreter Lock (GIL) mechanism via threading.Lock(). This ensures that only one thread can update the “Checked” list at a time, keeping your data 100% accurate.
F. The “Traffic Light” (Pause/Resume/Stop)
Unlike simpler scripts that you have to “kill” (potentially losing data), Gargoyle uses Condition Variables.
Pause: Gently tells threads to finish their current task and wait without consuming CPU.
Stop: Triggers an immediate “Safe Exit” that stops the engine and saves everything found up to that millisecond.
3. Diagnostic Safety (Broken Link Detection)
The program treats errors (404, 403, 500) as valuable data rather than failures. By tracking the Referrer, it ensures that even if a page is “broken,” you know exactly which healthy page contains the bad link. This allows you to repair the site without manual searching.
TECHNICAL SPECIFICATIONS & USER MANUAL:
1. Core Architecture
Gargoyle is built on a Non-Blocking Multithreaded Orchestrator. It utilizes a “Producer-Consumer” model where the main engine manages a centralized queue of URLs, and worker threads consume those URLs to perform HTTP requests and HTML parsing.
Concurrency Model: threading.
Thread with a threading condition synchronization construct.
Parsing Engine: lxml for high-performance XPath-based link extraction.
Networking: urllib.request with custom User-Agent rotation headers.
2. Safety & Ethics Protocol (The “Politeness” Engine)
The program is engineered to adhere to the Web Robot Standards.
| Feature | Technical Implementation | Purpose |
|---|---|---|
| Robots.txt | urllib.robotparser | Prevents crawling of private or sensitive server directories. |
| Rate Limiting | random.uniform() | Prevents server strain by injecting human-like delays. |
| Domain Lock | urlparse().netloc | Ensures the bot never wanders onto third-party websites. |
| Keyword Filter | exclude_keywords list | Skips “Spider Traps” and administrative login portals. |
3. User Manual
A. Setting Up the Crawl
1. Target URL: Enter the full domain (e.g., https://example.com).
2. Threads: Recommended 4–6 for standard shared hosting; 8–10 for dedicated servers.
3. Timeout: Set to minimum of 20s. This prevents the program from hanging on slow, unresponsive pages.
4. Pauses: Set Min Pause to 0.5 and Max Pause to 3.0 to maintain a steady, non-aggressive flow.
B. Managing the Session
Pause/Resume: Use this if you notice your internet connection is lagging or if you need to temporarily free up system resources. Worker threads will complete their current page and “sleep” until resumed.
Stop: This is the “Graceful Exit.” It tells the engine to cease all operations and immediately compile the sitemap.xml and broken_links.txt using only the data collected so far.
C. Understanding the Output
Every crawl creates a folder named after the domain (e.g., mysite_co_za). Inside, you will find:
Sitemap (XML): Upload this to your root directory and submit it to Google Search Console. Remember to remove the timestamp in front. The filename must be in the format sitemap.xml for search engines to read it.
Repair Report (TXT): Open this to see a list of broken links. It identifies the Source Page, making it easy for you to log into your CMS and fix the typo.
4. Safety Warnings
Note: While Gargoyle is designed to be safe, running too many threads (20+) with zero pauses may cause some firewalls (like Cloudflare) to temporarily block your IP address. Always start with the default settings.
5. Installation:
We highly recommend creating a virtual envelope using: python -m venv <env_name>. The envelope can be activated from within the venv folder you just created using source ./bin/activate
Some variants of linux will require you to install venv using sudo apt install python3-venv
– Graphical User Interface (Tkinter) allows all settings and URL to be changed inside the program
– Set script to executable: `sudo chmod +x GargoyleSiteMapper0.1.5.py`.
– Run the script: `python3 GargoyleSiteMapper0.1.5.py` or `python GargoyleSiteMapper2.0.py` (depending on how your system is configured).
6. Example Screenshot:
Gargoyle Site Mapper (SEO XML Generator) offers simplified design with a single screen where you enter the URL to be crawled. Please take care not to change the settings for the minimum and maximum wait period, timeout or threads unless you are sure. These variables was designed to protect you from being marked by your server as a potential DDOS attacker. If you have high simultaneous threads your server will be overrun with requests. If you set the random timeout too low, you bypass the purpose of the program to request your website mapping whilst acting like a considerate human and not overrunning the server with requests.

Gargoyle Site Mapper Screen Shot
Output will be created as sitemap.xml with a date and timestamp in front of the file name and stored in a subfolder named after the website you crawled. The date and timestamp was added to allow you to compare sitemaps ran at different times.
You will notice email addresses and phone numbers will not be mapped. To do this would open you up to unsolicited marketing.
Before you uploading the file to your webserver, rename the file to ‘sitemap.xml’.
Please take note, this program was designed for you to map your own website, for better search engines exposure, and not as a hacking tool. Please use the program for the purpose intended.
6. Example Output:
</url>
<url>
<loc>data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/privacy-policy</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/wp-content/uploads/unnamed-1.jpg</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/shop/cat_software/gargoyle-site-mapper</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/shop/?add-to-cart=15778</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-category/cat_software/?add-to-cart=15778</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/wp-content/uploads/Gargoyle-Site-Mapper.png</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/linux</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/python</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/software</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/windows</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/wp-content/uploads/Gargoyle-Site-Mapper-300×164.png</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/python/?add-to-cart=15778</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/linux/?add-to-cart=15778</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/software/?add-to-cart=15778</loc>
</url>
<url>
<loc>https://www.bringmesome.co.za/product-tag/windows/?add-to-cart=15778</loc>
</url>
</urlset>
7. Guarantees and Warranties:
The program will run as intended, provided you installed it exactly as instructed, on the operating systems specified at the time we offer it for sale.
We take no responsibility for the software not running on future versions of your operating system or any operating system not specifically specified in the specifications of this software, neither do we take responsibility for changes in future technology that may makes this program obsolete or inoperable.
We provide this program in good faith and with good intentions, and are not responsible for unscrupulous manipulation of the code for nefarious purposes.
Beyond the specific assurances provided in line 1 of this section, we offer no further guarantee regarding the software’s suitability for any particular purpose.
Software downloaded is not refundable, our description is clear as to what the program can, and cannot do. If you expected something else than what you downloaded, you did not read the information properly, and we are not responsibility for that. By downloading the program, you gain access to our code, and knowledge of it’s inner workings, therefore no refund will be given on purchases.<p>Once you have paid for the code, you will be able to download it freely as many times as you wish for a one year period. Please make your own backup copy for long term retention. After one year this code my (or may not) be available on our website, and you will have no claim against us for replacement, refund, or damages, or for whatsoever other reason.
We always try to assist our customers, but if there are costs involved to resupply an archived version of a program, the cost will be for your account.

Reviews
There are no reviews yet.