A Python Script to Scrape XML Sitemaps for URLs

JP Garbaccio · · 2 min read · SEO

A robust and simple Python script that scrapes XML sitemaps and extracts all URLs found within <loc> tags, following the Sitemaps.org Protocol.

This script supports two types of inputs:

  1. A sitemap index URL (e.g. https://www.google.com/sitemap.xml)
  2. A single sitemap URL (e.g. https://www.google.com/gmail/sitemap.xml)

If you want to speed up your workflow, you can input the main sitemap index URL, extract all child sitemap URLs, and then re-input those into the tool to fetch every listed URL within them.

👉 Download the code here (insert your GitHub or file URL)


Tutorial For Use

The script is written in Python 3, so make sure it’s installed on your system.

Step 1: Download the Repository

Clone or download the repository from GitHub to your local machine.

Step 2: Install Required Libraries

Open a terminal and run:

[PYTHON SCRIPT GOES HERE]

Once this is installed, you type 'clear' on Mac or 'cls' on Windows.

This will clear the terminal window.

Next, open the 'input_urls.csv' file and paste your sitemap URLs into the first column.

Do not add a header row, the script will run from the first row.

You can save the file, and then run the script from VSCode or from the terminal.

Check the logs to check the scripts activity, and, the folder you ran it in for the extracted URLs.

It's that simple! I will post a more updated guide soon.