Automatically Generating Content Inventories (Part 1)

Introduction

I’ll admit it, in my youth (say, a few days ago) I’d often generate a content inventory by hand. I’d simply open a new spreadsheet and start working my through the site until I was done chronicling the content. I chose this path because of its simplicity and because many of the websites I work on are quite small.

This month I’m working with a client on several sites, and the total number of pages is close to one thousand. Sure, I’ll likely still want to view each of the pages just in case the title and description fail to reflect the content (or it’s an asset that lacks this meta information), but automatically generating the url, file type, title and description should save a tremendous amount of time.

To automatically generate a content inventory, we’ll break the work up into three steps:

  1. Create a local copy of the website (covered in this post.)
  2. Create a list of broken links (covered in this post.)
  3. Parse the local files to create a spreadsheet (covered in the next post.)

Using Wget To Create A Local Copy Of Your Website

The GNU wget package makes it very easy to generate a local copy of a website. You can use it to crawl your entire website and download all of the linked assets (html files, images, pdf’s, etc.) While you can install wget on Windows and Macs, when I’m using one of these systems I just run a VM of my favorite Linux distro, which already has wget installed. I found a great tutorial that demonstrates how to create a mirror of a website with wget, and it’s most basic usage is illustrated by the command below.


$ wget -m http://www.site.com/

There are many more options, but the command above would create the directory “www.site.com” and put all of the linked files from your website in that directory.

Using Wget To Find Broken Links (404)

Next, let’s make sure we have a list of the broken links in the website. After all, a content inventory is supposed to guide future work, and all future work should take into account content that’s either missing or unfindable.

Again, making use of wget greatly simplifies this task, and I found another great tutorial that outlines using wget to find broken links. The basic command structure is listed below.


$ wget --spider -o file.log -r -p http://www.site.com

Once completed, you have a file that you can grep / search for occurrences of 404 errors.

A Bash Script To Automate Simplify Things

Of course, I’m old and I forget things easily. I can’t be expected to remember these commands for the next five minutes, let alone the next time I’m creating a content inventory a month from now. Additionally, instead of using multiple calls to wget, we can merge these operations into one roundtrip. Here’s a simple bash script that automates the creation of the local mirror of the website and the log file with broken link information.


#!/bin/bash

# remember to run chmod +x myFileNameWhateverItIs

# store domain
echo "Enter website domain (e.g., www.site.com):"
read domain
# store url
url="http://$domain"
# system status
echo "Creating mirror..."
# create local mirror
wget -m -w 2 -o wget.log -p $url
# system status
echo "Creating broken link log..."
# store broken link(s) info
grep -n -B 2 '404 Not Found' wget.log > wget-404.log
# system status
echo "Process completed."

If I store the code above in the file “local-site.sh” (and call chmod +x on it), I can call it directly to create a local copy of the website and a log file containing broken links:


$ ./local-site.sh
> Enter website domain (e.g., www.site.com):
> www.example.com
> Creating mirror...
> Creating broken link log...
> Process completed.

I’ll cover parsing of the local files to create a content inventory spreadsheet in the next post.