Automatically Generating Content Inventories (Part 1)

Introduction

I’ll admit it, in my youth (say, a few days ago) I’d often generate a content inventory by hand. I’d simply open a new spreadsheet and start working my through the site until I was done chronicling the content. I chose this path because of its simplicity and because many of the websites I work on are quite small.

This month I’m working with a client on several sites, and the total number of pages is close to one thousand. Sure, I’ll likely still want to view each of the pages just in case the title and description fail to reflect the content (or it’s an asset that lacks this meta information), but automatically generating the url, file type, title and description should save a tremendous amount of time.

To automatically generate a content inventory, we’ll break the work up into three steps:

  1. Create a local copy of the website (covered in this post.)
  2. Create a list of broken links (covered in this post.)
  3. Parse the local files to create a spreadsheet (covered in the next post.)

Using Wget To Create A Local Copy Of Your Website

The GNU wget package makes it very easy to generate a local copy of a website. You can use it to crawl your entire website and download all of the linked assets (html files, images, pdf’s, etc.) While you can install wget on Windows and Macs, when I’m using one of these systems I just run a VM of my favorite Linux distro, which already has wget installed. I found a great tutorial that demonstrates how to create a mirror of a website with wget, and it’s most basic usage is illustrated by the command below.


$ wget -m http://www.site.com/

There are many more options, but the command above would create the directory “www.site.com” and put all of the linked files from your website in that directory.

Using Wget To Find Broken Links (404)

Next, let’s make sure we have a list of the broken links in the website. After all, a content inventory is supposed to guide future work, and all future work should take into account content that’s either missing or unfindable.

Again, making use of wget greatly simplifies this task, and I found another great tutorial that outlines using wget to find broken links. The basic command structure is listed below.


$ wget --spider -o file.log -r -p http://www.site.com

Once completed, you have a file that you can grep / search for occurrences of 404 errors.

A Bash Script To Automate Simplify Things

Of course, I’m old and I forget things easily. I can’t be expected to remember these commands for the next five minutes, let alone the next time I’m creating a content inventory a month from now. Additionally, instead of using multiple calls to wget, we can merge these operations into one roundtrip. Here’s a simple bash script that automates the creation of the local mirror of the website and the log file with broken link information.


#!/bin/bash

# remember to run chmod +x myFileNameWhateverItIs

# store domain
echo "Enter website domain (e.g., www.site.com):"
read domain
# store url
url="http://$domain"
# system status
echo "Creating mirror..."
# create local mirror
wget -m -w 2 -o wget.log -p $url
# system status
echo "Creating broken link log..."
# store broken link(s) info
grep -n -B 2 '404 Not Found' wget.log > wget-404.log
# system status
echo "Process completed."

If I store the code above in the file “local-site.sh” (and call chmod +x on it), I can call it directly to create a local copy of the website and a log file containing broken links:


$ ./local-site.sh
> Enter website domain (e.g., www.site.com):
> www.example.com
> Creating mirror...
> Creating broken link log...
> Process completed.

I’ll cover parsing of the local files to create a content inventory spreadsheet in the next post.

Isolating Side Effects Using Isolation Sets

A program or function is said to have side effects if it impacts the system state through a means other than its return value or reads the system state through a means other than its arguments. Every meaningful program eventually requires some form of side effect(s),  such as writing output to the standard output file-stream or saving a record to a database. That said, working with pure functions, which lack side effects and are consistent, has many advantages. How can the practical necessity of side effects be amended with the benefits of avoiding them?

Your Special Island

If a program’s side effects are isolated in a small, known subset of the codebase, we can reap the benefits of working in their absence throughout large sections of the codebase whilst providing their practical application when needed. Indeed, functional programming languages like Haskell facilitate this approach by isolating side effects directly through language features / limitations. But what about the many languages that don’t directly facilitate side effect isolation, how can we achieve the same effects?

We Will All Go Down Together

Let’s begin with a typical example involving a non-isolated side effect. We’ll work through a small PHP function for sending email that resembles countless other examples online.* Because the side effect (the call to the mail function) is not isolated, the entire function is impure, making it all very difficult to test.


<?php
function sendSalesInquiry($from, $message)
{
  // validate email
  if (filter_var($from, FILTER_VALIDATE_EMAIL)) {
    return "<p>Email address invalid.</p>";
  }
  // init vars
  $to = "sales@company.com";
  $subject = "Sales Inquiry";
  $headers = "From: $from';
  // attempt to send
  if (mail($to, $subject, $message, $headers)) {
    return "<p>Email successfully sent.</p>";
  } else {
    return "<p>Email delivery failed.</p>"; 
  }
}
?>

And They Parted The Closest Of Friends

To isolate the side effect, we’ll add some all-powerful indirection by refactoring the email function into multiple functions. Using a combination of a potentially-pure function with two fall-through functions allows us to easily, cleanly isolate the side effect in this example. When using this combination of function types specifically to isolate side effects, I refer to them collectively as an isolation set.

<?php
// potentially-pure function
function sendSalesInquiry($from, $message, $mailer)
{
  // validate email
  if (filter_var($from, FILTER_VALIDATE_EMAIL)) {
    return "<p>Email address invalid.</p>";
  }
  // init vars
  $to = "sales@company.com";
  $subject = "Sales Inquiry";
  $headers = "From: $from';
  // attempt to send
  if ($mailer($to, $subject, $message, $headers)) {
    return "<p>Email successfully sent.</p>";
  } else {
    return "<p>Email delivery failed.</p>";
  }
}
// fall-through function provides implementation
function sendSalesInquiryMail($from, $message)
{
  // call potentially-pure function passing in mailer
  return sendSalesInquiry($from, $message, $mailer = function($from, $message, $headers) {
    return mail($from, $message, $headers);
  });
}
?>

The original example has been refactored into one potentially-pure function to handle the logic and initialization; and two fall-through functions, one to encapsulate the side effect, and one to provide the default behavior (in this case the mailer function) for production.**

When testing the code, the sendSalesInquire() function becomes the natural entry point, as it contains all of the important logic and initialization to be tested. Because the function is potentially-pure, passing in pure arguments causes the function to behave like a pure function, yielding better testing and clarity.

Music Left To Write

Although the example only dealt with one side effect, an isolation set can be used to isolate to any number of side effects. We could extend the example above and add a spam-checking algorithm. We’d just have to add another fall-through function for the side effect.

<?php
// potentially-pure function
function sendSalesInquiry($from, $message, $mailer, $isSpam)
{
  // validate email
  if (filter_var($from, FILTER_VALIDATE_EMAIL)) {
    return "<p>Email address invalid.</p>";
  }
  // check for spam
  if ($isSpam($from, $message)) {
    return "<p>Don't call us, we'll call you.</p>";
  }
  // init vars
  $to = "sales@company.com";
  $subject = "Sales Inquiry";
  $headers = "From: $from';
  // attempt to send
  if ($mailer($to, $subject, $message, $headers)) {
    return "<p>Email successfully sent.</p>";
  } else {
    return "<p>Email delivery failed.</p>";
  }
}

function sendSalesInquiryMail($from, $message)
{
  // call potentially-pure function passing in 
  return sendSalesInquiry(
    $from,
    $message,
    $mailer = function($from, $message, $headers) {
      return mail($from, $message, $headers);
    },
    $isSpam = function($from, $message) {
      $spamChecker = new SpamChecker();
      // this analysis could involve any number of database queries, networking requests, etc.
      return $spamChecker->isSpam($from, $message);
    }
  );
}
?>

It’s Nine O’Clock On A Saturday

What? Doesn’t getting your side effects isolated put you in a mood for a melody?

* I’m not enamored with returning HTML markup in this type of function, but it represents a common example I found online, and it’s for a programming language that people don’t typically associate with functional programming practices, so the example works well for the purposes of the current demonstration.

** You could reduce this example to two functions, as the potentially pure function could be used to contain default values for the fall-through function(s), which could then be overridden by passing in an argument for testing purposes. However, I like the clarity granted by implementing an isolation set with three functions, as I want to avoid marrying the potentially pure function to any default implementation. For example, I could easily provide a different mailing mechanism by merely creating a new function, like sendSalesInquirySMTP(), which provides a PHPMailer implementation.

Potentially-Pure Functions

Overview: Potentially-pure functions are argument-based higher-order functions (i.e., functions that accept other functions as arguments) with pure function bodies (i.e., function bodies that are consistent and side-effect free), meaning their purity is dependent upon the arguments passed into the function.

Higher-Order Functions

All potentially-pure functions are higher-order functions, so let’s begin with a brief overview of what it means to be a higher-order function.

Higher-order functions accept functions as arguments (we’ll call this specific form argument-based higher-order functions) or return functions as values (we’ll call this specific form return-based higher-order functions.) Higher-order functions enable tremendous power, flexibility, and parsimony; and they are leveraged heavily in functional programming.

In order to implement argument-based higher-order functions, a programming language must allow you pass functionality into functions through their arguments. While not all languages provide first-class functions, which can be passed around and stored like other data, you can effectively emulate first-class functions in most languages. In low-level languages like C, you can pass in function pointers; in OOP languages like Java, you can pass in interfaces; and in dynamic languages like PHP which used to lack anonymous functions (prior to version 5.2), you can pass in the string name of an existing function. No matter what language you’re using for your development, you should be able to fake it quite convincingly.

Potentially-Pure Functions

Pure functions are side-effect free and consistent. Higher-order functions provide a special situation when evaluating purity. If an argument-based higher-order function’s body is pure, then its purity is unfixed. In other words, the purity of the function is dependent upon the purity of the functions passed in as arguments. If the functions passed into the higher-order function are pure, then the function is pure; and if the functions passed in are impure, then the higher-order function is impure*. Because the phrase “argument-based higher-order function with unfixed purity” might jeopardize your conscious state, let’s just call this function type a potentially-pure function.

The natural duality of potentially-pure functions makes them especially helpful when it comes to isolating side effects. When testing a potentially-pure function, a pure form of the function can be passed in, allowing you to cleanly and easily test all possible states. When using a potentially-pure function in production, a fall-through function containing the side effect(s) can be passed in, allowing the code to perform its real-world requirements.

* If an impure function is passed to an argument-based higher-order function with unfixed purity, it is possible for the function to remain pure if the impure function passed in as an argument is never called.

Pure Functions

Overview: Striving to write pure functions (i.e., functions that are consistent and side-effect free) improves the testability, simplicity, and clarity of code.

What are Pure Functions?

Pure functions are consistent and side-effect free. A consistent function returns the same value every time for a particular set of arguments (this type of function is said to be referentially transparent, as calls to the function can be replaced by the return value without changing the program’s behavior.) A side-effect-free function does not change state through any means beyond its return value, meaning the values that existed before the function call (e.g., global variables, disk contents, static instances, UI, etc.) were not directly altered by the function; and it does not read any state beyond it’s arguments (i.e., no reading of data from files, databases, etc.) Think of pure functions like Mr. Spock: given a set of inputs, you will always get the same straight-forward, logical result (okay, okay, Spock showed an unpredictable, emotional response in “Amok Time”, but c’mon, he thought he had killed Captain Kirk.)

You don’t have to be using some fancy-pants functional programming language to benefit from pure functions. In languages that aren’t purely functional, you’ll have to work to avoid things like side effects and pay attention to whether the arguments you’ve received are copies or references, semantics that are language/context dependent. When dealing with references, you should treat them like you treat dad’s favorite belongings (like a special lamp, for instance): you can look (read the values), but don’t touch (edit the values)!

Examples of Pure and Impure Functions

Let’s work through some example functions and determine if they’re pure (i.e., consistent and side-effect free) or impure.

Below is a trivial example of a JavaScript function that returns the square of a number.

function square(x){
    return x * x;
}

Given a particular number x, this square function will always return the same result, so it is consistent. Additionally, it makes no changes to the global state beyond its return value. Therefore, it’s a pure function.

Next, an example Javascript function that checks out a book.

function checkOutBook(book, patron){
    if(book.isCheckedOut){
        return false;
    }
    // changes to the book object alter the object beyond the scope of this function
    book.isCheckedOut = true;
    book.checkedOutTo = patron;
    return true;
}

The function is consistent, as passing in a particular set of arguments will always return the same result. However, the function changes some of the properties of the book object, changes that will persist even after the function has returned, so this function has side effects. Therefore, it’s an impure function.

The Benefits of Pure Functions

Pure functions facilitate simplicity and clarity. Because pure functions lack side effects, the outside world is completely abstracted away and the programmer can focus entirely on the parameters and control flow constructs contained within the function. Additionally, when calling a pure function, the programmer can focus solely on the visible context of the call and the return value, as the function has no other impact on state.

Testing pure functions proves extremely straight-forward. All possible paths/states of a pure function can be directly achieved by passing in different sets of arguments. The only things you’ll be mocking are Lions Fans (sure, we didn’t end the season well, but we really could have a great season next… oh, the abject sadness.)

A Usage Strategy for Pure Functions

Because of the benefits of pure functions, I follow the simple rule, “Strive for purity.” That is to say, I work hard to write as many functions as I can in a pure form, and when needed, I write functions that have side effects or are inconsistent.

Side effects aren’t bad. Any meaningful program will have side effects, and it doesn’t bother me in the least when it’s time to write an impure function. However, I try to keep the side effects isolated in small fall-through functions, so as to simplify the simplicity, clarity, and testability of the rest of the code base.

Return To Me: A Song About Our Walk With God

I’ve completed the basic arrangement of a new song: Return To Me. The song takes the perspective of God singing to us, his children, throughout our journey with him. The song ends with us singing a chorus of “Hallelujah’s” to him, our God. You’ll have to use your imagination for now, as the melody is voiced with cello until I can complete a vocal version, which will hopefully include SATB parts for the chorus, too.

The song describes our need for God, our journey with God, and the love to which God has called us for all eternity. The lyrics and corresponding scriptural references are posted below the video.

My parents, Richard and Marjorie Richardson, inspired much of the work on the song, as did the precious example set by Laney and her family over the past few months. Davin Granroth and Rodney Page took the time to provide encouraging feedback on the work.

Return To Me (Instrumental)

Return To Me (Lyrics)

“Return to me for I still love you.”
Joel 2:13
Return to the Lord your God, for he is gracious and merciful, slow to anger, and abounding in steadfast love; and he relents over disaster.

“Return to me for I’ve redeemed you.”
Isaiah 44:22
I have blotted out your transgressions like a cloud and your sins like mist; return to me, for I have redeemed you.

“I so loved you, I laid my life down.”
1 John 3:16
By this we know love, that he laid down his life for us, and we ought to lay down our lives for the brothers.

“Come back home my wayward child.”
Jeremiah 3:14 & 3:22 (NLT AND NET bible) on wayward children.

“Walk with me and you will find rest.”
Jeremiah 6:16
Thus says the Lord: “Stand by the roads, and look, and ask for the ancient paths, where the good way is; and walk in it, and find rest for your souls.

“Walk with me and you will know hope.”
Ephesians 1:15 – 21
15 For this reason, because I have heard of your faith in the Lord Jesus and your love[f] toward all the saints, 16 I do not cease to give thanks for you, remembering you in my prayers, 17 that the God of our Lord Jesus Christ, the Father of glory, may give you the Spirit of wisdom and of revelation in the knowledge of him, 18 having the eyes of your hearts enlightened, that you may know what is the hope to which he has called you, what are the riches of his glorious inheritance in the saints, 19 and what is the immeasurable greatness of his power toward us who believe, according to the working of his great might 20 that he worked in Christ when he raised him from the dead and seated him at his right hand in the heavenly places, 21 far above all rule and authority and power and dominion, and above every name that is named, not only in this age but also in the one to come.

“I so loved you, I laid my life down.”
1 John 3:16
By this we know love, that he laid down his life for us, and we ought to lay down our lives for the brothers.

“Come on home my little child.”
Matthew 19:14
…but Jesus said, “Let the little children come to me and do not hinder them, for to such belongs the kingdom of heaven.”

“I take great delight in you. Let me quiet you within my love.”
Zephaniah 3:17
The Lord your God is in your midst, a mighty one who will save; he will rejoice over you with gladness [other translations read “take delight in you”]; he will quiet you by his love; he will exult over you with loud singing.

“Come home with me. Your place is prepared.”
John 14:2-3
In my Father’s house are many rooms. If it were not so, would I have told you that I go to prepare a place for you? And if I go and prepare a place for you, I will come again and will take you to myself, that where I am you may be also.

“Come home with me and worship The King”
Revelation 19

“I so loved you, I laid my life down.”
1 John 3:16
By this we know love, that he laid down his life for us, and we ought to lay down our lives for the brothers.

“Welcome home my precious child.”
Matthew 19:14
…but Jesus said, “Let the little children come to me and do not hinder them, for to such belongs the kingdom of heaven.”

“Nothing in all creation can separate you from my love.”
Romans 8:38-39
For I am sure that neither death nor life, nor angels nor rulers, nor things present nor things to come, nor powers, nor height nor depth, nor anything else in all creation, will be able to separate us from the love of God in Christ Jesus our Lord.

“Hallelu! Hallelujah!”
Revelation 19