Markdown Rendering with Awk

Posted:

I can't believe I'm writing another post about Awk but I'm just having too much fun throwing together tiny Awk scripts. This time around that tiny Awk script is a markdown renderer, converting markdown to good ol' HTML.

Markdown is interesting because 80% of the language is rather easy to implement by walking through a file and applying a regular expression to each line. Most implementations seem to start this way. However, once that final 20% is hit some aspects of the language start to show their warts, exposing the not-so-happy path that eventually leads to lexical analysis.

To list a few examples,

Each of these conditions complicates the simple line-based approach.

The renderer that I'm building doesn't aim to be comprehensive, so most of these edge cases are not handled. For my toy renderer, I'm assuming that the markdown is written in accordance to a general style guide with friendly linebreaks between paragraphs.

I am also being careful to call this project an markdown "renderer" and not a "parser" because it's not really parsing the markdown file. Instead, we're immediately replacing markdown with HTML. The difference may seem nitpicky but implies that there's no intermediate format between the markdown and the HTML output, a nuance that makes this implementation less powerful but also much simpler.

Let's get cracking.

Initial approach

Headers are a natural first step. The solution emphasizes Awk's strengths when it comes to handing line-based input:

/^# / { print "<h1>" substr($0, 3) "</h1>" }

This reads, "On every line, match # followed by a space. Replace that line with header tags and the text of that line beginning at index 3 (one-based indexing)." Since we're piping awk into another file, print statements are effectively writing our rendered HTML.

Line replacements are the name of the game in Awk, where the simplicity of the syntax really shines. The call to substr is less elegant than the usual space-delimited Awk fields ($1, $2, etc.), but it's necessary since we want to preserve the entire line sans the first two characters (the leading header hashtag).

The remaining headers follow the same pattern:

/^# /   { print "<h1>" substr($0, 3) "</h1>" }
/^## /  { print "<h2>" substr($0, 4) "</h2>" }
/^### / { print "<h3>" substr($0, 5) "</h3>" }
# ...

For something a little trickier, let's move on to block quotes. Block quotes in Markdown look like the following, leading each line with a caret:

> Deep in the human unconscious is a pervasive need for a logical
> universe that makes sense. But the real universe is always one step
> beyond logic. - Frank Herbert

Finding block quote lines is easy, we just use the same approach as our headers:

/^> / { print "<blockquote>"; print substr($0, 3); print "</blockquote>"}

But as you have probably guessed, this simplification isn't quite what we want. Instead of wrapping each line with a block quote tag, we want to wrap the entire block (three lines in this case) with one set of tags. This will require us to keep track of some intermediate state between line-reads:

/^> / { if (!inquote) print "<blockquote>"; inquote = 1; print substr($0, 3) "}

If we match a blockquote character and we're not yet inquote, we write the opening tag and set inquote. Otherwise, we simply write the content of the line. We need an extra rule to write the closing tag:

inquote && !/^> / { print "</blockquote>"; inquote = 0 }

If our program state says we're in a quote but we reach a line that doesn't lead with a block quote character, it's time to close the block. This matches against paragraph breaks which are normally used to separate paragraphs in Markdown documents.

This same strategy can be applied to the other block-style markdown elements: code blocks and lists. Each requires its own variable to keep track of the block content, but the approach is the same.

Paragraphs are tricky

It is very tempting to implement inline paragraph elements in the same way as we've handled other, single-line markdown syntax. However, paragraphs are special in that they often span more than a single line, especially so if you use hard-wrapping in your text editor at some column. For example, it's very common for links to span multiple lines:

A link that spans [multiple
lines](https://definitely-a-valid-link-here.com)

This breaks the nice, single-line worldview that we've been operating under, requiring some special handling that will end up leaking into other aspects of our rendering engine.

My approach is to collect multiple paragraph lines into a single string, rendering it altogether on paragraph breaks. This allows me to search the entire string for inline elements (links, bold, italics), effectively matching against multiple lines of input.

/./  { for (i=1; i<=NF; i++) collect($i) }
/^$/ { flushp() }

# Concatenate our multi-line string
function collect(v) {
  line = line sep v
  sep = " "
}

# Flush the string, rendering any inline HTML elements
function flushp() {
  if (line) {
    print "<p>" render(line) "</p>"
    line = sep = ""
  }
}

Each line of text is collected into a variable, line, that is persisted between line-reads. When a paragraph break is hit (a line that contains no text, /^$/) we render that line, wrapping it in paragraph tags and replacing any inline elements with their respective HTML tags.

I'll point out that the technique of collecting fields into a string or array is a very common pattern in Awk, hence the utility variable NF for "number of fields". The Awk book uses this pattern in quite a few places.

For completeness, here's what that render function looks like:

function render(line) {
    if (match(line, /_(.*)_/)) {
        gsub(/_(.*)_/, sprintf("<em>%s</em>", substr(line, RSTART+1, RLENGTH-2)), line)
    }

    if (match(line, /\*(.*)\*/)) {
        gsub(/\*(.*)\*/, sprintf("<strong>%s</strong>", substr(line, RSTART+1, RLENGTH-2)), line)
    }

    if (match(line, /\[.+\]\(.+\)/)) {
        inner = substr(line, RSTART+1, RLENGTH-2)
        split(inner, spl, /\]\(/)
        gsub(/\[.+\]\(.+\)/, sprintf("<a href=\"%s\">%s</a>", spl[2], spl[1]), line)
    }

    return line
}

This code is noticeably less clean than our earlier HTML rendering, an unfortunate consequence of handling multi-line paragraphs. I won't go into too much detail here since there's a lot of Awk-specific regular expression matching stuff going on, but the gist is a standard regexp-replace of the paragraph text with HTML tags for matching elements.

Another problem that we run into when collecting multiple lines into the line variable is accidentally collecting text from previous match rules. Awk's expression syntax is like a switch statement that lacks a break: a line will match as many expressions as it can before moving onto the next. That means that all of our previous rules for headers, blockquotes, and so on are now also included in our paragraph text. That's no good!

# I match a header here:
/^# /  { print "<h1>" substr($0, 3) "</h1>" }

# But I also match "any text" here, so I'm collected:
/./  { for (i=1; i<=NF; i++) collect($i) }

Each of our previous matchers now has to include a call to next to immediately stop processing and move on to the next line. This prevents them from being included in paragraph collection.

/^# /  { print "<h1>" substr($0, 3) "</h1>"; next }
/^## / { print "<h2>" substr($0, 4) "</h2>"; next }
# ...

Styling for HTML exports

The last piece of this Markdown renderer is adding the boilerplate HTML that wraps our document:

BEGIN {
    print "<!doctype html><html>"
    print "<head>"
    print "  <meta charset=\"utf-8\">"
    print "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">"
    if (head) print head
    print "</head>"
    print "<body>"
}

# ... all of our rules go here

END {
    print "</body>"
    print "</html>"
}

Unlike other Awk matchers, the special BEGIN and END keywords are only executed once.

As a nice bonus, we can add an optional head variable to inject a stylesheet into our rendered markdown, which can be added via the Awk CLI. The following adds the Simple CSS stylesheet to our rendered output:

awk -v head='  <link rel="stylesheet" href="https://cdn.simplecss.org/simple.min.css">' \
    -f awkdown.awk README.md > docs/index.html

The full source code is available here: https://github.com/mgmarlow/awkdown.


Thanks for reading! Send your comments to [email protected].