Extending Markdown Support in Yaydoc

Yaydoc, our automatic documentation generator, builds static websites from a set of markup documents in markdown or reStructuredText format. Yaydoc uses the sphinx documentation generator internally hence reStructuredText support comes out of the box with it. To support markdown we use multiple techniques depending on the context. Most of the markdown support is provided by recommonmark, a docutils bridge for sphinx which basically converts markdown documents into proper docutil’s abstract syntax tree which is then converted to HTML by sphinx. While It works pretty well for most of the use cases, It does fall short in some instances. They are discussed in the following paragraphs.

The first problem was inclusion of other markdown files in the starting page. This was due to the fact that markdown does not supports any include mechanism. And if we used the reStructuredText include directive, the included text was parsed as reStructuredText. This problem was solved earlier using pandoc – an excellent tool to convert between various markup formats. What we did was that we created another directive mdinclude which converts the markdown to reStructuredText before inclusion. Although this was solved a while ago, The reason I’m discussing this here is that this was the inspiration behind the solution to our recent problem.

The problem we encountered was that recommonmark follows the Commonmark spec which is an ongoing effort towards standardization of markdown which has been somewhat lacking till now. The process is currently going on so the recommonmark library doesn’t yet support the concept of extensions to support various features of different markdown flavours not in the core commonmark spec. We could have settled for only supporting the markdown features in the core spec but tables not being present in the core spec was problematic. We had to support tables as it is widely used in most of the docs present in github repositories as GFM(Github Flavoured Markdown) renders ascii tables nicely.

The solution was to use a combination of recommonmark and pandoc. recommonmark provides a eval_rst code block which can be used to embed non-section reStructuredText within markdown. I created a new MarkdownParser class which inherited the CommonMarkParser class from recommonmark. Within it, using regular expressions, I convert any text within `<!– markdown+ –>` and `<!– endmarkdown+ –>`  into reStructuredText and enclose it within eval_rst code block. The result was that tables when enclosed within those trigger html comments would be converted to reST tables and then enclosed within eval_rst block which resulted in recommonmark renderering them properly. Below is a snippet which shows how this was implemented.

import re
from recommonmark.parser import CommonMarkParser
from md2rst import md2rst

MARKDOWN_PLUS_REGEX = re.compile('<!--\s+markdown\+\s+-->(.*?)<!--\s+endmarkdown\+\s+-->', re.DOTALL)
EVAL_RST_TEMPLATE = "```eval_rst\n{content}\n```"

def preprocess_markdown(inputstring):
    def callback(match_object):
        text = match_object.group(1)
        return EVAL_RST_TEMPLATE.format(content=md2rst(text))

    return re.sub(MARKDOWN_PLUS_REGEX, callback, inputstring)

class MarkdownParser(CommonMarkParser):
    def parse(self, inputstring, document):
        content = preprocess_markdown(inputstring)
        CommonMarkParser.parse(self, content, document)


How to add a custom filter to pypandoc

In Yaydoc, we met the problem of converting Markdown file into restructuredText because sphinx needs restructured text.  

Let us say we have a yml CodeBlock in Yaydoc’s README.md, but sphinx  uses pygments for code highlighting which needs yaml instead of yml for proper documentation generation. Pandoc has an excellent feature which allows us to write our own custom logic to the AST parser.

INPUT --reader--> AST --filter--> AST --writer--> OUTPUT

Let me explain this in a few steps:

  1. Initially pandoc reads the file and then converts it into nodes.
  2. Then the nodes is sent to Pandoc AST for parsing the markdown to restructuredText.
  3. The parsed node will then go to the filter. The filter is converting the parsed node according to the logic implemented.
  4. Then the Pandoc AST performs further parsing and joins the Nodes into text and is written to the file.

One important point to remember is that, Pandoc reads the conversion from the filter output stream so don’t write print statement in the filter. If you write print statement pandoc cannot  parse the JSON. In order to do debugging you can use logging module from python standard module. Here is the sample Pypandoc filter:

#!/usr/bin/env python
from pandocfilters import CodeBlock, toJSONFilter

def yml_to_yaml(key, val, fmt, meta):
    if key == 'CodeBlock':
        [[indent, classes, keyvals], code] = val
        if len(classes) > 0:
            if classes[0] == u'yml':
                classes[0] = u'yaml'
        val = [[indent, classes, keyvals], code]
        return CodeBlock(val[0], val[1])

if __name__ == "__main__":

The above snippet checks whether the node is a CodeBlock or not. If it is a CodeBlock, it changes yml to yaml and prints it as a JSON in the output stream. It is then parsed by pandoc.

Finally, all you have to do is to add your filter to the Pypandoc’s filters argument.

output = pypandoc.convert_text(text, 'rst', format='md', filters=[os.path.join('filters', 'yml_filter.py')])