Extending Markdown Support in Yaydoc

Yaydoc, our automatic documentation generator, builds static websites from a set of markup documents in markdown or reStructuredText format. Yaydoc uses the sphinx documentation generator internally hence reStructuredText support comes out of the box with it. To support markdown we use multiple techniques depending on the context. Most of the markdown support is provided by recommonmark, a docutils bridge for sphinx which basically converts markdown documents into proper docutil’s abstract syntax tree which is then converted to HTML by sphinx. While It works pretty well for most of the use cases, It does fall short in some instances. They are discussed in the following paragraphs.

The first problem was inclusion of other markdown files in the starting page. This was due to the fact that markdown does not supports any include mechanism. And if we used the reStructuredText include directive, the included text was parsed as reStructuredText. This problem was solved earlier using pandoc – an excellent tool to convert between various markup formats. What we did was that we created another directive mdinclude which converts the markdown to reStructuredText before inclusion. Although this was solved a while ago, The reason I’m discussing this here is that this was the inspiration behind the solution to our recent problem.

The problem we encountered was that recommonmark follows the Commonmark spec which is an ongoing effort towards standardization of markdown which has been somewhat lacking till now. The process is currently going on so the recommonmark library doesn’t yet support the concept of extensions to support various features of different markdown flavours not in the core commonmark spec. We could have settled for only supporting the markdown features in the core spec but tables not being present in the core spec was problematic. We had to support tables as it is widely used in most of the docs present in github repositories as GFM(Github Flavoured Markdown) renders ascii tables nicely.

The solution was to use a combination of recommonmark and pandoc. recommonmark provides a eval_rst code block which can be used to embed non-section reStructuredText within markdown. I created a new MarkdownParser class which inherited the CommonMarkParser class from recommonmark. Within it, using regular expressions, I convert any text within `<!– markdown+ –>` and `<!– endmarkdown+ –>`  into reStructuredText and enclose it within eval_rst code block. The result was that tables when enclosed within those trigger html comments would be converted to reST tables and then enclosed within eval_rst block which resulted in recommonmark renderering them properly. Below is a snippet which shows how this was implemented.

import re
from recommonmark.parser import CommonMarkParser
from md2rst import md2rst

MARKDOWN_PLUS_REGEX = re.compile('<!--\s+markdown\+\s+-->(.*?)<!--\s+endmarkdown\+\s+-->', re.DOTALL)
EVAL_RST_TEMPLATE = "```eval_rst\n{content}\n```"

def preprocess_markdown(inputstring):
    def callback(match_object):
        text = match_object.group(1)
        return EVAL_RST_TEMPLATE.format(content=md2rst(text))

    return re.sub(MARKDOWN_PLUS_REGEX, callback, inputstring)

class MarkdownParser(CommonMarkParser):
    def parse(self, inputstring, document):
        content = preprocess_markdown(inputstring)
        CommonMarkParser.parse(self, content, document)


Handling Errors While Parsing the yaml File in Yaydoc

Yaydoc, our automatic documentation generator uses a yaml file to read a user’s configuration. The internal configuration parser basically converts the yaml file to a python dictionary. Then, it serializes the values of that dictionary using a custom serialization format. From there it associates those values with environment variables which are then passed to bash scripts for various tasks such as deployment, generation, etc.. Some of those environment variables are again passed to another python layer which interacts with sphinx where they are deserialized before use. This whole system works pretty well for our use cases.

Now let’s assume a user adds a yaml file where they have a malformed section in the file. For example, to specify a theme, one needs to add the following to the yaml file.

    name: sphinx_fossasia_theme

But our user has the following in their yaml file.

  theme: sphinx_fossasia_theme

Now this will raise an error as we expect a dictionary as a value for the key ‘theme’ but we got a string. Now how do we handle such cases without ignoring the entire file as that would be too much of a penalty for such a small mistake? One approach would have been to wrap each call to connect with a bunch of try-catch but that would render the code unreadable as the initial motivation for implementing the connect method was to abstract the internal implementation so that other contributors who may not be well versed with python can also easily add config options without needing to learn a bunch of python constructs.

So, what we did was that, while merging the dictionary containing default options and the dictionary containing the user preferences, we check whether the default has the same data type as that of the incoming value. If they are, It’s deemed safe to merge. There are certain relaxations though, like if the current type is a list, then the incoming value can be of any time as that can always be converted to a list of a single element. This is required to support the following syntax.

  - value
key: value

The above two blocks are equivalent due to the above-mentioned approach although the type is different.

Now, after this pre-validation step is over we can ensure that the if the assumed type for a key is let’s say a dictionary, then it would be a dictionary. Hence no type errors would be raised like trying to access a dict method for another object, say a string which happened with the earlier implementation. After this, an extra parameter was added to the connect method to which we can now pass a validation function which if returns false, those values would be ignored. Usage of this feature has been implemented to a small level where we validate the links to subprojects and if they look like a valid github repo only then will they be included. Note that their existence is not checked. Only a regex based validation is performed.

It was also important to notify the user about these events when we detect that a specific section is invalid and provide informative and helpful error messages without failing the build. Hence proper error messages were also added which were informative so that the user knows exactly which section is to blame. This is similar to compilers where the error message is crucial to debug a certain piece of code.


Implementing a Custom Serializer for Yaydoc

At the crux of it, Yaydoc is comprised of a number of specialized bash scripts which perform various tasks such as generating documentation, publishing it to github pages, heroku, etc. These bash scripts also serve as the central communication portal for various technologies used in Yaydoc. The core generator is composed of several Python modules extending the sphinx documentation generator. The web Interface has been built using Node, Express, etc. Yaydoc also contains a Python package dedicated to reading configuration options from a Yaml file.

Till now the options were read and then converted to strings irrespective of the actual data type, based on some simple rules.

  • List was converted to a comma separated string.(Nested lists were not handled)
  • Boolean values were converted to true | false respectively.
  • None was converted to an empty string.

While these simple rules were enough at that time, It was certain that a better solution would be required as the project grew in size. It was also getting tough to maintain because a lot of hard-coding was required when we wanted to convert those strings to python objects. To handle these cases, I decided to create a custom serialization format which would be simple for our use cases and easily parseable from a bash script yet can handle all edge cases. The format is mostly similar to its earlier form apart from lists where it takes heavy inspiration from the python language itself.

With the new implementation, Lists would get converted to comma separated strings enclosed by square brackets. This allowed us to encode the type of the object in the string so that it can later be decoded. This handled the case of an empty list or a list with single element well. The implementation also handled nested lists.

Two methods were created namely serialize and deserialize which detected the type of the corresponding object using several heuristics and applied the proper serialization or deserialization rule.

def serialize(value):
    Serializes a python object to a string.
    None is serialized to an empty string.
    bool values are converted to strings True False.
    list or tuples are recursively handled and are comma separated.
    if value is None:
        return ''
    if isinstance(value, str):
        return value
    if isinstance(value, bool):
        return "true" if value else "false"
    if isinstance(value, (list, tuple)):
        return '[' + ','.join(serialize(_) for _ in value) + ']'
    return str(value)

To deserialize we also had to handle the case of nested lists. The following snippet does that properly.

def deserialize(value, numeric=True):
    Deserializes a string to a python object.
    Strings True False are converted to bools.
    `numeric` controls whether strings should be converted to
    ints or floats if possible. List strings are handled recursively.
    if value.lower() in ("true", "false"):
        return value.lower() == "true"
    if numeric and _is_numeric(value):
        return _to_numeric(value)
    if value.startswith('[') and value.endswith(']'):
        split = []
        element = ''
        level = 0
        for c in value:
            if c == '[':
                level += 1
                if level != 1:
                    element += c
            elif c == ']':
                if level != 1:
                    element += c
                level -= 1
            elif c == ',' and level == 1:
                element = ''
                element += c
        if split or element:
        return [deserialize(_, numeric) for _ in split]
    return value

With this new approach, we are able to handle much more cases as compared to the previous implementation and is much more robust. It does however still lacks lacks certain features such as serializing dictionaries. That may be be implemented in the future if need be.


Adding Github buttons to Generated Documentation with Yaydoc

Many times repository owners would want to link to their github source code, issue tracker etc. from the documentation. This would also help to direct some users to become a potential contributor to the repository. As a step towards this feature, we added the ability to add automatically generated GitHub buttons to the top of the docs with Yaydoc.

To do so we created a custom sphinx extension which makes use of http://buttons.github.io/ which is an excellent service to embed GitHub buttons to any website. The extension takes multiple config values and using them generates the `html` which it adds to the top of the internal docutils tree using a raw node.

    'watch': ('eye', 'https://github.com/{user}/{repo}/subscription'),
    'star': ('star', 'https://github.com/{user}/{repo}'),
    'fork': ('repo-forked', 'https://github.com/{user}/{repo}/fork'),
    'follow': ('', 'https://github.com/{user}'),
    'issues': ('issue-opened', 'https://github.com/{user}/{repo}/issues'),

def get_button_tag(user, repo, btn_type, show_count, size):
    spec = GITHUB_BUTTON_SPEC[btn_type]
    icon, href = spec[0], spec[1].format(user=user, repo=repo)
    tag_fmt = '<a class="github-button" href="{href}" data-size="{size}"'
    if icon:
        tag_fmt += ' data-icon="octicon-{icon}"'
    tag_fmt += ' data-show-count="{show_count}">{text}</a>'
    return tag_fmt.format(href=href,

The above snippet shows how it takes various parameters such as the user name, name of the repository, the button type which can be one of fork, issues, watch, follow and star, whether to display counts beside the buttons and whether a large button should be used. Another method named get_button_tags is used to read the various configs and call the above method with appropriate parameters to generate each button.

The extension makes use of the doctree-resolved event emitted by sphinx to hook into the internal doctree. The following snippet shows how it is done.

def on_doctree_resolved(app, doctree, docname):
    if not app.config.github_user_name or not app.config.github_repo:
    buttons = nodes.raw('', get_button_tags(app.config), format='html')
    doctree.insert(0, buttons)

Finally we add the custom javascript using the add_javascript method.


To use this with yaydoc, users would just need to add the following to their .yaydoc.yml file.

      watch: true
      star: true
      issues: true
      fork: true
      follow: true
    show_count: true
    large: true


  1.  Homepage of Github:buttons – http://buttons.github.io/
  2. Sphinx extension Tutorial – http://www.sphinx-doc.org/en/stable/extdev/tutorial.html

Implementing an Interface for Reading Configuration from a YAML File for Yaydoc

Yaydoc reads configuration specified in a YAML file to set various options during the build process. This allows users to customize various properties of the build process. The current implementation for this was very basic. Basically it uses a pyYAML, a yaml parser for python to read the file and convert it to a python dictionary. From the dictionary we extracted values for various properties and converting them to strings using various heuristics such as converting True to ”true”, False to ”false”, a list to comma separated string and None to an empty string. Finally, we exported variables with those values.

Recently the entire code for this was rewritten using object-oriented paradigm. The motivation for this came from the fact that the implementation lacked certain features and also required some refactoring for long term readability. In the following paragraph, I have discussed the new implementation.

Firstly a Configuration class was created which basically wraps around a dictionary and provide certain utility methods. The primary difference is that the Configuration class allows dotted key access. This means that you can use the following syntax to access nested keys.

theme = conf[‘build.theme.name’]

The class provides another method connect which is used to connect environment variables with configuration values. This method also takes a dotted key but provides an extension on top of that to handle the case when a certain option can take multiple values. For example,

option: my_option


  - my_option1
  - my_option2

To indicate that a certain config is of this type, you can specify a “@” character at the end of the key. Anything after the “@” character is assumed to be an attribute of each element within the list. Let’s see an example of this whole process.

    - url: <url1>
  source: “doc”
    - url: <url2>

Now to extract all urls from the above file, we’d need to do the following

config.connect(‘SUBPROJECT_URLS’, ‘[email protected]’)

To extract sources, we’ll also use the default parameter as the source option is optional.

config.connect(‘SUBPROJECT_SOURCES’, [email protected]’, default=’docs’)

Finally, The Configuration object also provides a getenv method which reads all connection and serializes values to string according to the previously described heuristics. It then returns a dictionary of all environment variables which must be set.


Specifying Configurations for Yaydoc with .yaydoc.yml

Like many continuous integration services, Yaydoc also reads configurations from a YAML file. To get started with Yaydoc the first step is to create a file named .yaydoc.yml and specify the required options. Recently the yaydoc team finalized the layout of the file. You can still expect some changes as the projects continues to grow on but they shall not be major ones. Let me give you an outline of the entire layout before describing each in detail. Overall the file is divided into four sections.

  • metadata
  • build
  • publish
  • extras

For the first three sections, their intention is pretty clear from their names. The last section extras is meant to hold settings related to integration to external services.

Following is a description of config options under the metadata section and it’s example usage.

Key Description Default

The author of the repository. It is used to construct the copyright text.


The name of the project. This would be displayed on the generated documentation.

Name of the repository

The version of the project. This would be displayed alongside the project name

Current UTC date

If true, the logs generated would be a little more verbose. Can be one of true|false.


Whether inline math should be enabled. This only affects markdown documents.


This section contains various settings to control the behavior of the auto generated index. Use this to customize the starting page while having the benefit of not having to specify a manual index.

  author: FOSSASIA
  projectname: Yaydoc
  version: development
  debug: true
  inline_math: false

Following is a description of config options under the build section and it’s example usage.

Key Description Default

The theme which should be used to build the documentation. The attribute name can be one of the built-in themes or any custom sphinx theme from PyPI. Note that for PyPI themes, you need to specify the distribution name and not the package name. It also provides an attribute options to control specific theme options


This is the path which would be scanned for markdown and reST files. Also any static content such as images referenced from embedded html in markdown should be placed under a _static directory inside source. Note that the README would always be included in the starting page irrespective of source from the auto-generated index


The path to an image to be used as logo for the project. The path specified should be relative to the source directory.


The markdown flavour which should be used to parse the markdown documents. Possible values for this are markdown, markdown_strict, markdown_phpextra, markdown_github, markdown_mmd and commonmark. Note that currently this option is only used for parsing any included documents using the mdinclude directive and it’s unlikely to change soon.


Any python modules or packages which should be mocked. This only makes sense if the project is in python and uses autodoc has C dependencies.


If enabled, Yaydoc will crawl your repository and try to extract API documentation. It Provides attributes for specifying the language and source path. Currently supported languages are java and python.


This section can be used to include other repositories when building the docs for the current repositories. The source attribute should be set accordingly.


This section can be used to include various Github buttons such as fork, watch, star, etc.


This section can be used to include a Github ribbon linking to your github repo.

    name: sphinx_fossasia_theme
  source: docs
  logo: images/logo.svg
  markdown_flavour: markdown_github
    - numpy
    - scipy
    - language: python
      source: modules
    - language: java
    - url: <URL of Subproject 1>
      source: doc
    - url: <URL of subproject 2>

Following is a description of config options under the publish section and it’s example usage.

Key Description Default

It provides a attribute url whose value is written in a CNAME file while publishing to github pages.


It provides an app_name attribute which is used as the name of the heroku app. Your docs would be deployed at <app_name>.herokuapp.com

    url: yaydoc.fossasia.org
    app_name: yaydoc 

Following is a description of config options under the extras section and it’s example usage.

Key Description Default

This can be used to include swagger API documentation in the build. The attribute url should point to a valid swagger json file. It also accepts an additional parameter ui which for now only supports swagger.


It takes an attribute path and can include javadocs from the repository.

    url: http://api.susi.ai/docs/swagger.json
    ui: swagger
    path: 'src/' 


Using API Blueprint with Yaydoc

As part of extending the capability of Yaydoc to document APIs, this week we integrated API Blueprint with Yaydoc. Now we can parse apib files and add the parsed content to the generated documentation. From the official Homepage of API Blueprint,

API Blueprint is simple and accessible to everybody involved in the API lifecycle. Its syntax is concise yet expressive. With API Blueprint you can quickly design and prototype APIs to be created or document and test already deployed mission-critical APIs. It is a documentation-oriented web API description language. The API Blueprint is essentially a set of semantic assumptions laid on top of the Markdown syntax used to describe a web API.

To Integrate API Blueprint with Yaydoc, we used an sphinx extension named sphinxcontrib-apiblueprint. This extension can directly translate text in API Blueprint format into docutils nodes. The advantage with this approach as compared to using tools like aglio is that the generated html fits in nicely with the already existent theme. Though we may in future provide ability to generate html using tools like aglio if the user prefers. Adding an extension to sphinx is very easy. In the conf.py template, we added the extension to the already enabled list of extensions.

extensions += [‘sphinxcontrib.apiblueprint’]

The above extension provides a directive apiblueprint which can be then used to include apib files. The directive is very similar to the built in include directive. The difference is just that it should be only be used to include files in API Blueprint format. You can see an example below of how to use this directive.

.. apiblueprint:: <path to apib file>

Although this is enough for projects which use the ResT markup format, This cannot be used with projects using markdown as the primary markup format, since markdown doesn’t support the concept of directives. To solve this, we used the eval_rst block provided by recommonmark in Yaydoc. It allows users to embed valid ReST within markdown and recommonmark will properly parse the embedded text as ReST. Now a user can use this to use directives within markdown. You can see an example below.

.. apiblueprint:: <path to apib file>

In order to implement this, we used the AutoStructify class provided by recommonmark. Here’s a snippet from our conf.py template. Note that this does have far reaching effects. Now users would be able to use this to add constructs like toctree in markdown which wasn’t possible before.

from recommonmark.transform import AutoStructify

def setup(app):
    app.add_config_value('recommonmark_config', {
    'enable_eval_rst': True,
    }, True)

Let’s see all of this in action. Here’s a preview of a generated documentation with API Blueprint using Yaydoc.


Documenting APIs with Yaydoc

API Documentation is a quick and concise way to tell a user about how to use a library or work with a program. It details classes, functions, parameters, return types and more. Courtesy of Sphinx, Yaydoc had build in support for Documenting APIs for Python based projects right from it’s inception. Sphinx has a built in tool autodoc which provides certain directives such as autoclass, automodule, etc which can be used to automatically extract docstrings from all specified Python packages and modules and use it to generate API documentation. As a user of Yaydoc you could add ReST sources files with appropriate directives provided by autodoc and we would handle the rest. As part of enhancing this feature we wanted to do three things.

  • Enhance support for Python
  • Extend API documentation to other languages apart from Python
  • Automate the process of generating ReST source files

For Enhancing support for python projects, we implemented a few things.

Since autodoc imports the modules it needs to document, There could be import errors if a dependency was not met. To fix this issue, Now a user can specify certain modules to be mocked. This would really come in handy with projects depending on packages with third party C extensions such as numpy, scipy, etc.

{% if mock_modules %}
mock_modules = [name.strip() for name in '{{ mock_modules }}'.split(',')]
sys.modules.update((mod_name, mock.Mock()) for mod_name in mock_modules)
{% endif %}

Apart from this, if we detect a setup.py in the repository or a requirements.txt, we automatically try to install from it to meet dependencies.

# autodoc imports the module while building source files. To avoid
# ImportError, install any packages in requirements.txt of the project
# if available
if [ -f $ROOT_DIR/setup.py ]; then
  pip install $ROOT_DIR/
elif [ -f $ROOT_DIR/requirements.txt ]; then
  pip install -q -r $ROOT_DIR/requirements.txt

We also crawl the repository to detect any packages and add them to sys.path. With these changes, a user can expected generated API docs without having to extend conf.py.

{% if autoapi_python == 'true' %}
for (dirpath, dirnames, filenames) in os.walk('{{ root_dir }}'):
    # Directory contains __init__.py. It should be a python package
    if '__init__.py' in filenames:
        # appending instead of inserting at front so that user
        # cannot overwrite some of our own modules.
{% endif %}

The second goal is a no brainer. We would like to support as many languages as we can. With this week’s update, Java has been added to the officially supported list of languages for which Yaydoc can generate full API documentation without any manual intervention. To extract API documentation for java source files, we used a sphinx extension named javasphinx. From the official javasphinx docs,

javasphinx is a Sphinx extension that provides a Sphinx domain for documenting Java projects and a javasphinx-apidoc command line tool for automatically generating API documentation from existing Java source code and Javadoc documentation.

javasphinx-apidoc -o source/ $ROOT_DIR/$AUTOAPI_JAVA_PATH/
sphinx-apidoc -o source/ $ROOT_DIR/$AUTOAPI_PYTHON_PATH/

For the third goal, we use the tools sphinx-apidoc and javasphinx-apidoc to generate source files.


Improving Custom PyPI Theme Support In Yaydoc

Yaydoc has been supporting custom themes from nearly it’s inception. Themes, which it could not find locally, it would automatically try to install it via pip and set up appropriate metadata about the themes in the generated conf.py.  It was one of the first major enhancement we provided as compared to when using bare sphinx to generate documentation. Since then, a large number of features have been added to ease the process of documentation generation but the core theming aspects have remained unchanged.

To use a theme, sphinx needs the exact name of the theme and the absolute path to it. To obtain these metadata, the existing implementation accessed the __file__ attribute of the imported package to get the absolute path to the __init__.py file, a necessary element of all python packages. From there we searched for a file named theme.conf, and thus the directory containing that file was our required theme.

There were a few mistakes in our earlier implementation. For starters, we assumed that the distribution name of the theme in PyPI and the package name which should be imported would be same. This is generally true but is not necessary. One such theme from PyPI is Flask-Sphinx-Themes. While you need to install it using

pip install Flask-Sphinx-Themes

yet to import it in a module one needs to

import flask_sphinx_themes

This lead to build errors when specific themes like this was used. To solve this, we used the pkg_resources package. It allows us to get various metadata about a package in an abstract way without needing to specifically handle if the package is zipped or not.

    dist = pkg_resources.get_distribution('{{ html_theme }}')
    top_level = list(dist._get_metadata('top_level.txt'))[0]
    dist_path = os.path.join(dist.location, top_level)
except (pkg_resources.DistributionNotFound, IndexError):
    print("\nError with distribution {0}".format('{{ html_theme }}'))
    html_theme = 'fossasia_theme'
    html_theme_path = ['_themes']

The idea here is that instead of searching for __init__.py, we read the name of the top_level directory using the first entry of the top_level.txt, a file created by setuptools when installing the package. We build the path by joining the location attribute of the Distribution object and the name of the top_level directory. The advantage with this approach is that we don’t need to import anything and thus no longer need to know the exact package name.

With this update, Support for custom themes has been greatly increased.


Using a YAML file to read configuration options in Yaydoc

Yaydoc provides access to a lot of configurable variables which can be set as per requirements to configure various sections of the build process. You can see the entire list of variables in the project’s homepage. Till now the only way to do this was to set appropriate environment variables. Since a web user interface for yaydoc is in development, providing a clean UI was very important. This meant that we could not just create a bunch of input fields for all variables as that could be overwhelming for any new user. So we decided to ask only minimal information in the web form and read other variables if the user chooses to specify from a YAML file in the target repository.

To read a YAML file, we used PyYaml. It is a well established Python package to safely read info from a YAML file and convert it to a Python’s dictionary. Here is the code snippet for that.

def get_yaml_config():
        with open('.yaydoc.yml', 'r') as file:
            conf = yaml.safe_load(file)
    except FileNotFoundError:
        return {}
    return conf

The above code snippet returns a dictionary specifying all keys read from the YAML file. Since none of the options are required, we first create a dictionary with all defaults and recursively merges it with the yaml dict. The merging is done using the following code snippet:

for key, value in head.items():
    if isinstance(base, dict):
        if isinstance(value, dict):
            base[key] = update_dict(base.get(key, {}), value)
           base[key] = head[key]
        base = {key: head[key]}
return base

Now you can create a .yaydoc.yml file in the root of your repository and yaydoc would read options from there. Here is a sample yml file.

  author: FOSSASIA
  projectname: Yaydoc
  version: development

  doctheme: fossasia_theme
  docpath: docs/
  logo: images/logo.svg
  markdown_flavour: markdown_github

    docurl: yaydoc.fossasia.org

It should be noted that the layout of the file may change in the future as the project is in active development.