Extending Markdown Support in Yaydoc

Yaydoc, our automatic documentation generator, builds static websites from a set of markup documents in markdown or reStructuredText format. Yaydoc uses the sphinx documentation generator internally hence reStructuredText support comes out of the box with it. To support markdown we use multiple techniques depending on the context. Most of the markdown support is provided by recommonmark, a docutils bridge for sphinx which basically converts markdown documents into proper docutil’s abstract syntax tree which is then converted to HTML by sphinx. While It works pretty well for most of the use cases, It does fall short in some instances. They are discussed in the following paragraphs.

The first problem was inclusion of other markdown files in the starting page. This was due to the fact that markdown does not supports any include mechanism. And if we used the reStructuredText include directive, the included text was parsed as reStructuredText. This problem was solved earlier using pandoc – an excellent tool to convert between various markup formats. What we did was that we created another directive mdinclude which converts the markdown to reStructuredText before inclusion. Although this was solved a while ago, The reason I’m discussing this here is that this was the inspiration behind the solution to our recent problem.

The problem we encountered was that recommonmark follows the Commonmark spec which is an ongoing effort towards standardization of markdown which has been somewhat lacking till now. The process is currently going on so the recommonmark library doesn’t yet support the concept of extensions to support various features of different markdown flavours not in the core commonmark spec. We could have settled for only supporting the markdown features in the core spec but tables not being present in the core spec was problematic. We had to support tables as it is widely used in most of the docs present in github repositories as GFM(Github Flavoured Markdown) renders ascii tables nicely.

The solution was to use a combination of recommonmark and pandoc. recommonmark provides a eval_rst code block which can be used to embed non-section reStructuredText within markdown. I created a new MarkdownParser class which inherited the CommonMarkParser class from recommonmark. Within it, using regular expressions, I convert any text within `<!– markdown+ –>` and `<!– endmarkdown+ –>`  into reStructuredText and enclose it within eval_rst code block. The result was that tables when enclosed within those trigger html comments would be converted to reST tables and then enclosed within eval_rst block which resulted in recommonmark renderering them properly. Below is a snippet which shows how this was implemented.

import re
from recommonmark.parser import CommonMarkParser
from md2rst import md2rst


MARKDOWN_PLUS_REGEX = re.compile('<!--\s+markdown\+\s+-->(.*?)<!--\s+endmarkdown\+\s+-->', re.DOTALL)
EVAL_RST_TEMPLATE = "```eval_rst\n{content}\n```"


def preprocess_markdown(inputstring):
    def callback(match_object):
        text = match_object.group(1)
        return EVAL_RST_TEMPLATE.format(content=md2rst(text))

    return re.sub(MARKDOWN_PLUS_REGEX, callback, inputstring)


class MarkdownParser(CommonMarkParser):
    def parse(self, inputstring, document):
        content = preprocess_markdown(inputstring)
        CommonMarkParser.parse(self, content, document)

Resources

Handling Errors While Parsing the yaml File in Yaydoc

Yaydoc, our automatic documentation generator uses a yaml file to read a user’s configuration. The internal configuration parser basically converts the yaml file to a python dictionary. Then, it serializes the values of that dictionary using a custom serialization format. From there it associates those values with environment variables which are then passed to bash scripts for various tasks such as deployment, generation, etc.. Some of those environment variables are again passed to another python layer which interacts with sphinx where they are deserialized before use. This whole system works pretty well for our use cases.

Now let’s assume a user adds a yaml file where they have a malformed section in the file. For example, to specify a theme, one needs to add the following to the yaml file.

build:
  theme:
    name: sphinx_fossasia_theme

But our user has the following in their yaml file.

build:
  theme: sphinx_fossasia_theme

Now this will raise an error as we expect a dictionary as a value for the key ‘theme’ but we got a string. Now how do we handle such cases without ignoring the entire file as that would be too much of a penalty for such a small mistake? One approach would have been to wrap each call to connect with a bunch of try-catch but that would render the code unreadable as the initial motivation for implementing the connect method was to abstract the internal implementation so that other contributors who may not be well versed with python can also easily add config options without needing to learn a bunch of python constructs.

So, what we did was that, while merging the dictionary containing default options and the dictionary containing the user preferences, we check whether the default has the same data type as that of the incoming value. If they are, It’s deemed safe to merge. There are certain relaxations though, like if the current type is a list, then the incoming value can be of any time as that can always be converted to a list of a single element. This is required to support the following syntax.

key:
  - value
key: value

The above two blocks are equivalent due to the above-mentioned approach although the type is different.

Now, after this pre-validation step is over we can ensure that the if the assumed type for a key is let’s say a dictionary, then it would be a dictionary. Hence no type errors would be raised like trying to access a dict method for another object, say a string which happened with the earlier implementation. After this, an extra parameter was added to the connect method to which we can now pass a validation function which if returns false, those values would be ignored. Usage of this feature has been implemented to a small level where we validate the links to subprojects and if they look like a valid github repo only then will they be included. Note that their existence is not checked. Only a regex based validation is performed.

It was also important to notify the user about these events when we detect that a specific section is invalid and provide informative and helpful error messages without failing the build. Hence proper error messages were also added which were informative so that the user knows exactly which section is to blame. This is similar to compilers where the error message is crucial to debug a certain piece of code.

Resources

Implementing a Custom Serializer for Yaydoc

At the crux of it, Yaydoc is comprised of a number of specialized bash scripts which perform various tasks such as generating documentation, publishing it to github pages, heroku, etc. These bash scripts also serve as the central communication portal for various technologies used in Yaydoc. The core generator is composed of several Python modules extending the sphinx documentation generator. The web Interface has been built using Node, Express, etc. Yaydoc also contains a Python package dedicated to reading configuration options from a Yaml file.

Till now the options were read and then converted to strings irrespective of the actual data type, based on some simple rules.

  • List was converted to a comma separated string.(Nested lists were not handled)
  • Boolean values were converted to true | false respectively.
  • None was converted to an empty string.

While these simple rules were enough at that time, It was certain that a better solution would be required as the project grew in size. It was also getting tough to maintain because a lot of hard-coding was required when we wanted to convert those strings to python objects. To handle these cases, I decided to create a custom serialization format which would be simple for our use cases and easily parseable from a bash script yet can handle all edge cases. The format is mostly similar to its earlier form apart from lists where it takes heavy inspiration from the python language itself.

With the new implementation, Lists would get converted to comma separated strings enclosed by square brackets. This allowed us to encode the type of the object in the string so that it can later be decoded. This handled the case of an empty list or a list with single element well. The implementation also handled nested lists.

Two methods were created namely serialize and deserialize which detected the type of the corresponding object using several heuristics and applied the proper serialization or deserialization rule.

def serialize(value):
    """
    Serializes a python object to a string.
    None is serialized to an empty string.
    bool values are converted to strings True False.
    list or tuples are recursively handled and are comma separated.
    """
    if value is None:
        return ''
    if isinstance(value, str):
        return value
    if isinstance(value, bool):
        return "true" if value else "false"
    if isinstance(value, (list, tuple)):
        return '[' + ','.join(serialize(_) for _ in value) + ']'
    return str(value)

To deserialize we also had to handle the case of nested lists. The following snippet does that properly.

def deserialize(value, numeric=True):
    """
    Deserializes a string to a python object.
    Strings True False are converted to bools.
    `numeric` controls whether strings should be converted to
    ints or floats if possible. List strings are handled recursively.
    """
    if value.lower() in ("true", "false"):
        return value.lower() == "true"
    if numeric and _is_numeric(value):
        return _to_numeric(value)
    if value.startswith('[') and value.endswith(']'):
        split = []
        element = ''
        level = 0
        for c in value:
            if c == '[':
                level += 1
                if level != 1:
                    element += c
            elif c == ']':
                if level != 1:
                    element += c
                level -= 1
            elif c == ',' and level == 1:
                split.append(element)
                element = ''
            else:
                element += c
        if split or element:
            split.append(element)
        return [deserialize(_, numeric) for _ in split]
    return value

With this new approach, we are able to handle much more cases as compared to the previous implementation and is much more robust. It does however still lacks lacks certain features such as serializing dictionaries. That may be be implemented in the future if need be.

Resources

Adding Github buttons to Generated Documentation with Yaydoc

Many times repository owners would want to link to their github source code, issue tracker etc. from the documentation. This would also help to direct some users to become a potential contributor to the repository. As a step towards this feature, we added the ability to add automatically generated GitHub buttons to the top of the docs with Yaydoc.

To do so we created a custom sphinx extension which makes use of http://buttons.github.io/ which is an excellent service to embed GitHub buttons to any website. The extension takes multiple config values and using them generates the `html` which it adds to the top of the internal docutils tree using a raw node.

GITHUB_BUTTON_SPEC = {
    'watch': ('eye', 'https://github.com/{user}/{repo}/subscription'),
    'star': ('star', 'https://github.com/{user}/{repo}'),
    'fork': ('repo-forked', 'https://github.com/{user}/{repo}/fork'),
    'follow': ('', 'https://github.com/{user}'),
    'issues': ('issue-opened', 'https://github.com/{user}/{repo}/issues'),
}

def get_button_tag(user, repo, btn_type, show_count, size):
    spec = GITHUB_BUTTON_SPEC[btn_type]
    icon, href = spec[0], spec[1].format(user=user, repo=repo)
    tag_fmt = '<a class="github-button" href="{href}" data-size="{size}"'
    if icon:
        tag_fmt += ' data-icon="octicon-{icon}"'
    tag_fmt += ' data-show-count="{show_count}">{text}</a>'
    return tag_fmt.format(href=href,
                          icon=icon,
                          size=size,
                          show_count=show_count,
                          text=btn_type.title())

The above snippet shows how it takes various parameters such as the user name, name of the repository, the button type which can be one of fork, issues, watch, follow and star, whether to display counts beside the buttons and whether a large button should be used. Another method named get_button_tags is used to read the various configs and call the above method with appropriate parameters to generate each button.

The extension makes use of the doctree-resolved event emitted by sphinx to hook into the internal doctree. The following snippet shows how it is done.

def on_doctree_resolved(app, doctree, docname):
    if not app.config.github_user_name or not app.config.github_repo:
        return
    buttons = nodes.raw('', get_button_tags(app.config), format='html')
    doctree.insert(0, buttons)

Finally we add the custom javascript using the add_javascript method.

app.add_javascript('https://buttons.github.io/buttons.js')

To use this with yaydoc, users would just need to add the following to their .yaydoc.yml file.

build:
  github_button:
    buttons:
      watch: true
      star: true
      issues: true
      fork: true
      follow: true
    show_count: true
    large: true

Resources

  1.  Homepage of Github:buttons – http://buttons.github.io/
  2. Sphinx extension Tutorial – http://www.sphinx-doc.org/en/stable/extdev/tutorial.html

Implementing an Interface for Reading Configuration from a YAML File for Yaydoc

Yaydoc reads configuration specified in a YAML file to set various options during the build process. This allows users to customize various properties of the build process. The current implementation for this was very basic. Basically it uses a pyYAML, a yaml parser for python to read the file and convert it to a python dictionary. From the dictionary we extracted values for various properties and converting them to strings using various heuristics such as converting True to ”true”, False to ”false”, a list to comma separated string and None to an empty string. Finally, we exported variables with those values.

Recently the entire code for this was rewritten using object-oriented paradigm. The motivation for this came from the fact that the implementation lacked certain features and also required some refactoring for long term readability. In the following paragraph, I have discussed the new implementation.

Firstly a Configuration class was created which basically wraps around a dictionary and provide certain utility methods. The primary difference is that the Configuration class allows dotted key access. This means that you can use the following syntax to access nested keys.

theme = conf[‘build.theme.name’]

The class provides another method connect which is used to connect environment variables with configuration values. This method also takes a dotted key but provides an extension on top of that to handle the case when a certain option can take multiple values. For example,

option: my_option

Or,

option:
  - my_option1
  - my_option2

To indicate that a certain config is of this type, you can specify a “@” character at the end of the key. Anything after the “@” character is assumed to be an attribute of each element within the list. Let’s see an example of this whole process.

build:
  subproject:
    - url: <url1>
  source: “doc”
    - url: <url2>

Now to extract all urls from the above file, we’d need to do the following

config.connect(‘SUBPROJECT_URLS’, ‘[email protected]’)

To extract sources, we’ll also use the default parameter as the source option is optional.

config.connect(‘SUBPROJECT_SOURCES’, [email protected]’, default=’docs’)

Finally, The Configuration object also provides a getenv method which reads all connection and serializes values to string according to the previously described heuristics. It then returns a dictionary of all environment variables which must be set.

Resources

Specifying Configurations for Yaydoc with .yaydoc.yml

Like many continuous integration services, Yaydoc also reads configurations from a YAML file. To get started with Yaydoc the first step is to create a file named .yaydoc.yml and specify the required options. Recently the yaydoc team finalized the layout of the file. You can still expect some changes as the projects continues to grow on but they shall not be major ones. Let me give you an outline of the entire layout before describing each in detail. Overall the file is divided into four sections.

  • metadata
  • build
  • publish
  • extras

For the first three sections, their intention is pretty clear from their names. The last section extras is meant to hold settings related to integration to external services.

Following is a description of config options under the metadata section and it’s example usage.

Key Description Default
author

The author of the repository. It is used to construct the copyright text.

user/organization
projectname

The name of the project. This would be displayed on the generated documentation.

Name of the repository
version

The version of the project. This would be displayed alongside the project name

Current UTC date
debug

If true, the logs generated would be a little more verbose. Can be one of true|false.

true
inline_math

Whether inline math should be enabled. This only affects markdown documents.

false
autoindex

This section contains various settings to control the behavior of the auto generated index. Use this to customize the starting page while having the benefit of not having to specify a manual index.

metadata:
  author: FOSSASIA
  projectname: Yaydoc
  version: development
  debug: true
  inline_math: false

Following is a description of config options under the build section and it’s example usage.

Key Description Default
theme

The theme which should be used to build the documentation. The attribute name can be one of the built-in themes or any custom sphinx theme from PyPI. Note that for PyPI themes, you need to specify the distribution name and not the package name. It also provides an attribute options to control specific theme options

sphinx_fossasia_theme
source

This is the path which would be scanned for markdown and reST files. Also any static content such as images referenced from embedded html in markdown should be placed under a _static directory inside source. Note that the README would always be included in the starting page irrespective of source from the auto-generated index

docs/
logo

The path to an image to be used as logo for the project. The path specified should be relative to the source directory.

markdown_flavour

The markdown flavour which should be used to parse the markdown documents. Possible values for this are markdown, markdown_strict, markdown_phpextra, markdown_github, markdown_mmd and commonmark. Note that currently this option is only used for parsing any included documents using the mdinclude directive and it’s unlikely to change soon.

markdown_github
mock

Any python modules or packages which should be mocked. This only makes sense if the project is in python and uses autodoc has C dependencies.

autoapi

If enabled, Yaydoc will crawl your repository and try to extract API documentation. It Provides attributes for specifying the language and source path. Currently supported languages are java and python.

subproject

This section can be used to include other repositories when building the docs for the current repositories. The source attribute should be set accordingly.

github_button

This section can be used to include various Github buttons such as fork, watch, star, etc.

hidden
github_ribbon

This section can be used to include a Github ribbon linking to your github repo.

hidden
build:
  theme:
    name: sphinx_fossasia_theme
  source: docs
  logo: images/logo.svg
  markdown_flavour: markdown_github
  mock:
    - numpy
    - scipy
  autoapi:
    - language: python
      source: modules
    - language: java
  subproject:
    - url: <URL of Subproject 1>
      source: doc
    - url: <URL of subproject 2>

Following is a description of config options under the publish section and it’s example usage.

Key Description Default
ghpages

It provides a attribute url whose value is written in a CNAME file while publishing to github pages.

heroku

It provides an app_name attribute which is used as the name of the heroku app. Your docs would be deployed at <app_name>.herokuapp.com

publish:
  ghpages:
    url: yaydoc.fossasia.org
  heroku:
    app_name: yaydoc 

Following is a description of config options under the extras section and it’s example usage.

Key Description Default
swagger

This can be used to include swagger API documentation in the build. The attribute url should point to a valid swagger json file. It also accepts an additional parameter ui which for now only supports swagger.

javadoc

It takes an attribute path and can include javadocs from the repository.

extras:
  swagger:
    url: http://api.susi.ai/docs/swagger.json
    ui: swagger
  javadoc:
    path: 'src/' 

Resources

Displaying selective logs in Yaydoc with downloadable detailed logs

Yaydoc, our automatic documentation generation and deployment project, at the crux of it consists of bash scripts. These bash scripts perform various actions, including documentation generation, deployment to Github and deployment to Heroku.

Since the inception of the User Interface of the Web UI, we have been spitting out the output of the bash script directly to the console output block without any filter. We understand that the output contains a lot of jargon that provides no essential information to the user. Hence, to improve the user experience of our platform, we decided to separate the Detailed logs and show only basic logs outlining the flow of the processes.

To implement this, we append all the logs to a unique file, sending only basic logs to stdout and stderr. Since we attend to store logs and display them in the console block of Web UI simultaneously we use the tee command, piping it with echo commands.

function print_log {
  if [ -n “$LOGFILE” ]; then
    echo -e $1 | tee -a ${LOGFILE}
  else
    echo -e $1
  fi
}

${LOGFILE}  is a unique file that has the same name as the preview directory and the compressed repository. After storing the logs in the file, the contents of the file are outputted using the cat command and is then shown to the user in a modal which is after the Detailed Logs button is clicked.

$(“#btnLogs”).click(function () {
  socket.emit(‘retrieve-detailed-logs’, data);
  ....
});

socket.on(‘retrieve-detailed-logs’, function (data) {
  var process = spawn(‘cat’, [‘temp/’+data.email+’/’+data.uniqueId+’.txt’]);
  process.stdout.setEncoding(‘utf-8’);
  process.stdout.on(‘data’, function (data)) {
    socket.emit(‘file-content’, data);
  }
});

To keep the indentation of the logs, we display the content inside the HTML pre tags. Along with displaying the detailed logs, we also let our two additional functionalities. These are ‘Copy to Clipboard’ and ‘Download as text file’. The ‘copy to clipboard’ functionality is achieved using the clipboard.js jQuery library while the `Download` functionality is achieved using res.download(file) function of ExpressJS response.

socket.on(‘file-content’, function (data) {
  new Clipboard(‘#copy-button’);
  $(‘#detailed-logs’).html(data);
  $(‘#detailed-logs-modal’).modal(‘show’);
});

Resources:

  1. https://clipboardjs.com – A modern approach to copy text to clipboard
  2. https://socket.io/ – The fastest and most reliable real-time engine