In this post we will learn how to use the ast module to extract docstrings from Python files.
Simply put, ast is a module present in the standard library that can parse Python syntax. Its whole purpose is to read Python code and to break it down into its syntactic components. Let’s explore this concept by analyzing a simple statement:
To parse a statement with ast, we can pass the code as a string to the function
The function will return an instance of the
ast.Module class that represents, simply put, a piece of code.
How do we extract the contents of this piece of code? —
ast.Module has an attribute called
body, that lets you retrieve a list of all the syntactic expressions contained in this code:
As you can see, the attribute body is a Python list containing a single element, of type
ast.Assignment. Unsuprisingly this corresponds to the single assignment operation
a = value that we performed.
How do we retrieve the left and right components of the assignment? — Easily enough, the
ast.Assignment has two attributes
values that contain exactly those two components.
To interactively explore which fields are available, each ast object exposes the attribute
_fields containing a list of the available fields.
As you can see, the targets are the value we are assigning to ( in this case it is a
ast.Name object corresponding to the variable
a), and the value is a binary operation,
ast.BinOp, that corresponds to the expression
3 * (b + c). We can continue this process untill we decompose the expression into its prime components.
The end result of this process is called Abstract Syntax Tree. Each entity (
ast.Node) can be decomposed in a recursive structure. The following scheme is an illustration of the Abstract Syntax Tree for the expression above (put your mouse on the nodes to reveal the code):
Now that we have a good understanding of how the parsing works, we can write a simple tool that takes a Python file and extracts all the toplevel function definitions.
The main idea is that we iterate over all the nodes in
Module.body and we use
isinstance to check if the node is a function definition. As an example, we’ll parse the
ast module itself, but you can use whatever module you want. To retrieve the location of the
ast module we will use the following code:
At this point we read the file as a string and we parse it with
ast. Then, we iterate on the expression contained in the model and we collect all of the
If we want to see the function names, we can simply access the
name attribute of
How do we extract the docstrings?— Easy, you can use
ast.get_docstring on a
ast.FunctionDef object. The following code will print the name of each function and its documentation:
That will produce the following output:
--- parse --- Parse the source into an AST node. Equivalent to compile(source, filename, mode, PyCF_ONLY_AST). --- literal_eval --- Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following ...
So far we learned how to extract docstrings from function definitions, but what about classes and methods?
As you know, when you declare a class, you write a bunch of function definitions in the class body to declare its methods. This translates in
ast as follows. Class definitions are represented as
ast.ClassDef instances, and each
ast.ClassDef object contains a
body attribute that contains the function definitions (or methods). In the following example we first collect all the classes in the module, then for each class we collects its methods.
At this point, extracting the docstring is a matter of calling
ast.get_docstring on the collected
ast goodness, please check out the official documentation.
Thank you for reading, and happy parsing!