In this post we will learn how to use the ast module to extract docstrings from Python files.
Simply put, ast is a module present in the standard library that can parse Python syntax. Its whole purpose is to read Python code and to break it down into its syntactic components. Let’s explore this concept by analyzing a simple statement:
To parse a statement with ast, we can pass the code as a string to the function ast.parse
.
The function will return an instance of the ast.Module
class that represents, simply put, a piece of code.
How do we extract the contents of this piece of code? — ast.Module
has an attribute called body
, that lets you retrieve a list of all the syntactic expressions contained in this code:
As you can see, the attribute body is a Python list containing a single element, of type ast.Assignment
. Unsuprisingly this corresponds to the single assignment operation a = value
that we performed.
How do we retrieve the left and right components of the assignment? — Easily enough, the ast.Assignment
has two attributes targets
and values
that contain exactly those two components.
To interactively explore which fields are available, each ast object exposes the attribute _fields
containing a list of the available fields.
As you can see, the targets are the value we are assigning to ( in this case it is a ast.Name
object corresponding to the variable a
), and the value is a binary operation, ast.BinOp
, that corresponds to the expression 3 * (b + c)
. We can continue this process untill we decompose the expression into its prime components.
The end result of this process is called Abstract Syntax Tree. Each entity (ast.Node
) can be decomposed in a recursive structure. The following scheme is an illustration of the Abstract Syntax Tree for the expression above (put your mouse on the nodes to reveal the code):
Now that we have a good understanding of how the parsing works, we can write a simple tool that takes a Python file and extracts all the toplevel function definitions.
The main idea is that we iterate over all the nodes in Module.body
and we use isinstance
to check if the node is a function definition. As an example, we’ll parse the ast
module itself, but you can use whatever module you want. To retrieve the location of the ast
module we will use the following code:
At this point we read the file as a string and we parse it with ast
. Then, we iterate on the expression contained in the model and we collect all of the ast.FunctionDef
instances:
If we want to see the function names, we can simply access the name
attribute of ast.FunctionDef
:
How do we extract the docstrings?— Easy, you can use ast.get_docstring
on a ast.FunctionDef
object. The following code will print the name of each function and its documentation:
That will produce the following output:
---
parse
---
Parse the source into an AST node.
Equivalent to compile(source, filename, mode, PyCF_ONLY_AST).
---
literal_eval
---
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the following
...
So far we learned how to extract docstrings from function definitions, but what about classes and methods?
As you know, when you declare a class, you write a bunch of function definitions in the class body to declare its methods. This translates in ast
as follows. Class definitions are represented as ast.ClassDef
instances, and each ast.ClassDef
object contains a body
attribute that contains the function definitions (or methods). In the following example we first collect all the classes in the module, then for each class we collects its methods.
At this point, extracting the docstring is a matter of calling ast.get_docstring
on the collected ast.FunctionDef
and ast.ClassDef
objects.
For more ast
goodness, please check out the official documentation.
Thank you for reading, and happy parsing!