Python-Markdown includes an API for extension writers to plug their own custom functionality and/or syntax into the parser. There are Preprocessors which allow you to alter the source before it is passed to the parser, inline patterns which allow you to add, remove or override the syntax of any inline elements, and Postprocessors which allow munging of the output of the parser before it is returned. If you really want to dive in, there are also Blockprocessors which are part of the core BlockParser.
As the parser builds an ElementTree object which is later rendered as Unicode text, there are also some helpers provided to ease manipulation of the tree. Each part of the API is discussed in its respective section below. Additionally, reading the source of some Available Extensions may be helpful. For example, the Footnotes extension uses most of the features documented here.
Preprocessors munge the source text before it is passed into the Markdown core. This is an excellent place to clean up bad syntax, extract things the parser may otherwise choke on and perhaps even store it for later retrieval.
Preprocessors should inherit from markdown.preprocessors.Preprocessor
and
implement a run
method with one argument lines
. The run
method of
each Preprocessor will be passed the entire source text as a list of Unicode
strings. Each string will contain one line of text. The run
method should
return either that list, or an altered list of Unicode strings.
A pseudo example:
from markdown.preprocessors import Preprocessor
class MyPreprocessor(Preprocessor):
def run(self, lines):
new_lines = []
for line in lines:
m = MYREGEX.match(line)
if m:
# do stuff
else:
new_lines.append(line)
return new_lines
Inline Patterns implement the inline HTML element syntax for Markdown such as
*emphasis*
or [links](http://example.com)
. Pattern objects should be
instances of classes that inherit from markdown.inlinepatterns.Pattern
or
one of its children. Each pattern object uses a single regular expression and
must have the following methods:
getCompiledRegExp()
:
Returns a compiled regular expression.
handleMatch(m)
:
Accepts a match object and returns an ElementTree element of a plain Unicode string.
Note that any regular expression returned by getCompiledRegExp
must capture
the whole block. Therefore, they should all start with r'^(.*?)'
and end
with r'(.*?)!'
. When using the default getCompiledRegExp()
method
provided in the Pattern
you can pass in a regular expression without that
and getCompiledRegExp
will wrap your expression for you and set the
re.DOTALL
and re.UNICODE
flags. This means that the first group of your
match will be m.group(2)
as m.group(1)
will match everything before the
pattern.
For an example, consider this simplified emphasis pattern:
from markdown.inlinepatterns import Pattern
from markdown.util import etree
class EmphasisPattern(Pattern):
def handleMatch(self, m):
el = etree.Element('em')
el.text = m.group(3)
return el
As discussed in Integrating Your Code Into Markdown, an instance of this class will need to be provided to Markdown. That instance would be created like so:
# an oversimplified regex
MYPATTERN = r'\*([^*]+)\*'
# pass in pattern and create instance
emphasis = EmphasisPattern(MYPATTERN)
Actually it would not be necessary to create that pattern (and not just because
a more sophisticated emphasis pattern already exists in Markdown). The fact is,
that example pattern is not very DRY. A pattern for **strong**
text would
be almost identical, with the exception that it would create a ‘strong’ element.
Therefore, Markdown provides a number of generic pattern classes that can
provide some common functionality. For example, both emphasis and strong are
implemented with separate instances of the SimpleTagPattern
listed below.
Feel free to use or extend any of the Pattern classes found at
markdown.inlinepatterns
.
Generic Pattern Classes
SimpleTextPattern(pattern)
:
Returns simple text of group(2)
of a pattern
.
SimpleTagPattern(pattern, tag)
:
Returns an element of type “tag
” with a text attribute of group(3)
of a pattern
. tag
should be a string of a HTML element (i.e.: ‘em’).
SubstituteTagPattern(pattern, tag)
:
Returns an element of type “tag
” with no children or text (i.e.: br
).
There may be other Pattern classes in the Markdown source that you could extend or use as well. Read through the source and see if there is anything you can use. You might even get a few ideas for different approaches to your specific situation.
Treeprocessors manipulate an ElementTree object after it has passed through the core BlockParser. This is where additional manipulation of the tree takes place. Additionally, the InlineProcessor is a Treeprocessor which steps through the tree and runs the Inline Patterns on the text of each Element in the tree.
A Treeprocessor should inherit from markdown.treeprocessors.Treeprocessor
,
over-ride the run
method which takes one argument root
(an ElementTree
object) and either modifies that root element and returns None
or returns a
new ElementTree object.
A pseudo example:
from markdown.treeprocessors import Treeprocessor
class MyTreeprocessor(Treeprocessor):
def run(self, root):
root.text = 'modified content'
Note that Python class methods return None
by default when no return
statement is defined. Additionally all Python variables refer to objects by
reference. Therefore, the above run
method modifies the root
element
in place and returns None
. The changes made to the root
element and its
children are retained.
Some may be inclined to return the modified root
element. While that would
work, it would cause a copy of the entire ElementTree to be generated each
time the Treeprocessor is run. Therefore, it is generally expected that
the run
method would only return None
or a new ElementTree object.
For specifics on manipulating the ElementTree, see Working with the ElementTree below.
Postprocessors manipulate the document after the ElementTree has been serialized into a string. Postprocessors should be used to work with the text just before output.
A Postprocessor should inherit from markdown.postprocessors.Postprocessor
and over-ride the run
method which takes one argument text
and returns
a Unicode string.
Postprocessors are run after the ElementTree has been serialized back into Unicode text. For example, this may be an appropriate place to add a table of contents to a document:
from markdown.postprocessors import Postprocessor
class TocPostprocessor(Postprocessor):
def run(self, text):
return MYMARKERRE.sub(MyToc, text)
Sometimes, Preprocessors, Treeprocessors, Postprocessors, and Inline Patterns
are not going to do what you need. Perhaps you want a new type of block type
that needs to be integrated into the core parsing. In such a situation, you can
add/change/remove functionality of the core BlockParser
. The BlockParser is
composed of a number of Blockprocessors. The BlockParser steps through each
block of text (split by blank lines) and passes each block to the appropriate
Blockprocessor. That Blockprocessor parses the block and adds it to the
ElementTree. The
Definition Lists extension would be a good example of an extension that
adds/modifies Blockprocessors.
A Blockprocessor should inherit from markdown.blockprocessors.BlockProcessor
and implement both the test
and run
methods.
The test
method is used by BlockParser to identify the type of block.
Therefore the test
method must return a Boolean value. If the test returns
True
, then the BlockParser will call that Blockprocessor’s run
method.
If it returns False
, the BlockParser will move on to the next
Blockprocessor.
The test
method takes two arguments:
parent
: The parent ElementTree Element of the block. This can be useful
as the block may need to be treated differently if it is inside a list, for
example.
block
: A string of the current block of text. The test may be a
simple string method (such as block.startswith(some_text)
) or a complex
regular expression.
The run
method takes two arguments:
parent
: A pointer to the parent ElementTree Element of the block. The run
method will most likely attach additional nodes to this parent. Note that
nothing is returned by the method. The ElementTree object is altered in place.
blocks
: A list of all remaining blocks of the document. Your run
method must remove (pop) the first block from the list (which it altered in
place - not returned) and parse that block. You may find that a block of text
legitimately contains multiple block types. Therefore, after processing the
first type, your processor can insert the remaining text into the beginning
of the blocks
list for future parsing.
Please be aware that a single block can span multiple text blocks. For example,
The official Markdown syntax rules state that a blank line does not end a
Code Block. If the next block of text is also indented, then it is part of
the previous block. Therefore, the BlockParser was specifically designed to
address these types of situations. If you notice the CodeBlockProcessor
,
in the core, you will note that it checks the last child of the parent
.
If the last child is a code block (<pre><code>...</code></pre>
), then it
appends that block to the previous code block rather than creating a new
code block.
Each Blockprocessor has the following utility methods available:
lastChild(parent)
:
Returns the last child of the given ElementTree Element or None
if it
had no children.
detab(text)
:
Removes one level of indent (four spaces by default) from the front of each line of the given text string.
looseDetab(text, level)
:
Removes “level” levels of indent (defaults to 1) from the front of each line of the given text string. However, this methods allows secondary lines to not be indented as does some parts of the Markdown syntax.
Each Blockprocessor also has a pointer to the containing BlockParser instance at
self.parser
, which can be used to check or alter the state of the parser.
The BlockParser tracks it’s state in a stack at parser.state
. The state
stack is an instance of the State
class.
State
is a subclass of list
and has the additional methods:
set(state)
:
Set a new state to string state
. The new state is appended to the end
of the stack.
reset()
:
Step back one step in the stack. The last state at the end is removed from the stack.
isstate(state)
:
Test that the top (current) level of the stack is of the given string
state
.
Note that to ensure that the state stack does not become corrupted, each time a state is set for a block, that state must be reset when the parser finishes parsing that block.
An instance of the BlockParser
is found at Markdown.parser
.
BlockParser
has the following methods:
parseDocument(lines)
:
Given a list of lines, an ElementTree object is returned. This should be
passed an entire document and is the only method the Markdown
class
calls directly.
parseChunk(parent, text)
:
Parses a chunk of markdown text composed of multiple blocks and attaches
those blocks to the parent
Element. The parent
is altered in place
and nothing is returned. Extensions would most likely use this method for
block parsing.
parseBlocks(parent, blocks)
:
Parses a list of blocks of text and attaches those blocks to the parent
Element. The parent
is altered in place and nothing is returned. This
method will generally only be used internally to recursively parse nested
blocks of text.
While is is not recommended, an extension could subclass or completely replace
the BlockParser
. The new class would have to provide the same public API.
However, be aware that other extensions may expect the core parser provided
and will not work with such a drastically different parser.
As mentioned, the Markdown parser converts a source document to an ElementTree object before serializing that back to Unicode text. Markdown has provided some helpers to ease that manipulation within the context of the Markdown module.
First, to get access to the ElementTree module import ElementTree from
markdown
rather than importing it directly. This will ensure you are using
the same version of ElementTree as markdown. The module is found at
markdown.util.etree
within Markdown.
from markdown.util import etree
markdown.util.etree
tries to import ElementTree from any known location,
first as a standard library module (from xml.etree
in Python 2.5), then as
a third party package (ElementTree). In each instance, cElementTree
is
tried first, then ElementTree if the faster C implementation is not
available on your system.
Sometimes you may want text inserted into an element to be parsed by
Inline Patterns. In such a situation, simply insert the text as you normally
would and the text will be automatically run through the Inline Patterns.
However, if you do not want some text to be parsed by Inline Patterns,
then insert the text as an AtomicString
.
from markdown.util import AtomicString
some_element.text = AtomicString(some_text)
Here’s a basic example which creates an HTML table (note that the contents of
the second cell (td2
) will be run through Inline Patterns latter):
table = etree.Element("table")
table.set("cellpadding", "2") # Set cellpadding to 2
tr = etree.SubElement(table, "tr") # Add child tr to table
td1 = etree.SubElement(tr, "td") # Add child td1 to tr
td1.text = markdown.util.AtomicString("Cell content") # Add plain text content
td2 = etree.SubElement(tr, "td") # Add second td to tr
td2.text = "*text* with **inline** formatting." # Add markup text
table.tail = "Text after table" # Add text after table
You can also manipulate an existing tree. Consider the following example which
adds a class
attribute to <a>
elements:
def set_link_class(self, element):
for child in element:
if child.tag == "a":
child.set("class", "myclass") #set the class attribute
set_link_class(child) # run recursively on children
For more information about working with ElementTree see the ElementTree Documentation (Python Docs).
Once you have the various pieces of your extension built, you need to tell
Markdown about them and ensure that they are run in the proper sequence.
Markdown accepts an Extension
instance for each extension. Therefore, you
will need to define a class that extends markdown.extensions.Extension
and
over-rides the extendMarkdown
method. Within this class you will manage
configuration options for your extension and attach the various processors and
patterns to the Markdown instance.
It is important to note that the order of the various processors and patterns
matters. For example, if we replace http://...
links with <a>
elements,
and then try to deal with inline HTML, we will end up with a mess.
Therefore, the various types of processors and patterns are stored within an
instance of the Markdown class in OrderedDicts. Your Extension
class
will need to manipulate those OrderedDicts appropriately. You may insert
instances of your processors and patterns into the appropriate location in an
OrderedDict, remove a built-in instance, or replace a built-in instance with
your own.
extendMarkdown
¶The extendMarkdown
method of a markdown.extensions.Extension
class
accepts two arguments:
md
:
A pointer to the instance of the Markdown class. You should use this to access the OrderedDicts of processors and patterns. They are found under the following attributes:
md.preprocessors
md.inlinePatterns
md.parser.blockprocessors
md.treeprocessors
md.postprocessors
Some other things you may want to access in the markdown instance are:
md.htmlStash
md.output_formats
md.set_output_format()
md.output_format
md.serializer
md.registerExtension()
md.html_replacement_text
md.tab_length
md.enable_attributes
md.smart_emphasis
md_globals
:
Contains all the various global variables within the markdown module.
Warning
With access to the above items, theoretically you have the option to change anything through various monkey_patching techniques. However, you should be aware that the various undocumented parts of markdown may change without notice and your monkey_patches may break with a new release. Therefore, what you really should be doing is inserting processors and patterns into the markdown pipeline. Consider yourself warned!
A simple example:
from markdown.extensions import Extension
class MyExtension(Extension):
def extendMarkdown(self, md, md_globals):
# Insert instance of 'mypattern' before 'references' pattern
md.inlinePatterns.add('mypattern', MyPattern(md), '<references')
An OrderedDict is a dictionary like object that retains the order of it’s items. The items are ordered in the order in which they were appended to the OrderedDict. However, an item can also be inserted into the OrderedDict in a specific location in relation to the existing items.
Think of OrderedDict as a combination of a list and a dictionary as it has
methods common to both. For example, you can get and set items using the
od[key] = value
syntax and the methods keys()
, values()
, and
items()
work as expected with the keys, values and items returned in the
proper order. At the same time, you can use insert()
, append()
, and
index()
as you would with a list.
Generally speaking, within Markdown extensions you will be using the special
helper method add()
to add additional items to an existing OrderedDict.
The add()
method accepts three arguments:
key
: A string. The key is used for later reference to the item.
value
: The object instance stored in this item.
location
: Optional. The items location in relation to other items.
Note that the location can consist of a few different values:
The special strings "_begin"
and "_end"
insert that item at the
beginning or end of the OrderedDict respectively.
A less-than sign (<
) followed by an existing key (i.e.:
"<somekey"
) inserts that item before the existing key.
A greater-than sign (>
) followed by an existing key (i.e.:
">somekey"
) inserts that item after the existing key.
Consider the following example:
>>> from markdown.odict import OrderedDict
>>> od = OrderedDict()
>>> od['one'] = 1 # The same as: od.add('one', 1, '_begin')
>>> od['three'] = 3 # The same as: od.add('three', 3, '>one')
>>> od['four'] = 4 # The same as: od.add('four', 4, '_end')
>>> od.items()
[("one", 1), ("three", 3), ("four", 4)]
Note that when building an OrderedDict in order, the extra features of the
add
method offer no real value and are not necessary. However, when
manipulating an existing OrderedDict, add
can be very helpful. So let’s
insert another item into the OrderedDict.
>>> od.add('two', 2, '>one') # Insert after 'one'
>>> od.values()
[1, 2, 3, 4]
Now let’s insert another item.
>>> od.add('two-point-five', 2.5, '<three') # Insert before 'three'
>>> od.keys()
["one", "two", "two-point-five", "three", "four"]
Note that we also could have set the location of “two-point-five” to be ‘after two’
(i.e.: '>two'
). However, it’s unlikely that you will have control over the
order in which extensions will be loaded, and this could affect the final
sorted order of an OrderedDict. For example, suppose an extension adding
“two-point-five” in the above examples was loaded before a separate extension
which adds ‘two’. You may need to take this into consideration when adding your
extension components to the various markdown OrderedDicts.
Once an OrderedDict is created, the items are available via key:
MyNode = od['somekey']
Therefore, to delete an existing item:
del od['somekey']
To change the value of an existing item (leaving location unchanged):
od['somekey'] = MyNewObject()
To change the location of an existing item:
t.link('somekey', '<otherkey')
Some extensions may need to have their state reset between multiple runs of the Markdown class. For example, consider the following use of the Footnotes extension:
md = markdown.Markdown(extensions=['footnotes'])
html1 = md.convert(text_with_footnote)
md.reset()
html2 = md.convert(text_without_footnote)
Without calling reset
, the footnote definitions from the first document will
be inserted into the second document as they are still stored within the class
instance. Therefore the Extension
class needs to define a reset
method
that will reset the state of the extension (i.e.: self.footnotes = {}
).
However, as many extensions do not have a need for reset
, reset
is only
called on extensions that are registered.
To register an extension, call md.registerExtension
from within your
extendMarkdown
method:
def extendMarkdown(self, md, md_globals):
md.registerExtension(self)
# insert processors and patterns here
Then, each time reset
is called on the Markdown instance, the reset
method of each registered extension will be called as well. You should also
note that reset
will be called on each registered extension after it is
initialized the first time. Keep that in mind when over-riding the extension’s
reset
method.
If an extension uses any parameters that the user may want to change,
those parameters should be stored in self.config
of your
markdown.extensions.Extension
class in the following format:
class MyExtension(markdown.extensions.Extension):
def __init__(self, **kwargs):
self.config = {'option1' : ['value1', 'description1'],
'option2' : ['value2', 'description2'] }
super(MyExtension, self).__init__(**kwargs)
When implemented this way the configuration parameters can be over-ridden at
run time (thus the call to super
). For example:
markdown.Markdown(extensions=[MyExtension(option1='other value'])
Note that if a keyword is passed in that is not already defined in
self.config
, then a KeyError
is raised.
The markdown.extensions.Extension
class and its subclasses have the
following methods available to assist in working with configuration settings:
getConfig(key [, default])
:
Returns the stored value for the given key
or default
if the key
does not exist. If not set, default
returns an empty string.
getConfigs()
:
Returns a dict of all key/value pairs.
getConfigInfo()
:
Returns all configuration descriptions as a list of tuples.
setConfig(key, value)
:
Sets a configuration setting for key
with the given value
. If key
is
unknown, a KeyError
is raised. If the previous value of key
was
a Boolean value, then value
is converted to a Boolean value. If
the previous value of key
is None
, then value
is converted to
a Boolean value except when it is None
. No conversion takes place
when the previous value of key
is a string.
setConfigs(items)
:
Sets multiple configuration settings given a dict of key/value pairs.
makeExtension
¶As noted in the library reference an instance of an extension can be passed directly to Markdown. In fact, this is the preferred way to use third-party extensions.
For example:
import markdown
import myextension
myext = myextension.MyExtension(option='value')
md = markdown.Markdown(extensions=[myext])
Markdown also accepts “named” third party extensions for those occasions when it is impractical to import an extension directly (from the command line or from within templates).
The “name” of your extension must be a string consisting of the importable path to
your module using Python’s dot notation. Therefore, if you are providing a library
to your users and would like to include a custom markdown extension within your
library, that extension would be named "mylib.mdext.myext"
where mylib/mdext/myext.py
contains the extension and the mylib
directory is on the PYTHONPATH.
The string can also include the name of the class separated by a colon. Therefore, if you were to import the class like this:
from path.to.module import SomeExtensionClass
Then the named extension would comprise this string:
"path.to.module:SomeExtensionClass"
You do not need to do anything special to support this feature. As long as your extension class is able to be imported, a user can include it with the above syntax.
The above two methods are especially useful if you need to implement a large number of
extensions with more than one residing in a module. However, if you do not want to require
that your users include the class name in their string, you must define only one extension
per module and that module must contain a module-level function called makeExtension
that accepts **kwargs
and returns an extension instance.
For example:
class MyExtension(markdown.extensions.Extension)
# Define extension here...
def makeExtension(**kwargs):
return MyExtension(**kwargs)
When Markdown is passed the “name” of your extension as a dot notation string, it will import
the module and call the makeExtension
function to initiate your extension.