YAML 101
Introduction
YAML is short for “YAML Ain’t Markup Language”, it’s a way of storing basic data structures in human readable form. You may also have heard about JSON, XML, TOML formats which are also used for this purpose. It’s worth mentioning that YAML is a superset of JSON, meaning any valid JSON file is also a valid YAML file. The aim of this blog post is to introduce you to a new format and show a few trick if you knew it already
YAML 1.2 specification mentions a few goals
- YAML is easily readable by humans.
- YAML data is portable between programming languages.
- YAML matches the native data structures of agile languages.
- YAML has a consistent model to support generic tools.
- YAML supports one-pass processing.
- YAML is expressive and extensible.
- YAML is easy to implement and use
Library used to parse yaml here is PyYaml. Particular library’s complines with the specification may vary.
Jekyll theme used here and I believe library used by jekyll for highlighting snippets does not work well with some advance yaml syntax e.g. multiline strings variations.
Basic Syntax
The main reason behind YAML is data structure representation. There are basically 3 structures: sequence, map(mapping) and variable, actual document itself is just a big map. As I mentioned previously valid JSON is a valid YAML, but YAML can do much more with much less characters. First of all, identifiers for structures don’t have to be enclosed in double quotes. Brackets can also be skipped, instead of special formatting of the file is used.
An identifier ends with “:” and non zero number of whitespace characters after witch formatting determines the structure. If value is in the same line structure created is variable. If after “:” newline character is present and some following lines have variables defined in them with the same level of indentation but higher than “parent” identifier, this construct is a mapping. If next lines start with “-“ then a sequence was created.
string: value
integer: 1
float: 1.1
dictionary:
key1: value1
key2: value2
array:
- one
- two
- tree
Great thing about each one of them is the fact that nesting is allowed. Do you want array of maps of maps? Piece of cake, just take look.
array:
- mapZero:
key0:
subkey0: s0
subkey1: s1
- mapOne:
key0:
subkey0: s0
subkey1: s1
You can create a sequence and a map inline using syntax similar to JSON. This syntax also allows creation of documents independent from formatting, albeit it’s less clear this way.
array: [ mapZero: {key0: {subkey0: s0, subkey1: s1}}, mapOne: {key0: {subkey0: s0, subkey1: s1}}]
Latter 2 snippets of yaml are interchangeable.
Advanced Syntax
YAML, when looked in depth, is much more complex that any one might have anticipated after first contact. This section is a breve overview of some less popular but useful features.
Multiline Strings
As YAML is supposed to be “easily readable by humans” and sometimes long strings of characters are required to be under certain identifier. Let’s say we want to put 256 characters under key “long_string”. We can put them in one line
long_string: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer ac elit mi. Vestibulum sem orci, placerat sed condimentum id, rutrum imperdiet dolor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Curabitur nullam.
It’s readable to the certain degree but we can do better. For example add double quotes
long_string: "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Integer ac elit mi. Vestibulum sem orci, placerat sed condimentum id,
rutrum imperdiet dolor. Vestibulum ante ipsum primis in faucibus orci
luctus et ultrices posuere cubilia curae; Curabitur nullam."
Yaml has multiple versions of multiline strings, depending on character after the identifier. String starts form the following line, it has to have increased level of indentation. You could also explicitly state number of whitespace characters before each line begins. If unspecified defaults to number of whitespace characters before first line in string. Multiple variation go like this:
zero: |2
This text is interpreted literally
notice this was indented with 3 spaces with string starting at 2
one: |
This text is interpreted literally
new lines are preserved
two: >
This text is folded
newline characters are replaced with spaces
blank lines are replaced with newline characters
tree: |-
In this case trailing newline is removed
four: |+
Now any trailing newline character is preserved
# Like ones above
When using PyYaml a dictionary was loaded. After using pretty printing pprint module, this is what I got:
{ 'four': 'Now any trailing newline character is preserved\n\n\n',
'one': 'This text is interpreted literally\nnew lines are preserved\n',
'tree': 'In this case trailing newline is removed',
'two': 'This text is folded newline characters are replaced with spaces\n'
'blank lines are replaced with newline characters\n',
'zero': ' This text is interpreted literally\n'
' notice this was indented with 3 spaces with string starting at '
'2\n'}
Remember that by definition maps are unordered so your mileage may vary. In python it’s a bit more complex (check bibliography for Python’s dictionary)
Variable interpretation
Variables in yaml can be mapped () to different types, like floats, ints etc. We can also make YAML parser do some conversion between different representation of a number. It’s a nice little feature allowing for easier variable definition for example:
sequence:
- 0xFFFFFF
- 0xFF0000
- 0xFFFFFF
turns into:
{'sequence': [16777215, 16711680, 16777215]}
Some other formats for numbers are listed below:
sequence:
- 010 #integer interpreted as Base8
- 0x10 #integer interpreted as hexadecimal
- 10 #integer
- 10.2 # floating point
- 10.2.1 # string
Using PyYaml this is what was loaded:
{'sequence': [8, 16, 10, 10.2, '10.2.1']}
Multiple documents in one file
#document zero
document_number: 0
just_array:
- one:
q: 1
- two:
w: 2
---
---
#document one
document_number: 1
strange_dictionary:
k:
- j
- m
Anchors
You may have heard about DRY principle that is “Don’t Repeat Yourself”. Yaml has a convenient feature that allows applying this principle namely anchors, aliases. Anchors behave like variables inside the document itself. An anchor is creates using “&anchor_name” syntax. It’s value can be later referenced using “*anchor_name” syntax.
variable: &reference value
different_variable: *reference
turns into
{'variable': 'value', 'different_variable': 'value'}
This is already cool but I m not done yet. In case of more complex structures, like maps and sequences, you can actually merge structures together and not override them. In this case special syntax “« : *anchor_name” is used. It’s worth mentioning it was added in YAML 1.1
younger_sister: &one
first name: Serena
last name: Williams
older_sister:
<< : *one
first name: Venus
After parsing it will look like
{ 'older_sister': {'first name': 'Venus', 'last name': 'Williams'},
'younger_sister': {'first name': 'Serena', 'last name': 'Williams'}
}
Actually you could specify sequence of references after “« :” operator, but inline sequence must be used ([ item0, item1]). The next one will overrider the previous one. Theoretically after merging operator inline dictionary can be specified like ( { q: 0, w: 1} ). As far I know inline dictionary and sequence aren’t actual terms in case of YAML, but I hope you get it.