Introduction to Colander

August 12, 2020

Despite being derived from Java Script, the JSON data format has quickly spread to other languages and python is no exception. Commonly referred to as dictionary, the simple key-value mapping offers a wide range of use cases: data storage, configuration and data transfer !

JSON serialization

To make sure that the information contained in a dictionary corresponds exactly to the JSON open standard file format, we’ll need to serialise it. This operation converts each value of the dictionary as a json data type and place the whole object into a string. This way it can be understood by language supporting the JSON format.

In python, you would simply do

import json
message_serialized = json.dumps(message)

message = {
    "name": "Paul O'Brady",
    "email": "paul.obrady@example.com",
    "gender": "M",
    "date_of_birth": "1985-01-01",
    "coffee_club_member": True,
    "order": {
        "order_id": 1111,
        "items": [
            {
                "product_id": 12345,
                "quantity": 1,
                "unit_price": 4.65,
            },
            {
                "product_id": 54321,
                "quantity": 2,
                "unit_price": 8.19,
            },
            {
                "product_id": 112233,
                "quantity": 1,
                "unit_price": None,
            },
        ],
    }
}

'message_serialized = {
    "name": "Paul O\'Brady", 
    "email": "paul.obrady@example.com", 
    "gender": "M", 
    "date_of_birth": "1985-01-01", 
    "coffee_club_member": true, 
    "order": {
        "order_id": 1111, 
        items": [
            {
                "product_id": 12345, 
                "quantity": 1, 
                "unit_price": 4.65
            }, 
            {
                "product_id": 54321, 
                "quantity": 2, 
                "unit_price": 8.19
            }, 
            {
                "product_id": 112233, 
                "quantity": 1, 
                "unit_price": null
            }
        ]
    }
}'

Few things worth noting here:

the apostrophe in the name is automatically escaped
the boolean coffee_club_member switched from True (python) to true (js)
the last unit_price switch from None (python) to null (js)
the whole object is now a string

Similarly, to deserialized a json string into a python dictionary and get back what we had originally:

message_deserialized = json.loads(message_serialized)
message_deserialized

Regardless of what you want to do with this serialized message (save it in a database, send it via a Pub/Sub messaging service…), it is often a good idea to enusre that a consistent schema is in place.

This JSON serialization is used to format python dictionary into JSON standard format, but it doesn’t define nor validate the data contained inside the dictionary.

In our case, we serialise a missing unit_price, and we get back a missing unit_price after deserialising, which could be an issue. Since it is best to catch the errors as early as they appear, let’s see how colander can help avoiding that.

Colander serialization

Colander is a better serialization library as it checks for consistency of the JSON schema by performing a key by key validation and then converts each value into a string. Similarly, colander deserialization will check each serialized string and and convert it back to the defined type.

We start by defining each structure by subclassing MappingSchema. Each key inside the structure is a SchemaNode which can be of different SchemaTypes (available types here). On top of defining the schema, we have the possibility to add validators and preparers. For lists, we subclass SequenceSchema. Additionally, if a key is missing, we can give it a default value during serialization by specifying default or a default value during deserialization with the missing argument.

Example image

import colander

class Item(colander.MappingSchema):
    product_id = colander.SchemaNode(colander.Int())
    quantity = colander.SchemaNode(colander.Int())
    unit_price = colander.SchemaNode(colander.Float(), missing=None)

class Items(colander.SequenceSchema):
    item = Item()
    
class Order(colander.MappingSchema):
    order_id = colander.SchemaNode(colander.Int())
    items = Items()
    
class Message(colander.MappingSchema):
    name = colander.SchemaNode(colander.Str())
    email = colander.SchemaNode(colander.Str(), validator=colander.Email())    
    gender = colander.SchemaNode(colander.Str(), validator=colander.OneOf(['M', 'F', '']), default='', missing='')
    date_of_birth = colander.SchemaNode(colander.Date(format='%d %b %Y'))
    coffee_club_member = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    percent_discount = colander.SchemaNode(colander.Float(), validator=colander.Range(min=0, max=1), missing=0)
    order = Order()

json_serialized = Message().serialize(message)
json_serialized

>>> {'name': "Paul O'Brady",
 'email': 'paul.obrady@example.com',
 'gender': 'M',
 'date_of_birth': '01 Jan 1990',
 'coffee_club_member': 'true',
 'percent_discount': <colander.null>,
 'order': {'order_id': '1111',
  'items': [{'product_id': '12345', 'quantity': '1', 'unit_price': '4.65'},
   {'product_id': '54321', 'quantity': '2', 'unit_price': '8.19'},
   {'product_id': '112233', 'quantity': '1', 'unit_price': <colander.null>}]}}

json_deserialized = Message().deserialize(json_serialized)
json_deserialized

>>> {'name': "Paul O'Brady",
 'email': 'paul.obrady@example.com',
 'gender': 'M',
 'date_of_birth': datetime.date(1990, 1, 1),
 'coffee_club_member': True,
 'percent_discount': 0,
 'order': {'order_id': 1111,
  'items': [{'product_id': 12345, 'quantity': 1, 'unit_price': 4.65},
   {'product_id': 54321, 'quantity': 2, 'unit_price': 8.19},
   {'product_id': 112233, 'quantity': 1, 'unit_price': None}]}}

Here are few cases allowing you to catch a potential error early:

if a key cannot be validated, the serialization will fail
if a key is missing and does not have a default attribute, it is serialized as colander.null,
if a key is missing and does not have a missing attribute, it is considered as required and the deserialization will fail

Advanced colander features

Let’s try out a scenario where your JSON structure is more complex, for example it contains your model predictions.

import numpy as np

results_json = {
    'proba': np.array([0.56, 0.83, 0.23, 0.76, 0.92]),
    'categ': ['A', 'A', 'B', 'B', 'C']
}

Custom `SchemaType`

Same as before, we define the nodes, sequence and mapping

# Define each individual proba and category
class Probability(colander.SchemaNode):
    schema_type = colander.Float
    validator = colander.Range(min=0.00, max=1.00)

class Category(colander.SchemaNode):
    schema_type = colander.Str
    validator = validator=colander.OneOf(['A', 'B', 'C']) 

# Define the sequences
class Probabilities(colander.SequenceSchema):
    proba = Probability()

class Categories(colander.SequenceSchema):
    proba = Category()
    
# Define the final MappingSchema
class ModelResults(colander.MappingSchema):
    proba = Probabilities()
    categ = Categories()

# serialized
serialized_results = ModelResults().serialize(json_predictions)
serialized_results

{'proba': ['0.56', '0.83', '0.23', '0.76', '0.92'], 'categ': ['A', 'A', 'B', 'B', 'C']}

# deserialized
deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': [0.56, 0.83, 0.23, 0.76, 0.92], 'categ': ['A', 'A', 'B', 'B', 'C']}

Note that we have lost our original numpy array ! This is because Probabilities subclassed SequenceSchema and the default sequence data type is a list. In order to get back a numpy array, we’ll need to define our own SchemaType with its serialize and deserialize method for numpy arrays.

from colander import SchemaType, Invalid, null

class NumpyArray(SchemaType):
    def serialize(self, node, cstruct):
        if cstruct is null:
            return null
        if not isinstance(cstruct, np.ndarray):
            raise Invalid(node, '%r is not a np.array' % cstruct)
        return cstruct.tolist()
    def deserialize(self, node, appstruct):
        if appstruct is null:
            return null
        if not isinstance(appstruct, list):
            raise Invalid(node, '%r is not a list' % appstruct)
        return np.array([x for x in appstruct], dtype=float)

# Define the SchemaMapping
class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()
    
# deserialized
deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C']}

Deferred function and schema binding

If we need an extra indicator in our deserialized JSON which is not always provided pre serialization. This indicator should default to False when not provided, we can do:

class Indicators(colander.SequenceSchema):
    ind = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    missing = [False, False, False, False]
    
class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()
    indic = Indicators()

# deserialized
deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C'],
 'indic': [False, False, False, False]}

It seems to works, however we’ve just accidently hardcoded the length of our input proba and categ lists to 5, the default indic value will always have a length of 5.

A way around this issue is to:

define a deferred function that we’ll use in place of the hardcoded missing values
bind the schema passing the relevent parameter before deserialization

@colander.deferred
def missing_indicators(node, kw):
    return [False] * kw.get('n_predictions')

class Indicators(colander.SequenceSchema):
    ind = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    missing = missing_indicators

class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()
    indic = Indicators()
    
# deserialized
deserialized_results = ModelResults().bind(n_predictions=len(results_json['proba'])).deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C'],
 'indic': [False, False, False, False, False]}