Thinking About ETLs

My primary focus for the last year or so has been writing ETLs at work. It is an interesting problem because on some level it feels extremely easy, while in reality, it is a problem that is very difficult to abstract.


The essence of an ETL, beyond the obvious “extract, transform, load”, is the query. In the case of a database, the query is typically the SELECT statement, but it usually is more than that. It often includes the format of the results. You might need to chunk the data using multiple queries. There might be columns you skip or columns you create.

In non-database ETLs, it still ends up being very similar to query. You often still need to find boundaries for what you are extracting. For example, if you had a bunch of date stamped log files, doing a find /var/logs -name 2014*.log.gz could still be considered a query.

A query is important because ETLs are inherently fragile. ETLs are required because the standard interface to some data is not available due to some constraints. By bypassing standard, and more importantly supported, interfaces, you are on your own when it comes to ensuring the ETL runs. The database dump you are running might timeout. The machine you are reading files from may reboot. The REST API node you are hitting gets a new version and restarts. There are always good reasons for your ETL process to fail. The query makes it possible to go back and try things again, limiting them to the specific subset of data you are missing.


ETLs often are considered part of some analytics pipeline. The goal of an ETL is typically to take some data from some system and transform it to a format that can be loaded into another system for analysis. A better principle is to consider storing the intermediaries such that transformation is focused on a specific generalized format, rather than a specific system such as a database.

This is much harder than it sounds.

The key to providing generic access to data is a standard schema for the data. The “shape” of the data needs to be described in a fashion that is actionable by the transformation process that loads the data into the analytics system.

The schema is more than a type system. Some data is heavy with metadata while other data is extremely consistent. The schema should provide notation for both extremes.

The schema also should provide hints on how to convert the data. The most important aspect of the schema is to communicate to the loading system how to transform and / or import the data. One system might happily accept a string with 2014-02-15 as a date if you specify it is a date, while others may need something more explicit. The schema should communicate that the data is date string with a specific format that the loading system can use accordingly.

The schema can be difficult to create. Metadata might need a suite of queries to other systems in order to fill in the data. There might need to be calculations that have to happen that the querying system doesn’t support. In these cases you are not just transforming the data, but processing it.

I admit I just made an arbitrary distinction and definition of “processing”, so let me explain.

Processing Data

In a transformation you take the data you have and change it. If I have a URL, I might transform it into JSON that looks like {‘url’: $URL}. Processing, on the other hand, uses the data to create new data. For example, if I have a RESTful resource, I might crawl it to create a single view of some tree of objects. The important difference is that we are creating new information by using other resources not found in the original query data.

The processing of data can be expensive. You might have to make many requests for every row of output in a database table. The calculations, while small, might be on a huge dataset. Whatever the processing that needs happen in order to get your data to a generically usable state, it is a difficult problem to abstract over a wide breadth of data.

While there is no silver bullet to processing data, there are tactics that can be used to process data reliably and reasonably fast. The key to abstracting processing is defining the unit of work.

A Unit of Work

“Unit of Work” is probably a loaded term, so once again, I’ll define what I mean here.

When processing data in an ETL, the Unit of Work is the combination of:

  • an atomic record
  • an atomic algorithm
  • the ability to run the implementation

If all this sounds very map/reducey it is because it is! The difference is that in an ETL you don’t have the same reliability you’d have with something like Hadoop. There is no magical distributed file system that has your data ready to go on a cluster designed to run code explicitly written to support your map/reduce platform.

The key difference with processing data in ETLs vs. some system like Hadoop is the implementation and execution of the algorithm. The implementation includes:

  • some command to run on the atomic record
  • the information necessary to setup an environment for that script to run
  • an automated to input the atomic record to the command
  • a guarantee of reliable execution (or failure)

If we look at a system like Hadoop, and this applies to most map/reduce platforms that I’ve seen, there is an explicit step that takes data from some system and adds it to the HDFS store. There is another step that installs code, specifically written for Hadoop, onto the cluster. This code could be using Hadoop streaming or actual Java, but in either case, the installation is done via some deployment.

In other words, there is an unsaid step that Extracts data from some system, Transforms it for Hadoop and Loads it into HDFS. The processing in this case is getting the data from whatever the source system is into the analytics system, therefore, the requirements are slightly different.

We start off with a command. The command is simply an executable script like you would see in Hadoop streaming. No real difference here. Each line passed to the command contains the atomic record as usual.

Before we can run that command, we need to have an environment configured. In Hadoop, you’ve configured your cluster and deployed your code to the nodes. In an ETL system, due to the fragility and simpler processing requirements (no one should write a SQL-like system on top of an ETL framework), we want to set up an environment every time the command runs. By setting up this environment every time the command runs you allow a clear path for development of your ETL steps. Making the environment creation part of the development process it means that you ensure the deployment is tested along side the actual command(s) your ETL uses.

Once we have the command and an environment to run it in we need a way to get our atomic record to the command for actual processing. In Hadoop streaming, we use everyone’s favorite file handle, stdin. In an ETL system, while the command may still use stdin, the way the data enters the ETL system doesn’t necessarily have a distributed file system to use. Data might be downloaded from S3, some RESTful service, and / or some queue system. It important that you have a clear automated way to get data to an ETL processing node.

Finally, this processing must be reliable. ETLs are low priority. An ETL should not lock your production database for an hour in order to dump the data. Instead ETLs must quietly grab the data in a way that doesn’t add contention to the running systems. After all, you are extracting the data because a query on the production server will bog it down when it needs to be serving real time requests. An ETL system needs to reliably stop and start as necessary to get the data necessary and avoid adding more contention to an already resource intensive service.

Loading Data

Loading data from an ETL system requires analyzing the schema in order to construct the understanding between the analytics system and the data. In order to make this as flexible as possible, it is important that the schema use the source of data to add as much metadata as possible. If the data pulls from a Postgres table, the schema should idealling include most of the schema information. If that data must be loaded into some other RDBMS, you have all you need to safely read the data into the system.

Development and Maintenance

ETLs are always going to be changing. New analytics systems will be used and new source of data will be created. As the source system constraints change so do the constraints of an ETL system, again, with the ETL system being the lowest priority.

Since we can rely on ETLs changing and breaking, it is critical to raise awareness of maintenance within the system.

The key to creating a maintainable system is to build up from small tools. The reason being is that as you create small abstractions at a low level, you can reuse these easily. The trade off is that in the short term, more code is needed to accomplish common tasks. Over time, you find patterns specific to your organizations requirements that allow repetitive tasks to be abstracted into tools.

The converse to building up an ETL system based on small tools is to use a pre-built execution system. Unfortunately, pre-built ETL systems have been generalized for common tasks. As we’ve said earlier, ETLs are often changing and require more attention than a typical distributed system. The result is that using a pre-built ETL environment often means creating ETLs that allow the pre-built ETL system to do its work!


Our goal for our ETLs is to make them extremely easy to test. There are many facets to testing ETLs such as unit testing within an actual package. The testing that is most critical for development and maintenance is simply being able to quickly run and test a single step of an ETL.

For example, lets say we have an ETL that dumps a table, reformats some rows and creates a 10GB gzipped CSV file. I only mention the size here as it implies that it takes too long to run over the entire set of data every time while testing. The file will then be uploaded to S3 and notify a central data warehouse system. Here are some steps that the ETL might perform:

  1. Dumping the table
  2. Create a schema
  3. Processing the rows
  4. Gzipping the output
  5. Uploading the data
  6. Update the warehouse

Each of these steps should be runnable:

  • locally on a fake or testing datbase
  • locally, using a production database
  • remotely using a production database and testing system (test bucket and test warehouse)
  • remotely using the production database and production systems

By “runnable”, I mean that an ETL developer can run a command with a specific config and watch the output for issues.

These steps are all pretty basic, but the goal with an ETL system is to abstract the pieces that can be used across all ETLs in a way that is optimal for your system. For example, if your system is consistently streaming, your ETL framework might allow you to chain file handles together. For example

$ dump table | process rows | gzip | upload

Another option might be that each step produces a file that is used by the next step.

Both tactics are valid and can be optimized for over time to help distill ETLs to the minimal, changing requirements. In the above example, the database table dump could be abstracted to take the schema and some database settings to dump any table in your databases. The gzip, upload and data warehouse interactions can be broken out into a library and/or command line apps. Each of these optimizations are simple enough to be included in an ETL development framework without forcing a user to jump through a ton of hoops when a new data store needs to be considered.

An ETL Framework

Making it easy to develop ETLs means a framework. We want to create a Ruby on Rails for writing ETLs that makes it easy enough to get the easy stuff done and powerful enough to do deal with the corner cases. The framework revolves around the schema and the APIs to the different systems and libraries that provide language specific APIs.

At some level the framework needs to allow the introduction of other languages. My only suggestion here is that other languages are abstracted through a command line layer. The ETL framework can eventually call a command that could be written in whatever language the developer wants to use. ETLs are typically used to export data for to a system that is reasonably technical. Someone using this data most likely has some knowledge of some language such as R, Julia or maybe JavaScript. It is these technically savvy data wranglers we want to empower with the ETL framework in order to allow them to solve small ETL issues themselves and provide reliability where the system can be flaky.

Open Questions

The system I’ve described is what I’m working on. While I’m confident the design goals are reasonable, the implementation is going to be difficult. Specifically, the task of generically supporting many languages is challenging because each language has its own ecosystem and environment. Python is an easy language for this task b/c it is trivial to connect to a Ubuntu host have a good deal of the ecosystem in place. Other languages, such as R, probably require some coordination with the cluster provisioning system to make sure base requirements are available. That said, it is unclear if other languages provide small environments like virtualenvs do. Obviously typical scripting languages like Ruby and JavaScript have support for an application local environment, but I’m doubtful R or Julia would have the same facilities.

Another option would be to use a formal build / deployment pattern where a container is built. This answers many of the platform questions, but it brings up other questions such as how to make this available in the ETL Framework. It is ideal if an ETL author can simply call a command to test. If the author needs to build a container locally then I suspect that might be too large a requirement as each platform is going to be different. Obviously, we could introduce a build host to handle the build steps, but that makes it much harder for someone to feel confident the script they wrote will run in production.

The challenge is because our hope is to empower semi-technical ETL authors. If we compare this goal to people who can write HTML/CSS vs. programmers, it clarifies the requirements. A user learning to write HTML/CSS only has to open the file in a web browser to test it. If the page looks correct, they can be confident when they deploy it will work. The goal with the ETL framework and APIs is that the system can provide a similar work flow and ease of use.

Wrapping Up

I’ve written a LOT of ETL code over the past year. Much of what I propose above reflects my experiences. It also reflects the server environment in which these ETLs run as well as the organizational environment. ETLs are low priority code, by nature, that can be used to build first class products. Systems that require a lot of sysadmin time, server resources or have too specific an API may still be helpful moving data around, but they will fall short as systems evolve. My goal has been to create a system that evolves with the data in the organization and empowers a large number of users to distribute the task of developing and maintaining ETLs.

Dadd, ErrorEmail and CacheControl Releases

I’ve written a couple new bits of code that semeed like they could be helpful to others.


Dadd (pronounced Daddy) is a tool to help administer daemons.

Most deployment systems are based on the idea of long running processes. You want to release a new version of some service. You build a package, upload it somewhere and tell your package manager to grab it. Then you tell your process manager to restart it to get the new code.

Dadd works differently. Dadd lets you define a short spec that includes the process you want to run. A dadd worker then will use that spec to download any necessary files, create a temporary directory to run in and start the process. When the process ends, assuming everything went well, it will clean up the temp directory. If there was an error, it will upload the logs to the master and send an email.

Where this sort of system comes in handy is when you have scripts that take a while to run and that shouldn’t be killed when new code is released. For example, at work I manage a ton of ETL processes to get our data into a data warehouse we’ve written. These ETL processes are triggered with Celery tasks, but they typically will ssh into a specific host, create a virtaulenv, install some dependencies, and copy files before running a deamon and disconnecting. Dadd, makes this kind of processing more automatic where it can run these processes on any host in our cluster. Also, because the dadd worker builds the environment, it means we can run a custom script without having to go through the process of a release. This is extremely helpful for running backfills or custom updates to migrate old data.

I have some ideas for Dadd such as incorporating a more involved build system and possibly using lxc containers to run the code. Another inspriation for Dadd is for setting up nodes in a cluster. Often times it would be really easy to just install a couple python packages but most solutions are either too manual or require a specific image to use things like chef, puppet, etc. With Dadd, you could pretty easily write a script to install and run it on a node and then let it do the rest regarding setting up an environment and running some code.

But, for the moment, if you have code you run by copying some files, Dadd works really well.


ErrorEmail was written specifically for Dadd. When you have a script to run and you want a nice traceback email when things fail, give ErrorEmail a try. It doesn’t do any sort rate limiting an the server config is extremely basic, but sometimes you don’t want to install a bunch of packages just to send an email on an error.

When you can’t install django or some other framework for an application, you can still get nice error emails with ErrorEmail.


The CacheControl 0.10.6 release includes support for calling close on the cache implementation. This is helpful when you are using a cache via some client (ie Redis) and that client needs to safely close the connection.

Ugly Attributes

At some point in my programming career I recognized that Object Oriented Programming is not all it’s cracked up to be. It can be a powerful tool, especially in a statically typed language, but in the grand scheme of managing complexity, it often falls short of the design ideals that we were taught in school. One area where this becomes evident is object attributes.

Attributes are just variables that are “attached” to an object. This simplicity, unfortunately, makes attributes require a good deal more complexity to manage in a system. The reason being is that languages do not provide any tools to respect the perceived boundaries that an attribute appears to provide.

Let’s look at a simple example.

class Person(object):

    def __init__(self, age):
        self. age = age

We have a simple Person object. We want to be able to access the person’s age by way of an attribute. The first change we’ll want to make is to make this attribute a property.

class Person(object):
    def __init__(self, year, month, day):
        self.year = year
        self.month = date = day

    def age(self):
        age = - datetime(self.year, self.month,
        return age.days / 365

So far, this feels pretty simple. But lets get a little more realistic and presume that this Person is not a naive object but one that talks to a RESTful service in order to get is values.

A Quick Side Note

Most of the time you’d see a database and an ORM for this sort of code. If you are using Django or SQLAlchemy (and I’m sure other ORMs are the same) you’d see something like.

user = User.query.get(id)

You might have a nifty function on your model that calculates the age. That is, until you realize you stored your data in a non-timezone aware date field and now that you’re company has started supporting Europe, some folks are complaining that they are turning 30 a day earlier than they expected...

The point being is that ORMs do an interesting thing that is your only logical choice if you want to ensure your attribute access is consistent with the database. ORMs MUST create new instances for each query and provide a SYNC method or function to ensure they are updated. Sure, they might have an eagercommit mode or something, but Stack Overflow will most likely provide plenty of examples where this falls down.

I’d like to keep this reality in mind moving forward as it presents a fact of life when working with objects that is important to understand as your program gets more complex.

Back to Our Person

So, we want to make this Person object use a RESTful service as our database. Lets change how we load the data.

class Person(ServiceModel):
    # We inherit from some ServiceModel that has the machinery to
    # grab our data form our service.

    def by_id(cls, id):
        doc = conn.get('people', id=id).pop()
        return cls(**doc)

    def age(self):
        age = - datetime(self.year, self.month,
        return age.days / 365

    # This would probably be implemented in the ServiceModel, but
    # I'll add it hear for clarity.
    def __getattr__(self, name):
        if name in self.doc:
            return self.doc[name]
        raise AttributeError('%s is not in the resource.' % name)

Now assuming we get a document that has a year, month, day, our age function would still work.

So far, this all feels pretty reasonable. But what happens when things change? Fortunately in the age use case, people rarely change their birth date. But, unfortunately, we do have pesky time zones that we didn’t want to think about when we had 100 users and everyone lived on the west coast. The “least viable product” typically doesn’t afford thinking ahead that far, so these are issues you’ll need to deal with after you have a lot of code.

Also, the whole point of all this work has been to support an attribute on an object. We haven’t sped anything up. These are not new features. We haven’t even done some clever with meta classes or generators! The reality is that you’ve refactored your code 4 or 5 five times to support a single call in a template.

{{ person.age }}

Let’s take a step back for a bit.

Taking a Step Back

Do not feel guilty for going down this rabbit hole. I’ve taken the trip hundreds of times! But maybe it is time to reconsider how we think about object oriented design.

When we think back to when we were just learning OO there was a zoo. In this zoo we had the mythical Animal class. We’d have new animals show up at the zoo. We’d get a Lion, Tiger and Bear they would all need to eat. This modeling feels so right it can’t be wrong! An in many ways it isn’t.

If we take a step back, there might be a better way.

Let’s first acknowledge that that our Animal does need to eat. But lets really think about what it means to our zoo. The Animals will eat, but so will the Visitors. I’m sure the Employee would like to have some food now and then as well. The reason we want to know about all this sustenance is because we need to Order food and track it’s cost. If we reconsider this in the code, what if, and this is a big what if, we didn’t make eat a method on some class. What if we passed our object to our eat method.


While that looks cannibalistic at first, we can reconsider our original age method as well.


And how about our Animals?


Looking back at our issues with time zones, because our zoo has grown and people come from all over the world, we can even update our code without much trouble.


Assuming we’re using imports, here is a more realistic refactoring.

from myapp.time import age


Rather than rewriting all our age calls for timezone awareness, we can change our myapp/

def age(obj):
   age = - adjust_for_timezones(obj.birthday())
   return age / 365

In this idealized world, we haven’t thrown out objects completely. We’ve simply adjusted how we use them. Our age depends on a birthday method. This might be a Mixin class we use with our Models. We also could still have our classic Animal base class. Age might even be relative where you’d want to know how old an Animal is in “person years”. We might create a time.animal.age function that has slightly different requirements.

In any case, by reconsidering our object oriented design, we can remove quite a bit of code related to ugly attributes.

The Real World Conclusions

While it might seem obvious now how to implement a system using these ideas, it requires a different set of skills. Naming things is one of the two hard things in computer science. We don’t have obvious design patterns for grouping functions in dynamic languages where it becomes clear the expectations. Our age function above likely would need some check to ensure that the object has a birthday method. You wouldn’t want every age call to be wrapped in a try/except.

You also wouldn’t want to be too limiting on type, especially in a dynamic language like Python (or Ruby, JavaScript, etc.). Even though there has been some rumbling for type hints in Python that seem reasonable, right now you have to make some decisions on how you want to handle the communication that some function foo expects some object of type of Bar or has a method baz. These are trivial problems at a technical level, but socially, they are difficult to enforce without formal language support.

There are also some technical issues to consider. In Python, function calls can be expensive. Each function call requires its own lexical stack such that many small nested functions, while designed well, can become slow. There are tools to help with this, but again, it is difficult to make this style obvious over time.

There is never a panacea, but it seems that there is still room for OO design to grow and change. Functional programming, while elegant, is pretty tough to grok, especially when you have a dynamic language code sitting in your editor, allowing you to mutate everything under the sun. Still, there are some powerful themes in Functional Programming that can make your Object Oriented code more helpful in managing complexity.


Programming is really about layering complexity. It is taking concepts and modeling them to a language that computers can take and, eventually, consider in terms of voltage. As we model our systems we need to consider the data vs. the functionality, which means avoiding ugly attributes (and methods) in favor of orthogonal functionality that respects the design inherit in the objects.

It is not easy by any stretch, but I believe by adopting the techniques mentioned above, we can move past the kludgy parts of OO (and functional programming) into better designed and more maintainable software.

Functional Programming in Python

While Python doesn’t natively support some essential traits of an actual functional programming language, it is possible to use a functional style (rather than object oriented) to write programs. What makes it hard is that some of the constraints functional programming requires must be followed done manually.

First off, lets talk about what Python does well that makes functional programming possible.

Python has first class functions that allow passing a function around the same way that you’d pass around a normal variable. First class functions make it possible to do things like currying and complex list processing. Fortunately, the standard library provides the excellent functools library. For example

>>> from functools import partial
>>> def add(x, y): return x + y
>>> add_five = partial(add, 5)
>>> map(add_five, [10, 20, 30])
[15, 25, 35]

The next thing critical functional tool that Python provides is iteration. More specifically, Python generators provide a tool to process data lazily. Generators allow to you create functions that can create data on demand rather than forcing the creation of an entire set. Again, the standard library provides some helpful tools via the itertools library.

>>> from itertools import count, imap, islice
>>> nums = islice(imap(add_five, count(10, 10)), 0, 3)
>>> nums
<itertools.islice object at 0xb7cf6dc4>

In this example each of the functions only calculates and returns a value when it is required.

Python also has other functional concepts built in such as list comprehensions and decorators that when used with first class functions and generators makes programming a functional style feasible.

Where Python does not make functional programming easy is dealing with immutable data. In Python, critical core datatypes such as lists and dicts are mutable. In functional languages, all variables are immutable. The result is you often create value based on some initial immutable variable that then has functions applied to it.

(defn add-markup [price]
  (+ price (* .25 price)))

(defn add-tax [total]
  (+ total (* .087 total)))

(defn get-total [initial-price]
  (add-tax (add-markup initial-price)))

In each of the steps above, the argument is passed in by value and can’t be changed. When you need to use the total described from get-total, rather than storing it in a variable, you’d often times always call the get-total function. Typically a functional language will optimize these calls. In Python we can mimic this by memoizing the result.

import functools
import operator

def memoize(f):
    cache = {}
    def wrapper(*args, **kw):
        key = (args, sorted(kw.iteritems()))
        if key not in cache:
            cache[key] = f(*args, **kw)
        return cache[key]
    return wrapper

def factorial(num):
    return reduce(operator.mul, range(1, num + 1))

Now, calls to the function will re-use previous results without having to execute the body of the function.

Another pattern seen in functional languages such as LISP is to re-use a core data type, such as a list, as a richer object. For example, associates lists act like dictionaries in Python, but they are essentially still just lists that have functions to access them as a dictionary such that you can access random keys. In other functional languages such as haskell or clojure, you create actual types that are similar to a struct to communicate more complex type information.

Obviously in Python we have actual objects. The problem is that objects are mutable. In order to make sure we’re using immutable types we can use Python’s immutable data type, the tuple. What’s more, we can replicate richer types by using a named tuple.

from collections import namedtuple

User = namedtuple('User', ['name', 'email', 'password'])

def update_password(user, new_password):
    return User(,, new_password)

I’ve found that using named tuples often helps close the mental gap of going from object oriented to a functional style.

While Python is most definitely not a functional language, it has many tools that make using a functional paradigm possible. Functional programming can be a powerful model to consider as there are a whole class of bugs that disappear in an immutable world. Functional programming is also a great change of pace from the typical object oriented patterns you might be used to. Even if you don’t refactor all your code to a functional style, there is much to learn, and fortunately, Python makes it easy to get started in a language you are familiar with.

Parallel Processing

It can be really hard to work with data programmatically. There is some moment when working with a large dataset where you realize you need to process the data in parallel. As a programmer, this sounds like it could be a fun problem, and in many cases it is fun to get all your cores working hard crunching data.

The problem is parallel processing never is purely a matter of distributing your work across CPUs. The hard part ends up being getting the data organized before sending it to your workers and doing something with the results. Tools like Hadoop boast processing terabytes of data, but it’s a little misleading because there is most likely a ton of code on either end of that processing.

The input and output code (I/O) can also have big impact on the processing itself. The input often needs to consider what the atomic unit is as well as what the “chunk” of data needs to be. For example, if you have 10 million tiny messages to process, you probably want to chunk up the million messages into 5000 messages when sending it to your worker nodes, yet the workers will need to know it is getting a chunk of messages vs. 1 message. Similarly, for some applications the message:chunk ratio needs to be tweaked.

In hadoop this sort of detail can be dealt with via HDFS, but hadoop is not trivial to set up. Not to mention if you have a bunch of data that doesn’t live in HDFS. The same goes for the output. When you are done, where does it go?

The point being is that “data” always tends towards spcificity. You can’t abstract away data. Data always ends up being physical at its core. Even if the processing happens in parallel, the I/O will always be a challenging constraint.

View Source

I watched a bit of this fireside chat with Steve Jobs. It was pretty interesting to hear Steve discuss the availability of the network and how it changes the way we can work. Specifically, he mentioned that because of NFS (presumably in the BSD family of unices), he could share his home directory on every computer he works on without ever having to think about back ups or syncing his work.

What occurred to me was how much of the software we use is taken for granted. Back in the day and educational license for Unix was around $1800! I can only imagine the difficulties becoming a software developer back then when all the interesting tools like databases or servers were prohibitively expensive!

It reminds me of when I first started learning about HTML and web development. I could always view the source to see what was happening. It became an essential part of how I saw the web and programming in general. The value of software was not only in its function, but in its transparency. The ability to read the source and support myself as a user allowed me the opportunity to understand why the software was so valuable.

When I think about how difficult it must have been to become a hacker back in the early days of personal computing, it is no wonder that free software and open source became so prevalent. These early pioneers had to learn the systems without reading the source! Learning meant reading through incomplete, poorly written manuals. When the manual was wrong or out of date, I can only imagine the hair pulling that must have occurred. The natural solution to this problem was to make the source available.

The process of programming is still very new and very detailed, while still being extremely generic. We are fortunate as programmers that the computing landscape was not able to enclose software development within proprietary walls like so many other technical fields. I’m very thankful I can view the source!

Property Pattern

I’ve found myself doing this quite a bit lately and thought it might be helpful to others.

Often times when I’m writing some code I want to access something as an attribute, even though it comes from some service or database. For example, say we want to download a bunch of files form some service and store them on our file system for processing.

Here is what we’d like the processing code to look like:

def process_files(self):
    for fn in self.downloaded_files:

We don’t really care what the filter_and_store method does. What we do care about is downloaded_files attribute.

Lets step back and see what the calling code might look like:

processor = MyServiceProcessor(conn)

Again, this is pretty simple, but now we have a problem. When do we actually download the files and store them on the filesystem. One option would be to do something like this in our process_files method.

def process_files(self):
    self.downloaded_files = self.download_files()
    for fn in self.downloaded_files:

While it may not seem like a big deal, we just created a side effect. The downloaded_files attribute is getting set in the process_files method. There is a good chance the downloaded_files attribute is something you’d want to reuse. This creates an odd coupling between the process_files method and the downloaded_files method.

Another option would be to do something like this in the constructor:

def __init__(self, conn):
    self.downloaded_files = self.download_files()

Obviously, this is a bad idea. Anytime you instantiate the object it will seemingly try to reach out across some network and download a bunch of files. We can do better!

Here are some goals:

  1. keep the API simple by using a simple attribute, downloaded_files
  2. don’t download anything until it is required
  3. only download the files once per-object
  4. allow injecting downloaded values for tests

The way I’ve been solving this recently has been to use the following property pattern:

class MyServiceProcessor(object):

    def __init__(self, conn):
        self.conn = conn
        self._downloaded_files = None

    def downloaded_files(self):
        if not self._downloaded_files:
            self._downloaded_files = []
            tmpdir = tempfile.mkdtemp()
            for obj in self.conn.resources():
        return self._downloaded_files

    def process_files(self):
        result = []
        for fn in self.downloaded_files:
        return result

Say we wanted to test our process_files method. It becomes much easier.

def setup(self):
    self.test_files = os.listdir(os.path.join(HERE, 'service_files'))
    self.conn = Mock()
    self.processor = MyServiceProcessor(self.conn)

def test_process_files(self):
    # Just set the property variable to inject the values.
    self.processor._downloaded_files = self.test_files

    assert len(self.processor.process_files()) == len(self.test_files)

As you can see it was realy easy to inject our stub files. We know that we don’t perform any downloads until we have to. We also know that the downloads are only performed once.

Here is another variation I’ve used that doesn’t required setting up a _downloaded_files.

def downloaded_files(self):
    if not hasattr(self, '_downloaded_files'):
    return self._downloaded_files

Generally, I prefer the explicit _downloaded_files attribute in the constructor as it allows more granularity when setting a default value. You can set it as an empty list for example, which helps to communicate that the property will need to return a list.

Similarly, you can set the value to None and ensure that when the attribute is accessed, the value may become an empty list. This small differentiation helps to make the API easier to use. An empty list is still iterable while still being “falsey”.

This technique is nothing technically interesting. What I hope someone takes from this is how you can use this technique to write clearer code and encapsulate your implementation, while exposing a clear API between your objects. Just because you don’t publish a library, keeping your internal object APIs simple and communicative helps make your code easier to reason about.

One caveat is that this method can add a lot of small property methods to your classes. There is nothing wrong with this, but it might give a reader of your code the impression the classes are complex. One method to combat this is to use mixins.

class MyWorkerMixinProperties(object):

    def __init__(self, conn):
        self.conn = conn
        self._categories = None
        self._foo_resources = None
        sef._names = None

    def categories(self):
        if not self._categories:
            self._categories = self.conn.categories()
        return self._categories

    def foo_resources(self):
        if not self._foo_resources:
            self._foo_resources = self.conn.resources(name='foo')
        return self._foo_resources

    def names(self):
        if not self._names:
            self._names = [r.meta()['name'] for r in self.resources]

class MyWorker(MyWorkerMixinProperties):

    def __init__(self, conn):
        MyWorkerMixinProperties.__init__(self, conn)

    def run(self):
        for resource in self.foo_resources:
            if resource.category in self.categories:
                self.put('/api/foos', {
                    'real_name': self.names[resource.name_id],
                    'values': self.process_values(resource.values),

This is a somewhat contrived example, but the point being is that we’ve taken all our service based data and made it accessible via normal attributes. Each service request is encapsulated in a function, while our primary worker class has a reasonably straightforward implementation of some algorithm.

The big win here is clarity. You can write an algorithm by describing what it should do. You can then test the algorithm easily by injecting the values you know should produce the expected results. Furthermore, you’ve decoupled the algorithm from the I/O code, which is typically where you’ll see a good deal of repetition in the case of RESTful services or optimization when talking to databases. Lastly, it becomes trivial to inject values for testing.

Again, this isn’t rocket science. It is a really simple technique that can help make your code much clearer. I’ve found it really useful and I hope you do too!

Iterative Code Cycle

TDD prescribes a simple process for working on code.

  1. Write a failing test
  2. Write some code to get the test to pass
  3. Refactor
  4. Repeat

If we consider this cycle more generically, we see a typical cycle every modern software developer must use when writing code.

  1. Write some code
  2. Run the code
  3. Fix any problems
  4. Repeat

In this generic cycle you might use a REPL, a stand alone script, a debugger, etc. to quickly iterate on the code.

Personally, I’ve found that I do use a test for this iteration because it is integrated into my editor. The benefit of using my test suite is that I often have a repeatable test when I’m done that proves (to some level of confidence) the code works as I expect it to. It may not be entirely correct, but at least it codifies that I think it should work. When it does break, I can take a more TDD-like approach and fix the test, which makes it fail, and then fix the actual bug.

The essence then of any developer’s work is to make this cycle as quick as possible, no matter what tool you use to run and re-run your code. The process should be fluid and help get you in the flow when programming. If you do use tests for this process, it may be a helpful design tool. For example, if you are writing a client library for some service, you write an idealistic API you’d like to have without letting the implementation drive the design.

TDD has been on my mind recently as I’ve written a lot of code recently and have questioned whether or not my testing patterns have truly been helpful. It has been helpful in fixing bugs and provides a quick coding cycle. I’d argue the code has been improved, but at the same time, I do wonder if by making things testable I’ve introduced more abstractions than necessary. I’ve had to look back on some code that used these patterns and getting up to speed was somewhat difficult. At the same time, anytime you read code you need to put in effort in order to understand what is happening. Often times I’ll assume if code doesn’t immediately convey exactly what is happening it is terrible code. The reality is code is complex and takes effort to understand. It should be judged based on how reasonable it is fix once it is understood. In this way, I believe my test based coding cycle has proven itself to be valuable.

Obviously, the next person to look at the code will disagree, but hopefully once they understand what is going on, it won’t be too bad.


I watched DHH’s keynote at Railsconf 2014. A large part of his talk discusses the misassociation of TDD on metrics and making code “testable” rather than stepping back an focusing on clarity, as an author would when writing.

If you’ve ever tried to do true TDD, you might have a similar feeling that you’re doing it wrong. I know I have. Yet, I’ve also seen the benefit of iterating on code via writing tests. The faster the code / test cycle, the easier it is to experiment and write the code. Similarly, I’ve noticed more bugs show up in code that is not as well covered by tests. It might not be clear how DHH’s perspective then fits in with the benefits of testing and facets of TDD.

What I’ve found is that readability and clarity in code often comes by way of being testable. Tests and making code testable can go along way in finding the clarity that DHH describes. It can become clear very quickly that your class API is actually really difficult to use by writing a test. You can easily spot odd dependencies in a class by the number of mocks you are required to deal with in your tests. Sometimes I find it easier to write a quick test rather than spin up a repl to run and rerun code.

The point being is that TDD can be a helpful tool to write clear code. As DHH points out, it is not a singular path to a well thought out design. Unfortunately, just as people take TDD too literally, people will feel that any sort of granular testing is a waste of time. The irony here is that DHH says very clearly that we, as software writers, need to practice. Writing tests and re-writing tests are a great way to become a better writer. Just because the ideals presented in TDD might be a bit too extreme, the mechanism of a fast test suite and the goal for 100% coverage are still valuable in that they force you to think about and practice writing code.

The process of thinking about code is what is truly critical in almost all software development exercises. Writing tests first is just another way to slow you down and force you to think about your problem before hacking out some code. Some developers can avoid tests, most likely because they are really good about thinking about code before writing it. These people can likely iterate on ideas and concepts in their head before turning to the editor for the actual implementation. The rest of us can use the opportunity of writing tests, taking notes, and even drawing a diagram as tools to force us to think about our system before hacking some ugly code together.

Concurrency Transitions

Glyph, the creator of Twisted wrote an interesting article discussing the intrinsic flaws of using threads. The essential idea is that unless you know explicitly when you are switching contexts, it is extremely difficult to effectively reason about concurrency in code.

I agree that this is one way to handle concurrency. Glyph also provides a clear perspective into the underlying constraints of concurrent programming. The biggest constraint is that you need a way to guarantee a set of statements happens atomically. He suggests an event driven paradigm as how best to do this. In a typical async system, the work is built up using small procedures that run atomically, yielding back control to the main loop as they finish. The reason the async model works so well is because you eliminate all CPU based concurrency and allow work to happen while waiting for I/O.

There are other valid ways to achieve as similar effect. The key in all these methods, async included, is to know when you transition from atomic sequential operations to potentially concurrent, and often parallel, operations.

A great example of this mindset is found in functional programming, and specifically, in monads. A monad is essentially a guarantee that some set of operations will happen atomically. In a functional language, functions are considered “pure” meaning they don’t introduce any “side effects”, or more specifically, they do not change any state. Monads allow functional languages a way to interact with the outside world by providing a logical interface that the underlying system can use to do any necessary work to make the operation safe. Clojure, for example, uses a Software Transactional Memory system to safely apply changes to state. Another approach might be to use locking and mutexes. No matter the methodology, the goal is to provide a safe way to change state by allowing the developer an explicit way to identify portions of code that change external state.

Here is a classic example in Python of where mutable state can cause problems.

In Python, and the vast majority of languages, it is assumed that a function can act on a variable of a larger scope. This is possible thanks to mutable data structures. In the example above, calling the function multiple time doesn’t re-initialize argument to an empty list. It is a mutable data structure that exists as state. When the function is called that state changes and that change of state is considered a “side effect” in functional programming. This sort of issue is even more difficult in threaded programming because your state can cross threads in addition to lexical boundaries.

If we generalize the purpose of monads and Clojure’s reference types, we can establish that concurrent systems need to be able to manage the transitions between pure functionality (no state manipulation) and operations that effect state.

One methodology that I have found to be effective managing this transition is to use queues. More generally, this might be called message passing, but I don’t believe message passing guarantees the system understands when state changes. In the case of a queue, you have an obvious entrance and exit point for the transition between purity and side effects to take place.

The way to implement this sort of system is to consider each consumer of a queue as a different process. By considering consumers / producers as processes we ensure there is a clear boundary between them that protects shared memory, and more generally shared state. The queue then acts as bridge to cross the “physical” border. The queue also provides the control over the transition between pure functionality and side effects.

To relate this back to Glyph’s async perspective, when state is pushed onto the queue it is similar to yielding to the reactor in an async system. When state is popped off the queue into a process, it can be acted upon without worry of causing side effects that could effect other operations.

Glyph brought up the scenario where a function might yield multiple times in order to pass back control to the managing reactor. This becomes less necessary in the queue and process system I describe because there is no chance of a context switch interrupting an operation or slowing down the reactor. In a typical async framework, the job of the reactor is to order each bit of work. The work still happens in series. Therefore, if one operation takes a long time, it stops all other work from happening, assuming that work is not doing I/O. The queue and process system doesn’t have this same limitation as it is able to yield control to the queue at the correct logical point in the algorithm. Also, in terms of Python, the GIL is mitigated by using processes. The result is that you can program in a sequential manner for your algorithms, while still tackle problems concurrently.

Like anything, this queue and process model is not a panacea. If your data is large, you often need to pass around references to the data and where it can be retrieved. If that resource is not something that tries to handle concurrent connections, the file system for example, you still may run into concurrency issue accessing some resource. It also can be difficult to reason about failures in a queue based system. How full is too full? You can limit the queue size, but that might cause blocking issues that may be unreasonable.

There is no silver bullet, but if you understand the significance of transitions between pure functionality and side effects, you have a good chance of producing a reasonable system no matter what concurrency model you use.