Software Project Structure and Ecosystems

Code can smell, but just like our own human noses, everyone has his/her own perspective on what stinks. Also, just because something smells bad now, it doesn’t mean you can’t get used to the smell and eventually enjoy it. The same applies for how software projects are organized and configured.

Most languages have the concept a package. Your source code repository is organized to support building some a package that can be distributed via your language’s ecosystem. Python has setuptools/pip, Perl has CPAN, JavaScript has NPM, Ruby has gems, etc. Even compiled languages provide packages by way of providing builds for different operating system packaging systems.

In each package use case, there are common idioms and best practices that the community supports. There are tools that promote these ideals that end up part of the language and community packing ecosystem. This ecosystem often ends up permeating not just the project layout, but the project itself. As developers, we want to be participants in the ecosystem and benefit from everything that the ecosystem provides.

As we become accustomed to our language ecosystem and its project tendencies, we develop an appreciation for the aroma of the code. Our sense of code smell can be tainted to the point of feeling that other project structures smell bad and are some how “wrong” in how they work.

If you’ve ever gone from disliking some cuisine to making it an integral part of your diet, you will quickly see the problem with associating a community’s ecosystem with sound software development practices. By disregarding different patterns as “smelly”, you also lose the opportunity to glean from positive aspects of other ecosystems.

A great example is Python’s virtualenv infrastructure compared to Go’s complete lack of packages. In Python, you create source based packages (most of the time) that are extracted and built within a virtual Python environment. If the package requires other packages, the environment uses the same system (usually pip) to download and build the sub-requirements. Go, on the other hand, requires that you specify libraries and other programs you use in your build step. These requirements are not “packaged” at all and typically use source repositories directly when defining requirements. These requirements become a part of your distributed executable. The executable is a single file that can be copied to another machine and run without requiring any libraries or tools on the target machine.

Both of these systems are extremely powerful, yet radically different. Python might benefit a great deal by building in the ability to package up a virtualenv and specific command as a single file that could be run on another system. Similarly, Go could benefit from a more formalized package repository system that ensures higher security standards. It is easy to look at either system and feel the lack of the others packing features is detrimental, when in fact, they are just different design trade offs. Python has become very successful on systems such as Heroku where it is relatively easy to take a project from source code to a release because the ecosystem promotes building software during deployment. Go, on the other hand, has become an ideal system management tool because it is trivial to copy a Go program to another machine and run it without requiring OS specific libraries or dependencies.

While packaging is one area we see programmers develop a nose for coding conventions, it doesn’t stop there.

The language ecosystem also prescribes coding standards. Some languages are perfectly happy to have a huge single file while others require a new file for each class / function. Each methodology has its own penalties and rewards. Many a Python / Vimmer has wretched at the number of files produced when writing Java, while the Java developer stares in shock as the vimmer uses search and replace to refactor code in a project. Again, these design decisions with their own set of trade offs.

One area where code smell becomes especially problematic is when it involves a coding paradigm that is different from the ecosystem. Up until recently, The Twisted and stackless ecosystems in Python felt strange and isolated. Many developers felt that things like deferred and greenlets felt like code smell when compared to the tried and true threads. Yet, as we discovered the need for more socket connections to be open at the same time and as we needed to read more I/O concurrently, it became clear that asynchronous programming models should be a first class citizen in the Python ecosystem at large. Prior to the ecosystem’s acceptance of async, the best practices felt very different and quite smelly.

Fortunately, in the case of async, Python managed to find a way to be more inclusive. The async paradigm still has far reaching requirements (there needs to be a main loop somewhere...), but the community has begun the process of integrating the ideas as seamless, natural, fragrant additions to the language.

Unfortunately, other paradigms have not been as well integrated. Functional programming, while having many champions and excellent libraries, still has not managed to break into the ecosystem at the project level. If we have a packages that have very few classes, nested function calls, LRU caches and tons of (cy)func|itertool(s|z) it feels smelly.

As programmers we need to make a conscious effort to expand our code olfaction to include as wide a bouquet as possible. When we see code that seems stinky, rather than assuming it is poorly designed or dangerous, take a big whiff and see what you find out. Make an effort to understand it and appreciate its fragrance. Over time, you’ll eventually understand the difference between pungent code, that is powerful and efficient, vs. stinky code that is actually rancid and should be thrown out.

Flask vs. CherryPy

I’ve always been a fan of CherryPy. It seems to avoid making decisions for you that will matter over time and it helps you make good decisions where it matters most. CherryPy is simple, pragmatic, stable fast enough and plays nice with other processes. As much as I appreciate what CherryPy offers, there is unfortunately not a lot of mindshare in the greater Python community. I suspect the reason CherryPy is not seen a hip framework is due to most users of CherryPy happily work around the rough edges and get work done rather than make an effort market their framework of choice. Tragic.

While there are a lot of microframeworks out there, Flask seems to be the most popular. I don’t say this with any sort of scientific accuracy, just a gut feeling. So, when I set out to write a different kind of process manager , it seemed like a good time to see how other microframeworks work.

The best thing I can say about Flask is the community of projects. After having worked on a Django project, I appreciate the admin interface and how easy it is to get 80% there. Flask is surprisingly similar in that searching google for “flask + something” quickly provides some options to implement something you want. Also, as Flask generally tries to avoid being too specific, the plugins (called Blueprints... I think) seem to provide basic tools with the opportunity to customize as necessary. Flask-Admin is extremely helpful along side Flask-SQLAlchemy.

Unfortunately, while this wealth of excellent community packages is excellent, Flask falls short when it comes to actual development. Its lack of organization in terms of dispatching makes organizing code feel very haphazard. It is easy to create circular dependencies due to the use of imports for establishing what code gets called. In essence, Flask forces you to build some patterns that are application specific rather than prescribing some models that make sense generally.

While a lack of direction can make the organization of the code less obvious, it does allow you to easily hook applications together. The Blueprint model, from what I can tell, makes it reasonably easy to compose applications within a site.

Another difficulty with Flask is configuration. Since you are using the import mechanism to configure your app, your configuration must also be semi-available at import time. Where this makes things slightly difficult is when you are creating a app that starts a web server (as opposed to an app that runs a web service). It is kind of tricky to create myapp –config because by the time you’ve started the app, you’ve already imported your application and set up some config. Not a huge issue, but it can be kludgy.

This model is where CherryPy excels. It allows you create a stand alone process that acts as a server. It provides a robust configuration mechanism that allows turning on/off process and request level features. It allows configuration per-URL as well. The result is that if you’re writing a daemon or some single app you want to run as a command, CherryPy makes this exceptionally easy and clear.

CherryPy also helps you stay a bit more organized in the framework. It provides some helpful dispatcher patterns that support a wide array of functionality and provide some more obvious patterns for organizing code. It is not a panacea. There are patterns that take some getting used to. But, once you understand these patterns, it becomes a powerful model to code in.

Fortuately, if you do want to use Flask as a framework and CherryPy as a process runner / server, it couldn’t be easier. It is trivial to run a Flask app with CherryPy, getting the best of both worlds in some ways.

While I wish CherryPy had more mindshare, I’m willing to face facts that Flask might have “won” the microframework war. With that said, I think there are valuable lessons to learn from CherryPy that could be implemented for Flask. I’d personally love to see the process bus model made available and a production web server included. Until then though, I’m happy to use CherryPy for its server and continue to enjoy the functionality graciously provided by the Flask community.

Thinking About ETLs

My primary focus for the last year or so has been writing ETLs at work. It is an interesting problem because on some level it feels extremely easy, while in reality, it is a problem that is very difficult to abstract.

Queries

The essence of an ETL, beyond the obvious “extract, transform, load”, is the query. In the case of a database, the query is typically the SELECT statement, but it usually is more than that. It often includes the format of the results. You might need to chunk the data using multiple queries. There might be columns you skip or columns you create.

In non-database ETLs, it still ends up being very similar to query. You often still need to find boundaries for what you are extracting. For example, if you had a bunch of date stamped log files, doing a find /var/logs -name 2014*.log.gz could still be considered a query.

A query is important because ETLs are inherently fragile. ETLs are required because the standard interface to some data is not available due to some constraints. By bypassing standard, and more importantly supported, interfaces, you are on your own when it comes to ensuring the ETL runs. The database dump you are running might timeout. The machine you are reading files from may reboot. The REST API node you are hitting gets a new version and restarts. There are always good reasons for your ETL process to fail. The query makes it possible to go back and try things again, limiting them to the specific subset of data you are missing.

Transforms

ETLs often are considered part of some analytics pipeline. The goal of an ETL is typically to take some data from some system and transform it to a format that can be loaded into another system for analysis. A better principle is to consider storing the intermediaries such that transformation is focused on a specific generalized format, rather than a specific system such as a database.

This is much harder than it sounds.

The key to providing generic access to data is a standard schema for the data. The “shape” of the data needs to be described in a fashion that is actionable by the transformation process that loads the data into the analytics system.

The schema is more than a type system. Some data is heavy with metadata while other data is extremely consistent. The schema should provide notation for both extremes.

The schema also should provide hints on how to convert the data. The most important aspect of the schema is to communicate to the loading system how to transform and / or import the data. One system might happily accept a string with 2014-02-15 as a date if you specify it is a date, while others may need something more explicit. The schema should communicate that the data is date string with a specific format that the loading system can use accordingly.

The schema can be difficult to create. Metadata might need a suite of queries to other systems in order to fill in the data. There might need to be calculations that have to happen that the querying system doesn’t support. In these cases you are not just transforming the data, but processing it.

I admit I just made an arbitrary distinction and definition of “processing”, so let me explain.

Processing Data

In a transformation you take the data you have and change it. If I have a URL, I might transform it into JSON that looks like {‘url’: $URL}. Processing, on the other hand, uses the data to create new data. For example, if I have a RESTful resource, I might crawl it to create a single view of some tree of objects. The important difference is that we are creating new information by using other resources not found in the original query data.

The processing of data can be expensive. You might have to make many requests for every row of output in a database table. The calculations, while small, might be on a huge dataset. Whatever the processing that needs happen in order to get your data to a generically usable state, it is a difficult problem to abstract over a wide breadth of data.

While there is no silver bullet to processing data, there are tactics that can be used to process data reliably and reasonably fast. The key to abstracting processing is defining the unit of work.

A Unit of Work

“Unit of Work” is probably a loaded term, so once again, I’ll define what I mean here.

When processing data in an ETL, the Unit of Work is the combination of:

  • an atomic record
  • an atomic algorithm
  • the ability to run the implementation

If all this sounds very map/reducey it is because it is! The difference is that in an ETL you don’t have the same reliability you’d have with something like Hadoop. There is no magical distributed file system that has your data ready to go on a cluster designed to run code explicitly written to support your map/reduce platform.

The key difference with processing data in ETLs vs. some system like Hadoop is the implementation and execution of the algorithm. The implementation includes:

  • some command to run on the atomic record
  • the information necessary to setup an environment for that script to run
  • an automated to input the atomic record to the command
  • a guarantee of reliable execution (or failure)

If we look at a system like Hadoop, and this applies to most map/reduce platforms that I’ve seen, there is an explicit step that takes data from some system and adds it to the HDFS store. There is another step that installs code, specifically written for Hadoop, onto the cluster. This code could be using Hadoop streaming or actual Java, but in either case, the installation is done via some deployment.

In other words, there is an unsaid step that Extracts data from some system, Transforms it for Hadoop and Loads it into HDFS. The processing in this case is getting the data from whatever the source system is into the analytics system, therefore, the requirements are slightly different.

We start off with a command. The command is simply an executable script like you would see in Hadoop streaming. No real difference here. Each line passed to the command contains the atomic record as usual.

Before we can run that command, we need to have an environment configured. In Hadoop, you’ve configured your cluster and deployed your code to the nodes. In an ETL system, due to the fragility and simpler processing requirements (no one should write a SQL-like system on top of an ETL framework), we want to set up an environment every time the command runs. By setting up this environment every time the command runs you allow a clear path for development of your ETL steps. Making the environment creation part of the development process it means that you ensure the deployment is tested along side the actual command(s) your ETL uses.

Once we have the command and an environment to run it in we need a way to get our atomic record to the command for actual processing. In Hadoop streaming, we use everyone’s favorite file handle, stdin. In an ETL system, while the command may still use stdin, the way the data enters the ETL system doesn’t necessarily have a distributed file system to use. Data might be downloaded from S3, some RESTful service, and / or some queue system. It important that you have a clear automated way to get data to an ETL processing node.

Finally, this processing must be reliable. ETLs are low priority. An ETL should not lock your production database for an hour in order to dump the data. Instead ETLs must quietly grab the data in a way that doesn’t add contention to the running systems. After all, you are extracting the data because a query on the production server will bog it down when it needs to be serving real time requests. An ETL system needs to reliably stop and start as necessary to get the data necessary and avoid adding more contention to an already resource intensive service.

Loading Data

Loading data from an ETL system requires analyzing the schema in order to construct the understanding between the analytics system and the data. In order to make this as flexible as possible, it is important that the schema use the source of data to add as much metadata as possible. If the data pulls from a Postgres table, the schema should idealling include most of the schema information. If that data must be loaded into some other RDBMS, you have all you need to safely read the data into the system.

Development and Maintenance

ETLs are always going to be changing. New analytics systems will be used and new source of data will be created. As the source system constraints change so do the constraints of an ETL system, again, with the ETL system being the lowest priority.

Since we can rely on ETLs changing and breaking, it is critical to raise awareness of maintenance within the system.

The key to creating a maintainable system is to build up from small tools. The reason being is that as you create small abstractions at a low level, you can reuse these easily. The trade off is that in the short term, more code is needed to accomplish common tasks. Over time, you find patterns specific to your organizations requirements that allow repetitive tasks to be abstracted into tools.

The converse to building up an ETL system based on small tools is to use a pre-built execution system. Unfortunately, pre-built ETL systems have been generalized for common tasks. As we’ve said earlier, ETLs are often changing and require more attention than a typical distributed system. The result is that using a pre-built ETL environment often means creating ETLs that allow the pre-built ETL system to do its work!

Testing

Our goal for our ETLs is to make them extremely easy to test. There are many facets to testing ETLs such as unit testing within an actual package. The testing that is most critical for development and maintenance is simply being able to quickly run and test a single step of an ETL.

For example, lets say we have an ETL that dumps a table, reformats some rows and creates a 10GB gzipped CSV file. I only mention the size here as it implies that it takes too long to run over the entire set of data every time while testing. The file will then be uploaded to S3 and notify a central data warehouse system. Here are some steps that the ETL might perform:

  1. Dumping the table
  2. Create a schema
  3. Processing the rows
  4. Gzipping the output
  5. Uploading the data
  6. Update the warehouse

Each of these steps should be runnable:

  • locally on a fake or testing datbase
  • locally, using a production database
  • remotely using a production database and testing system (test bucket and test warehouse)
  • remotely using the production database and production systems

By “runnable”, I mean that an ETL developer can run a command with a specific config and watch the output for issues.

These steps are all pretty basic, but the goal with an ETL system is to abstract the pieces that can be used across all ETLs in a way that is optimal for your system. For example, if your system is consistently streaming, your ETL framework might allow you to chain file handles together. For example

$ dump table | process rows | gzip | upload

Another option might be that each step produces a file that is used by the next step.

Both tactics are valid and can be optimized for over time to help distill ETLs to the minimal, changing requirements. In the above example, the database table dump could be abstracted to take the schema and some database settings to dump any table in your databases. The gzip, upload and data warehouse interactions can be broken out into a library and/or command line apps. Each of these optimizations are simple enough to be included in an ETL development framework without forcing a user to jump through a ton of hoops when a new data store needs to be considered.

An ETL Framework

Making it easy to develop ETLs means a framework. We want to create a Ruby on Rails for writing ETLs that makes it easy enough to get the easy stuff done and powerful enough to do deal with the corner cases. The framework revolves around the schema and the APIs to the different systems and libraries that provide language specific APIs.

At some level the framework needs to allow the introduction of other languages. My only suggestion here is that other languages are abstracted through a command line layer. The ETL framework can eventually call a command that could be written in whatever language the developer wants to use. ETLs are typically used to export data for to a system that is reasonably technical. Someone using this data most likely has some knowledge of some language such as R, Julia or maybe JavaScript. It is these technically savvy data wranglers we want to empower with the ETL framework in order to allow them to solve small ETL issues themselves and provide reliability where the system can be flaky.

Open Questions

The system I’ve described is what I’m working on. While I’m confident the design goals are reasonable, the implementation is going to be difficult. Specifically, the task of generically supporting many languages is challenging because each language has its own ecosystem and environment. Python is an easy language for this task b/c it is trivial to connect to a Ubuntu host have a good deal of the ecosystem in place. Other languages, such as R, probably require some coordination with the cluster provisioning system to make sure base requirements are available. That said, it is unclear if other languages provide small environments like virtualenvs do. Obviously typical scripting languages like Ruby and JavaScript have support for an application local environment, but I’m doubtful R or Julia would have the same facilities.

Another option would be to use a formal build / deployment pattern where a container is built. This answers many of the platform questions, but it brings up other questions such as how to make this available in the ETL Framework. It is ideal if an ETL author can simply call a command to test. If the author needs to build a container locally then I suspect that might be too large a requirement as each platform is going to be different. Obviously, we could introduce a build host to handle the build steps, but that makes it much harder for someone to feel confident the script they wrote will run in production.

The challenge is because our hope is to empower semi-technical ETL authors. If we compare this goal to people who can write HTML/CSS vs. programmers, it clarifies the requirements. A user learning to write HTML/CSS only has to open the file in a web browser to test it. If the page looks correct, they can be confident when they deploy it will work. The goal with the ETL framework and APIs is that the system can provide a similar work flow and ease of use.

Wrapping Up

I’ve written a LOT of ETL code over the past year. Much of what I propose above reflects my experiences. It also reflects the server environment in which these ETLs run as well as the organizational environment. ETLs are low priority code, by nature, that can be used to build first class products. Systems that require a lot of sysadmin time, server resources or have too specific an API may still be helpful moving data around, but they will fall short as systems evolve. My goal has been to create a system that evolves with the data in the organization and empowers a large number of users to distribute the task of developing and maintaining ETLs.

Dadd, ErrorEmail and CacheControl Releases

I’ve written a couple new bits of code that semeed like they could be helpful to others.

Dadd

Dadd (pronounced Daddy) is a tool to help administer daemons.

Most deployment systems are based on the idea of long running processes. You want to release a new version of some service. You build a package, upload it somewhere and tell your package manager to grab it. Then you tell your process manager to restart it to get the new code.

Dadd works differently. Dadd lets you define a short spec that includes the process you want to run. A dadd worker then will use that spec to download any necessary files, create a temporary directory to run in and start the process. When the process ends, assuming everything went well, it will clean up the temp directory. If there was an error, it will upload the logs to the master and send an email.

Where this sort of system comes in handy is when you have scripts that take a while to run and that shouldn’t be killed when new code is released. For example, at work I manage a ton of ETL processes to get our data into a data warehouse we’ve written. These ETL processes are triggered with Celery tasks, but they typically will ssh into a specific host, create a virtaulenv, install some dependencies, and copy files before running a deamon and disconnecting. Dadd, makes this kind of processing more automatic where it can run these processes on any host in our cluster. Also, because the dadd worker builds the environment, it means we can run a custom script without having to go through the process of a release. This is extremely helpful for running backfills or custom updates to migrate old data.

I have some ideas for Dadd such as incorporating a more involved build system and possibly using lxc containers to run the code. Another inspriation for Dadd is for setting up nodes in a cluster. Often times it would be really easy to just install a couple python packages but most solutions are either too manual or require a specific image to use things like chef, puppet, etc. With Dadd, you could pretty easily write a script to install and run it on a node and then let it do the rest regarding setting up an environment and running some code.

But, for the moment, if you have code you run by copying some files, Dadd works really well.

ErrorEmail

ErrorEmail was written specifically for Dadd. When you have a script to run and you want a nice traceback email when things fail, give ErrorEmail a try. It doesn’t do any sort rate limiting an the server config is extremely basic, but sometimes you don’t want to install a bunch of packages just to send an email on an error.

When you can’t install django or some other framework for an application, you can still get nice error emails with ErrorEmail.

CacheControl

The CacheControl 0.10.6 release includes support for calling close on the cache implementation. This is helpful when you are using a cache via some client (ie Redis) and that client needs to safely close the connection.

Ugly Attributes

At some point in my programming career I recognized that Object Oriented Programming is not all it’s cracked up to be. It can be a powerful tool, especially in a statically typed language, but in the grand scheme of managing complexity, it often falls short of the design ideals that we were taught in school. One area where this becomes evident is object attributes.

Attributes are just variables that are “attached” to an object. This simplicity, unfortunately, makes attributes require a good deal more complexity to manage in a system. The reason being is that languages do not provide any tools to respect the perceived boundaries that an attribute appears to provide.

Let’s look at a simple example.

class Person(object):

    def __init__(self, age):
        self. age = age

We have a simple Person object. We want to be able to access the person’s age by way of an attribute. The first change we’ll want to make is to make this attribute a property.

class Person(object):
    def __init__(self, year, month, day):
        self.year = year
        self.month = date
        self.day = day

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

So far, this feels pretty simple. But lets get a little more realistic and presume that this Person is not a naive object but one that talks to a RESTful service in order to get is values.

A Quick Side Note

Most of the time you’d see a database and an ORM for this sort of code. If you are using Django or SQLAlchemy (and I’m sure other ORMs are the same) you’d see something like.

user = User.query.get(id)

You might have a nifty function on your model that calculates the age. That is, until you realize you stored your data in a non-timezone aware date field and now that you’re company has started supporting Europe, some folks are complaining that they are turning 30 a day earlier than they expected...

The point being is that ORMs do an interesting thing that is your only logical choice if you want to ensure your attribute access is consistent with the database. ORMs MUST create new instances for each query and provide a SYNC method or function to ensure they are updated. Sure, they might have an eagercommit mode or something, but Stack Overflow will most likely provide plenty of examples where this falls down.

I’d like to keep this reality in mind moving forward as it presents a fact of life when working with objects that is important to understand as your program gets more complex.

Back to Our Person

So, we want to make this Person object use a RESTful service as our database. Lets change how we load the data.

class Person(ServiceModel):
    # We inherit from some ServiceModel that has the machinery to
    # grab our data form our service.

    @classmethod
    def by_id(cls, id):
        doc = conn.get('people', id=id).pop()
        return cls(**doc)

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

    # This would probably be implemented in the ServiceModel, but
    # I'll add it hear for clarity.
    def __getattr__(self, name):
        if name in self.doc:
            return self.doc[name]
        raise AttributeError('%s is not in the resource.' % name)

Now assuming we get a document that has a year, month, day, our age function would still work.

So far, this all feels pretty reasonable. But what happens when things change? Fortunately in the age use case, people rarely change their birth date. But, unfortunately, we do have pesky time zones that we didn’t want to think about when we had 100 users and everyone lived on the west coast. The “least viable product” typically doesn’t afford thinking ahead that far, so these are issues you’ll need to deal with after you have a lot of code.

Also, the whole point of all this work has been to support an attribute on an object. We haven’t sped anything up. These are not new features. We haven’t even done some clever with meta classes or generators! The reality is that you’ve refactored your code 4 or 5 five times to support a single call in a template.

{{ person.age }}

Let’s take a step back for a bit.

Taking a Step Back

Do not feel guilty for going down this rabbit hole. I’ve taken the trip hundreds of times! But maybe it is time to reconsider how we think about object oriented design.

When we think back to when we were just learning OO there was a zoo. In this zoo we had the mythical Animal class. We’d have new animals show up at the zoo. We’d get a Lion, Tiger and Bear they would all need to eat. This modeling feels so right it can’t be wrong! An in many ways it isn’t.

If we take a step back, there might be a better way.

Let’s first acknowledge that that our Animal does need to eat. But lets really think about what it means to our zoo. The Animals will eat, but so will the Visitors. I’m sure the Employee would like to have some food now and then as well. The reason we want to know about all this sustenance is because we need to Order food and track it’s cost. If we reconsider this in the code, what if, and this is a big what if, we didn’t make eat a method on some class. What if we passed our object to our eat method.

eat(Person)

While that looks cannibalistic at first, we can reconsider our original age method as well.

age(Person)

And how about our Animals?

age(Lion)

Looking back at our issues with time zones, because our zoo has grown and people come from all over the world, we can even update our code without much trouble.

age(adjust_for_timezones(Person))

Assuming we’re using imports, here is a more realistic refactoring.

from myapp.time import age

age(Lion)

Rather than rewriting all our age calls for timezone awareness, we can change our myapp/time.py.

def age(obj):
   age = utc.now() - adjust_for_timezones(obj.birthday())
   return age / 365

In this idealized world, we haven’t thrown out objects completely. We’ve simply adjusted how we use them. Our age depends on a birthday method. This might be a Mixin class we use with our Models. We also could still have our classic Animal base class. Age might even be relative where you’d want to know how old an Animal is in “person years”. We might create a time.animal.age function that has slightly different requirements.

In any case, by reconsidering our object oriented design, we can remove quite a bit of code related to ugly attributes.

The Real World Conclusions

While it might seem obvious now how to implement a system using these ideas, it requires a different set of skills. Naming things is one of the two hard things in computer science. We don’t have obvious design patterns for grouping functions in dynamic languages where it becomes clear the expectations. Our age function above likely would need some check to ensure that the object has a birthday method. You wouldn’t want every age call to be wrapped in a try/except.

You also wouldn’t want to be too limiting on type, especially in a dynamic language like Python (or Ruby, JavaScript, etc.). Even though there has been some rumbling for type hints in Python that seem reasonable, right now you have to make some decisions on how you want to handle the communication that some function foo expects some object of type of Bar or has a method baz. These are trivial problems at a technical level, but socially, they are difficult to enforce without formal language support.

There are also some technical issues to consider. In Python, function calls can be expensive. Each function call requires its own lexical stack such that many small nested functions, while designed well, can become slow. There are tools to help with this, but again, it is difficult to make this style obvious over time.

There is never a panacea, but it seems that there is still room for OO design to grow and change. Functional programming, while elegant, is pretty tough to grok, especially when you have a dynamic language code sitting in your editor, allowing you to mutate everything under the sun. Still, there are some powerful themes in Functional Programming that can make your Object Oriented code more helpful in managing complexity.

Finally

Programming is really about layering complexity. It is taking concepts and modeling them to a language that computers can take and, eventually, consider in terms of voltage. As we model our systems we need to consider the data vs. the functionality, which means avoiding ugly attributes (and methods) in favor of orthogonal functionality that respects the design inherit in the objects.

It is not easy by any stretch, but I believe by adopting the techniques mentioned above, we can move past the kludgy parts of OO (and functional programming) into better designed and more maintainable software.

Functional Programming in Python

While Python doesn’t natively support some essential traits of an actual functional programming language, it is possible to use a functional style (rather than object oriented) to write programs. What makes it hard is that some of the constraints functional programming requires must be followed done manually.

First off, lets talk about what Python does well that makes functional programming possible.

Python has first class functions that allow passing a function around the same way that you’d pass around a normal variable. First class functions make it possible to do things like currying and complex list processing. Fortunately, the standard library provides the excellent functools library. For example

>>> from functools import partial
>>> def add(x, y): return x + y
>>> add_five = partial(add, 5)
>>> map(add_five, [10, 20, 30])
[15, 25, 35]

The next thing critical functional tool that Python provides is iteration. More specifically, Python generators provide a tool to process data lazily. Generators allow to you create functions that can create data on demand rather than forcing the creation of an entire set. Again, the standard library provides some helpful tools via the itertools library.

>>> from itertools import count, imap, islice
>>> nums = islice(imap(add_five, count(10, 10)), 0, 3)
>>> nums
<itertools.islice object at 0xb7cf6dc4>
>>> nums.next()
15
>>> nums.next()
25
>>> nums.next()
35

In this example each of the functions only calculates and returns a value when it is required.

Python also has other functional concepts built in such as list comprehensions and decorators that when used with first class functions and generators makes programming a functional style feasible.

Where Python does not make functional programming easy is dealing with immutable data. In Python, critical core datatypes such as lists and dicts are mutable. In functional languages, all variables are immutable. The result is you often create value based on some initial immutable variable that then has functions applied to it.

(defn add-markup [price]
  (+ price (* .25 price)))

(defn add-tax [total]
  (+ total (* .087 total)))

(defn get-total [initial-price]
  (add-tax (add-markup initial-price)))

In each of the steps above, the argument is passed in by value and can’t be changed. When you need to use the total described from get-total, rather than storing it in a variable, you’d often times always call the get-total function. Typically a functional language will optimize these calls. In Python we can mimic this by memoizing the result.

import functools
import operator

def memoize(f):
    cache = {}
    @functools.wraps(f)
    def wrapper(*args, **kw):
        key = (args, sorted(kw.iteritems()))
        if key not in cache:
            cache[key] = f(*args, **kw)
        return cache[key]
    return wrapper

@memoize
def factorial(num):
    return reduce(operator.mul, range(1, num + 1))

Now, calls to the function will re-use previous results without having to execute the body of the function.

Another pattern seen in functional languages such as LISP is to re-use a core data type, such as a list, as a richer object. For example, associates lists act like dictionaries in Python, but they are essentially still just lists that have functions to access them as a dictionary such that you can access random keys. In other functional languages such as haskell or clojure, you create actual types that are similar to a struct to communicate more complex type information.

Obviously in Python we have actual objects. The problem is that objects are mutable. In order to make sure we’re using immutable types we can use Python’s immutable data type, the tuple. What’s more, we can replicate richer types by using a named tuple.

from collections import namedtuple

User = namedtuple('User', ['name', 'email', 'password'])

def update_password(user, new_password):
    return User(user.name, user.email, new_password)

I’ve found that using named tuples often helps close the mental gap of going from object oriented to a functional style.

While Python is most definitely not a functional language, it has many tools that make using a functional paradigm possible. Functional programming can be a powerful model to consider as there are a whole class of bugs that disappear in an immutable world. Functional programming is also a great change of pace from the typical object oriented patterns you might be used to. Even if you don’t refactor all your code to a functional style, there is much to learn, and fortunately, Python makes it easy to get started in a language you are familiar with.

Parallel Processing

It can be really hard to work with data programmatically. There is some moment when working with a large dataset where you realize you need to process the data in parallel. As a programmer, this sounds like it could be a fun problem, and in many cases it is fun to get all your cores working hard crunching data.

The problem is parallel processing never is purely a matter of distributing your work across CPUs. The hard part ends up being getting the data organized before sending it to your workers and doing something with the results. Tools like Hadoop boast processing terabytes of data, but it’s a little misleading because there is most likely a ton of code on either end of that processing.

The input and output code (I/O) can also have big impact on the processing itself. The input often needs to consider what the atomic unit is as well as what the “chunk” of data needs to be. For example, if you have 10 million tiny messages to process, you probably want to chunk up the million messages into 5000 messages when sending it to your worker nodes, yet the workers will need to know it is getting a chunk of messages vs. 1 message. Similarly, for some applications the message:chunk ratio needs to be tweaked.

In hadoop this sort of detail can be dealt with via HDFS, but hadoop is not trivial to set up. Not to mention if you have a bunch of data that doesn’t live in HDFS. The same goes for the output. When you are done, where does it go?

The point being is that “data” always tends towards spcificity. You can’t abstract away data. Data always ends up being physical at its core. Even if the processing happens in parallel, the I/O will always be a challenging constraint.

View Source

I watched a bit of this fireside chat with Steve Jobs. It was pretty interesting to hear Steve discuss the availability of the network and how it changes the way we can work. Specifically, he mentioned that because of NFS (presumably in the BSD family of unices), he could share his home directory on every computer he works on without ever having to think about back ups or syncing his work.

What occurred to me was how much of the software we use is taken for granted. Back in the day and educational license for Unix was around $1800! I can only imagine the difficulties becoming a software developer back then when all the interesting tools like databases or servers were prohibitively expensive!

It reminds me of when I first started learning about HTML and web development. I could always view the source to see what was happening. It became an essential part of how I saw the web and programming in general. The value of software was not only in its function, but in its transparency. The ability to read the source and support myself as a user allowed me the opportunity to understand why the software was so valuable.

When I think about how difficult it must have been to become a hacker back in the early days of personal computing, it is no wonder that free software and open source became so prevalent. These early pioneers had to learn the systems without reading the source! Learning meant reading through incomplete, poorly written manuals. When the manual was wrong or out of date, I can only imagine the hair pulling that must have occurred. The natural solution to this problem was to make the source available.

The process of programming is still very new and very detailed, while still being extremely generic. We are fortunate as programmers that the computing landscape was not able to enclose software development within proprietary walls like so many other technical fields. I’m very thankful I can view the source!

Property Pattern

I’ve found myself doing this quite a bit lately and thought it might be helpful to others.

Often times when I’m writing some code I want to access something as an attribute, even though it comes from some service or database. For example, say we want to download a bunch of files form some service and store them on our file system for processing.

Here is what we’d like the processing code to look like:

def process_files(self):
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

We don’t really care what the filter_and_store method does. What we do care about is downloaded_files attribute.

Lets step back and see what the calling code might look like:

processor = MyServiceProcessor(conn)
processor.process_files()

Again, this is pretty simple, but now we have a problem. When do we actually download the files and store them on the filesystem. One option would be to do something like this in our process_files method.

def process_files(self):
    self.downloaded_files = self.download_files()
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

While it may not seem like a big deal, we just created a side effect. The downloaded_files attribute is getting set in the process_files method. There is a good chance the downloaded_files attribute is something you’d want to reuse. This creates an odd coupling between the process_files method and the downloaded_files method.

Another option would be to do something like this in the constructor:

def __init__(self, conn):
    self.downloaded_files = self.download_files()

Obviously, this is a bad idea. Anytime you instantiate the object it will seemingly try to reach out across some network and download a bunch of files. We can do better!

Here are some goals:

  1. keep the API simple by using a simple attribute, downloaded_files
  2. don’t download anything until it is required
  3. only download the files once per-object
  4. allow injecting downloaded values for tests

The way I’ve been solving this recently has been to use the following property pattern:

class MyServiceProcessor(object):

    def __init__(self, conn):
        self.conn = conn
        self._downloaded_files = None

    @property
    def downloaded_files(self):
        if not self._downloaded_files:
            self._downloaded_files = []
            tmpdir = tempfile.mkdtemp()
            for obj in self.conn.resources():
                self._downloaded_files.append(obj.download(tmpdir))
        return self._downloaded_files

    def process_files(self):
        result = []
        for fn in self.downloaded_files:
            result.append(self.filter_and_store(fn))
        return result

Say we wanted to test our process_files method. It becomes much easier.

def setup(self):
    self.test_files = os.listdir(os.path.join(HERE, 'service_files'))
    self.conn = Mock()
    self.processor = MyServiceProcessor(self.conn)

def test_process_files(self):
    # Just set the property variable to inject the values.
    self.processor._downloaded_files = self.test_files

    assert len(self.processor.process_files()) == len(self.test_files)

As you can see it was realy easy to inject our stub files. We know that we don’t perform any downloads until we have to. We also know that the downloads are only performed once.

Here is another variation I’ve used that doesn’t required setting up a _downloaded_files.

@property
def downloaded_files(self):
    if not hasattr(self, '_downloaded_files'):
        ...
    return self._downloaded_files

Generally, I prefer the explicit _downloaded_files attribute in the constructor as it allows more granularity when setting a default value. You can set it as an empty list for example, which helps to communicate that the property will need to return a list.

Similarly, you can set the value to None and ensure that when the attribute is accessed, the value may become an empty list. This small differentiation helps to make the API easier to use. An empty list is still iterable while still being “falsey”.

This technique is nothing technically interesting. What I hope someone takes from this is how you can use this technique to write clearer code and encapsulate your implementation, while exposing a clear API between your objects. Just because you don’t publish a library, keeping your internal object APIs simple and communicative helps make your code easier to reason about.

One caveat is that this method can add a lot of small property methods to your classes. There is nothing wrong with this, but it might give a reader of your code the impression the classes are complex. One method to combat this is to use mixins.

class MyWorkerMixinProperties(object):

    def __init__(self, conn):
        self.conn = conn
        self._categories = None
        self._foo_resources = None
        sef._names = None

    @property
    def categories(self):
        if not self._categories:
            self._categories = self.conn.categories()
        return self._categories

    @property
    def foo_resources(self):
        if not self._foo_resources:
            self._foo_resources = self.conn.resources(name='foo')
        return self._foo_resources

    @property
    def names(self):
        if not self._names:
            self._names = [r.meta()['name'] for r in self.resources]



class MyWorker(MyWorkerMixinProperties):

    def __init__(self, conn):
        MyWorkerMixinProperties.__init__(self, conn)

    def run(self):
        for resource in self.foo_resources:
            if resource.category in self.categories:
                self.put('/api/foos', {
                    'real_name': self.names[resource.name_id],
                    'values': self.process_values(resource.values),
                })

This is a somewhat contrived example, but the point being is that we’ve taken all our service based data and made it accessible via normal attributes. Each service request is encapsulated in a function, while our primary worker class has a reasonably straightforward implementation of some algorithm.

The big win here is clarity. You can write an algorithm by describing what it should do. You can then test the algorithm easily by injecting the values you know should produce the expected results. Furthermore, you’ve decoupled the algorithm from the I/O code, which is typically where you’ll see a good deal of repetition in the case of RESTful services or optimization when talking to databases. Lastly, it becomes trivial to inject values for testing.

Again, this isn’t rocket science. It is a really simple technique that can help make your code much clearer. I’ve found it really useful and I hope you do too!

Iterative Code Cycle

TDD prescribes a simple process for working on code.

  1. Write a failing test
  2. Write some code to get the test to pass
  3. Refactor
  4. Repeat

If we consider this cycle more generically, we see a typical cycle every modern software developer must use when writing code.

  1. Write some code
  2. Run the code
  3. Fix any problems
  4. Repeat

In this generic cycle you might use a REPL, a stand alone script, a debugger, etc. to quickly iterate on the code.

Personally, I’ve found that I do use a test for this iteration because it is integrated into my editor. The benefit of using my test suite is that I often have a repeatable test when I’m done that proves (to some level of confidence) the code works as I expect it to. It may not be entirely correct, but at least it codifies that I think it should work. When it does break, I can take a more TDD-like approach and fix the test, which makes it fail, and then fix the actual bug.

The essence then of any developer’s work is to make this cycle as quick as possible, no matter what tool you use to run and re-run your code. The process should be fluid and help get you in the flow when programming. If you do use tests for this process, it may be a helpful design tool. For example, if you are writing a client library for some service, you write an idealistic API you’d like to have without letting the implementation drive the design.

TDD has been on my mind recently as I’ve written a lot of code recently and have questioned whether or not my testing patterns have truly been helpful. It has been helpful in fixing bugs and provides a quick coding cycle. I’d argue the code has been improved, but at the same time, I do wonder if by making things testable I’ve introduced more abstractions than necessary. I’ve had to look back on some code that used these patterns and getting up to speed was somewhat difficult. At the same time, anytime you read code you need to put in effort in order to understand what is happening. Often times I’ll assume if code doesn’t immediately convey exactly what is happening it is terrible code. The reality is code is complex and takes effort to understand. It should be judged based on how reasonable it is fix once it is understood. In this way, I believe my test based coding cycle has proven itself to be valuable.

Obviously, the next person to look at the code will disagree, but hopefully once they understand what is going on, it won’t be too bad.