Todo Lists

With a baby on the way, I’ve started reconsidering how I keep track of my TODO list.

Seeing as I spent an inordinate amount of time Emacs, org-mode is my go to tool when it comes to keeping track of my life. I can keep notes, track time and customize my environment to support a workflow that works for me. The problem is life happens outside of Emacs. Shocking, I know.

So, my goal is to have a system that integrates well with org-mode and Emacs, while still allows me to use my phone, calendar, etc. when I’m away from my computer. Also, seeing as Emacs doesn’t provide an obvious, effective calendaring solution like GMail does, I want to be sure I’m able to schedule things so I do them at specific times and have reminders.

With that in mind, I started looking at org-mobile. It seems like the perfect solution. It is basically a way to edit an org file on my phone and will (supposedly) sync deadlines and scheduled items to my calendar as one would expect. Sure, the UI could use some work, but having to know the date vs. picking it from a slick dialog on my phone seemed like a reasonable way to know what date it was...

Unfortunately, it didn’t work. I had one event sync’d to my google calendar, but that was the end of that. It didn’t seem to add anything to my calendar no matter the settings. That is a deal breaker.

I’m currently starting to play with org-trello instead. I’m confident I can make this work for two reasons.

  1. The mobile app is nice
  2. The sync’ing in Emacs is nice

What doesn’t work (yet?) is adding deadlines or scheduling to my calendar, but seeing as this new year I’m resolving to slow down, I’m going to stop trying to over optimize and just add stuff to my calendar. It is a true revelation, I know.

One thing I did consider was just skipping the computer all together and using a physical planner. The problem with a planner for me is,

  1. My handwriting is atrocious
  2. It doesn’t sync to my calendar

In addition to trying to understand my handwriting, I’d have to develop a habit to always look at it. I can’t see it happening when I can get my phone to annoy me as needed.

Part of this effort is part of a larger plan to use some of the tactics in GTD to get tasks off of my mental plate and put them somewhere useful. So far, it has been sticking more than it ever has, so I’m hopeful this could be the time it becomes a real habit. Wish me luck, as I’m sure I’ll need it!

CacheControl 0.11.0 Released

Last night I released CacheControl 0.11.0. The big change this time around is using compressed JSON to store the cache response rather than a pickled dictionary. Using pickle caused some problems if a cache was going to be read by both Python 3.x and Python 2.x clients.

Another benefit is that avoiding pickle also avoid exec’ing potentially dangerous code. It is not unreasonable that someone could include a header that could cause problems. This hasn’t happened yet, but it wouldn’t suprise me if it were feasible.

Finally, the size of the cached object should be a little smaller. Generally responses are not going to that large, but it should help if you use storage that keeps a hot keys in memory. MongoDB comes to mind along with Memcached and, probably, Redis. If you are avoiding caching large objects it could also be valuable. For example, a large sparse CSV response might be able to get compressed quite a bit to make caching it reasonable.

I haven’t don any conclusive tests regarding the actual size impact of compression, so these are just my theories. Take them with a grain of salt or let me know your experiences.

Huge thanks to Donald Stuff for sending in the compressed JSON patches as well all the folks who have submitted other suggestions and pull requests.

The Future

I’ve avoided making any major changes to CacheControl as it has been reasonably stable and flexible. There are some features that others have requested that have been too low on my own personal priorites to take time to implement.

One thing I’ve been tempted to do is add more storage implementations. For example, I started working on a SQLite store. My argument, to myself at least, was that the standard library supports SQLite, which makes it a reasonable target.

I decided to stop that work as it didn’t really seem very helpful. What did seem enticing was that a cache store becoming queryable. Unfortunately, since the cache store API only gets a key and blob for the value, it would require any cache store author to unpack the blog in order read any values it is interested in.

In the future I’ll be adding a hook system to let a cache store author have access to the requests.Response object in order to create extra arguments for setting the value.

For example, in Redis, you can set an expires time the DB will use to expire the response automatically. The cache store then might have an extra method that looks like this.

class RedisCache(BaseCache):

    def on_cache_set(self, response):
        kwargs = {}
        if 'expires' not in response.headers:
            return kwargs

        return {'expires': response.headers['expires']}

    def set(self, key, value, expires=None):
        # Set the value accordingly

I’m not crazy about this API as it is a little confusing to communicate that creating a on_cache_set hook is really a way to edit the arguments sent to the set method. Maybe calling it a hook is really the wrong term. Maybe it should be called prepare and it explicitly calls set. If anyone has thoughts, please let me know!

The reasoning is that I’d like to remove the Redis cache and start a new project for CacheControl stores that includes common cache store implementations. At the very least, I’d like to find some good implementations that I can link to from the docs to help folks find a path from a local file cache to using a real database when the time comes.

Lastly, there are a couple spec related details that could use some attention that I’ll be looking at in the meantime.

Replacing Monitors

I just read a quick article on Microsoft’s new VR goggles. The idea of layering virtual interfaces on top of the real world seems really cool and even practical in some use cases. What seems really difficult is how an application will understand the infinite number visual environments in order to effectively and accurately visualize the interfaces. Hopefully the SDK for the device includes a library that provides for different real world elements like placeOnTop(vObject, x, y z) where it can recognize some object in the room and allows that object to be made available as a platform. In other words, it sees the coffee table and makes it available as an object that you can put something on top of.

The thing is, what I’d love to see is VR replacing monitors! Every year TVs and monitors get upgrades in size and quality, yet the prices rarely drop very radically. Right now I have a laptop and an external monitor that I use at home. I’d love to get rid of both tiny screens and just look into space and see my windows on a huge, virtual, flat surface.

An actual computer would still be required and there wouldn’t necessarily need to be huge changes to the OS at first. The goggles would just be another big screen that would take the rasterized screen and convert it to something seen in the analog world. Sure, this would probably ruin a few eyes at first, but having a huge monitor measured in feet vs. inches is extremely enticing.

DevOps

I finally realized why DevOps is an idea. Up until this point, I felt DevOps was a term for a developer that was also responsible for the infrastructure. In other words, I never associated DevOps with an actual strategy or idea, and instead, it was simply something that happened. Well no more!

DevOps is providing developers keys [1] to operations. In a small organization these keys never have a chance to leave the hands of the small team of developers that have nothing to concern themselves except getting things done. As an organization grows, there becomes a dedicated person (and soon after group of people) dedicated to maintaining the infrastructure. The thing that happens is that the keys developers had to log into any server or install a new database are taken away and given to operations to manage. DevOps is a trend where operations and developers share the keys.

Because developers and operations both have access to change the infrastructure, there is a meeting of the minds that has to happen. Developers and Ops are forced to communicate the what, where, when, why and how of changes to the infrastructure. Since folks in DevOps are familiar with code, version control becomes a place of communication and cohesion.

The reason I now understand this paradigm more clearly is because when a developer doesn’t have access to the infrastructure, it is a huge waste of time. When code doesn’t work we need to be able to debug it. It important to come up with theories why things don’t work and iteratively test the theories until we find the reason for the failure. While it is possible to debug bugs that only show up in production, it can be slow, frustrating and difficult, when access to the infrastructure isn’t available.

I say all this with a huge grain of salt. I’m not a sysadmin and I’ve never been a true sysadmin. While I understand the hand wavy idea that if all developers have ssh keys to the hosts in a datacenter, there are more vectors for attack. What I don’t understand is why a developer with ssh keys is any more dangerous than a sysadmin having ssh keys. Obviously, a sysadmin may have a more stringent outlook on what is acceptable, but at the same time, anyone can be duped. After all, if you trust your developers to write code that writes to your database millions (or billions!) of times a day, I’m sure you can trust them to keep an ssh key safe or avoid exposing services that are meant to remain private.

I’m now a full on fan of DevOps. Developers and Ops working together and applying software engineering techniques to everything they work with seems like a great idea. Providing Developers keys to the infrastructure and pushing Ops to communicate the important security and/or hardware concerns is only a good thing. The more cohesion between Ops and Dev, the better.

[1]I’m not talking about ssh keys, but rather the idea of having “keys to the castle”. One facet of this could be making sure developer keys and accounts are available on the servers, but that is not the only way to give devs access to the infrastructure.

Logging

Yesterday I fixed an issue in dadd where logs from processes were being correctly sent back to the master. My solution ended up being a rather specific process of opening the file that would contain the logs and ensuring that any subprocesses used this file handle.

Here is the essential code annotated:

# This daemonizes the code. It can except stdin/stdout parameters #
# that I had originally used to capture output. But, the file used for
# capturing the output would not be closed or flushed and we'd get #
# nothing. After this code finishes we do some cleanup, so my logs were
# empty.
with daemon.DaemonContext(**kw):

    # Just watching for errors. We pass in the path of our log file
    # so we can upload it for an error.
    with ErrorHandler(spec, env.logfile) as error_handler:
        configure_environ(spec)

        # We open our logfile as a context manager to ensure it gets
        # closed, and more importantly, flushed and fsync'd to the disk.
        with open(env.logfile, 'w+') as output:

            # Pass in the file handle to our worker will starts some
            # subprocesses that we want to know the output of.
            worker = PythonWorkerProcess(spec, output)

            # printf => print to file... I'm sure this will get
            # renamed in order to avoid confusion...
            printf('Setting up', output)
            worker.setup()
            printf('Starting', output)
            try:
                worker.start()
            except:
                import traceback

                # Print our traceback in our logfile
                printf(traceback.format_exc())

                # Upload our log to our dadd master server
                error_handler.upload_log()

                # Raise the exception for our error handler to send
                # me an email.
                raise

            # Wrapping things up
            printf('Finishing', output)
            worker.finish()
            printf('Done')

Hopefully, the big question is, “Why not use the logging module?”

When I initially hacked the code, I just used print and had planned on letting the daemon library capture logs. That would make it easy for the subprocesses (scripts written by anyone) to get logs. Things were out of order though, and by the time the logs were meant to be sent, the code had already cleaned up the environment where the subprocesses had run, including deleting the log file.

My next step then was to use the logging module.

Logging is Complicated!

I’m sure it is not the intent of the logging module to be extremely complex, but the fact is, the management of handlers, loggers and levels across a wide array of libraries and application code gets unwieldy fast. I’m not sure people run into this complexity that often, as it is easy to use the basicConfig and be done with it. As an application scales, logging becomes more complicated and, in my experience, you either explicitly log to syslog (via the syslog module) or to stdout, where some other process manager handles the logs.

But, in the case where you do use logging, it is important to understand some essential complexity you simply can’t ignore.

First off, configuring loggers needs to be done early in the application. When I say early, I’m talking about at import time. The reason being is that libraries, that should try to log intelligently, when imported, might have already configured the logging system.

Secondly, the propagation of the different loggers needs to be explicit, and again, some libraries / frameworks are going to do it wrong. By “wrong”, I mean that the assumptions the library author makes don’t align with your application. In my dadd, I’m using Flask. Flask comes with a handy app.logger object that you can use to write to the log. It has a specific formatter as well, that makes messages really loud in the logs. Unfortunately, I couldn’t use this logger because I needed to reconfigure the logs for a daemon process. The problem was this daemon process was in the same repo as my main Flask application. If my daemon logging code gets loaded, which is almost certain will happen, it reconfigures the logging module, including Flasks handy app.logger object. It was frustrating to test logging in my daemon process and my Flask logs had disappeared. When I go them back, I ended up seeing things show up multiple times because different handlers had been attached that use the same output, which leads me to my next gripe.

The logging module is opaque. It would be extremely helpful to be able to inject at some point in your code a pprint(logging.current_config) that will provide the current config at that point in the code. In this way, you could intelligently make efforts to update the config correctly with tools like logging.config.dictConfig by editing the current config or using the incremental and disable_existing_loggers correctly.

Logging is Great

I’d like to make it clear that I’m a fan of the logging module. It is extremely helpful as it makes logging reliable and can be used in a multithreaded / multiprocessing environment. You don’t have to worry about explicitly flushing the buffer or fsync’ing the file handle. You have an easily way to configure the output. There are excellent handlers that help you log intelligently such as the RotatatingFileHandler, WatchedFileHandler and SysLogHandler. Many libraries also allow turning up the log level to see more deeply into what they are doing. Requests and urllib3 do a pretty decent job of this.

The problem is that controlling output is a different problem than controlling logging, yet they are intertwined. If you find it difficult to add some sort of output control to your application and the logging module seems be causing more problems than it is solving, then don’t use it! The technical debt you need to pay off for a small, customized output control system is extremely low compared to the hoops you might need to jump through in order to mold logging to your needs.

With that said, learning the logging module is extremely important. Django provides a really easy way to configure the logging and you can be certain that it gets loaded early enough in the process that you can rely on it. Flask and CherryPy (and I’m sure others) provide hooks into their own loggers that are extremely helpful. Finally, the basicConfig is a great tool to get started logging in standalone scripts that need to differentiate between DEBUG statements and INFO. Just remember, if things get tough and you feel like your battling logging, you might have hit the edges of its valid use cases and it is time to consider another strategy. There is no shame in it!

Build Tools

I’ve recently been creating more new projects for small libraries / apps and its got me thinking about build tools. First off, “build tools” is probably a somewhat misleading name, yet I think it is most often associated with type of tools I’m thinking of. Make is the big one, but there are a whole host of tools like Make on almost every platform.

One of the apps I’ve been working on is a Flask application. For the past year or so, I’ve been working on a Django app along side some CherryPy apps. The interesting thing about these different frameworks is the built in integration with some sort of build tool.

CherryPy essentially has no integration. If you unfamiliar with CherryPy, I’d argue it is the un-opinionated web framework, so it shouldn’t be surprising that there are no tools to stub out some directories, provide helpers for connecting to the DB and starting a shell with the framework code loaded.

Flask is similar to CherryPy in that it is a microframework, but the community has been much more active providing plugins (Blueprints in Flask terms) that provide more advance functionality. One such plugin, mimics Django’s manage.py file, which provides typical build tool and project helpers.

Django, as I just mentioned, provides manage.py file that adds some project helpers and, arguably, functions as a generic build tool.

I’ve become convince that every project should have some sort of build tool in place that codifies how to work with the project. The build tool should automate how to build and release the software, along with how that process interacts with the source control system. The build tool should provide helpers, where applicable, to aid in development and debugging. Finally, the build tool should help in running the apps processes, and/or supporting processes (ie make sure a database is up and running).

Yet, many projects don’t include these sorts of features. Frameworks, as we’ve already seen, don’t always provide it by default, which is a shame.

I certainly understand why programmers avoid build tools, especially in our current world where many programs don’t need an actual “build”. As programmers, we hope to create generalized solutions, while we are constantly pelted with proposed productivity gains in the form of personal automations. The result is that while we create simple programs that guide our users toward success, when it comes to writing code, we avoid prescribing how to develop on a project as if its a plague that will ruin free thinking.

The irony here is that Django, the framework with a built in build tool, is an order of magnitude more popular. I’m confident the reason for its popularity lies in the integrated build tool, that helps developers find success.

At some point, we need to consider how other developers work with our code. As authors of a codebase, we should consider our users, our fellow developer, and provide them with build tools that aid in working on the code. The goal is not to force everyone into the same workflow. The goal is to make our codebases usable.

Good Enough

Eric Shrock wrote a great blog on Engineer Anti-Patterns. I, unfortunately, can admit I’ve probably been guilty of each and every one of these patterns at one time or another. When I think back to times these behaviors have crept in, the motivations always come back to what is really “good enough”.

For example, I’ve been “the talker” when I’ve seen a better solution to a problem and can’t let it go. The proposed solution was considered “good enough” but not to me. My perspective of what is good enough clashes with that of the team and I feel it necessary to argue my point. I wouldn’t say that my motives are wrong, but at some point, a programmer must understand when and how to let an argument go.

The quest to balance “good enough” with best practices is rarely a simple yes or no question. Financial requirements might force you to make poor engineering decisions in favor of losing money in that moment. There are times where a program hasn’t proven its value, therefore strong engineering practices aren’t as important as simply proving the software is worth being written. Many times, writing and releasing something, even if it is broken, is a better political decision in an organization.

I suspect that most of these anti-patterns are a function of fear, specifically, the fear of failing. All of the anti-patterns reflect a lack of confidence in a developer. It might be imposter syndrome creeping in or the feeling of reliving a bad experience in the past. In order to programmers to be more productive and effective, it is critical that effort is made to reduce fear. In doing so, a developer can try a solution that may be simply “good enough” as the programmer knows if it falls short, it is something to learn from rather than fear.

Our goals as software developers and as an industry should be to raise the bar of “good enough” to the point where we truly are making educated risk / reward decisions instead of rolling the dice that some software is “good enough”.

The first step is to reduce the fear of failure. An organization should take steps to provide an environment where developers can easily and incrementally release code. Having tests you run locally, then in CI, then in a staging environment before releasing to a staging environment and finally to production helps developers feel confident that pushing code is safe.

Similarly, an organization should make it easy to find failures. Tests are an obvious example here, but providing well known libraries for easy integration into your logging infrastructure and error reporting are critical. It should be easy for developers to poke around in the environment where things failed to see what went wrong. Adding new metrics and profiling to code should be documented and encouraged.

Finally, when failures do occur, they should not be a time to place blame. There should be blameless postmortems.

Many programmers and organizations fear spending time on basic tooling and consistent development environments. Some developers feel having more than on step between commit and release represents a movement towards perfection, the enemy of “good enough”. We need to stop thinking like this! Basic logging, error reporting, writing tests, basic release strategies are all critical pieces that have been written and rewritten over and over again at thousands of organizations. We need to stop avoiding basic tenants of software development under the guise being “good enough”.

Software Project Structure and Ecosystems

Code can smell, but just like our own human noses, everyone has his/her own perspective on what stinks. Also, just because something smells bad now, it doesn’t mean you can’t get used to the smell and eventually enjoy it. The same applies for how software projects are organized and configured.

Most languages have the concept a package. Your source code repository is organized to support building some a package that can be distributed via your language’s ecosystem. Python has setuptools/pip, Perl has CPAN, JavaScript has NPM, Ruby has gems, etc. Even compiled languages provide packages by way of providing builds for different operating system packaging systems.

In each package use case, there are common idioms and best practices that the community supports. There are tools that promote these ideals that end up part of the language and community packing ecosystem. This ecosystem often ends up permeating not just the project layout, but the project itself. As developers, we want to be participants in the ecosystem and benefit from everything that the ecosystem provides.

As we become accustomed to our language ecosystem and its project tendencies, we develop an appreciation for the aroma of the code. Our sense of code smell can be tainted to the point of feeling that other project structures smell bad and are some how “wrong” in how they work.

If you’ve ever gone from disliking some cuisine to making it an integral part of your diet, you will quickly see the problem with associating a community’s ecosystem with sound software development practices. By disregarding different patterns as “smelly”, you also lose the opportunity to glean from positive aspects of other ecosystems.

A great example is Python’s virtualenv infrastructure compared to Go’s complete lack of packages. In Python, you create source based packages (most of the time) that are extracted and built within a virtual Python environment. If the package requires other packages, the environment uses the same system (usually pip) to download and build the sub-requirements. Go, on the other hand, requires that you specify libraries and other programs you use in your build step. These requirements are not “packaged” at all and typically use source repositories directly when defining requirements. These requirements become a part of your distributed executable. The executable is a single file that can be copied to another machine and run without requiring any libraries or tools on the target machine.

Both of these systems are extremely powerful, yet radically different. Python might benefit a great deal by building in the ability to package up a virtualenv and specific command as a single file that could be run on another system. Similarly, Go could benefit from a more formalized package repository system that ensures higher security standards. It is easy to look at either system and feel the lack of the others packing features is detrimental, when in fact, they are just different design trade offs. Python has become very successful on systems such as Heroku where it is relatively easy to take a project from source code to a release because the ecosystem promotes building software during deployment. Go, on the other hand, has become an ideal system management tool because it is trivial to copy a Go program to another machine and run it without requiring OS specific libraries or dependencies.

While packaging is one area we see programmers develop a nose for coding conventions, it doesn’t stop there.

The language ecosystem also prescribes coding standards. Some languages are perfectly happy to have a huge single file while others require a new file for each class / function. Each methodology has its own penalties and rewards. Many a Python / Vimmer has wretched at the number of files produced when writing Java, while the Java developer stares in shock as the vimmer uses search and replace to refactor code in a project. Again, these design decisions with their own set of trade offs.

One area where code smell becomes especially problematic is when it involves a coding paradigm that is different from the ecosystem. Up until recently, The Twisted and stackless ecosystems in Python felt strange and isolated. Many developers felt that things like deferred and greenlets felt like code smell when compared to the tried and true threads. Yet, as we discovered the need for more socket connections to be open at the same time and as we needed to read more I/O concurrently, it became clear that asynchronous programming models should be a first class citizen in the Python ecosystem at large. Prior to the ecosystem’s acceptance of async, the best practices felt very different and quite smelly.

Fortunately, in the case of async, Python managed to find a way to be more inclusive. The async paradigm still has far reaching requirements (there needs to be a main loop somewhere...), but the community has begun the process of integrating the ideas as seamless, natural, fragrant additions to the language.

Unfortunately, other paradigms have not been as well integrated. Functional programming, while having many champions and excellent libraries, still has not managed to break into the ecosystem at the project level. If we have a packages that have very few classes, nested function calls, LRU caches and tons of (cy)func|itertool(s|z) it feels smelly.

As programmers we need to make a conscious effort to expand our code olfaction to include as wide a bouquet as possible. When we see code that seems stinky, rather than assuming it is poorly designed or dangerous, take a big whiff and see what you find out. Make an effort to understand it and appreciate its fragrance. Over time, you’ll eventually understand the difference between pungent code, that is powerful and efficient, vs. stinky code that is actually rancid and should be thrown out.

Flask vs. CherryPy

I’ve always been a fan of CherryPy. It seems to avoid making decisions for you that will matter over time and it helps you make good decisions where it matters most. CherryPy is simple, pragmatic, stable fast enough and plays nice with other processes. As much as I appreciate what CherryPy offers, there is unfortunately not a lot of mindshare in the greater Python community. I suspect the reason CherryPy is not seen a hip framework is due to most users of CherryPy happily work around the rough edges and get work done rather than make an effort market their framework of choice. Tragic.

While there are a lot of microframeworks out there, Flask seems to be the most popular. I don’t say this with any sort of scientific accuracy, just a gut feeling. So, when I set out to write a different kind of process manager , it seemed like a good time to see how other microframeworks work.

The best thing I can say about Flask is the community of projects. After having worked on a Django project, I appreciate the admin interface and how easy it is to get 80% there. Flask is surprisingly similar in that searching google for “flask + something” quickly provides some options to implement something you want. Also, as Flask generally tries to avoid being too specific, the plugins (called Blueprints... I think) seem to provide basic tools with the opportunity to customize as necessary. Flask-Admin is extremely helpful along side Flask-SQLAlchemy.

Unfortunately, while this wealth of excellent community packages is excellent, Flask falls short when it comes to actual development. Its lack of organization in terms of dispatching makes organizing code feel very haphazard. It is easy to create circular dependencies due to the use of imports for establishing what code gets called. In essence, Flask forces you to build some patterns that are application specific rather than prescribing some models that make sense generally.

While a lack of direction can make the organization of the code less obvious, it does allow you to easily hook applications together. The Blueprint model, from what I can tell, makes it reasonably easy to compose applications within a site.

Another difficulty with Flask is configuration. Since you are using the import mechanism to configure your app, your configuration must also be semi-available at import time. Where this makes things slightly difficult is when you are creating a app that starts a web server (as opposed to an app that runs a web service). It is kind of tricky to create myapp –config because by the time you’ve started the app, you’ve already imported your application and set up some config. Not a huge issue, but it can be kludgy.

This model is where CherryPy excels. It allows you create a stand alone process that acts as a server. It provides a robust configuration mechanism that allows turning on/off process and request level features. It allows configuration per-URL as well. The result is that if you’re writing a daemon or some single app you want to run as a command, CherryPy makes this exceptionally easy and clear.

CherryPy also helps you stay a bit more organized in the framework. It provides some helpful dispatcher patterns that support a wide array of functionality and provide some more obvious patterns for organizing code. It is not a panacea. There are patterns that take some getting used to. But, once you understand these patterns, it becomes a powerful model to code in.

Fortuately, if you do want to use Flask as a framework and CherryPy as a process runner / server, it couldn’t be easier. It is trivial to run a Flask app with CherryPy, getting the best of both worlds in some ways.

While I wish CherryPy had more mindshare, I’m willing to face facts that Flask might have “won” the microframework war. With that said, I think there are valuable lessons to learn from CherryPy that could be implemented for Flask. I’d personally love to see the process bus model made available and a production web server included. Until then though, I’m happy to use CherryPy for its server and continue to enjoy the functionality graciously provided by the Flask community.

Thinking About ETLs

My primary focus for the last year or so has been writing ETLs at work. It is an interesting problem because on some level it feels extremely easy, while in reality, it is a problem that is very difficult to abstract.

Queries

The essence of an ETL, beyond the obvious “extract, transform, load”, is the query. In the case of a database, the query is typically the SELECT statement, but it usually is more than that. It often includes the format of the results. You might need to chunk the data using multiple queries. There might be columns you skip or columns you create.

In non-database ETLs, it still ends up being very similar to query. You often still need to find boundaries for what you are extracting. For example, if you had a bunch of date stamped log files, doing a find /var/logs -name 2014*.log.gz could still be considered a query.

A query is important because ETLs are inherently fragile. ETLs are required because the standard interface to some data is not available due to some constraints. By bypassing standard, and more importantly supported, interfaces, you are on your own when it comes to ensuring the ETL runs. The database dump you are running might timeout. The machine you are reading files from may reboot. The REST API node you are hitting gets a new version and restarts. There are always good reasons for your ETL process to fail. The query makes it possible to go back and try things again, limiting them to the specific subset of data you are missing.

Transforms

ETLs often are considered part of some analytics pipeline. The goal of an ETL is typically to take some data from some system and transform it to a format that can be loaded into another system for analysis. A better principle is to consider storing the intermediaries such that transformation is focused on a specific generalized format, rather than a specific system such as a database.

This is much harder than it sounds.

The key to providing generic access to data is a standard schema for the data. The “shape” of the data needs to be described in a fashion that is actionable by the transformation process that loads the data into the analytics system.

The schema is more than a type system. Some data is heavy with metadata while other data is extremely consistent. The schema should provide notation for both extremes.

The schema also should provide hints on how to convert the data. The most important aspect of the schema is to communicate to the loading system how to transform and / or import the data. One system might happily accept a string with 2014-02-15 as a date if you specify it is a date, while others may need something more explicit. The schema should communicate that the data is date string with a specific format that the loading system can use accordingly.

The schema can be difficult to create. Metadata might need a suite of queries to other systems in order to fill in the data. There might need to be calculations that have to happen that the querying system doesn’t support. In these cases you are not just transforming the data, but processing it.

I admit I just made an arbitrary distinction and definition of “processing”, so let me explain.

Processing Data

In a transformation you take the data you have and change it. If I have a URL, I might transform it into JSON that looks like {‘url’: $URL}. Processing, on the other hand, uses the data to create new data. For example, if I have a RESTful resource, I might crawl it to create a single view of some tree of objects. The important difference is that we are creating new information by using other resources not found in the original query data.

The processing of data can be expensive. You might have to make many requests for every row of output in a database table. The calculations, while small, might be on a huge dataset. Whatever the processing that needs happen in order to get your data to a generically usable state, it is a difficult problem to abstract over a wide breadth of data.

While there is no silver bullet to processing data, there are tactics that can be used to process data reliably and reasonably fast. The key to abstracting processing is defining the unit of work.

A Unit of Work

“Unit of Work” is probably a loaded term, so once again, I’ll define what I mean here.

When processing data in an ETL, the Unit of Work is the combination of:

  • an atomic record
  • an atomic algorithm
  • the ability to run the implementation

If all this sounds very map/reducey it is because it is! The difference is that in an ETL you don’t have the same reliability you’d have with something like Hadoop. There is no magical distributed file system that has your data ready to go on a cluster designed to run code explicitly written to support your map/reduce platform.

The key difference with processing data in ETLs vs. some system like Hadoop is the implementation and execution of the algorithm. The implementation includes:

  • some command to run on the atomic record
  • the information necessary to setup an environment for that script to run
  • an automated to input the atomic record to the command
  • a guarantee of reliable execution (or failure)

If we look at a system like Hadoop, and this applies to most map/reduce platforms that I’ve seen, there is an explicit step that takes data from some system and adds it to the HDFS store. There is another step that installs code, specifically written for Hadoop, onto the cluster. This code could be using Hadoop streaming or actual Java, but in either case, the installation is done via some deployment.

In other words, there is an unsaid step that Extracts data from some system, Transforms it for Hadoop and Loads it into HDFS. The processing in this case is getting the data from whatever the source system is into the analytics system, therefore, the requirements are slightly different.

We start off with a command. The command is simply an executable script like you would see in Hadoop streaming. No real difference here. Each line passed to the command contains the atomic record as usual.

Before we can run that command, we need to have an environment configured. In Hadoop, you’ve configured your cluster and deployed your code to the nodes. In an ETL system, due to the fragility and simpler processing requirements (no one should write a SQL-like system on top of an ETL framework), we want to set up an environment every time the command runs. By setting up this environment every time the command runs you allow a clear path for development of your ETL steps. Making the environment creation part of the development process it means that you ensure the deployment is tested along side the actual command(s) your ETL uses.

Once we have the command and an environment to run it in we need a way to get our atomic record to the command for actual processing. In Hadoop streaming, we use everyone’s favorite file handle, stdin. In an ETL system, while the command may still use stdin, the way the data enters the ETL system doesn’t necessarily have a distributed file system to use. Data might be downloaded from S3, some RESTful service, and / or some queue system. It important that you have a clear automated way to get data to an ETL processing node.

Finally, this processing must be reliable. ETLs are low priority. An ETL should not lock your production database for an hour in order to dump the data. Instead ETLs must quietly grab the data in a way that doesn’t add contention to the running systems. After all, you are extracting the data because a query on the production server will bog it down when it needs to be serving real time requests. An ETL system needs to reliably stop and start as necessary to get the data necessary and avoid adding more contention to an already resource intensive service.

Loading Data

Loading data from an ETL system requires analyzing the schema in order to construct the understanding between the analytics system and the data. In order to make this as flexible as possible, it is important that the schema use the source of data to add as much metadata as possible. If the data pulls from a Postgres table, the schema should idealling include most of the schema information. If that data must be loaded into some other RDBMS, you have all you need to safely read the data into the system.

Development and Maintenance

ETLs are always going to be changing. New analytics systems will be used and new source of data will be created. As the source system constraints change so do the constraints of an ETL system, again, with the ETL system being the lowest priority.

Since we can rely on ETLs changing and breaking, it is critical to raise awareness of maintenance within the system.

The key to creating a maintainable system is to build up from small tools. The reason being is that as you create small abstractions at a low level, you can reuse these easily. The trade off is that in the short term, more code is needed to accomplish common tasks. Over time, you find patterns specific to your organizations requirements that allow repetitive tasks to be abstracted into tools.

The converse to building up an ETL system based on small tools is to use a pre-built execution system. Unfortunately, pre-built ETL systems have been generalized for common tasks. As we’ve said earlier, ETLs are often changing and require more attention than a typical distributed system. The result is that using a pre-built ETL environment often means creating ETLs that allow the pre-built ETL system to do its work!

Testing

Our goal for our ETLs is to make them extremely easy to test. There are many facets to testing ETLs such as unit testing within an actual package. The testing that is most critical for development and maintenance is simply being able to quickly run and test a single step of an ETL.

For example, lets say we have an ETL that dumps a table, reformats some rows and creates a 10GB gzipped CSV file. I only mention the size here as it implies that it takes too long to run over the entire set of data every time while testing. The file will then be uploaded to S3 and notify a central data warehouse system. Here are some steps that the ETL might perform:

  1. Dumping the table
  2. Create a schema
  3. Processing the rows
  4. Gzipping the output
  5. Uploading the data
  6. Update the warehouse

Each of these steps should be runnable:

  • locally on a fake or testing datbase
  • locally, using a production database
  • remotely using a production database and testing system (test bucket and test warehouse)
  • remotely using the production database and production systems

By “runnable”, I mean that an ETL developer can run a command with a specific config and watch the output for issues.

These steps are all pretty basic, but the goal with an ETL system is to abstract the pieces that can be used across all ETLs in a way that is optimal for your system. For example, if your system is consistently streaming, your ETL framework might allow you to chain file handles together. For example

$ dump table | process rows | gzip | upload

Another option might be that each step produces a file that is used by the next step.

Both tactics are valid and can be optimized for over time to help distill ETLs to the minimal, changing requirements. In the above example, the database table dump could be abstracted to take the schema and some database settings to dump any table in your databases. The gzip, upload and data warehouse interactions can be broken out into a library and/or command line apps. Each of these optimizations are simple enough to be included in an ETL development framework without forcing a user to jump through a ton of hoops when a new data store needs to be considered.

An ETL Framework

Making it easy to develop ETLs means a framework. We want to create a Ruby on Rails for writing ETLs that makes it easy enough to get the easy stuff done and powerful enough to do deal with the corner cases. The framework revolves around the schema and the APIs to the different systems and libraries that provide language specific APIs.

At some level the framework needs to allow the introduction of other languages. My only suggestion here is that other languages are abstracted through a command line layer. The ETL framework can eventually call a command that could be written in whatever language the developer wants to use. ETLs are typically used to export data for to a system that is reasonably technical. Someone using this data most likely has some knowledge of some language such as R, Julia or maybe JavaScript. It is these technically savvy data wranglers we want to empower with the ETL framework in order to allow them to solve small ETL issues themselves and provide reliability where the system can be flaky.

Open Questions

The system I’ve described is what I’m working on. While I’m confident the design goals are reasonable, the implementation is going to be difficult. Specifically, the task of generically supporting many languages is challenging because each language has its own ecosystem and environment. Python is an easy language for this task b/c it is trivial to connect to a Ubuntu host have a good deal of the ecosystem in place. Other languages, such as R, probably require some coordination with the cluster provisioning system to make sure base requirements are available. That said, it is unclear if other languages provide small environments like virtualenvs do. Obviously typical scripting languages like Ruby and JavaScript have support for an application local environment, but I’m doubtful R or Julia would have the same facilities.

Another option would be to use a formal build / deployment pattern where a container is built. This answers many of the platform questions, but it brings up other questions such as how to make this available in the ETL Framework. It is ideal if an ETL author can simply call a command to test. If the author needs to build a container locally then I suspect that might be too large a requirement as each platform is going to be different. Obviously, we could introduce a build host to handle the build steps, but that makes it much harder for someone to feel confident the script they wrote will run in production.

The challenge is because our hope is to empower semi-technical ETL authors. If we compare this goal to people who can write HTML/CSS vs. programmers, it clarifies the requirements. A user learning to write HTML/CSS only has to open the file in a web browser to test it. If the page looks correct, they can be confident when they deploy it will work. The goal with the ETL framework and APIs is that the system can provide a similar work flow and ease of use.

Wrapping Up

I’ve written a LOT of ETL code over the past year. Much of what I propose above reflects my experiences. It also reflects the server environment in which these ETLs run as well as the organizational environment. ETLs are low priority code, by nature, that can be used to build first class products. Systems that require a lot of sysadmin time, server resources or have too specific an API may still be helpful moving data around, but they will fall short as systems evolve. My goal has been to create a system that evolves with the data in the organization and empowers a large number of users to distribute the task of developing and maintaining ETLs.