Replacing Monitors

I just read a quick article on Microsoft’s new VR goggles. The idea of layering virtual interfaces on top of the real world seems really cool and even practical in some use cases. What seems really difficult is how an application will understand the infinite number visual environments in order to effectively and accurately visualize the interfaces. Hopefully the SDK for the device includes a library that provides for different real world elements like placeOnTop(vObject, x, y z) where it can recognize some object in the room and allows that object to be made available as a platform. In other words, it sees the coffee table and makes it available as an object that you can put something on top of.

The thing is, what I’d love to see is VR replacing monitors! Every year TVs and monitors get upgrades in size and quality, yet the prices rarely drop very radically. Right now I have a laptop and an external monitor that I use at home. I’d love to get rid of both tiny screens and just look into space and see my windows on a huge, virtual, flat surface.

An actual computer would still be required and there wouldn’t necessarily need to be huge changes to the OS at first. The goggles would just be another big screen that would take the rasterized screen and convert it to something seen in the analog world. Sure, this would probably ruin a few eyes at first, but having a huge monitor measured in feet vs. inches is extremely enticing.

DevOps

I finally realized why DevOps is an idea. Up until this point, I felt DevOps was a term for a developer that was also responsible for the infrastructure. In other words, I never associated DevOps with an actual strategy or idea, and instead, it was simply something that happened. Well no more!

DevOps is providing developers keys [1] to operations. In a small organization these keys never have a chance to leave the hands of the small team of developers that have nothing to concern themselves except getting things done. As an organization grows, there becomes a dedicated person (and soon after group of people) dedicated to maintaining the infrastructure. The thing that happens is that the keys developers had to log into any server or install a new database are taken away and given to operations to manage. DevOps is a trend where operations and developers share the keys.

Because developers and operations both have access to change the infrastructure, there is a meeting of the minds that has to happen. Developers and Ops are forced to communicate the what, where, when, why and how of changes to the infrastructure. Since folks in DevOps are familiar with code, version control becomes a place of communication and cohesion.

The reason I now understand this paradigm more clearly is because when a developer doesn’t have access to the infrastructure, it is a huge waste of time. When code doesn’t work we need to be able to debug it. It important to come up with theories why things don’t work and iteratively test the theories until we find the reason for the failure. While it is possible to debug bugs that only show up in production, it can be slow, frustrating and difficult, when access to the infrastructure isn’t available.

I say all this with a huge grain of salt. I’m not a sysadmin and I’ve never been a true sysadmin. While I understand the hand wavy idea that if all developers have ssh keys to the hosts in a datacenter, there are more vectors for attack. What I don’t understand is why a developer with ssh keys is any more dangerous than a sysadmin having ssh keys. Obviously, a sysadmin may have a more stringent outlook on what is acceptable, but at the same time, anyone can be duped. After all, if you trust your developers to write code that writes to your database millions (or billions!) of times a day, I’m sure you can trust them to keep an ssh key safe or avoid exposing services that are meant to remain private.

I’m now a full on fan of DevOps. Developers and Ops working together and applying software engineering techniques to everything they work with seems like a great idea. Providing Developers keys to the infrastructure and pushing Ops to communicate the important security and/or hardware concerns is only a good thing. The more cohesion between Ops and Dev, the better.

[1]I’m not talking about ssh keys, but rather the idea of having “keys to the castle”. One facet of this could be making sure developer keys and accounts are available on the servers, but that is not the only way to give devs access to the infrastructure.

Logging

Yesterday I fixed an issue in dadd where logs from processes were being correctly sent back to the master. My solution ended up being a rather specific process of opening the file that would contain the logs and ensuring that any subprocesses used this file handle.

Here is the essential code annotated:

# This daemonizes the code. It can except stdin/stdout parameters #
# that I had originally used to capture output. But, the file used for
# capturing the output would not be closed or flushed and we'd get #
# nothing. After this code finishes we do some cleanup, so my logs were
# empty.
with daemon.DaemonContext(**kw):

    # Just watching for errors. We pass in the path of our log file
    # so we can upload it for an error.
    with ErrorHandler(spec, env.logfile) as error_handler:
        configure_environ(spec)

        # We open our logfile as a context manager to ensure it gets
        # closed, and more importantly, flushed and fsync'd to the disk.
        with open(env.logfile, 'w+') as output:

            # Pass in the file handle to our worker will starts some
            # subprocesses that we want to know the output of.
            worker = PythonWorkerProcess(spec, output)

            # printf => print to file... I'm sure this will get
            # renamed in order to avoid confusion...
            printf('Setting up', output)
            worker.setup()
            printf('Starting', output)
            try:
                worker.start()
            except:
                import traceback

                # Print our traceback in our logfile
                printf(traceback.format_exc())

                # Upload our log to our dadd master server
                error_handler.upload_log()

                # Raise the exception for our error handler to send
                # me an email.
                raise

            # Wrapping things up
            printf('Finishing', output)
            worker.finish()
            printf('Done')

Hopefully, the big question is, “Why not use the logging module?”

When I initially hacked the code, I just used print and had planned on letting the daemon library capture logs. That would make it easy for the subprocesses (scripts written by anyone) to get logs. Things were out of order though, and by the time the logs were meant to be sent, the code had already cleaned up the environment where the subprocesses had run, including deleting the log file.

My next step then was to use the logging module.

Logging is Complicated!

I’m sure it is not the intent of the logging module to be extremely complex, but the fact is, the management of handlers, loggers and levels across a wide array of libraries and application code gets unwieldy fast. I’m not sure people run into this complexity that often, as it is easy to use the basicConfig and be done with it. As an application scales, logging becomes more complicated and, in my experience, you either explicitly log to syslog (via the syslog module) or to stdout, where some other process manager handles the logs.

But, in the case where you do use logging, it is important to understand some essential complexity you simply can’t ignore.

First off, configuring loggers needs to be done early in the application. When I say early, I’m talking about at import time. The reason being is that libraries, that should try to log intelligently, when imported, might have already configured the logging system.

Secondly, the propagation of the different loggers needs to be explicit, and again, some libraries / frameworks are going to do it wrong. By “wrong”, I mean that the assumptions the library author makes don’t align with your application. In my dadd, I’m using Flask. Flask comes with a handy app.logger object that you can use to write to the log. It has a specific formatter as well, that makes messages really loud in the logs. Unfortunately, I couldn’t use this logger because I needed to reconfigure the logs for a daemon process. The problem was this daemon process was in the same repo as my main Flask application. If my daemon logging code gets loaded, which is almost certain will happen, it reconfigures the logging module, including Flasks handy app.logger object. It was frustrating to test logging in my daemon process and my Flask logs had disappeared. When I go them back, I ended up seeing things show up multiple times because different handlers had been attached that use the same output, which leads me to my next gripe.

The logging module is opaque. It would be extremely helpful to be able to inject at some point in your code a pprint(logging.current_config) that will provide the current config at that point in the code. In this way, you could intelligently make efforts to update the config correctly with tools like logging.config.dictConfig by editing the current config or using the incremental and disable_existing_loggers correctly.

Logging is Great

I’d like to make it clear that I’m a fan of the logging module. It is extremely helpful as it makes logging reliable and can be used in a multithreaded / multiprocessing environment. You don’t have to worry about explicitly flushing the buffer or fsync’ing the file handle. You have an easily way to configure the output. There are excellent handlers that help you log intelligently such as the RotatatingFileHandler, WatchedFileHandler and SysLogHandler. Many libraries also allow turning up the log level to see more deeply into what they are doing. Requests and urllib3 do a pretty decent job of this.

The problem is that controlling output is a different problem than controlling logging, yet they are intertwined. If you find it difficult to add some sort of output control to your application and the logging module seems be causing more problems than it is solving, then don’t use it! The technical debt you need to pay off for a small, customized output control system is extremely low compared to the hoops you might need to jump through in order to mold logging to your needs.

With that said, learning the logging module is extremely important. Django provides a really easy way to configure the logging and you can be certain that it gets loaded early enough in the process that you can rely on it. Flask and CherryPy (and I’m sure others) provide hooks into their own loggers that are extremely helpful. Finally, the basicConfig is a great tool to get started logging in standalone scripts that need to differentiate between DEBUG statements and INFO. Just remember, if things get tough and you feel like your battling logging, you might have hit the edges of its valid use cases and it is time to consider another strategy. There is no shame in it!

Build Tools

I’ve recently been creating more new projects for small libraries / apps and its got me thinking about build tools. First off, “build tools” is probably a somewhat misleading name, yet I think it is most often associated with type of tools I’m thinking of. Make is the big one, but there are a whole host of tools like Make on almost every platform.

One of the apps I’ve been working on is a Flask application. For the past year or so, I’ve been working on a Django app along side some CherryPy apps. The interesting thing about these different frameworks is the built in integration with some sort of build tool.

CherryPy essentially has no integration. If you unfamiliar with CherryPy, I’d argue it is the un-opinionated web framework, so it shouldn’t be surprising that there are no tools to stub out some directories, provide helpers for connecting to the DB and starting a shell with the framework code loaded.

Flask is similar to CherryPy in that it is a microframework, but the community has been much more active providing plugins (Blueprints in Flask terms) that provide more advance functionality. One such plugin, mimics Django’s manage.py file, which provides typical build tool and project helpers.

Django, as I just mentioned, provides manage.py file that adds some project helpers and, arguably, functions as a generic build tool.

I’ve become convince that every project should have some sort of build tool in place that codifies how to work with the project. The build tool should automate how to build and release the software, along with how that process interacts with the source control system. The build tool should provide helpers, where applicable, to aid in development and debugging. Finally, the build tool should help in running the apps processes, and/or supporting processes (ie make sure a database is up and running).

Yet, many projects don’t include these sorts of features. Frameworks, as we’ve already seen, don’t always provide it by default, which is a shame.

I certainly understand why programmers avoid build tools, especially in our current world where many programs don’t need an actual “build”. As programmers, we hope to create generalized solutions, while we are constantly pelted with proposed productivity gains in the form of personal automations. The result is that while we create simple programs that guide our users toward success, when it comes to writing code, we avoid prescribing how to develop on a project as if its a plague that will ruin free thinking.

The irony here is that Django, the framework with a built in build tool, is an order of magnitude more popular. I’m confident the reason for its popularity lies in the integrated build tool, that helps developers find success.

At some point, we need to consider how other developers work with our code. As authors of a codebase, we should consider our users, our fellow developer, and provide them with build tools that aid in working on the code. The goal is not to force everyone into the same workflow. The goal is to make our codebases usable.

Good Enough

Eric Shrock wrote a great blog on Engineer Anti-Patterns. I, unfortunately, can admit I’ve probably been guilty of each and every one of these patterns at one time or another. When I think back to times these behaviors have crept in, the motivations always come back to what is really “good enough”.

For example, I’ve been “the talker” when I’ve seen a better solution to a problem and can’t let it go. The proposed solution was considered “good enough” but not to me. My perspective of what is good enough clashes with that of the team and I feel it necessary to argue my point. I wouldn’t say that my motives are wrong, but at some point, a programmer must understand when and how to let an argument go.

The quest to balance “good enough” with best practices is rarely a simple yes or no question. Financial requirements might force you to make poor engineering decisions in favor of losing money in that moment. There are times where a program hasn’t proven its value, therefore strong engineering practices aren’t as important as simply proving the software is worth being written. Many times, writing and releasing something, even if it is broken, is a better political decision in an organization.

I suspect that most of these anti-patterns are a function of fear, specifically, the fear of failing. All of the anti-patterns reflect a lack of confidence in a developer. It might be imposter syndrome creeping in or the feeling of reliving a bad experience in the past. In order to programmers to be more productive and effective, it is critical that effort is made to reduce fear. In doing so, a developer can try a solution that may be simply “good enough” as the programmer knows if it falls short, it is something to learn from rather than fear.

Our goals as software developers and as an industry should be to raise the bar of “good enough” to the point where we truly are making educated risk / reward decisions instead of rolling the dice that some software is “good enough”.

The first step is to reduce the fear of failure. An organization should take steps to provide an environment where developers can easily and incrementally release code. Having tests you run locally, then in CI, then in a staging environment before releasing to a staging environment and finally to production helps developers feel confident that pushing code is safe.

Similarly, an organization should make it easy to find failures. Tests are an obvious example here, but providing well known libraries for easy integration into your logging infrastructure and error reporting are critical. It should be easy for developers to poke around in the environment where things failed to see what went wrong. Adding new metrics and profiling to code should be documented and encouraged.

Finally, when failures do occur, they should not be a time to place blame. There should be blameless postmortems.

Many programmers and organizations fear spending time on basic tooling and consistent development environments. Some developers feel having more than on step between commit and release represents a movement towards perfection, the enemy of “good enough”. We need to stop thinking like this! Basic logging, error reporting, writing tests, basic release strategies are all critical pieces that have been written and rewritten over and over again at thousands of organizations. We need to stop avoiding basic tenants of software development under the guise being “good enough”.

Software Project Structure and Ecosystems

Code can smell, but just like our own human noses, everyone has his/her own perspective on what stinks. Also, just because something smells bad now, it doesn’t mean you can’t get used to the smell and eventually enjoy it. The same applies for how software projects are organized and configured.

Most languages have the concept a package. Your source code repository is organized to support building some a package that can be distributed via your language’s ecosystem. Python has setuptools/pip, Perl has CPAN, JavaScript has NPM, Ruby has gems, etc. Even compiled languages provide packages by way of providing builds for different operating system packaging systems.

In each package use case, there are common idioms and best practices that the community supports. There are tools that promote these ideals that end up part of the language and community packing ecosystem. This ecosystem often ends up permeating not just the project layout, but the project itself. As developers, we want to be participants in the ecosystem and benefit from everything that the ecosystem provides.

As we become accustomed to our language ecosystem and its project tendencies, we develop an appreciation for the aroma of the code. Our sense of code smell can be tainted to the point of feeling that other project structures smell bad and are some how “wrong” in how they work.

If you’ve ever gone from disliking some cuisine to making it an integral part of your diet, you will quickly see the problem with associating a community’s ecosystem with sound software development practices. By disregarding different patterns as “smelly”, you also lose the opportunity to glean from positive aspects of other ecosystems.

A great example is Python’s virtualenv infrastructure compared to Go’s complete lack of packages. In Python, you create source based packages (most of the time) that are extracted and built within a virtual Python environment. If the package requires other packages, the environment uses the same system (usually pip) to download and build the sub-requirements. Go, on the other hand, requires that you specify libraries and other programs you use in your build step. These requirements are not “packaged” at all and typically use source repositories directly when defining requirements. These requirements become a part of your distributed executable. The executable is a single file that can be copied to another machine and run without requiring any libraries or tools on the target machine.

Both of these systems are extremely powerful, yet radically different. Python might benefit a great deal by building in the ability to package up a virtualenv and specific command as a single file that could be run on another system. Similarly, Go could benefit from a more formalized package repository system that ensures higher security standards. It is easy to look at either system and feel the lack of the others packing features is detrimental, when in fact, they are just different design trade offs. Python has become very successful on systems such as Heroku where it is relatively easy to take a project from source code to a release because the ecosystem promotes building software during deployment. Go, on the other hand, has become an ideal system management tool because it is trivial to copy a Go program to another machine and run it without requiring OS specific libraries or dependencies.

While packaging is one area we see programmers develop a nose for coding conventions, it doesn’t stop there.

The language ecosystem also prescribes coding standards. Some languages are perfectly happy to have a huge single file while others require a new file for each class / function. Each methodology has its own penalties and rewards. Many a Python / Vimmer has wretched at the number of files produced when writing Java, while the Java developer stares in shock as the vimmer uses search and replace to refactor code in a project. Again, these design decisions with their own set of trade offs.

One area where code smell becomes especially problematic is when it involves a coding paradigm that is different from the ecosystem. Up until recently, The Twisted and stackless ecosystems in Python felt strange and isolated. Many developers felt that things like deferred and greenlets felt like code smell when compared to the tried and true threads. Yet, as we discovered the need for more socket connections to be open at the same time and as we needed to read more I/O concurrently, it became clear that asynchronous programming models should be a first class citizen in the Python ecosystem at large. Prior to the ecosystem’s acceptance of async, the best practices felt very different and quite smelly.

Fortunately, in the case of async, Python managed to find a way to be more inclusive. The async paradigm still has far reaching requirements (there needs to be a main loop somewhere...), but the community has begun the process of integrating the ideas as seamless, natural, fragrant additions to the language.

Unfortunately, other paradigms have not been as well integrated. Functional programming, while having many champions and excellent libraries, still has not managed to break into the ecosystem at the project level. If we have a packages that have very few classes, nested function calls, LRU caches and tons of (cy)func|itertool(s|z) it feels smelly.

As programmers we need to make a conscious effort to expand our code olfaction to include as wide a bouquet as possible. When we see code that seems stinky, rather than assuming it is poorly designed or dangerous, take a big whiff and see what you find out. Make an effort to understand it and appreciate its fragrance. Over time, you’ll eventually understand the difference between pungent code, that is powerful and efficient, vs. stinky code that is actually rancid and should be thrown out.

Flask vs. CherryPy

I’ve always been a fan of CherryPy. It seems to avoid making decisions for you that will matter over time and it helps you make good decisions where it matters most. CherryPy is simple, pragmatic, stable fast enough and plays nice with other processes. As much as I appreciate what CherryPy offers, there is unfortunately not a lot of mindshare in the greater Python community. I suspect the reason CherryPy is not seen a hip framework is due to most users of CherryPy happily work around the rough edges and get work done rather than make an effort market their framework of choice. Tragic.

While there are a lot of microframeworks out there, Flask seems to be the most popular. I don’t say this with any sort of scientific accuracy, just a gut feeling. So, when I set out to write a different kind of process manager , it seemed like a good time to see how other microframeworks work.

The best thing I can say about Flask is the community of projects. After having worked on a Django project, I appreciate the admin interface and how easy it is to get 80% there. Flask is surprisingly similar in that searching google for “flask + something” quickly provides some options to implement something you want. Also, as Flask generally tries to avoid being too specific, the plugins (called Blueprints... I think) seem to provide basic tools with the opportunity to customize as necessary. Flask-Admin is extremely helpful along side Flask-SQLAlchemy.

Unfortunately, while this wealth of excellent community packages is excellent, Flask falls short when it comes to actual development. Its lack of organization in terms of dispatching makes organizing code feel very haphazard. It is easy to create circular dependencies due to the use of imports for establishing what code gets called. In essence, Flask forces you to build some patterns that are application specific rather than prescribing some models that make sense generally.

While a lack of direction can make the organization of the code less obvious, it does allow you to easily hook applications together. The Blueprint model, from what I can tell, makes it reasonably easy to compose applications within a site.

Another difficulty with Flask is configuration. Since you are using the import mechanism to configure your app, your configuration must also be semi-available at import time. Where this makes things slightly difficult is when you are creating a app that starts a web server (as opposed to an app that runs a web service). It is kind of tricky to create myapp –config because by the time you’ve started the app, you’ve already imported your application and set up some config. Not a huge issue, but it can be kludgy.

This model is where CherryPy excels. It allows you create a stand alone process that acts as a server. It provides a robust configuration mechanism that allows turning on/off process and request level features. It allows configuration per-URL as well. The result is that if you’re writing a daemon or some single app you want to run as a command, CherryPy makes this exceptionally easy and clear.

CherryPy also helps you stay a bit more organized in the framework. It provides some helpful dispatcher patterns that support a wide array of functionality and provide some more obvious patterns for organizing code. It is not a panacea. There are patterns that take some getting used to. But, once you understand these patterns, it becomes a powerful model to code in.

Fortuately, if you do want to use Flask as a framework and CherryPy as a process runner / server, it couldn’t be easier. It is trivial to run a Flask app with CherryPy, getting the best of both worlds in some ways.

While I wish CherryPy had more mindshare, I’m willing to face facts that Flask might have “won” the microframework war. With that said, I think there are valuable lessons to learn from CherryPy that could be implemented for Flask. I’d personally love to see the process bus model made available and a production web server included. Until then though, I’m happy to use CherryPy for its server and continue to enjoy the functionality graciously provided by the Flask community.

Thinking About ETLs

My primary focus for the last year or so has been writing ETLs at work. It is an interesting problem because on some level it feels extremely easy, while in reality, it is a problem that is very difficult to abstract.

Queries

The essence of an ETL, beyond the obvious “extract, transform, load”, is the query. In the case of a database, the query is typically the SELECT statement, but it usually is more than that. It often includes the format of the results. You might need to chunk the data using multiple queries. There might be columns you skip or columns you create.

In non-database ETLs, it still ends up being very similar to query. You often still need to find boundaries for what you are extracting. For example, if you had a bunch of date stamped log files, doing a find /var/logs -name 2014*.log.gz could still be considered a query.

A query is important because ETLs are inherently fragile. ETLs are required because the standard interface to some data is not available due to some constraints. By bypassing standard, and more importantly supported, interfaces, you are on your own when it comes to ensuring the ETL runs. The database dump you are running might timeout. The machine you are reading files from may reboot. The REST API node you are hitting gets a new version and restarts. There are always good reasons for your ETL process to fail. The query makes it possible to go back and try things again, limiting them to the specific subset of data you are missing.

Transforms

ETLs often are considered part of some analytics pipeline. The goal of an ETL is typically to take some data from some system and transform it to a format that can be loaded into another system for analysis. A better principle is to consider storing the intermediaries such that transformation is focused on a specific generalized format, rather than a specific system such as a database.

This is much harder than it sounds.

The key to providing generic access to data is a standard schema for the data. The “shape” of the data needs to be described in a fashion that is actionable by the transformation process that loads the data into the analytics system.

The schema is more than a type system. Some data is heavy with metadata while other data is extremely consistent. The schema should provide notation for both extremes.

The schema also should provide hints on how to convert the data. The most important aspect of the schema is to communicate to the loading system how to transform and / or import the data. One system might happily accept a string with 2014-02-15 as a date if you specify it is a date, while others may need something more explicit. The schema should communicate that the data is date string with a specific format that the loading system can use accordingly.

The schema can be difficult to create. Metadata might need a suite of queries to other systems in order to fill in the data. There might need to be calculations that have to happen that the querying system doesn’t support. In these cases you are not just transforming the data, but processing it.

I admit I just made an arbitrary distinction and definition of “processing”, so let me explain.

Processing Data

In a transformation you take the data you have and change it. If I have a URL, I might transform it into JSON that looks like {‘url’: $URL}. Processing, on the other hand, uses the data to create new data. For example, if I have a RESTful resource, I might crawl it to create a single view of some tree of objects. The important difference is that we are creating new information by using other resources not found in the original query data.

The processing of data can be expensive. You might have to make many requests for every row of output in a database table. The calculations, while small, might be on a huge dataset. Whatever the processing that needs happen in order to get your data to a generically usable state, it is a difficult problem to abstract over a wide breadth of data.

While there is no silver bullet to processing data, there are tactics that can be used to process data reliably and reasonably fast. The key to abstracting processing is defining the unit of work.

A Unit of Work

“Unit of Work” is probably a loaded term, so once again, I’ll define what I mean here.

When processing data in an ETL, the Unit of Work is the combination of:

  • an atomic record
  • an atomic algorithm
  • the ability to run the implementation

If all this sounds very map/reducey it is because it is! The difference is that in an ETL you don’t have the same reliability you’d have with something like Hadoop. There is no magical distributed file system that has your data ready to go on a cluster designed to run code explicitly written to support your map/reduce platform.

The key difference with processing data in ETLs vs. some system like Hadoop is the implementation and execution of the algorithm. The implementation includes:

  • some command to run on the atomic record
  • the information necessary to setup an environment for that script to run
  • an automated to input the atomic record to the command
  • a guarantee of reliable execution (or failure)

If we look at a system like Hadoop, and this applies to most map/reduce platforms that I’ve seen, there is an explicit step that takes data from some system and adds it to the HDFS store. There is another step that installs code, specifically written for Hadoop, onto the cluster. This code could be using Hadoop streaming or actual Java, but in either case, the installation is done via some deployment.

In other words, there is an unsaid step that Extracts data from some system, Transforms it for Hadoop and Loads it into HDFS. The processing in this case is getting the data from whatever the source system is into the analytics system, therefore, the requirements are slightly different.

We start off with a command. The command is simply an executable script like you would see in Hadoop streaming. No real difference here. Each line passed to the command contains the atomic record as usual.

Before we can run that command, we need to have an environment configured. In Hadoop, you’ve configured your cluster and deployed your code to the nodes. In an ETL system, due to the fragility and simpler processing requirements (no one should write a SQL-like system on top of an ETL framework), we want to set up an environment every time the command runs. By setting up this environment every time the command runs you allow a clear path for development of your ETL steps. Making the environment creation part of the development process it means that you ensure the deployment is tested along side the actual command(s) your ETL uses.

Once we have the command and an environment to run it in we need a way to get our atomic record to the command for actual processing. In Hadoop streaming, we use everyone’s favorite file handle, stdin. In an ETL system, while the command may still use stdin, the way the data enters the ETL system doesn’t necessarily have a distributed file system to use. Data might be downloaded from S3, some RESTful service, and / or some queue system. It important that you have a clear automated way to get data to an ETL processing node.

Finally, this processing must be reliable. ETLs are low priority. An ETL should not lock your production database for an hour in order to dump the data. Instead ETLs must quietly grab the data in a way that doesn’t add contention to the running systems. After all, you are extracting the data because a query on the production server will bog it down when it needs to be serving real time requests. An ETL system needs to reliably stop and start as necessary to get the data necessary and avoid adding more contention to an already resource intensive service.

Loading Data

Loading data from an ETL system requires analyzing the schema in order to construct the understanding between the analytics system and the data. In order to make this as flexible as possible, it is important that the schema use the source of data to add as much metadata as possible. If the data pulls from a Postgres table, the schema should idealling include most of the schema information. If that data must be loaded into some other RDBMS, you have all you need to safely read the data into the system.

Development and Maintenance

ETLs are always going to be changing. New analytics systems will be used and new source of data will be created. As the source system constraints change so do the constraints of an ETL system, again, with the ETL system being the lowest priority.

Since we can rely on ETLs changing and breaking, it is critical to raise awareness of maintenance within the system.

The key to creating a maintainable system is to build up from small tools. The reason being is that as you create small abstractions at a low level, you can reuse these easily. The trade off is that in the short term, more code is needed to accomplish common tasks. Over time, you find patterns specific to your organizations requirements that allow repetitive tasks to be abstracted into tools.

The converse to building up an ETL system based on small tools is to use a pre-built execution system. Unfortunately, pre-built ETL systems have been generalized for common tasks. As we’ve said earlier, ETLs are often changing and require more attention than a typical distributed system. The result is that using a pre-built ETL environment often means creating ETLs that allow the pre-built ETL system to do its work!

Testing

Our goal for our ETLs is to make them extremely easy to test. There are many facets to testing ETLs such as unit testing within an actual package. The testing that is most critical for development and maintenance is simply being able to quickly run and test a single step of an ETL.

For example, lets say we have an ETL that dumps a table, reformats some rows and creates a 10GB gzipped CSV file. I only mention the size here as it implies that it takes too long to run over the entire set of data every time while testing. The file will then be uploaded to S3 and notify a central data warehouse system. Here are some steps that the ETL might perform:

  1. Dumping the table
  2. Create a schema
  3. Processing the rows
  4. Gzipping the output
  5. Uploading the data
  6. Update the warehouse

Each of these steps should be runnable:

  • locally on a fake or testing datbase
  • locally, using a production database
  • remotely using a production database and testing system (test bucket and test warehouse)
  • remotely using the production database and production systems

By “runnable”, I mean that an ETL developer can run a command with a specific config and watch the output for issues.

These steps are all pretty basic, but the goal with an ETL system is to abstract the pieces that can be used across all ETLs in a way that is optimal for your system. For example, if your system is consistently streaming, your ETL framework might allow you to chain file handles together. For example

$ dump table | process rows | gzip | upload

Another option might be that each step produces a file that is used by the next step.

Both tactics are valid and can be optimized for over time to help distill ETLs to the minimal, changing requirements. In the above example, the database table dump could be abstracted to take the schema and some database settings to dump any table in your databases. The gzip, upload and data warehouse interactions can be broken out into a library and/or command line apps. Each of these optimizations are simple enough to be included in an ETL development framework without forcing a user to jump through a ton of hoops when a new data store needs to be considered.

An ETL Framework

Making it easy to develop ETLs means a framework. We want to create a Ruby on Rails for writing ETLs that makes it easy enough to get the easy stuff done and powerful enough to do deal with the corner cases. The framework revolves around the schema and the APIs to the different systems and libraries that provide language specific APIs.

At some level the framework needs to allow the introduction of other languages. My only suggestion here is that other languages are abstracted through a command line layer. The ETL framework can eventually call a command that could be written in whatever language the developer wants to use. ETLs are typically used to export data for to a system that is reasonably technical. Someone using this data most likely has some knowledge of some language such as R, Julia or maybe JavaScript. It is these technically savvy data wranglers we want to empower with the ETL framework in order to allow them to solve small ETL issues themselves and provide reliability where the system can be flaky.

Open Questions

The system I’ve described is what I’m working on. While I’m confident the design goals are reasonable, the implementation is going to be difficult. Specifically, the task of generically supporting many languages is challenging because each language has its own ecosystem and environment. Python is an easy language for this task b/c it is trivial to connect to a Ubuntu host have a good deal of the ecosystem in place. Other languages, such as R, probably require some coordination with the cluster provisioning system to make sure base requirements are available. That said, it is unclear if other languages provide small environments like virtualenvs do. Obviously typical scripting languages like Ruby and JavaScript have support for an application local environment, but I’m doubtful R or Julia would have the same facilities.

Another option would be to use a formal build / deployment pattern where a container is built. This answers many of the platform questions, but it brings up other questions such as how to make this available in the ETL Framework. It is ideal if an ETL author can simply call a command to test. If the author needs to build a container locally then I suspect that might be too large a requirement as each platform is going to be different. Obviously, we could introduce a build host to handle the build steps, but that makes it much harder for someone to feel confident the script they wrote will run in production.

The challenge is because our hope is to empower semi-technical ETL authors. If we compare this goal to people who can write HTML/CSS vs. programmers, it clarifies the requirements. A user learning to write HTML/CSS only has to open the file in a web browser to test it. If the page looks correct, they can be confident when they deploy it will work. The goal with the ETL framework and APIs is that the system can provide a similar work flow and ease of use.

Wrapping Up

I’ve written a LOT of ETL code over the past year. Much of what I propose above reflects my experiences. It also reflects the server environment in which these ETLs run as well as the organizational environment. ETLs are low priority code, by nature, that can be used to build first class products. Systems that require a lot of sysadmin time, server resources or have too specific an API may still be helpful moving data around, but they will fall short as systems evolve. My goal has been to create a system that evolves with the data in the organization and empowers a large number of users to distribute the task of developing and maintaining ETLs.

Dadd, ErrorEmail and CacheControl Releases

I’ve written a couple new bits of code that semeed like they could be helpful to others.

Dadd

Dadd (pronounced Daddy) is a tool to help administer daemons.

Most deployment systems are based on the idea of long running processes. You want to release a new version of some service. You build a package, upload it somewhere and tell your package manager to grab it. Then you tell your process manager to restart it to get the new code.

Dadd works differently. Dadd lets you define a short spec that includes the process you want to run. A dadd worker then will use that spec to download any necessary files, create a temporary directory to run in and start the process. When the process ends, assuming everything went well, it will clean up the temp directory. If there was an error, it will upload the logs to the master and send an email.

Where this sort of system comes in handy is when you have scripts that take a while to run and that shouldn’t be killed when new code is released. For example, at work I manage a ton of ETL processes to get our data into a data warehouse we’ve written. These ETL processes are triggered with Celery tasks, but they typically will ssh into a specific host, create a virtaulenv, install some dependencies, and copy files before running a deamon and disconnecting. Dadd, makes this kind of processing more automatic where it can run these processes on any host in our cluster. Also, because the dadd worker builds the environment, it means we can run a custom script without having to go through the process of a release. This is extremely helpful for running backfills or custom updates to migrate old data.

I have some ideas for Dadd such as incorporating a more involved build system and possibly using lxc containers to run the code. Another inspriation for Dadd is for setting up nodes in a cluster. Often times it would be really easy to just install a couple python packages but most solutions are either too manual or require a specific image to use things like chef, puppet, etc. With Dadd, you could pretty easily write a script to install and run it on a node and then let it do the rest regarding setting up an environment and running some code.

But, for the moment, if you have code you run by copying some files, Dadd works really well.

ErrorEmail

ErrorEmail was written specifically for Dadd. When you have a script to run and you want a nice traceback email when things fail, give ErrorEmail a try. It doesn’t do any sort rate limiting an the server config is extremely basic, but sometimes you don’t want to install a bunch of packages just to send an email on an error.

When you can’t install django or some other framework for an application, you can still get nice error emails with ErrorEmail.

CacheControl

The CacheControl 0.10.6 release includes support for calling close on the cache implementation. This is helpful when you are using a cache via some client (ie Redis) and that client needs to safely close the connection.

Ugly Attributes

At some point in my programming career I recognized that Object Oriented Programming is not all it’s cracked up to be. It can be a powerful tool, especially in a statically typed language, but in the grand scheme of managing complexity, it often falls short of the design ideals that we were taught in school. One area where this becomes evident is object attributes.

Attributes are just variables that are “attached” to an object. This simplicity, unfortunately, makes attributes require a good deal more complexity to manage in a system. The reason being is that languages do not provide any tools to respect the perceived boundaries that an attribute appears to provide.

Let’s look at a simple example.

class Person(object):

    def __init__(self, age):
        self. age = age

We have a simple Person object. We want to be able to access the person’s age by way of an attribute. The first change we’ll want to make is to make this attribute a property.

class Person(object):
    def __init__(self, year, month, day):
        self.year = year
        self.month = date
        self.day = day

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

So far, this feels pretty simple. But lets get a little more realistic and presume that this Person is not a naive object but one that talks to a RESTful service in order to get is values.

A Quick Side Note

Most of the time you’d see a database and an ORM for this sort of code. If you are using Django or SQLAlchemy (and I’m sure other ORMs are the same) you’d see something like.

user = User.query.get(id)

You might have a nifty function on your model that calculates the age. That is, until you realize you stored your data in a non-timezone aware date field and now that you’re company has started supporting Europe, some folks are complaining that they are turning 30 a day earlier than they expected...

The point being is that ORMs do an interesting thing that is your only logical choice if you want to ensure your attribute access is consistent with the database. ORMs MUST create new instances for each query and provide a SYNC method or function to ensure they are updated. Sure, they might have an eagercommit mode or something, but Stack Overflow will most likely provide plenty of examples where this falls down.

I’d like to keep this reality in mind moving forward as it presents a fact of life when working with objects that is important to understand as your program gets more complex.

Back to Our Person

So, we want to make this Person object use a RESTful service as our database. Lets change how we load the data.

class Person(ServiceModel):
    # We inherit from some ServiceModel that has the machinery to
    # grab our data form our service.

    @classmethod
    def by_id(cls, id):
        doc = conn.get('people', id=id).pop()
        return cls(**doc)

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

    # This would probably be implemented in the ServiceModel, but
    # I'll add it hear for clarity.
    def __getattr__(self, name):
        if name in self.doc:
            return self.doc[name]
        raise AttributeError('%s is not in the resource.' % name)

Now assuming we get a document that has a year, month, day, our age function would still work.

So far, this all feels pretty reasonable. But what happens when things change? Fortunately in the age use case, people rarely change their birth date. But, unfortunately, we do have pesky time zones that we didn’t want to think about when we had 100 users and everyone lived on the west coast. The “least viable product” typically doesn’t afford thinking ahead that far, so these are issues you’ll need to deal with after you have a lot of code.

Also, the whole point of all this work has been to support an attribute on an object. We haven’t sped anything up. These are not new features. We haven’t even done some clever with meta classes or generators! The reality is that you’ve refactored your code 4 or 5 five times to support a single call in a template.

{{ person.age }}

Let’s take a step back for a bit.

Taking a Step Back

Do not feel guilty for going down this rabbit hole. I’ve taken the trip hundreds of times! But maybe it is time to reconsider how we think about object oriented design.

When we think back to when we were just learning OO there was a zoo. In this zoo we had the mythical Animal class. We’d have new animals show up at the zoo. We’d get a Lion, Tiger and Bear they would all need to eat. This modeling feels so right it can’t be wrong! An in many ways it isn’t.

If we take a step back, there might be a better way.

Let’s first acknowledge that that our Animal does need to eat. But lets really think about what it means to our zoo. The Animals will eat, but so will the Visitors. I’m sure the Employee would like to have some food now and then as well. The reason we want to know about all this sustenance is because we need to Order food and track it’s cost. If we reconsider this in the code, what if, and this is a big what if, we didn’t make eat a method on some class. What if we passed our object to our eat method.

eat(Person)

While that looks cannibalistic at first, we can reconsider our original age method as well.

age(Person)

And how about our Animals?

age(Lion)

Looking back at our issues with time zones, because our zoo has grown and people come from all over the world, we can even update our code without much trouble.

age(adjust_for_timezones(Person))

Assuming we’re using imports, here is a more realistic refactoring.

from myapp.time import age

age(Lion)

Rather than rewriting all our age calls for timezone awareness, we can change our myapp/time.py.

def age(obj):
   age = utc.now() - adjust_for_timezones(obj.birthday())
   return age / 365

In this idealized world, we haven’t thrown out objects completely. We’ve simply adjusted how we use them. Our age depends on a birthday method. This might be a Mixin class we use with our Models. We also could still have our classic Animal base class. Age might even be relative where you’d want to know how old an Animal is in “person years”. We might create a time.animal.age function that has slightly different requirements.

In any case, by reconsidering our object oriented design, we can remove quite a bit of code related to ugly attributes.

The Real World Conclusions

While it might seem obvious now how to implement a system using these ideas, it requires a different set of skills. Naming things is one of the two hard things in computer science. We don’t have obvious design patterns for grouping functions in dynamic languages where it becomes clear the expectations. Our age function above likely would need some check to ensure that the object has a birthday method. You wouldn’t want every age call to be wrapped in a try/except.

You also wouldn’t want to be too limiting on type, especially in a dynamic language like Python (or Ruby, JavaScript, etc.). Even though there has been some rumbling for type hints in Python that seem reasonable, right now you have to make some decisions on how you want to handle the communication that some function foo expects some object of type of Bar or has a method baz. These are trivial problems at a technical level, but socially, they are difficult to enforce without formal language support.

There are also some technical issues to consider. In Python, function calls can be expensive. Each function call requires its own lexical stack such that many small nested functions, while designed well, can become slow. There are tools to help with this, but again, it is difficult to make this style obvious over time.

There is never a panacea, but it seems that there is still room for OO design to grow and change. Functional programming, while elegant, is pretty tough to grok, especially when you have a dynamic language code sitting in your editor, allowing you to mutate everything under the sun. Still, there are some powerful themes in Functional Programming that can make your Object Oriented code more helpful in managing complexity.

Finally

Programming is really about layering complexity. It is taking concepts and modeling them to a language that computers can take and, eventually, consider in terms of voltage. As we model our systems we need to consider the data vs. the functionality, which means avoiding ugly attributes (and methods) in favor of orthogonal functionality that respects the design inherit in the objects.

It is not easy by any stretch, but I believe by adopting the techniques mentioned above, we can move past the kludgy parts of OO (and functional programming) into better designed and more maintainable software.