Ionrock Dot Org Programming and Music en-us Tue, 06 Dec 2016 00:00:00 +0000 <![CDATA[Announcing Withenv 0.7.0]]> Announcing Withenv 0.7.0

I pushed a new version of Withenv and thought it might be helpful to discuss some new features.

No More Shells

In order to make things easy for me, I used shell=True when calling commands within Withenv. I’ve removed that aspect and started parsing the shell commands in order to avoid the shell. Replacements should still work as expected.

This removes some security risk by explicit replacements and avoiding the shell. Shells can be leaky with environment information, and I’d like people to feel comfortable passing secrets via Withenv.


Withenv allows dynamic environment variables to be injected by calling commands and scripts. This was always a little kludgy to me. I had hoped I could do something like /bin/bash and inspect the environment afterwards in order to reuse local shell scripts people write to manage environments.

Instead, when working on a Go version of Withenv, I realized I could just load up JSON or YAML from a command. There are tons of commands that will output JSON, so this seemed like a reaosnable plan.

With that in mind, you often want to use a piped command. For example, lets say I’m trying to get a secret token. I might do something like this:

$ curl -X POST | jq .user.token

I went ahead an implementing the piping between commands so you can use this sort of thing your Withenv YAML.

- TOKEN: "`curl -X $IDURL | jq .user.token`"

The Future

There are two things I’ve been working on. The first is feature parity with the Go version. This has been for my own practice and to provide a lighter weight solution in that might be more palletable to a larger audience. The second is implementing some config file templates.

I wrote a version of this in Go, but I’d like to consider one in Python that uses a popular template language. The idea is you can provide a template and Withenv will use the environment as the context and write a file before starting the process. You could then configure the file to be deleted when tht process exits or even after some specified time where the upstream process should have already loaded it into memory.

This adds a lot of complexity, so we’ll see if it makes it. I originally wrote this tooling as another tool, which might be the best course of action. I’d also like to support the same config language in the Go version, but, for now at least, I’m hoping I don’t need to write a jinja2 parser in Go.

Tue, 06 Dec 2016 00:00:00 +0000 <![CDATA[Configuration and Environments]]> Configuration and Environments

I’ve found myself increasinly frustrated with configuration when programming. It is not a difficult problem, but it gets complicated quickly when the intent is typically for simplicity.

Lets consider a simple command line client that starts a server and needs some info such as a database connection, some file paths, etc. On the surface it seems pretty simple:

$ myservice --config service.conf

But, an operator might want to override values in the config via command line flags.

$ myservice --config service.conf --port 9001

This is all well and good, but what if you need to override something that more sensitive you don’t want showing up in the process table such as the database connection URL. Instead the operator uses an environment variable.

$ export DBURL=mysql://appuser:secret@dbserver
$ myservice --config service.conf --port 9001

By the way, we haven’t talked about things like handling stdin so you can use pipes with your app.

Now, in your code you have to support all these sorts of input and make the configuration available to the code. This typically happens as a singleton of sorts that you import all over the place in your code and tests. This code needs to deal with command line arguments, reading config files, providing overrides in flags and the environment, etc.

There are tons of frameworks to make these problems easier, but they get pretty complicated. What’s more, if you’ve already made some design decisions about how you want to pass around configuration in your application, there is a really good chance that library you chose won’t work with that model and you’ll need to adjust for that. If you have a lot of code, this can be a pretty difficult refactoring, especially if you realize the framework has a bug or doesn’t actually do what it says it does.

Is there a better way?

Maybe? Lately, I’ve been focusing on only using environment variables for configuration. This solves some issues in the code. I don’t have to think about parsing command line flags or dealing with override operations. I also don’t have to create some sort of global singleton because languages provide access to the environment directly. This doesn’t answer things like type conversion, but generally, it is much simpler.

The reason it is simpler is because of withenv. Using we I get config files via YAML and JSON, use the environment and layer overrides as needed as well as codify that layering on the filesystem. I even have a good way to load dynamic values to avoid storing secrets on the file system. The downside is that we becomes a dependency, so this isn’t a reasonable solution for some command line tool you distribute broadly. But, if you’re running network services and driving an operational code base, depending on we is a great way to reduce friction between different systems and have an extremely cross platform means of communicating configuration to your applications.

The larger philosophy that is at play here is the division between delcaring configuration and communicating that configuration to the application. These two aspects are often tightly coupled resulting in a non-trivial amount of code that is tightly coupled throughout an application’s code base. It also leaks into the operational code as configuration files need to be written and rewritten, adding an unnecessary layer of abstraction in order to meet the needs of the application.

Environment variables can be tricky though. While it is non-trivial to inspect the environment of a running process, it may be easy for an attacker to replace the application that is meant to run and capture sensitive data. If you start processes via shell, that can also lend itself to executing dangerous code. With that said, there are many tools to aid in keeping a production filesystem secure such that these sorts of attacks can be avoided for the most part, leaving only rare attack vectors available.

But my app doesn’t use environment variables

The on gotcha about a system such as this is that all apps may not be using environment variables. I’ve tried to deal with this in withenv by allowing writing a config file before starting the app based on a template. Until then, the best I can offer to is write something similar yourself.


I’ve discussed this topic with some folks and have seen a somewhat wide spectrum of positive and negative comments. The folks that have tried withenv and see how to extend it, withenv becomes a critical tool. This makes me believe that it is a good method for constructing an environment and providing it to a process. But, you never know until you try, so please give it a go the next time you find yourself exporting environment variables for some code and see if withenv improves anything for you.

Sun, 04 Dec 2016 00:00:00 +0000 <![CDATA[Announcing Bach]]> Announcing Bach

A while back I wrote a tool called withenv to help manage environment variables. My biggest complaint with systems that depend on environment variables is that it is really easy for the variables to be sticky. For example, if you use docker-machine or docker swarm, you need to be very careful that your environment variables are configured or else you could be using the wrong connection information.

Another change for me recently has been using Go regularly. I’ve found the language to be easy to learn (coming from Python), fast and a lot of fun on the type of problems I’ve been solving at work.

So, with some existing tooling around that I’d wanted to improve upon and with some new ideas in tow, I started writing Bach.

The basic idea is that Bach helps you orchestrate your process environment. While it is a clever name, I don’t know that it really make clear what the bach tools are meant to do, so I’ll try to clarify why I think these tools can be helpful.

Say we have an web application that we want to run on 3 nodes with a database and load balancer. The application is pretty old, meaning it was written with the intent to deploy the app on traditional bare metal servers. Our goal is to take our app and run it in the cloud on some containers without making any major changes to the application.

When you start looking at containers and the cloud, it can be very limiting to consider a world where you can’t just ssh into the box, make a configuration change and be done. Even when you use containers, making the assumption the target platform will provide a way to utilize volumes is a bit tricky. These limitations can be difficult to work around without changing the code. For example, changing code that previously wrote the file system to instead write to some service like Amazon S3 is non-trivial and introduces a pretty big paradigm shift to code and operations.

Bach, is meant to help provide tooling to make these sorts of transitions easier such that the operations code base doesn’t dictate the developer experience, and vice versa. As a secondary goal, the bach tools should be easy to run and verify independently of each other, while working together in unison.

Going back to our example, lets think about how we’d configure our application to run in our container environment. Lets assume that we can’t simply mount a config directory for our application and we need to pass environment variables for configuration to a container. Our app used to run with the following command.

$ app --config /etc/app.conf

Unfortunately, that won’t work now. Here is where the Bach tools can be helpful.

First off, our application needs a database endpoint. We’ll use withenv (we) to find the URL and inject it in our environment before starting our app. Lets assume we use some DNS to expose what services live at what endpoints. Here is little script we might write to get our database URL.


# The environment will provide a SERVICES environment variable that
# is the IP of DNS server that we use for service discovery. The
# ENV_NAME is the name of the environment (test, prod, dev,
# branch-foo, etc...)
DBIP=`dig @$SERVICES +short db.$ENV_NAME.service.list`
echo "{\"DBURI\": \"$DBIP\"}"

Now that we have the script we can use we to inject that into our environment before running our app.

$ we -s env | grep DBURI

Now that we know we are able to get our DBURI into our environment, we still need to add it to our application’s configuration. For that we’ll use toconfig. We use a simple template to write the config before running our app.

dburi = {{ DBURI }}

We can test this template by running the following command.

$ DBURI=mysql:// toconfig --template app.conf.tmpl

This will print the resulting template for review.

With both these pieces in place, we can start to put things together.

$ we -s \
  toconfig --template /etc/app.conf.tmpl \
           --config /etc/app.conf \
  app --config /etc/app.conf

Now, when we switch our command in our container to run the above command, we get to run our app in the new environment without any code changes while still capitalizing on new features the environment provides.

If that command is a bit too long, we can copy the arguments to a YAML file run it with the bach command.

# setup.yml
- withenv:
  - script:
- toconfig:
  template: /etc/app.conf.tmpl
  config: /etc/app.conf

Then our command becomes a bit shorter.

$ bach --config setup.yml app --config /etc/app.conf

At the moment these are the only released apps that come with Bach. With that said, I have other tools to help with different tasks:

  • present: This runs a script before and after a command exits. The idea was to automate service discovery mechanisms by letting the app join some cluster on start and leave when the process exits.
  • cluster: This provides some minimal clustering functionality. When you start an app, it will create a cluster if none exists or join one if it is provided. You can then query any member of the cluster to get the cluster state and easily pass that result into the environment via we and a script (ie we -s ‘cluster nodes’).

At the moment, the withenv docs should be correct for Bach’s we command. I’m still working on getting documentation together for toconfig and the other tools, so the source is your best bet for reference.

If you try any of the tools out, please let me know!

Wed, 21 Sep 2016 00:00:00 +0000 <![CDATA[Learning Golang]]> Learning Golang

Lately I’ve had the opportunity to spend some time with Go. I’ve been interested in the language for a while now, but I’ve only done a couple toy projects. The best way to get a feel for a language is to write spend some consecutive time writing a real program that others will be using. Having a real project makes it possible to get a really good picture of what software development is like using a language.

First Impressions

When I first started going beyond toy projects, the most difficult aspect was the package (or module in Python terms) system. In Python, you have an abstraction of modules that uses the filesystem, but requires extra tooling to include other code. Other languages, like Ruby (if I remember it correctly!) can use a more direct include type of module system where you include the code you want. PHP is a great example of this simple module pattern where you truly just include other code as if it were written within your own. Go does something different where a package is really a directory and the import statement includes everything in that directory that has the same package declaration. I’m sure there is more to it, but this crude understanding has been enough to be dangerous.

One aspect that drew me to go was the simplified deployment. Go, aseems to be focused on producing extremely flexible binaries that can be used on a wide array of systems. The result is that a Go binary that you built with linux typically can be copied to any other linux without having any issues. While I’m the sure the same is possible with C/C++ or any other compiled language, it has been an early feature to produce all inclusive binaries that don’t have a dependency on anything on the target machine. As I’m coming from Python, I’m not an expert in these sorts of things, so this is really just my impression, validated by my limited experience.

Finally, Go is reasonably fast without being incredibly complex. I don’t have to manage memory, but I do get to play with pointers. Concurrency is a core feature of the language that helps implement the fun features of Python like generators. Go routines are poised to do the things that I always wished Python did and that is to take care of work no matter whether it is async I/O or CPU bound. Lastly, it doesn’t have a warm up period like you’d see in a JVM language, which makes it suitable for small scripts.

But What About Python?

I still enjoy Python, but the enjoyment comes from feeling fluent in the language more than enjoying any set of features. The thing I like most about programming in Python is that I feel very comfortable banging out code and knowing that it is reasonably well written with tests and can function well within the larger Python ecosystem. This fluency in Python is not something I’m willing to drop due to some hype in another language. There are some frustrating aspects of programming in Python that make me want to try something new.

The first is the packaging landscape. I don’t mean to suggest that pip is terrible or wheels are a huge mistake. Instead, it is the more general pattern of shipping source code to an environment in order to run the code. There are TONS of great tools to make this manageable and containers only make it even easier, so again, I’m not condemning the future of Python deployments. But I am tired of it.

In Python, I need to make a ton of decisions (some that have been chosen for me) such as whether or not to use the system packages or a virtualenv. I have to consider how to test a release before releasing it. I have to establish how I can be sure that what I tested is the same as what I’m releasing. All these questions can become subtlely complex over time and it gets tiring.

The second aspect of Python that is frustrating is the performance. Generally, it is fast enough, except when it isn’t. It is when things need to be faster that Python becomes painful. If you are CPU bound then get ready to scale out those processes! If I/O is where you are lacking there are a plethora of async I/O libraries to help efficiently accessing I/O. But, dealing with both CPU and I/O bound issues is complicated, especially when you are using libraries that may not be compatible with the async I/O framework you are using. It is definitely possible to write fast Python code, but there is a lot of work to do, none of which makes deployment easier, hence it is a little tiring.

When all else fails you can write a C module, use Cython or try another interpreter. Again, all interesting ideas, but the cost shows up in complexity and in the deployment.

There is no easy fix to these problems and it isn’t as fun as it used to be to think about solving it.

So Why Go Then?

The question then is why Go is a contender to unseat my most fluent language Python? The simple answer is that I’m tired of the complexity. The nice thing about Go is that takes care of some essential aspects I don’t care to delve into, specifically, memory management and concurrency. Go is also fast enough that I should not have to be terribly concerned about adopting new programming paradigms in order to work around some critical bottleneck. While I’m confident that I could make Python work for almost anything, at some point it feels like it has become a hammer and I’d like to start doing other things than hitting nails.

Now, even though I’m learning Go and have been excited about the possibilities, it doesn’t change the fact most of my day to day work is in Python. Maybe that will change at some point, but for the foreseeable future, I don’t see it changing anytime soon, which is totally fine by me.

I’m sure there are issues with Go that I haven’t run into yet. Static typing has been challenging to use after having massive freedom in Python. What is appealing is that less than perfect Go code has been good enough in quality for production uses. That is exciting because it means while I learn more about the langauge and community, it doesn’t preclude me from being productive and learning new things.

Fri, 03 Jun 2016 00:00:00 +0000 <![CDATA[Balancing Microservices]]> Balancing Microservices

Microservices are often touted as a critical design pattern, but really it is just a tactic for managing certain vectors of complexity by decoupling the code from operations. For example, if you have a large suite of APIs, each driven by a different microservice, you can iterate and release different APIs as needed, reducing the complexity of releasing a single monolithic API service. In this use case, the microservices allow more concurrent work between teams by decoupling the code and the release process.

Another use case for microservices would be to decouple resource contention. Lets assume you have a database that is used by a few different apps. You could remove this decoupling by using separate services that manage their own data plane, removing contention between the services that exists by using the same database.

From a design standpoint, it is non-trivial. The first example of a huge API suite can be implemented primarily through load balancing rules without much issue. The database example is more difficult because there will need to be a middleman ensuring data is replicated properly. Just like normalization of databases, the normalization that occurs in microservices can be costly as it requires more work (expense) when those services need consistency.

Another expense is designing and maintaining APIs between the services. Refactoring becomes more complicated because you have to consider there is old code still handling messages in addition to the new code. Before, the interactions were isolated to the code, but when using microservices the APIs will need some level of backwards compatibility.

The one assumption with microservices that make them operationally feasible is automation. Artifacts should be built and tested reliably from your CI pipeline. Deployments need to be driven by a configuration management system (ie chef, ansible, etc.). Monitoring needs to be in place and communicated to everyone doing releases. The reason all this automation is so critical is because without it, you can’t possibly keep track of everything. Imagine having 10 services each scaled out to 10 “nodes” and you begin to see why it is difficult to manage this sort of system without a lot of automation.

Even though it is expensive, it might be worth it. Incidents are one area where microservices can be valuable. Microservices provide a smaller space to search for root causes. Rather than having to examine a large breadth of over a single codebase, you can (hopefully) review a single service in semi-isolation and determine the problem. Assuming your microservices are designed properly and even though the number of services is large, the hope is that the problem areas will be limited to small services and can be more easily debugged and fixed.

Microservices require a balance. Using microservices is a tactic, not a design. Lot of small processes doesn’t make anything easier, but lots of small processes that divide operational concerns and decouple systems can be really helpful. Like anything else, it is best to measure and understand the reasoning for breaking up code into microservices.

Tue, 03 May 2016 00:00:00 +0000 <![CDATA[REST Command Line Interface]]> REST Command Line Interface

At work we have a support rotation where we’re responsible for handling ticket escalations. Typically, this is somewhat rare event that requires the team to get involved, thanks to the excellent and knowledgable support folks. But, when there is an issue, it can be a rough road to finding the answers you need.

The biggest challenge is simply being out of practice. We have a few different APIs to use, both internal and external that use different authentication mechanisms. Using something like cURL definitely works, but gets rather difficult over time. It doesn’t help that I’m not a bash expert that can bend and shape the world to my will. There are other tools, like httpie, that I’d like to spend some more time with. Unfortunately,I never seem to remember about in the heat of the moment. Some coworkers delved into this idea for a bit and wrote some tools, but my understanding is that it was still very difficult to get around the verbosity in a generic enough way for the approach to really pay off.

Looking at things from a different perspective, what if you had a shell of sorts? Specifically, it doesn’t take much to configure something like ipython with some builtins for reading auth from known files and including some helpful functions to make things easier. You could also progressively improve the support over time. For example, I can imagine writing small helpers to follow links, dealing with pagination, or finding data in a specific document. Lastly, I can also imagine it would be beneficial to store some state inbetween sessions in order to bounce back and forth between your HTTP shell and some other shell.

Seeing as this doesn’t seem too revolutionary, I’m curious if others have investigated this already. I’m also curious how others balance a generic command line interface, API specific tooling and reusability over time, without adding a million flags and more to learn than just using cURL!

Mon, 04 Apr 2016 00:00:00 +0000 <![CDATA[Balancing Security and Convenience]]> Balancing Security and Convenience

There is always a question of whether to use a hosted service or manage a service yourself. It is a tough question because the answer changes over time depending on your business needs. A startup might be totally fine using github and slack, but the size of google means rolling your own solution. The arguments regarding hosted services revolve around security, and more specifically, the sensitive data you make available by using these services.

There are many that would argue that owning a service is more secure. While it is true that you may send fewer bits to a third party, it really says nothing of security. A hosted service such as github or slack is already targeted by hackers. While I’m sure there are vulnerabilities, popular hosted services have been vetted by huge amounts of usage. It is in the providers interest to provide a secure and reliable service, constantly improving infrastructure and security over time. Running a service at scale shakes loose quite a few bugs that contain difficult to find attack vectors.

Even if a service is reasonably secure, there is still a risk of trusting your data to another company. Unless you have clients that specifically disallow this, I’d argue that this is not worth the cost. Successful hosted services generally have a community of supporters that have done the work of integrating with the platform. That makes hooking up your bug tracker with your build system, chat and monitoring is trivial. Sorting out all the bits to make this work in an internal environment means writing, debugging and maintaining code along side operating each dependent system. That is far from impossible, but it is certainly expensive when you need developers and operators working on more pressing issues. The irony here is that by avoiding the hosted service, you’ve essentially made local development more difficult and reduced the ability of your development pipeline to improve the code.

To put this in financial terms, lets say you have a team of 5 people and lets say the average salary is $100k, or $48 / hour. If each person spends 10 hours a week on the CI/CD system and operating tools like chat, it would cost $480 / week, or ~$25k per year. That doesn’t seem too bad, but that doesn’t include the extra cost of an effective build system catching bugs and the initial development time to get these systems up and running and talking to each other. You might need to get hardware within the network, setup firewalls, configure secure routes via VPNs to allow remote developers to use the system. At this point you might have included the time of another 15 people and spent at least 3+ months of your team’s time getting the initial system up and running, noting, that it is all code you’ll need to maintain. Also note, that this says nothing about problems that come up.

The fact is, it is really expensive to design, build and operate a suite of services simply to avoid having some bits on another person’s computer. It seems better to focus on making your team more productive by providing helpful tools they don’t have to manage and prepare mitigation plans for how to recover from a security breach or service failure. Obviously, there are dangers, but mitigating them is less work than rebuilding a service along with its integrations from scratch.

I’m curious what others have experienced when choosing an external service over a DIY solution. Did you feel the DIY solution was full featured? Did you get burned choosing a hosted solution? Let me know!

Mon, 22 Feb 2016 00:00:00 +0000 <![CDATA[The Rough Edges of Openstack]]> The Rough Edges of Openstack

I’m thankful Rackspace sent me to Tokyo for the OpenStack Summit. Besides the experience of visiting Tokyo, the conference and design summit proved to be a great experience. Going into the summit, I had become a little frustrated with the OpenStack development ecosystem. There seemed to be a lot of decisions that separated it from more main stream Python development with little actual reason for the diverting from the norm. Fortunately, the summit contextualized the oddities of developing OpenStack code, which made the things more palatable as a developer familiar with the larger Python ecosystem.

One thing that hit me at the summit was that OpenStack is really just 5 years old. This may not seem like a big deal, but when you consider the number of projects, amount of code, infrastructure and contributors that have made all this happen it is really impressive. It is a huge community of paid developers that have managed to get amazing amounts of work done making OpenStack a functioning and reasonably robust platform. So, while OpenStack is an impressive feat of software development, it still has a ton of rough edges, which should be expected!

As a developer, it can be easy to come to a project, see how things are different and feel as though the code and/or project is of poor quality. While this is a bad habit, we are all human and fear things we don’t understand! The best way to combat this silliness is to try and educate those coming to a new project on what can be improved. In addition to helping recognize ways to improve a project, it helps new developers feel confident when they hit rough edges to dig deeper and fix the problems.

With that in mind, here are some rough edges of OpenStack that developers can expect to run into, and hopefully, we can fix!

OpenStack Doesn’t Mean Stable

You’ll quickly find that “OpenStack” as a suite of projects is HUGE. Each of these projects is at different stages of stability. Documentation may be lacking, but the code is well tested and reliable, while other projects may have docs and nothing else. It critical to keep this in mind when developing for an OpenStack project that the other OpenStack requirements, and there will be TONS, may not be entirely stable.

What this means is that it is OK to dig deep when trying to figure out why something doesn’t work as expected. Don’t be afraid to checkout the latest version of the library and dive into the source to see what is going on. Add some temporary print / log messages to get some visibility into what’s happening. Write a small test script to get some context. All these tactics you’d use with your own internal libraries are the same you should use with ANY OpenStack project.

This is not to say that OpenStack libraries aren’t stable. You can’t assume that just because it has gotten the label “oslo” or “OpenStack”, that it has been tested and considered working. The inclusion of libraries or applications into OpenStack, from a development standpoint, has more to do with answering the question of “Does the community need this”. Inclusion means that the community has identified a need, not a fully fleshed out solution.

Not Invented Here

OpenStack is an interesting project because the essence of what it provides is an interface to existing software. Nova, for example, provides an interface to start VMs. Nova doesn’t actually do the work to act as a hypervisor, instead, it leaves that to pluggable backends that work with things like Xen or KVM. Right off the bat, this should make it clear that when a project is labeled as the OpenStack solution for X, it most likely means it provides an interface for managing some other software that implements the actual functionality.

Designate and Octavia are two examples of this. Designate manages DNS servers. You get a REST interface that can can update different DNS servers like bind or PowerDNS (or both!). Designate handles things like multi-tenancy, validation and storing data in order to provide a reliable system. Octavia does a similar task, but specifically for haproxy.

It doesn’t stop there though. OpenStack aims to be as flexible as possible in order to cater to the needs / preferences of operators. If one organization prefers Postgres over MySQL that should be supported because an operator will need to manage that database. The result is that many libraries tend to provide the same sort of wrapping. Tooz and oslo.messaging, for example, provide access to distributed locking and queues respectively. Abstractions are created to consistently support different backends, so projects can not only provide flexibility for core functionality, but also the services that support the application.

In the cases where there really was a decision to reimplement some library, it is often due to an in compatibility with another library. A good example of this is oslo.messaging. It supports building apps with an async or background worker pattern, much like celery. This makes one wonder, why not just celery! My understanding is that celery has been tried in many different projects and it wasn’t a good fit within the community at large.

By the way, my vague answer of “it wasn’t a good fit” is intentional. There are so many projects in OpenStack that often times these questions are bought up again and again. A new project is started and the developers try out other libs, like celery, because it is a good solution that is well supported. Sometimes the technical challenges of integration with other services is a problem, while other times, the dependencies of the library aren’t compatible with something in OpenStack, making it impossible to resolve dependencies. I’m sure there are cases where someone just doesn’t like the library in question. No matter what the reasons are, OpenStack has a huge plane of software that makes it hard for new libraries to be a “good fit”, so sometimes it is easier to rewrite something specifically for OpenStack.


OpenStack is committed to providing software that can be deployed via distro packages. In other words, OpenStack wants to make yum install openstack or apt-get install openstack work. It is a noble goal, especially for a suite of applications written in python, moving at radically different rates of change.

You see, distro package managers have different priorities than most Python authors may have when it comes to packaging. A distro is something of an editor, ensuring that all the pieces for all the use cases work reliably. This is how Red Hat provides general “linux” support, by knowing that all the pieces work together. Python REST services (like OpenStack), on the other hand, typically assume that the person running it uses some best practices such as a separate virtualenv for each application. This design pattern means that at the application level, the dependencies are isolated from the rest of the system.

Even though the vast majority of OpenStack operators don’t rely on system packages in a way that requires all projects use the same versions, it is an implementation detail OpenStack has adopted. As a developer, you have to be ready to deal with this limitation, and more importantly, the impact it has on your ability to introduce new code. I believe that this restriction is most likely to blame for the Not Invented Here nature of many OpenStack tooling, which leads reimplementations that are not very stable.

Why Develop for OpenStack?

If OpenStack has such a rough development experience, why should you commit to learning it and developing on OpenStack software?

You’ll remember, I began all this with a recognition that OpenStack is only 5 years old. Things will continue to change, and I believe, improve. Many of the rough edges of OpenStack have been caused by growing pains. There is a crazy amount of code happening and it takes time and effort to improve development patterns. Even though it can be rough at first to develop in OpenStack, it gets better.

Another reason to develop OpenStack code is that it is a exciting work. OpenStack includes everything from low level, high performance systems to distributed CPU intensive tasks to containers and micro-services. If you enjoy scaling backend applications, OpenStack is a great place to be. The community is huge with loads of great people. OpenStack also makes for a very healthy career path.

No project is perfect, and OpenStack is no different. Fortunately, even though there are rough edges, OpenStack is a great project to write code. If you are new to OpenStack development and need a hand, please don’t hesitate to reach out!

Tue, 10 Nov 2015 00:00:00 +0000 <![CDATA[Development in the Cloud]]> Development in the Cloud

I’ve recently made an effort to stop using local virtual machines. This has not been by choice, but rather because OS X has become extremely unstable as of late with VirtualBox and seems to show similar behavior with VMWare. Rather than trying to find a version of VirtualBox that is more stable, I’m making an effort to develop on cloud servers instead.

First off, to aid in the transition, I’ve started using Emacs in a terminal exclusively. While I miss some aspects of GUI Emacs, such as viewing PDFs and images, it generally hasn’t been a huge change. I’ve had to do some fiddling as well with my $TERM in order to make sure Emacs picks up a value that provides a readable color setting.

Another thing I started doing was getting more familiar with byobu and tmux. As Emacs does most of my window management for me, my use is relatively limited. That said, it is nice to keep my actually terminal (iTerm2) tabs to a minimum and use consistent key bindings. It also makes keeping an IRC bouncer less of a requirement because my client is up all the time.

The one thing I haven’t done yet is to provision a new dev machine automatically. The dev machine I’m on now has been updated manually. I started using a Vagrantfile to configure a local VM that would get everything configured, but as is clear by my opening paragraph, frequent crashes made that less than ideal. I’m hoping to try and containerize some processes I run in order to make a Vagrantfile that can spin up a cloud server reasonably simple.

What makes all this possible is Emacs. It runs well in a terminal and makes switching between local and remote development reasonably painless. The biggest pain is the integrations with my desktop, aka my local web browser. When developing locally, I can automatically open links with key bindings. While I’m sure I could figure something out with iTerm2 to make this happen, I’m going to avoid wasting my time and just click the link.

If you don’t use Emacs, I can’t recommend tmux enough for “window” management. I can see how someone using vim could become very proficient with minimal setup.

Mon, 12 Oct 2015 00:00:00 +0000 <![CDATA[Modern Development]]> Modern Development

My MacbookPro crashed with a gray screen 4 times yesterday and it gave me time to think about what sort of environment to expect when developing.

The first thing is that you can forget about developing locally. Running OS X or Windows means your operating system is nothing more than an inconvenient integration point that lets you use Office and video conferencing software. Even if you use Linux, you’ll still have some level of indirection as you separate your dev environment from your OS. At the very least, it will be language specific like virtualenv in Python. At most you’ll be running VirtualBox / Vagrant with Docker falling somewhere in between.

Seeing as you can’t really develop locally, that means you probably don’t have decent integration into an IDE. While I can already hear the Java folks about to tell me about remote debugging, let me define “decent”. Decent integration into and IDE means running tests and code quickly. So, even if you can step through remote code, it is going to be slow. The same goes for developing for iOS or Android. You have a VM in the mix and it is going to be slow. When developing server software, you’re probably running a Vagrant instance and sharing a folder. Again, this gets slow and you break most slick debugging bits your editor / IDE might have provided.

So, when given the choice, I imagine most developers choose speed over integration. You can generally get something “good enough” working with a terminal and maybe some grep to iterate quickly on code. That means you work in a shell or deal with copying code over to some machine. In both cases, it’s kludgey to say the least.

For example, in my case, I’ve started ssh’ing into a server and working there in Emacs. Fortunately, Emacs is reasonably feature-full in a terminal. That said, there are still integration issues. The key bindings I’ve used to integrate with non-code have been lost. Copy and paste becomes tricky when you have Emacs or tmux open on a server with split screens. Hopefully, your terminal is reasonably smart where it can help finding links and passing through mouse events, but that can be a painful process to configure.

OK. I’m venting. It could be worse.

That said, I don’t see a major shift anytime soon. I’ve gone ahead and tried to change my expectations. I’ll need to shell into servers to develop code. It is important to do a better job learning bash and developing a decent shell work flow. Configuring tmux / screen / byobu is a good investment. Part of me can appreciate the lean and mean text interface 24/7, but at the same time, I do hope that we eventually will expect a richer environment for development than a terminal.

Thu, 01 Oct 2015 00:00:00 +0000 <![CDATA[Ansible and Version Control]]> Ansible and Version Control

I’ve become reasonbly comfortable with both Chef and Ansible. Both have pros and cons, but there is one aspect of Ansible that I think deserves mention is how it can work with version control thanks to its lack of a central server and through defining its operations via YAML.

No Central Server

In Chef, there is the chef server that keeps the actual scripts for different roles, recipes, etc. It also maintains the list of nodes / clients and environments available. The good thing about this design is that you have a single source of truth for all aspects of the process. The downside, is that the central server must be updated outside of version control. This presents the situation where version 1.1 of some recipe introduces some bug and you may need to cut a 1.2 that is the same as 1.0 in order to fix it.

Another downside is that if a node goes down or doesn’t get cleaned up properly, it will still exist on the chef server. Other recipes may still think the node is up even though it has become unavailable.

Ansible, at its core, runs commands via SSH. The management of nodes happens in the inventory and is dependent on file listing or a dynamic module. The result is that everything Ansible needs to work is the local machine. While it is not automatic, using a dynamic inventory, Ansible can examine the infrastructure at run time and act accordingly.

If you are not using a dynamic inventory, you can add hosts in your invetory files and just commit them like any other change! From here you can see when nodes come up and go down in your change history.

YAML Roles

Ansible defines its actions via playbooks defined as YAML. You can also add your own modules if need be in the same repo. What’s more, if you find a helpful role or library in Ansible Galaxy, installing the library downloads its file directly into your code tree, ready to be committed. This vendoring makes things like version pins unnecessary. Instead, you simply checkout the changeset or version tag and everything should be good to go.

To compare this with Chef, you can maintain roles, environments, etc. as JSON and sync them with the central Chef server using Kitchen. The problem with this tactic is that a new commit in version control may or may not be an update to the resource on the chef server. You can get around this limitation with things like commit hooks that automatically sync the repo with the chef server, but that is not always feasible. For example, if you mistakenly update a role with an incorrect version pin and your servers are updating on a cadence, then that change will get rolled out automatically.

Again, there are known ways around these sorts of issues, but the point being is that it is harder to maintain everything via version control, which I’d argue is beneficial.

I’m a firm believer that keeping your config management code in version control is the best way to manange your systems. There are ways to make central server based systems effectively gated via version control, but it is not always obvious, and it is certainly not built in. Ansible, by running code from the local machine and repository code, makes it simple to keep the repository as the single source of truth.

Mon, 28 Sep 2015 00:00:00 +0000 <![CDATA[More Virtual Machine Development]]> More Virtual Machine Development

I’ve written before developing on OS X using virtual machines. That was ~6 months ago and it seemed time for an update.

Docker is Better

One thing that has improved is Docker. Docker Compose now works on OS X. Docker Machine also has become more stable, which allows a boot2docker like experience on any “remote” VM where “remote” includes things like cloud servers as well as Vagrant VMs.

In terms of development, Docker Compose seems really nice in theory. You can define a suite of services that get started and linked together without having to manually switch configs all over the place. Eventually, Docker Compose will also be supported on things like Docker Swarm, which means, in theory at least, that you could deploy an entire stack.

In practice though, the linking is not as robust as one would like. If a container doesn’t start, the link doesn’t get added to the host machine’s /etc/hosts file. It’s also unclear how portable this is across different distros. While an /etc/hosts file is pretty standard, it still must be used by the OS using whatever resolution system is in place. I’ve gotten the impression that things are moving towards injecting DNS in the mix in order to avoid lower level changes, but time will tell.

While it hasn’t worked for me, I’m still planning to keep trying docker compose as it seems like the best solution to starting up a suite of microservices for development.

Docker and Emacs !?!

Emacs has a package called Prodigy that lets you manage processes. I also played around with running my dev environment in a docker container where I just used Prodigy to run all the services I need as subprocesses of the Emacs process. It is like a poorman’s emacs init system. Alas, this wasn’t much better than working on a cloud server. While it is clever and portable, it still is like ssh’ing into a box to work, which get frustrating over time.

Vagrant and rdo

A while back I released a tool called rdo that lets you remotely execute commands like they were local. This has become my primary entry point into using Vagrant and avoiding having to use a terminal to access my dev environment. I also integrated this into my xe tool to make things more cohesive, but overall, it is easier to just use rdo.

Even with a reasonably efficient way to make VM development feel local, it’s still a little painful. The pain comes in the slowness of always working through the VM. Vagrant shares (by way of VirtualBox shared folders) are slow. I’m looking at using NFS instead, which I understand should be faster, so we’ll see. The fastest method is to use rsync. This makes sense because it basically copies everything over to the VM, making things run as “native” speeds. My understanding is that this is a one way operation, so that doesn’t work well if you want to run a command and have the output piped to a file so you can use it locally (ie dosomething | grep “foo” > result_to_email.txt).

Cloud Server

I also spent some time developing on a cloud server. Basically, I set up an Ubuntu box in our cloud and ssh’d in to work. From a purely development standpoint, this worked pretty well. I used byobu to keep my emacs and shell sessions around after logging out or getting disconnected. Everything worked really fast as far as running tests, starting processes and even IRC. The downside was the integration into my desktop.

When I work in a local Emacs, I have key bindings and tools that help me work. I use Emacs for IRC, therefore, I have an easy key binding to open URLs. This doesn’t work on a cloud server. What’s more, because I’m using Emacs, any fancy URL recognition iTerm2 often gets screwed up where I can’t click it, and in some cases, I can’t even copy it. Another thing that was problematic was that since the cloud server used a hypervisor, I couldn’t run VirtualBox. This meant no Chef Kitchen, so I was forced to head back to OS X for doing any devops work. Due to firewall restrictions, I also couldn’t access my work email on my cloud server.

Lastly, since the machine was on a public cloud, running any process was usually available to the public! In theory, these are just dev processes that don’t matter. In practice though, spinning up a jenkins instance to test jobs in our real environment, was dangerous to run openly.

The result was that I had to make a rather hard switch between pure dev work and devops work when using a cloud server. I also had to be very aware of the security implications of using my cloud server to access important networks.

Sidenote: Emacs in the Terminal

Emacs through a terminal, while very usable, has some nits related to key bindings. There are far fewer key bindings available in a terminal due to the shell having to provide actual characters to the process. While I tried to work around these nits, the result was more complex key bindings that I had no hopes of establishing as habit. One example is C-= (C == ctrl). In reStructuredText mode, hitting C-= will allow cycling through headings and automatically add the right text decoration. Outside rst-mode, I typically have C-= binded to expand region. Expand region lets me hit C-= repeated to semantically select more of the current section. For example, I could expand to a word, sentence, paragraph, section, and then document. While these seem like trivial issues, they are frustrating to overcome as they have become ingrained in how I work. This frustration is made more bitter by the fact that the hoops I’ve jumped through are simply due to using OS X rather than Linux for development.

Future Plans

I’m not going to give up just yet and install Linux on my Macbook Pro. I’m doing some work with rdo to allow using docker in addition to Vagrant. I’m also confident that the docker compose and network issues will be solved relatively soon. In the meantime, I’m looking at ways to make rdo faster by reusing SSH connections and seeing if there are any ways to speed up shared folders in Vagrant (NFS, Rsync).

Tue, 08 Sep 2015 00:00:00 +0000 <![CDATA[Config Files]]> Config Files

In Chef, you can use attributes to set up values that you can later use in your recipes. I suspect a majority of these attributes end up in config files. For example, at work, we have added pretty much every value necessary in our config files. The result is that we duplicate our configuration (more or less) as a Ruby hash that gets serialized using ERB templates into the necessary config, which, in this case, is a conf file format. The other thing that happens is that we also set many of these values via environment variables using withenv, which describes this data as YAML.

Essentially, we go from YAML to Ruby to a template to a config file in a known parsable format. The problem is that each time you transition between data formats there is a chance for mistakes. As we all know, humans can make mistakes writing config files. It is worth considering how we could improve the situation.

I imagine a tool that accepts YAML and spits out different config file formats. YAML seems like a good option because it is arguably ubiquitous and provides data structures that programmers like. The tool to spit out a config file would use consistent patterns to output formats like conf files and plugins would need to be written for common tools like databases, web servers, queues, etc.

For example:

$ ymlconf --format rsyslog rsyslog_conf.yml > /etc/rsyslog.conf

I’m sure there would be some weirdness for some apps archaic and silly configuration file formats, but I’d hope that 1) the people using these apps understand the archaic file format well enough that 2) translating it to a YAML format wouldn’t be that terrible. For apps that do understand simple conf files, or even better, YAML or JSON or environment variables, things can be a matter of simply writing the files.

What’s more, YAML is resonably programmatic. If you have a list of nodes that need to repeat the same sections over a list of IPs you get when you start up cloud servers, it is trivial to do in a chef recipe. Rather than adding it to a data structure, only to decompose and pass that data to a template, you just append them to a list in a data structure read from YAML.

After using withenv, I think this is another way to greatly reduce the cognitive mapping that is required to constantly go from some driving data (like environment variables) to configuration management system data structures (chef attributes) that are passed to a template languages in order to write config files. Instead it would simply be a matter of running some command and pass it the path or YAML as stdin and be done with it.

Tue, 25 Aug 2015 00:00:00 +0000 <![CDATA[Getting Paid]]> Getting Paid

Cory wrote up some thoughts on funding OSS that was inspired by the recent decision to stop feature development on hypothesis. There is a problem in the FOSS community where developers get frustrated that the time they invest has a very low return. While I understand this feeling, it is also worthwhile to consider some benefits to investing in open source software that makes it worth your time.

I think most FOSS developers would agree the benefit of working on FOSS is recognition. It feels good to know others are using your software. Beyond simply being recognized, FOSS also are often given respect from other programmers and technical community at large. It is this respect that can be used to

So, what do you get, right now, from working on (successful) Open Source software?

While many developers would like to get money, the reality is you get something less tangible, a reputation as a good developer. You can use your reputation as an open source leader to negotiate for the things you want. Often you can work from home, negotiate for more vacation time, better pay and time to work on your project, all thanks to your reputation. Not to mention, you often can choose to work for well respected organizations doing work you find reasonably interesting.

Companies should support FOSS developers for the code they have graciously offered for free. At the same time, we as developers should realize that it is our responsibility to capitalize on our contributions, even when they may not be directly justifiable to a business. If a company hired a well known Emacs (or Vim!) developer, even though the company may uses Python, the company may still be able to offer time to work on the FOSS code, more money and/or sponsoring going to conferences. These types of expenses are easy to account for on a balance sheet when compared to giving someone money for non-specific work on a project.

Hopefully, in the future new methods of directly supporting FOSS developers come to light, but in the meantime, lets see what we can do today. Ask your manager about having X number of hours to work on your open source project. Request sponsorship to a conference. If they refuse, look for another job and include your project work as part of the package. A new opportunity is a great means of letting your employer know your skills are valuable and your terms for staying include working on your FOSS work.

For companies, support developers to work on FOSS! Even if someone doesn’t work on something directly associated with your current code base or business, that person has proven themselves as a good developer, with the real world experience being available in the open. Similarly, if your organization is strugging to keep or acquire good talent, offering someone 4-8 hours a week to work on a project they already contribute to is a relatively cheap benefit in order to hire someone that is a proven successful developer. What’s more, that person is likely to have a network of other great devs that you can dip into.

Again, I understand why developers are frustrated that they spend time on FOSS with seemingly very little to gain. But, rather than sit by and wait for someone to offer to pay you money for your time, communicate your frustrations to your employer and try to use some of your reputation to get what you want. If your current employer refuses to listen, it is probably time to consider other options. Companies that having difficulties attracting or keping talent should offer FOSS support as part of the benefits package.

Finally, for some developers, it is a good idea to take a step back and consider why you write software. As a musician, it is cliché to say I don’t do it for the money. The reason being is that the music industry is notorious for not paying anything with a long line of willing musicians to work for nothing. While we as software developers do make a good living writing software, there is something to be said for enjoying our independent time writing code. Taking a hobby you enjoy and turning into a business often removes much of what makes that hobby fun. It no longer is yours to do with as you want. If you write FOSS code for fun, then you are under no obligations, other than your own desires. Programming is fun, so regardless of whether 1 or 1 million people use your code, recognize why you started writing the code in the first place.

Tue, 11 Aug 2015 00:00:00 +0000 <![CDATA[Heat vs. Ansible]]> Heat vs. Ansible

At work we’re using Ansible to start up servers and configure them to run Chef. While I’m sure we could do everything with Ansible or Chef, our goal is to use the right tool for each specific job. Ansible does a pretty decent job starting up infrastructure, while Chef works better maintaining infrastructure once it has been created.

With that in mind, as we’ve developed our Ansible plays and created some tooling, there is a sense that Heat could be a good option to do the same things we do with Ansible.

Before getting too involved comparing these two tools, lets set the scope. The goal is to spin up some infrastructure. It is easy enough in either tool to run some commands to setup chef or what have you. The harder part is how to spin everything up in such a way that you can confirm everything is truly “up” and configured correctly. For example, say you wanted to spin up a load balancer, some application nodes, and some database nodes. You need to be sure when you get a message that everything is “done” that:

  1. The load balancer is accepting traffic with the correct DNS hostname.
  2. There are X number of app server nodes that all can be accessed by ssh and any other ports services might be running on.
  3. There are X number database nodes that are accessed via the proxy using a private network.

I’m not going to provide examples on how to spin this infrastructure up in each tool, but rather discuss what each tool does well and not so well.


Anisble, for the most part, connects to servers via ssh and runs commands. It has a bunch of handy modules for doing this, but in a nutshell, that is what Ansible does. Since all Ansible really does is run commands on a host, it might not be clear how you’d automate your infrastructure. The key is the inventory.

Ansible uses a concept of an inventory that contains the set of hosts that exist. This inventory can be dynamic as well such that “plays” create and add hosts to the inventory. A commone pattern we use is to create some set of hosts and pass that list on to subsequent plays that do other tasks like configure chef.

What’s Good

The nice aspect of Ansible is that you have an extremely granular process of starting up hosts. You can run checks within the plays to ensure that nothing continues if there is a failure. You can “rescue” failed tasks as well in order to clean up broken resources. It is also really simple to understand what is happening. You know Ansible is simply running some code on a host (sometime localhost!) via ssh. This makes it easy to reproduce and perform the same tasks manually.

What’s Bad

The difficulty in using Ansible is that you are responsible for everything. There is no free lunch and the definition of what you want your infrastructure to look like is completely up to you. This is OK when you’re talking about a few servers that all look the same. But, when you’d need 50 small 512 mem machines along with 10 big compute machines using some shared block storage, 10 memcache nodes with tons of ram, a load balancer and ensure this infrastructure runs in 3 different data centers, then it starts to hurt. While there is a dynamic inventory to maintain your infrastructure, it is not well integrated as a concept in Ansible. The process often involves using a template language in YAML to correctly access your infrastructure, which is less than ideal.


I’m sure ansible gurus have answers to my complaints. No doubt, using tower could be one. Unfortunately, I haven’t had the opportunity to use tower and since it isn’t free, we haven’t considered it for our relatively limited use case.


Heat comes from Cloud Formation Templates from AWS. The idea is to define what you’d like your infrastructure to look like and pass that definition to your orchestration system. The orchestration system will take the template, establish a plan of attack and start performing the operations necessary. The end result is that everything gets created and linked together as requested and you’re done!

At Rackspace, we have a product called Cloud Orchestration that is responsible for making your template a reality.

What’s Good

Heat lets you define a template that outlines precisely what you want your infrastructure to look like. Just to provide a simple example, here is a template I wrote to spin up a Fleet cluster.

heat_template_version: '2015-04-30'
description: "This is a Heat template to deploy Linux servers running fleet and etcd"

    type: 'OS::Heat::ResourceGroup'
      count: 3
        type: 'OS::Nova::Server'
          flavor: '512MB Standard Instance'
          image: 'CoreOS (Stable)'
          config_drive: true
          user_data: |
                  initial-advertise-peer-urls: http://$private_ipv4:2380
                  advertise-client-urls: http://$public_ipv4:2379
                  listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001
                  public-ip: $private_ipv4

                  - name: etcd2.service
                    command: start

                  - name: fleet.service
                    command: start

    value: { get_attr: [fleet_servers, accessIPv4] }

Heat templates allow a bunch of features to make this more programmable such that you pass in arguments where necessary. For example, I might make count a parameter in order to spin up 1 server when testing and more in production.

What we do currently in Ansible is to pass environment variables to our plays that end up as configuration options for creating our dynamic inventory. We use the withenv to make this more declarative by writing this in YAML. Here is an example:

  - MDNS:          1
  - POOL_MGR:      1
  - CENTRAL:       1
  - API:           1
  - DB:            3
  - QUEUE:         3

As you can see, the process defining this sort of infrastructure is slowly becoming closer to Heat templates.

Another benefit of using Heat is that you are not responsible for implementing every single step of the process. Heat provides semantics for naming a group of servers in such a way that they can be reused. If you create 5 hosts for some pool that need to be added to a load balancer, that is easy peasy with Heat. What’s more, the orchestration system can act with a deeper knowledge of the underlying system. It can perform retries as needed with no manual intervention.

Heat also provides makes it easy to use cloud-init. While this doesn’t provide the same flexibility as an Ansible play, it is an easy way to get a node configured after it boots.

What’s Bad

Heat templates are still just templates. The result is that if you are trying to do complex tasks, get ready to write a bunch of YAML that is not easy to look at. Heat also doesn’t provide a ton of granularity. If one step fails, where failure is defined by the orchestration system and the heat template, the entire stack must be thrown away.

Heat is really meant to spin up or teardown a stack. If you have a stack that has 5 servers and you want to add 5 more, updating that stack with your template will teardown the entire stack and rebuild it from scratch.


I’m thankfully wrong here! Heat should recognize that you are only adding/removing servers and assuming you aren’t changing other details, it will just add or remove the machines. The one caveat here is if there are dependencies on that server. I’m not clear on what a “dependency” in this case means, but for my basic use case of adding more nodes in the typical case (ie more REST API nodes) should work just fine.

Conclusions and Closing Thoughts

Heat, currently, is a great tool to spin up and tear down a complex stack. While it seems frustrating that updates do not consider the state of the stack, it does promote a better deployment design where infrastructure is an orthogonal concern to how apps are actually run.

Also, Heat at Rackspace, supports autoscaling, which handles the most common use case of adding / removing nodes from a cluster.

From the user perspective, decoupling your infrastructure from your application deployments works well when you run containers and use a tool like Fleet to automatically start your app on the available hosts in a cluster. When a host goes away, Fleet is responsible for running the lost processes on the nodes still available in the cluster.

With that in mind, if your applications haven’t been developed to run on containers and that isn’t part of your CI/CD pipeline, Ansible is a great option. Ansible is simple to understand and has a vibrant ecosystem. There are definitely mismatches when it comes to infrastructure, but nothing is ever perfect. For example, I do think the dynamic inventory ends up a little bit cleaner than the machine semantics I’ve seen in chef.

Finally, there is no reason you can’t use both! In my Heat template example, you notice that there is an outputs section. That can be used to create your own dynamic inventory system so you could get the benefits of setup/teardown with Heat, while doing your initial machine configuration with Ansible rather than fitting it all into a cloud-init script.

I hope this overview helps. Both Heat and Ansible are excellent tools for managing infrastructure. The big thing to rememeber is that there no free lunch when it comes to spinning up infrastructure. It is a good idea to consider it as separate process from managing software. For example, it is tempting to try and install and configure your app via a cloud-init script or immediately after spinning up a node in ansible. Step one should be to get your infrastructure up and tested before moving on to configuring software. By keeping the concerns separate, you’ll find the tools like, heat and ansible, become more reliable while staying simple.

Mon, 10 Aug 2015 00:00:00 +0000 <![CDATA[Vendoring Dependencies]]> Vendoring Dependencies

All software has to deal with dependencies. It is a fact of life. No program can execute without some supporting tools. These tools, more often than not, are made available by some sort of dependency management system.

There are many paths to including dependencies in your software. A popular use case, especially in uncompiled languages like Python or Ruby, is to resolve dependencies when installing the program. In Python, pip install will look at the dependencies a program or library needs, download and install them into the environment.

The best argument for resolving dependencies during install (or runtime) is that you can automatically apply things like security fixes to dependent libraries across all applications. For example, if there is a libssl issue, you can install the new libssl, restart your processes and the new processes should automatically be using the newly installed library.

The biggest problem with this pattern is that it’s easy to have version conflicts. For example, if a software declares it needs version 1.2 of a dependency and some other library requires 1.1, the dependencies are in conflict. These conflicts are resolved by establishing some sort of sandboxed environment where each application can use the necessary dependencies in isolation. Unfortunately, by creating a sandboxed environment, you often disconnect the ability for a package to inherit system wide libraries!

The better solution is to vendor dependencies with your program. By packaging the necessary libraries you eliminate the potential for conflicts. The negative is that you also eliminate the potential for automatically inheriting some library fixes, but in reality, this relationship is not black and white.

To make this more concrete lets look at an example. Say we have a package called “supersecret” that can sign and decrypt messages. It uses libcrypto for doing the complicated work in C. Our “supersecret” package installs a command line utility ss that uses the click library. The current version of click is 4.x but we wrote this when 3.x was released. Lets assume as well that we use some feature that breaks our code if we’re using 4.x.

We’ll install it with pip

$ pip install supersecret

When this gets installed, it will use the system level shared libcryto library. But, we’ve vendored our click dependency.

The benefit here is that we’ve eliminated the opportunity to conflict with other packages, while still inheriting beneficial updates from the lower level system.

The arguments against this pattern is that keeping these dependencies up to date can be difficult. I’d argue that this is incorrect when you consider automated testing via a continuous integration server. For example, if we simply have click in our dependency list via or requirements.txt we can assume our test suite will be run from scratch, downloading the latest version and revealing broken dependencies. While this requires tests that cover your library usage, that is a good practice regardless.

To see a good example of how this vendoring pattern works in practice, take a look at the Go <> language. Go has explicitly made the decision to push dependency resolution to happen at build time. The result is that go binaries can be copied to a machine and run without any other requirements.

One thing that would make vendoring even safer is a standard means of providing information about what libraries are versioned. For example, if you do use libssl, having a way to communicate that dependency is vendored would allow an operator recognize what applications may need to be updated when certain issues arise. That in mind, as we’ve seen above, many critical components in languages such as Python or Ruby make it trivial utilize the system level dependencies that typically are considered when discussions arise regarding rotted code due to vendoring.

Vendoring is far from a panacea, but it does put the onus on the software author to take responsibility for dependencies. It also promotes working software over purity from the user’s perspective. Even if you are releasing services where you are the operator, managing your dependencies when you are working on the code will greatly simplify the rest of the build/release process.

Mon, 20 Jul 2015 00:00:00 +0000 <![CDATA[Small Functions without an IDE]]> Small Functions without an IDE

I’ve been reading Clean Code for a book club at work. So far, it is really a great book as it provides attributes and techniques for understanding what clean code really looks like. My only complaint, which is probably the wrong word, is that the suggestions are based on using a more verbose language such as Java that you use with an IDE.

As I said, this really isn’t a complaint so much as the author mentions how things like renaming things and creating very small functions are painless thanks to the functionality of the IDE. In dynamically typed languages like Python, the same level of introspection doesn’t generally exist for tooling available.

As an aside, function calls in Python can be expensive, so extracting a single function into many, much smaller functions does have the potential to slow things down. I saw this impact on a data export tool that needed to perform a suite of operations on each role. It had started as one huge function and I refactored it into a class with a ton of smaller methods. Unfortunately, this did slow things down somewhat, but considering the domain and expense of maintaining the single, huge function, the slowdown was worth it.

Performance aside, I’d argue that it is definitely better to try to keep functions small and use more when writing any code. The question then, is how do you manage the code base when you can’t reliably jump to function references automatically?

I certainly don’t have all the answers, but here are some things I’ve noticed that seem to help.

Use a Single File

While your editor might support refactoring tools, it most certainly has the ability to search. When you need to pop around to different functions, keep the number of files to a minimum so you can easily use search to your advantage.

Use a Flat Namespace

Using a flat namespace goes hand in hand with keeping more functions / methods in a single file. Avoid nesting your modules to make it faster to find files within the code. One thing to note is that the goal here is no to keep a single folder with hundreds of files. The goal is to limit the scope of each folder / module to include the code it will be using.

You can think of this in the same terms as refactoring your classes. If a file has functionality that seems out of place in the module, move it somewhere else. One benefit of using a dynamic language like Python is you don’t have the same one class per file requirements you see in something like Java.

Consistent Naming

Consistent naming is pretty obvious, but it is even more important in a dynamic language. Make absolutely sure you name the same usage of the same objects / classes throughout your code base. If you are diligent in how you name your variables, search and replace can be extremely effective in refactoring.

Write Tests

Another obvious one here, but make sure you write tests. Smaller functions means more functions. More functions should mean more tests. Fortunately, writing the tests are much, much easier.

class TestFoo(object):

    def test_foo_default(self): ...
    def test_foo_bar(self): ...
    def test_foo_bar_and_baz(self): ...
    def test_foo_bar_and_baz_and_kw(self): ...

If you’ve ever written a class like this, adding more functions should make your tests easier to understand as well. A common pattern is to write a test for each path a function can take. You often end up with a class that has tons of oddly named test functions with different variables mocked in order to test the different code paths in isolation. When a function gets refactored into many small functions (or methods) you see something more like this:

class TestFoo(object):

    def setup(self): = Foo()

    def test_foo(self): = Mock()


    def test_bar(self): = Mock()'/path/to/cfg')

In the above example, you can easily mock the functions that should be called and assert the interface the function requires is being met. Since the functions are small, you’re tests end up being easy to isolate and you can test the typically very small bit of functionality that needs to happan in that function.

Use Good Tools

When I first started to program and found out about find ... | xargs grep my life was forever changed. Hopefully your editor supports some integration of search tools. For example, I use Emacs along with projectile, which supports searching with ag. When I use these tools in my editor along side the massive amount of functionality my editor provides, it is a very powerful environment. If you write code in a dynamic language, it is extremely important to take some time to master the tools available.


I’m sure there are other best practices that help to manage well factored code in dynamic languages. I’ve heard some programmers that feel refactoring code to very small functions is “overkill” in a language like Python, but I’d argue these people are wrong. The cost associated with navigating the code base can be reduced a great deal using good tools and some simple best practices. The benefits of a clean, well tested code base far out weigh the cost of a developer reading the code.

Mon, 06 Jul 2015 00:00:00 +0000 <![CDATA[Operators]]> Operators

Someone mentioned to me a little while back a disinterest in going to PyCon because it felt directed towards operators more than programmers. Basically, there have become more talks about integrations using Python than discussions regarding language features, libraries or development techniques. I think this trend is natural because Python has proven itself as a main stream language that has solved many common programming problems. Therefore, when people talk about it, it is a matter of how Python was used rather than describing how to apply some programming technique using the language.

With that in mind, it got me thinking about “Operators” and what that means.

Where I work there are two types of operators. The first is the somewhat traditional system administrator. This role is focused on knowledge about the particular system being administered. There is still a good deal of automation work that happens at this level, but it is typically focused on administering a particular suite of applications. For example, managing apache httpd or bind9 via config files and rolling out updates using the specific package manager. There is typically more nuance to this sort of role than can be expressed in a paragraph, so needless to say, these are domain experts that understand the common and extreme corner cases for the applications and systems they administer.

The second type of operator is closer to the operations included devops. These operators are responsible for building the systems that run application software. These folks are responsible for designing the systems and infrastructure to run the custom applications. While more traditional sysadmins use configuration management, these operators master it. Ops must have a huge breadth of knowledge that spans everything. File systems, networking, databases, services, *nix, shell, version control and everything in between are all topics that Ops are familiar with.

As a software developer, we think about abstract designs, while ops makes the abstract concrete.

After working with Ops for a while, I have a huge amount of respect due to the complexity that must be managed. There is no way to simply import cloud and cloud.start(). The tools available to Ops for enacapsulating concepts is rudimentary by necessity. The configuration management tools are still very new and the terminology hasn’t coalesced towards design patterns due to the fact that everyone’s starting point is different. Ops is where linux distros, databases, load balancers, firewalls, user management and apps come together to actually have working products.

It is this complexity that makes DevOps such an interesting place for software development. Amidst the myriad of programs and systems, there needs to be established concepts that can be reused as best practices, and eventually, as programs. Just as C revolutionized programming by allowing a way to build for different architectures, DevOps is creating the language, frameworks, and concepts to deploy large scale systems.

The current state of the art is using configuration manangement / orchestration tools to configure a system. While in many ways this is very high level, I’d argue that it is closer to assembly in the grand scheme of things. There is still room to encapsulate these tools and provide higher level abstractions that simplify and make safe the processes of working with systems.

Thu, 02 Jul 2015 00:00:00 +0000 <![CDATA[Docker and Chef]]> Docker and Chef

Chef is considered a “configuration management” tool, but really is an environment automation tool. Chef makes an effort to peform operations on your system according to a series of recipes. In theory, these recipes provide a declarative means of:

  1. Defining the process of performing some operations
  2. Defining the different paths to complete an operation
  3. The completed state on the system when the recipe has finished

An obvious, configuration specific, example would be a chef recipe to add a new httpd config file in /etc/httpd/sites.enabled.d/ or somewhere similar. You can use similar tactics you see in make check if you have a newer file or not and how to apply the change.

Defining the operations that need to happen, along with handling valid error cases, is non-trivial. When you add to that also defining what the final state should look like between processes running, file changes or even database updates, you have a ton of work to do with an incredible amount of room for error.

Docker, while it is not a configuration management tool, allows you to bundle your build with your configuration, thus separating some of the responsibility. This doesn’t preclude using chef as much as it limits it to configuring the system in which you will run the containers.

Putting this into more concrete terms, what we want is a cascading system that allows each level to encapsulate its responsibilities. By doing so, a declaration that some requirement has been met can allow the lower layer to report back a simple true/false.

In a nutshell, use chef to configure the host that will run your processes. Use docker containers to run your process with the production configuration bundled in the container. By doing so, you take advantage of Chef and its community cookbooks while making configuration of your application encapsulated in your build step and the resulting container.

While this should work, there are still questions to consider. Chef can dyanmically find configuration values when converging while a docker container’s filesystem is read only. While I don’t have a clear answer for this, my gut says it shouldn’t be that difficult to sort out in a reliable pattern. For example, chef could check out some tagged configuration from a git repo that gets mounted at /etc/$appname when running the container. Another option would be to use etcd to update the filesystem mounted in a container. In either case, the application uses the filesystem normally, while chef provides the dynamism when converging.

Another concern is that in order to use docker containers, it is important you have access to a docker registry. Fortunately, this is a relatively simple process. One downside is that there is not a OpenStack Swift backed v2 registry. The other option is to use docker hub and pay for more private containers. The containers should be registered as private because they include the production configuration.

It seems clear that a declarative system is valuable when configuring a host. Unfortunately, the reality is that the resources that are typically “declared” with Chef are too complex to maintain a completely declarative pattern. Using docker, a container can be can be tested reliably such that a running container is enough to consider its dependency met in the declared state.

Fri, 26 Jun 2015 00:00:00 +0000 <![CDATA[Emacs and Strings]]> Emacs and Strings

If you’ve ever programmed any elisp (emacs lisp) you might have been frustrated and surprised by the lack of string handling functions. In Python, it is trivial to do things like:

print('Hello World!'.split().lower())

The lack of string functions in elisp has been improved greatly by s.el, but why haven’t these sorts functions existed in Emacs in the first place? Obviously, I don’t know the answer, but I do have a theory.

Elisp is (obviously) a LISP and LISPs are functional! One tenant of functional languages is the use of immutable data. While many would argue immutability is not something elisp is known for, when acting on a buffer, it is effectively immutable. So, rather than load some string into memory, mutate it and use it somewhere, my hunch is early emacs authors saw things differently. Instead, they considered the buffer the place to act on strings. When you call an elisp function it acts like a monad or a transaction where the underlying text is effectively locked. Rather than loading it into some data structure, you instead are given access to the editor primitives to literally “edit” the text as necessary. When the function exits, the buffer is then returned to the UI and user in its new state.

The benefits here are:

  1. You use the same actions the user uses to manipulate text
  2. You re-use the same memory and content the editor is using

While, it feels confusing coming from other languages, if you think of all the tools available to edit text in Emacs, one could argue that string manipulation is not necessary.

Of course, my theory could be totally wrong, so who knows. Fortunately, there is s.el to help bridge the gap between editing buffers and manipulating text.

Thu, 25 Jun 2015 00:00:00 +0000 <![CDATA[Announcing Withenv]]> Announcing Withenv

I wrote a tool to help sanely manage environment variables. Environment Variables (env vars) are a great way to pass data to programs because it works practically everywhere with no set up. It is a lowest common denominator that almost all systems support all the way from dev to production.

The problem with env vars is that they can be sticky. You are in a shell (zsh, bash, fish, etc...) and you set an environment variable. It exists and is available to every command from then on. If an env var contains an important secret such as a cloud account key, you could silently delete production nodes by mistake. Someone else could use your computer and do the same thing, with or without malicious intent.

Another difficulty with env vars is that they are a global key value store. Writing small shell scripts to export environment variables can be error prone. Copying and pasting or commenting out env vars in order to configure a script is easy to screw up. The fact these env vars are long lasting only makes it more difficult to automate reliably.

Withenv tries to improve this situation by providing some helpful features:

  • Setup the environment for each command without it leaking into your shell
  • Organization of your environment via YAML files
  • Cascading of your environment files in order to override specific values
  • Debugging the environment variables

Here is how it works.

Lets say we have a script that starts up some servers. It uses some environment variables to choose how many servers to spin, what cloud account to use and what role to configure them with (via Chef or Ansible or Salt, etc.). The script isn’t important, so we’ll just assume make create does all the work.

Lets organize our environment YAML files. We’ll create a envs folder that we can use to populate our environment. It will have some directories to help build up an environment.

├─ env
│  ├─ dev
│  └─ prod
└─ roles
   ├─ app-foo
   └─ app-bar

Now we’ll add some YAML files. For example, lets create a YAML file in the envs/env/dev that connects to a development account.

# envs/env/dev/rax_creds.yml
  - USERNAME: eric
  - API_KEY: 02aloksjdfp;aoidjf;aosdijf

You’ll notice that we used a nested data structure as well as lists. Using lists ensure we get an explicit ordering. We could have used a normal dictionary as well if the order doesn’t matter. The nesting ensures that each child entry will use the correct prefix. For example, the YAML above is equivalent to the following bash script.

export RACKSPACE_API_KEY=02aloksjdfp;aoidjf;aosdijf

Now, lets create another file for defining some object storage info.

# envs/env/dev/cloud_storage.yml
- STORAGE_BUCKET: devstore

You’ll notice that the STORAGE_PREFIX uses the value of the STORAGE_BUCKET. You can do normal dollar prefixed replacements like you would do normally in an shell. This includes any variables currently defined in your environment such as $HOME or $USER that are typically set. Also, by using a list (as defined by the -), we ensure that we apply the variables in order and the STORAGE_BUCKET exists for use within the STORAGE_PREFIX value.

With our environment YAML in place, we can now use the we command withenv provides in order to set up the environment before calling a command.

$ we -e envs/common.yml -d envs/env/dev -d envs/role/app-foo make create

The -e flag lets you point to a specific YAML file, while the -d flag points to a directory of YAML files. The ordering of the flags is important because the last entry will take precedence. In the command above, we might have configured common.yml with a personal dev account along with our defaults. The envs/env/dev/ folder contains a rax_creds.yml file that overrides the default cloud account with shared development account, leaving the other defaults alone.

The one limitation is that you cannot use the output from commands as a value to an env var. For example, the following wouldn’t work to set a directory path.

CONFIG_PATH: `pwd`/etc/foo/

This might be fixed in the future, but at the moment it is not supported.

If you don’t pass any argument to the we command it will output he environment as a bash script using export to set variables.

Withenv is available on pypi. Please let me know if you give it a try.

Thu, 18 Jun 2015 00:00:00 +0000 <![CDATA[Log Buffering]]> Log Buffering

Have you ever had code that needed to do some logging, but your logging configuration hadn’t been loaded? While it is a best practice to set up logging as early as possible, logging is still code that needs to be executed. The Python runtime will still do some setup (ie import everything) that MUST come before ANY code is executed, including your logging code.

One solution would be to jump through some hoops to make that code evaluated more lazily. For example, say you wanted to apply a decorator from some other package if it is installed. The first time the function is called, you could apply the decorator. This would get pretty complex pretty quickly.

class LazyDecorator(object):
    def __init__(self, entry_point):
        self.entry_point = entry_point
        self.func = None

    def find_decorator(self):
        # find our decorator...

    def __call__(self, f):
        self.original_func = f
        def lazy_wrapper(*args, **kw):
            if not self.func:
                self.func = self.find_decorator()
            return self.func(*args, **kw)
        return lazy_wrapper

I haven’t tried the code above, but it does rub me the wrong way. The reason being is that we’re jumping through hoops just to do some logging. Function calls are expensive in Python, which means if you decorated a ton of functions, the result could end up as a lot of overhead for a feature that only effects start up.

Instead, we can just buffer the log output until after we’ve loaded our logging config.

import logging

class LazyLogger(object):

    LVLS = dict(

    def __init__(self):
        self.messages = []

    def replay(self, logger=None):
        logger = logging.getLogger(__name__)
        for level, msg, args, kw in self.messages:
            logger.log(level, msg, *args, **kw)

    __call__ = replay

    def capture(self, lvl, msg, *args, **kw):
        self.messages.append((lvl, msg, args, kw))

    def __getattr__(self, name):
        if name in self.LVLS:
            return functools.partial(self.capture, self.LVLS[name])

We can use this as our logging object in our code that needs to log before logging has been configured. Then, when we can replay our log when it is appropriate by importing the logger, and calling the replay method. We could even keep a registry of lazy loggers and call them all after configuring logging.

The benefit of this tactic is that you avoid adding runtime complexity, while supporting the same logging patterns at startup / import time.

Tue, 16 Jun 2015 00:00:00 +0000 <![CDATA[DevOps System Calls]]> DevOps System Calls

One thing I’ve found when looking at DevOps is the adherance to specific tools. For example, if an organization uses chef, then it is expected that chef be responsible for all tasks. It is understandable to reuse knowledge gained in a system, but at the same time, all systems have pros and cons.

More importantly, each tool adheres to its own philosophies for how a system should be defined. Some are declarative while others are iterative and almost all systems define their own (clever at times) verbage for what the different elements of a system should be.

What the DevOps ecosystem really needs is a low level suite of common primitives we can build off of. A set of DevOps System Calls, if you will, we can use to build higher order systems. The reason is to gain the ability to have some gaurantees we can start to assume will work.

For example, in Python, when I write tests, I assume the standard library functions such as open or the socket module work as expected. You don’t see tests such as:

def test_open():
    with open('test_file.txt') as fh:

    assert open('test_file.txt').read() = 'foo'

We have similar expectations regarding much of the TCP/IP stack. We assume the bits are read correctly on the network hardware and passed to the OS, eventually landing in our program correctly. We take it for granted that the HTTP request becomes something like request.headers[‘Content-Type’] in our language of choice.

These assumptions let us consider our program in higher level terms that are portable across languages and systems. Every programmer understands what it means to open file, connect to a database or make a HTTP request within our programs because our level of abstraction is reasonably high.

DevOps could use a similar standard and the implementation doesn’t matter. A machine might be created with Ansible, but configured via Chef. That part doesn’t matter. What matters is we can write simple code that manages our operations.

For example, lets say I want to spin up a machine to run an app and a DB. Here is some psuedo code that might get the job done.

machine = cloud.create(flavor=provider.FLAVOR_COMPUTE)
app = packages.find('my-app')

This would compile to a suite of commands that trigger some DevOps tools do the work necessary to build the machines. The configuration of what provider, available flavors, and repository locations would all live in OS level config like you see for your OS networking, auth and everything else in /etc.

The key is that we can assume the calls will work or throw an error. The process is ecapsulated in such a way that we don’t need to think about the provider, setting API keys in an environment, bootstrapping the node for our configuration managment and every other tiny detail that needs to be performed and validated in order to consider the “recipe” or “playbook” as done.

Obviously, this is not trivial. But, if we consider where our tools excel and begin the process of encapsulating the tools behind some higher order concepts, we can begin to create a glossary and shared expectations. The result is a true Cloud OS.

Thu, 04 Jun 2015 00:00:00 +0000 <![CDATA[Playing with Repose]]> Playing with Repose

At work we use a proxy called repose in front of most services in order to make common tasks such as auth, rate limiting, etc. consistent. In python, this type of function might also be accomplished via WSGI middleware, but by using a separate proxy, you get two benefits.

  1. The service can be written in any language that understands HTTP.
  2. The service gets to avoid many orthogonal concerns.

While the reasoning for repose makes a lot of sense, for someone not familiar with Java, it can be a little daunting to play with. Fortunately, the repose folks have provided some packages to make playing with repose pretty easy.

We’ll start with a docker container to run repose. The repose docs has an example we can use as a template. But first lets make a directory to play in.

$ mkdir repose-playground
$ cd repose-playground

Now lets create our Dockerfile:

FROM ubuntu

RUN apt-get install -y wget

RUN wget -O - | apt-key add - && echo "deb stable main" > /etc/apt/sources.list.d/openrepose.list

RUN apt-get update && apt-get install -y \
  repose-valve \
  repose-filter-bundle \

CMD ["java", "-jar", "/usr/share/repose/repose-valve.jar"]

The next step will be to start up our container and grab the default config files. This makes it much easier to experiment since we have decent defaults.

$ docker build -t repose-playground .
$ mkdir etc
$ docker run -it -v `pwd`/etc:/code repose-playground cp -r /etc/repose /code

Now we have our config in ./etc/repose, we can try something out. Lets change our default endpoint to point to a different website.

<?xml version="1.0" encoding="UTF-8"?>

<!-- To configure Repose see: -->
<system-model xmlns="">
    <repose-cluster id="repose">
            <node id="repose_node1" hostname="localhost" http-port="8080"/>
            <endpoint id="open_repose" protocol="http"
                      <!-- redirect to! -->
                      root-path="/" port="80"

Now we’ll run repose from our container, using our local config instead of the config in the container.

$ docker run -it -v `pwd`/etc/repose:/etc/repose -p 8080:8080

If you’re using boot2docker, you can use boot2docker ip to find the IP of your VM.

$ export REPOSE_HOST=`boot2docker ip`
$ curl "http://$REPOSE_HOST:8080"

You should see the homepage HTML from!

Once you have repose running, you can leave it up and change the config as needed. Repose will periodically pick up any changes without restarting.

I’ve gone ahead and automated the steps in this repose-playground repo. While it can be tricky to get started with repose, especially if you’re not familiar with Java, it is worth taking a look at repose for implementing orthogonal requirements that make the essential application code more complex. This especially true if you’re using a micro services model where the less code the better. Just run repose on the same node, proxying requests to your service, which only listens on localhost and you’re good to go.

Wed, 27 May 2015 00:00:00 +0000 <![CDATA[Docker vs. Provisioning]]> Docker vs. Provisioning

Lately, I’ve been playing around with Docker as I’ve moved back to OS X for development. At the same time, I’ve been getting acquainted with Chef in a reasonably complex production environment. As both systems have a decent level of overlap, it might be helpful to compare and contrast the different methodologies of these two deployment tactics.

What does Docker actually do?

Docker wraps up the container functionality built into the Linux kernel. Basically, it lets a process use the machine’s hardware in a very specific manner, using a predefined filesystem. When you use docker, it feels like starting up a tiny VM to run a process. But, what really happens, the container’s filesystem is used along with the hardware provided by the kernel in order to run the process in an isolated environment.

When you use Docker, you typically start from an “image”. The image is just an initial filesystem you’ll be starting from. From there, you might install some packages and add some files in order to run some process. When you are ready to run the process, you use docker run and it will get the filesystem ready and run the process using the computer’s hardware.

Where this differs from VM is that you only start one process. While you might create a container that has installed Postgres, RabbitMQ and your own app, when you run docker run myimage myapp, no other processes are running. The container only provides the filesystem. It is up to the caller how the underlying hardware is accessed and utilized. This includes everything from the disk to the network.

What does a Provisioner do?

A provisioner, like Chef, configures a machine in a certain state. Like Docker, this means getting the file system sorted out, including installing packages, adding configuration, adding users, etc. A provisioner also can start processes on the machine as part of the provisioning process.

A provisioner usually starts from a known image. In this case, I’m using “image” in the more common VM context, where it is a snapshot of the OS. With that in mind, a provisioner doesn’t require a specific image, but rather, the set of required resources necessary to consider the provisioned machine as complete. For example, there is no reason you couldn’t use a provisioner to create user directories across variety of unices, including OS X and the BSDs.

Different Deployment Strategies

The key difference when using Docker or a provisioner is the strategy used for deployment. How do you take your infrastructure and configure it to run your applications consistently?

Docker takes the approach of deploying containers. The concept of a container is that it is self contained. The OS doesn’t matter, assuming it supports docker. Your deployment then involves getting the container image and running the processes supported by the container.

From a development perspective, the deliverable artifact of the build process would be a container (or set of containers) to run your different processes. From there, you would configure your infrastructure accordingly, configuring the resources the processes can use at run time.

A provisioner takes a more generalized route. The provisioner configures the machine, therefore, it can use any number of deliverables to get your processes running. You can create system packages, programming language environments or even containers to get your system up and running.

The key difference from the devops perspective (the intersection of development and sysops), is development within constraints of the system must be coordinated with the provisioner. In other words, developers can’t simply choose some platform or application. All dependencies must be integrated into the provisioning system. A docker container, on the other hand, can be based on any image and use any resource available within the image’s filesystem.

What do you want to do?

The question of whether to use Docker or a provisioning system is not an either or proposition. If you choose to use Docker containers as your deployment artifact, the actual machines may still need to be configured. There are options that avoid the need to use a provisioning system, but generally, you may still use something like Chef to maintain and provision the servers that will be running your containers.

One vector to make a decision on what strategy to use is the level of consistency across your infrastructure. If you are fine with developers creating containers that may use different operating systems and tooling, docker is an excellent choice. If you have hard requirements as to how your OS should be configured, using a provisioning system might be better suited for you.

Another thing to consider is development resources. It can be a blessing and a curse to provision a development system, no matter what system you use. Your team might be more than happy to take on managing containers efficiently, while other teams would be better off leaving most system decisions to the provisioning system. The ecosystem surrounding each platform is another consideration.


I don’t imagine that docker (and containers generally) will completely supplant provisioning services. But, I do believe the model does aid in producing more consistent deployment artifacts over time. Testing a container locally is a reasonably reliable means of ensuring it should run in production. That said, containers require that many resources must be configured (network, disk, etc.) in order to work correctly. This is a non-trivial step and making it work in development, especially when you consider devs using tools like boot2docker, can be a difficult process. It can much easier to simply spin up a Vagrant VM with the necessary processes and be done with it. Fortunately, there tools like docker compose and docker machine that seem to be addressing this shortcoming.

Tue, 19 May 2015 00:00:00 +0000 <![CDATA[Some Thoughts on Agile]]> Some Thoughts on Agile

I’ve recently started working under in an “agile” environment. There are stories, story points, a board, etc. I couldn’t tell you whether it was pure scrum or some other flavor of “agile”, but I can say it is definitely meant to be an “agile” system of software development. It is not perfect, but that is sort of the point, to roll with the punches of the real world and do your best with a reasonable about of data.

Some folks argue agile is nonsense, but after using agile techniques, the detractors typically don’t consider using agile techniques as a tool and consider it a concrete set of rules. No project management technique perfectly compartmentalizes all problems into easily solvable units. The best we can do is utilize techniques in order to improve our chances of success writing software.

There are two benefits that I’ve noticed of using an agile technique.

  1. You must communicate and record what is happening
  2. You may change things according to your needs

The requirement to communicate and record what’s happening is important because it forces developers to make information public. The process of writing a good story and updating the story with comments (I’m assuming some bug tracking software is being used) helps guard against problems going unnoticed. It also provides history that can be learned from. It holds people accountable.

Allowing change is also critical. Something like Scrum is extremely specific and detailed, yet as an organization, you have the option and priviledge to adapt and change vanilla Scrum to your own requirements. For example, some organizations should use story points for estimating work and establishing velocity, while others would be better suited for using more specific time estimates. Both estimation methods have their place and you have can choose the method that best meets your needs.

When adopting an agile practice it is a good idea to try out the vanilla version. But, just like your software, you should iterate and try things to optimize the process for your needs. It is OK to create stories that establish specifications. It is OK to use 1-10 for estimating work. It is OK to write “stories” more like traditional bug reports.

It is not OK to skip the communication and recording of what is going on and it is not OK to ignore the needs of your organization in order to adhere to the tenents of your chosen agile methodology.

Fri, 08 May 2015 00:00:00 +0000 <![CDATA[Announcing rdo]]> Announcing rdo

At work I’ve been using Vagrant for development. The thing that bothers me most about using Vagrant or any remote development machine is the disconnect it presents with local tools. You are forced to either login into the machine and run command or jump through hoops to run the commands from the local machine, most often times losing the file system context that make the local tools possible.

Local Tools

What I mean by local tools are things like IDEs or build code that performs tasks on your repository. IDEs assume you are developing locally and expect a local executable for certain tasks in order to work. Build code can be platform specific, which is likely why you are using Vagrant in the first place.

My answer to this is rdo.

Why rdo

I have a similar project called xe that you can configure to sort out your path when in a specific project. For example, if I have a virtualenv venv, in my cloned repo, I can use xe python to run the correct python without having to activate the virtualenv or manually include the path to the python executable.

rdo works in a similar way, the difference being that instead of adjust the path, it configures the command to run on a remote machine, such as a Vagrant VM.

Using rdo

For example, lets assume you have a Makefile in your project repo. You’ve written a bootstrap task that will download any system dependencies for your app.

     sudo apt-get install -y python-pip python-lxml

Obviously if you are on OS X or RHEL, you don’t use apt for package management, and therefore use a Vagrant VM. Rather than having to ssh into the VM, you can use rdo.

The first step is to create a config file in your repo.

driver = vagrant
directory = /vagrant

That assumes you’re Vagrantfile is mounting your repo at /vagrant. You can change it as needed.

From there you can use rdo to run commands.

$ rdo make bootstrap

That will compose a command to connect to the vagrant VM, cd to the correct directory and run your command.


I hope you give it a try and report back any issues. At the moment it extremely basic in that it doesn’t do anything terribly smart as far as escaping goes. I hope to remedy that as well as support generic ssh connections as well.

Thu, 07 May 2015 00:00:00 +0000 <![CDATA[Virtual Machine Development]]> Virtual Machine Development

I’ve recently started developing on OS X again on software that will be run on Linux. The solution I’ve used has been to use a Vagrant VM, but I’m not entirely happy with it. Here are a few other things I’ve tried.

Docker / Fig

On OS X, boot2docker makes it possible to use docker for running processes in containers. Fig lets you orchestrate and connect containers.


Fig is deprecated and will be replaced with Docker Compose, but I found that Docker Compose didn’t work for me on OS X.

The idea is that you’d run MySQL, RabbitMQ, etc. in containers and expose those processes’ ports and hosts to your app container. Here is an example:

  image: mysql:5.5

  build: path/to/myapp/  # A Dockerfile must be here
    - mysql

The app container then can access mysql as a host in order to get to the container running MySQL.

While I think this pattern could work, I found that it needs a bit too much hand holding. For example, you explicitly need to make sure volumes are set up for each service that needs persistence. Doing the typical database sync ended up being problematic because it wasn’t trivial to connect. I’m sure I was doing it wrong along the way, but it seems that you have to constantly tweak any tutorial because you have to use boot2docker.

Docker Machine

Another tactic I used was docker-machine. This is basically how boot2docker works. It can start a machine, configured by docker, and provide you commands so you can run things on that machine via the normal docker command line. This seemed promising, but in the end, it was pretty much the same as using Vagrant, only a lot less convenient.

I also considered using it with my Rackspace account, but, for whatever reason, the client couldn’t destroy machines, which made it much less enticing.


One thing that was frustrating with Vagrant is that if you use a virtualenv that is on part of the file system that is mounted from the host (ie OS X), doing any sort of package loading is really slow. I have no clue why this is. I ended up just installing things as root, but I think a better tactic might be to use virtualenvwrapper, which should install it to the home directory, while code still lives in /vagrant/*.

One thing that I did do initially was to write a Makefile for working with Vagrant. Here is a snippet:

VCMD=vagrant ssh -c

     $(VCMD) 'virtualenv $(VENV)'
     $(VCMD) 'cd $(SRC) && $(VENV)/bin/pip install -r requirements.txt -r test-requirements.txt'
     $(VCMD) 'cd $(SRC) && $(VENV)/bin/python develop'

     $(VCMD) 'cd $(SRC) && $(VENV)/bin/tox'

It is kind of ugly, but it more or less works. I also tried some other options such as using my xe package to use paramiko or fabric, but both tactics made it too hard to simply do things like:

$ xe tox -e py27

And make xe figure out what needs to happen to run the commands correctly on the remote host. What is frustrating is that docker managed to essentially do this aspect rather well.


OS X is not Linux. There are more than enough differences that make developing locally really difficult. Also, most code is not meant to be portable. I’m on the fence as to whether this is a real problem or just a fact of life with more work being done on servers in the cloud. Finally, virtualization and containers still need tons of work. It feels a little like the wild west in that there are really no rules and very few best practices. The potential is obvious, but the path is far from paved. Of the two, virtualization definitely feels like a better tactic for development. With that in mind, it would be even better if you could simply do with Vagrant what you can do with docker. Time will tell!

Even though I didn’t manage to make major strides into a better dev story for OS X, I did learn quite a bit about the different options out there. Did it make me miss using Linux? Yes. But I haven’t given up yet!

Fri, 10 Apr 2015 00:00:00 +0000 <![CDATA[tsf and randstr]]> tsf and randstr

I wrote a couple really small tools the other day that I packaged up. I hope someone else finds them useful!


tsf directs stdin to a timestamped file or directory. For example:

$ curl | tsf homepage.html

The output from the cURL request goes into a file 20150326123412-homepage.html. You can also specify that a directory should be used.

$ curl | tsf -d homepage.html

That will create a homepage.html directory with the timestamped files in the directory.

Why is this helpful?

I often debug things by writing small scripts that automate repetitive actions. It is common that I’ll keep around output for this sort of thing so I can examine it. Using tsf, it is a little less likely that I’ll overwrite a version that I needed to keep.

Another place it can be helpful is if you periodically run a script and you need to keep the result in a time stamped file. It does that too.


randstr is a function creates a random string.

from randstr import randstr


randstr provides some globals containing different sets of characters you can pass in to the call in order to get different varieties of random strings. You can also set the length.

Why is this helpful?

I’ve written this function a ton of times so I figured it was worth making a package out of it. It is helpful for testing b/c you can easily create random identifiers. For example:

from randstr import randstr

batch_records = [{'name': randstr()} for i in range(1000)]

I’m sure there are other tools out there that do similar or exactly the same thing, but these are mine and I like them. I hope you do too.

Thu, 26 Mar 2015 00:00:00 +0000 <![CDATA[Automate Everything]]> Automate Everything

I’ve found that it has become increasingly difficult to simply hack something together without formally automating the process. Something inside me just can’t handle the idea of repeating steps that could be automated. My solution has been to become faster at the process of formal automation. Most of the steps are small, easy to do and don’t take much time. Rather than feeling guilty that I’m wasting time by writing small a library or script, I work to make the process faster and am able to re-use these scripts and snippets in the future.

A nice side effect is that writing code has become much more fluid. I get more practice using essential libraries and tools where over time they’ve become second nature. It also can be helpful getting in the flow because taking the extra steps of writing to files and setting up a small package feels like a warm up of sorts.

One thing that has been difficult is navigating the wealth of options. For example, I’ve gone back to OS X for development. I’ve had to use VMs for running processes and tests. I’ve been playing with Vagrant and Docker. These can be configured with chef, ansible, or puppet in addition to writing a Vagrantfile or Dockerfile. Does chef set up a docker container? Do you call chef-apply in your Dockerfile? On OS X you have to use boot2docker, which seems to be a wrapper around docker machine. Even though I know the process can be configured to be completely automated, it is tough to feel as though you’re doing it right.

Obviously, there is a balance. It can be easy to become caught in a quagmire of automation, especially when you’re trying to automate something that was never intended to be driven programaticaly. At some point, even though it hurts, I have to just bear down and type the commands or click the mouse over and over again.

That is until I break down and start writing soem elisp to do it for me ;)

Fri, 20 Mar 2015 00:00:00 +0000 <![CDATA[Server Buffer Names in Circe]]> Server Buffer Names in Circe

Circe is an IRC client for Emacs. If you are dying to try out Emacs for your IRC-ing needs, it already comes with two other clients, ERC and rcirc. Both work just fine. Personally, I’ve found circe to be a great mix of helpful features alongside simple configuration.

One thing that was always bugging me was that the server buffers names. I use an IRC bouncer that keeps me connected to the different IRC networks I use. At work, I connect to each network using a different username using a port forwarded by ssh. The result being I get 3 buffers with extremely descriptive names such as localhost:6668<2>. I’d love to have names like *irc-freenode* instead, so here is what I came up with.

First off, I wrote a small function to connect to each network that looks like this:

(defun my-start-ircs ()
  ;; Tell circe not to show mode changes as they are pretty noisey
  (circe-set-display-handler "MODE" (lambda (&rest ignored) nil)))

Then for each IRC server I call the normal circe call. The circe call returns the server buffer. In order to rename the buffer, I can do the following:

(defun start-freenode-irc ()
  (with-current-buffer (circe "locahost"
                              :port 6689
                              :nick "elarson"
                              :user "eric_on_freenode"
                              :password (my-irc-pw))
    (rename-buffer "*irc-freenode*"))

Bingo! I get a nice server buffer name. I suspect this could work with ERC and rcirc, but I haven’t tried it. Hope it helps someone else out!

Thu, 12 Mar 2015 00:00:00 +0000 <![CDATA[Hello Rackspace!]]> Hello Rackspace!

After quite a while at YouGov, I’ve started a new position at Rackspace working on the Cloud DNS team! Specifically, I’m working on Designate, a DNS as a Service project that is in incubation for OpenStack. I’ve had an interest in OpenStack for a while now, so I feel extremely lucky I have the opportunity to join the community with one of the founding organizations.

One thing that has been interesting is the idea of the Managed Cloud. AWS focuses on Infrastructure as a Service (IaaS). Rackspace also offers IaaS, but takes that a step farther by providing support. For example, if you need a DB as a Service, AWS provides services like Redshift or SimpleDB. It is up to the users to figure out ways to optimize queries and tune the database for the user’s specific needs. In the Managed Cloud, you can ask for help and know that an experienced expert will understand what you need to do and help to make it happen, even at the application level.

While this support feels expensive, it can be much cheaper than you think when you consider the amount of time developers spend becoming pseudo experts at a huge breadth of technology that doesn’t offer any actual value to a business. Just think of the time spent on sharding a database, maintaining a CI service, learning the latest / greatest container technology, building your own VM images, maintaining your own configuration management systems, etc. Then imagine having a company that will help you set it up and maintain it over time. That is what the managed cloud is all about.

It doesn’t stop at software either. You can mix and match physical hardware with traditional cloud infrastructure as needed. If your database server needs real hardware that is integrated with your cloud storage and cloud servers, Rackspace can do it. If you have strict security compliance requirements that prevent you from using traditional IaaS providers, Rackspace can help there too. If you need to use VMWare or Windows as well as Open Stack cloud technologies, Rackspace has you covered.

I just got back from orientation, so I’m still full of the Kool-Aid.

That said, Fanatical Support truly is ingrained in the culture here. It started when the founders were challenged. They hired someone to get a handle on support and he proposed Fanatical Support. His argument was simple. If we offer a product and don’t support it, we are lying to our customers. The service they are buying is not what they are getting, so don’t be a lier and give users Fanatical Support.

I’m extremely excited to work on great technology at an extremely large scale, but more importantly, I’m ecstatic to being working at a company that ingrains integrity treats its customers and employees with the utmost respect.

Mon, 09 Mar 2015 00:00:00 +0000 <![CDATA[Setting up magit-gerrit in Emacs]]> Setting up magit-gerrit in Emacs

I recently started working on OpenStack and, being an avid Emacs user, I hoped to find a more integrated workflow with my editor of choice. Of the options out there, I settled on magit-gerrit.

OpenStack uses git for source control and gerrit for code review. The way code gets merged into OpenStack is through code review and gerrit. In a nutshell, you create a branch, write some code, submit a code review and after that code is reviewed and approved, it is merged upstream. The key is ensuring the code review process is thorough and convenient.

As developers with specific environments, it is crucial to be able to quickly download a patch and play around with the code. For example, running the tests locally or playing around with a new endpoint is important when approving a review. Fortunately, magit-gerrit makes this process really easy.

First off, you need to install the git-review tool. This is available via pip.

$ pip install git-review

Next up, you can check out a repo. We’ll use the Designate repo because that is what I’m working on!

$ git clone
$ cd designate

With a repo in place, we can start setting up magit-gerrit. Assuming you’ve setup Melpa, you can install it via M-x package-install RET magit-gerrit. Add to your emacs init file:

(require 'magit-gerrit)

The magit-gerrit docs suggest setting two variables.

;; if remote url is not using the default gerrit port and
;; ssh scheme, need to manually set this variable
(setq-default magit-gerrit-ssh-creds "")

;; if necessary, use an alternative remote instead of 'origin'
(setq-default magit-gerrit-remote "gerrit")

The magit-gerrit package can infer the magit-gerrit-ssh-creds from the magit-gerrit-remote. This makes it easy to configure your repo via a .dir-locals.el file.

  (magit-gerrit-remote . "ssh://")))

Once you have your repo configured, you open your repo in magit via M-x magit-status. You should also see a message saying “Detected magit-gerrit-ssh-creds” that shows the credentials used to login into the gerrit server. These are simple ssh credentials, so if you can’t ssh into the gerrit server using the credentials, then you need to adjust your settings accordingly.

If everything is configured correctly, there should be an entry in the status page that lists any reviews for the project. The listing shows the summary of the review. You can navigate to the review and press T to get a list of options. From there, you can download the patchset as a branch or simply view the diff. You can also browse to the review in gerrit. From what I can tell, you can’t comment on a review, but you can give a +/- for a review.

I’ve just started using gerrit and magit-gerrit, so I’m sure there are some features that I don’t fully understand. For example, I’ve yet to understand how to re-run git review in order to update a patch after getting feedback. Assuming that isn’t supported, I’m sure it shouldn’t be too hard to add.

Feel free to ping me if you try it and have questions or tips!

Thu, 05 Mar 2015 00:00:00 +0000 <![CDATA[Todo Lists]]> Todo Lists

With a baby on the way, I’ve started reconsidering how I keep track of my TODO list.

Seeing as I spent an inordinate amount of time Emacs, org-mode is my go to tool when it comes to keeping track of my life. I can keep notes, track time and customize my environment to support a workflow that works for me. The problem is life happens outside of Emacs. Shocking, I know.

So, my goal is to have a system that integrates well with org-mode and Emacs, while still allows me to use my phone, calendar, etc. when I’m away from my computer. Also, seeing as Emacs doesn’t provide an obvious, effective calendaring solution like GMail does, I want to be sure I’m able to schedule things so I do them at specific times and have reminders.

With that in mind, I started looking at org-mobile. It seems like the perfect solution. It is basically a way to edit an org file on my phone and will (supposedly) sync deadlines and scheduled items to my calendar as one would expect. Sure, the UI could use some work, but having to know the date vs. picking it from a slick dialog on my phone seemed like a reasonable way to know what date it was...

Unfortunately, it didn’t work. I had one event sync’d to my google calendar, but that was the end of that. It didn’t seem to add anything to my calendar no matter the settings. That is a deal breaker.

I’m currently starting to play with org-trello instead. I’m confident I can make this work for two reasons.

  1. The mobile app is nice
  2. The sync’ing in Emacs is nice

What doesn’t work (yet?) is adding deadlines or scheduling to my calendar, but seeing as this new year I’m resolving to slow down, I’m going to stop trying to over optimize and just add stuff to my calendar. It is a true revelation, I know.

One thing I did consider was just skipping the computer all together and using a physical planner. The problem with a planner for me is,

  1. My handwriting is atrocious
  2. It doesn’t sync to my calendar

In addition to trying to understand my handwriting, I’d have to develop a habit to always look at it. I can’t see it happening when I can get my phone to annoy me as needed.

Part of this effort is part of a larger plan to use some of the tactics in GTD to get tasks off of my mental plate and put them somewhere useful. So far, it has been sticking more than it ever has, so I’m hopeful this could be the time it becomes a real habit. Wish me luck, as I’m sure I’ll need it!

Thu, 05 Feb 2015 00:00:00 +0000 <![CDATA[CacheControl 0.11.0 Released]]> CacheControl 0.11.0 Released

Last night I released CacheControl 0.11.0. The big change this time around is using compressed JSON to store the cache response rather than a pickled dictionary. Using pickle caused some problems if a cache was going to be read by both Python 3.x and Python 2.x clients.

Another benefit is that avoiding pickle also avoid exec’ing potentially dangerous code. It is not unreasonable that someone could include a header that could cause problems. This hasn’t happened yet, but it wouldn’t suprise me if it were feasible.

Finally, the size of the cached object should be a little smaller. Generally responses are not going to that large, but it should help if you use storage that keeps a hot keys in memory. MongoDB comes to mind along with Memcached and, probably, Redis. If you are avoiding caching large objects it could also be valuable. For example, a large sparse CSV response might be able to get compressed quite a bit to make caching it reasonable.

I haven’t don any conclusive tests regarding the actual size impact of compression, so these are just my theories. Take them with a grain of salt or let me know your experiences.

Huge thanks to Donald Stuff for sending in the compressed JSON patches as well all the folks who have submitted other suggestions and pull requests.

The Future

I’ve avoided making any major changes to CacheControl as it has been reasonably stable and flexible. There are some features that others have requested that have been too low on my own personal priorites to take time to implement.

One thing I’ve been tempted to do is add more storage implementations. For example, I started working on a SQLite store. My argument, to myself at least, was that the standard library supports SQLite, which makes it a reasonable target.

I decided to stop that work as it didn’t really seem very helpful. What did seem enticing was that a cache store becoming queryable. Unfortunately, since the cache store API only gets a key and blob for the value, it would require any cache store author to unpack the blog in order read any values it is interested in.

In the future I’ll be adding a hook system to let a cache store author have access to the requests.Response object in order to create extra arguments for setting the value.

For example, in Redis, you can set an expires time the DB will use to expire the response automatically. The cache store then might have an extra method that looks like this.

class RedisCache(BaseCache):

    def on_cache_set(self, response):
        kwargs = {}
        if 'expires' not in response.headers:
            return kwargs

        return {'expires': response.headers['expires']}

    def set(self, key, value, expires=None):
        # Set the value accordingly

I’m not crazy about this API as it is a little confusing to communicate that creating a on_cache_set hook is really a way to edit the arguments sent to the set method. Maybe calling it a hook is really the wrong term. Maybe it should be called prepare and it explicitly calls set. If anyone has thoughts, please let me know!

The reasoning is that I’d like to remove the Redis cache and start a new project for CacheControl stores that includes common cache store implementations. At the very least, I’d like to find some good implementations that I can link to from the docs to help folks find a path from a local file cache to using a real database when the time comes.

Lastly, there are a couple spec related details that could use some attention that I’ll be looking at in the meantime.

Wed, 28 Jan 2015 00:00:00 +0000 <![CDATA[Replacing Monitors]]> Replacing Monitors

I just read a quick article on Microsoft’s new VR goggles. The idea of layering virtual interfaces on top of the real world seems really cool and even practical in some use cases. What seems really difficult is how an application will understand the infinite number visual environments in order to effectively and accurately visualize the interfaces. Hopefully the SDK for the device includes a library that provides for different real world elements like placeOnTop(vObject, x, y z) where it can recognize some object in the room and allows that object to be made available as a platform. In other words, it sees the coffee table and makes it available as an object that you can put something on top of.

The thing is, what I’d love to see is VR replacing monitors! Every year TVs and monitors get upgrades in size and quality, yet the prices rarely drop very radically. Right now I have a laptop and an external monitor that I use at home. I’d love to get rid of both tiny screens and just look into space and see my windows on a huge, virtual, flat surface.

An actual computer would still be required and there wouldn’t necessarily need to be huge changes to the OS at first. The goggles would just be another big screen that would take the rasterized screen and convert it to something seen in the analog world. Sure, this would probably ruin a few eyes at first, but having a huge monitor measured in feet vs. inches is extremely enticing.

Thu, 22 Jan 2015 00:00:00 +0000 <![CDATA[DevOps]]> DevOps

I finally realized why DevOps is an idea. Up until this point, I felt DevOps was a term for a developer that was also responsible for the infrastructure. In other words, I never associated DevOps with an actual strategy or idea, and instead, it was simply something that happened. Well no more!

DevOps is providing developers keys [1] to operations. In a small organization these keys never have a chance to leave the hands of the small team of developers that have nothing to concern themselves except getting things done. As an organization grows, there becomes a dedicated person (and soon after group of people) dedicated to maintaining the infrastructure. The thing that happens is that the keys developers had to log into any server or install a new database are taken away and given to operations to manage. DevOps is a trend where operations and developers share the keys.

Because developers and operations both have access to change the infrastructure, there is a meeting of the minds that has to happen. Developers and Ops are forced to communicate the what, where, when, why and how of changes to the infrastructure. Since folks in DevOps are familiar with code, version control becomes a place of communication and cohesion.

The reason I now understand this paradigm more clearly is because when a developer doesn’t have access to the infrastructure, it is a huge waste of time. When code doesn’t work we need to be able to debug it. It important to come up with theories why things don’t work and iteratively test the theories until we find the reason for the failure. While it is possible to debug bugs that only show up in production, it can be slow, frustrating and difficult, when access to the infrastructure isn’t available.

I say all this with a huge grain of salt. I’m not a sysadmin and I’ve never been a true sysadmin. While I understand the hand wavy idea that if all developers have ssh keys to the hosts in a datacenter, there are more vectors for attack. What I don’t understand is why a developer with ssh keys is any more dangerous than a sysadmin having ssh keys. Obviously, a sysadmin may have a more stringent outlook on what is acceptable, but at the same time, anyone can be duped. After all, if you trust your developers to write code that writes to your database millions (or billions!) of times a day, I’m sure you can trust them to keep an ssh key safe or avoid exposing services that are meant to remain private.

I’m now a full on fan of DevOps. Developers and Ops working together and applying software engineering techniques to everything they work with seems like a great idea. Providing Developers keys to the infrastructure and pushing Ops to communicate the important security and/or hardware concerns is only a good thing. The more cohesion between Ops and Dev, the better.

[1]I’m not talking about ssh keys, but rather the idea of having “keys to the castle”. One facet of this could be making sure developer keys and accounts are available on the servers, but that is not the only way to give devs access to the infrastructure.
Fri, 16 Jan 2015 00:00:00 +0000 <![CDATA[Logging]]> Logging

Yesterday I fixed an issue in dadd where logs from processes were being correctly sent back to the master. My solution ended up being a rather specific process of opening the file that would contain the logs and ensuring that any subprocesses used this file handle.

Here is the essential code annotated:

# This daemonizes the code. It can except stdin/stdout parameters #
# that I had originally used to capture output. But, the file used for
# capturing the output would not be closed or flushed and we'd get #
# nothing. After this code finishes we do some cleanup, so my logs were
# empty.
with daemon.DaemonContext(**kw):

    # Just watching for errors. We pass in the path of our log file
    # so we can upload it for an error.
    with ErrorHandler(spec, env.logfile) as error_handler:

        # We open our logfile as a context manager to ensure it gets
        # closed, and more importantly, flushed and fsync'd to the disk.
        with open(env.logfile, 'w+') as output:

            # Pass in the file handle to our worker will starts some
            # subprocesses that we want to know the output of.
            worker = PythonWorkerProcess(spec, output)

            # printf => print to file... I'm sure this will get
            # renamed in order to avoid confusion...
            printf('Setting up', output)
            printf('Starting', output)
                import traceback

                # Print our traceback in our logfile

                # Upload our log to our dadd master server

                # Raise the exception for our error handler to send
                # me an email.

            # Wrapping things up
            printf('Finishing', output)

Hopefully, the big question is, “Why not use the logging module?”

When I initially hacked the code, I just used print and had planned on letting the daemon library capture logs. That would make it easy for the subprocesses (scripts written by anyone) to get logs. Things were out of order though, and by the time the logs were meant to be sent, the code had already cleaned up the environment where the subprocesses had run, including deleting the log file.

My next step then was to use the logging module.

Logging is Complicated!

I’m sure it is not the intent of the logging module to be extremely complex, but the fact is, the management of handlers, loggers and levels across a wide array of libraries and application code gets unwieldy fast. I’m not sure people run into this complexity that often, as it is easy to use the basicConfig and be done with it. As an application scales, logging becomes more complicated and, in my experience, you either explicitly log to syslog (via the syslog module) or to stdout, where some other process manager handles the logs.

But, in the case where you do use logging, it is important to understand some essential complexity you simply can’t ignore.

First off, configuring loggers needs to be done early in the application. When I say early, I’m talking about at import time. The reason being is that libraries, that should try to log intelligently, when imported, might have already configured the logging system.

Secondly, the propagation of the different loggers needs to be explicit, and again, some libraries / frameworks are going to do it wrong. By “wrong”, I mean that the assumptions the library author makes don’t align with your application. In my dadd, I’m using Flask. Flask comes with a handy app.logger object that you can use to write to the log. It has a specific formatter as well, that makes messages really loud in the logs. Unfortunately, I couldn’t use this logger because I needed to reconfigure the logs for a daemon process. The problem was this daemon process was in the same repo as my main Flask application. If my daemon logging code gets loaded, which is almost certain will happen, it reconfigures the logging module, including Flasks handy app.logger object. It was frustrating to test logging in my daemon process and my Flask logs had disappeared. When I go them back, I ended up seeing things show up multiple times because different handlers had been attached that use the same output, which leads me to my next gripe.

The logging module is opaque. It would be extremely helpful to be able to inject at some point in your code a pprint(logging.current_config) that will provide the current config at that point in the code. In this way, you could intelligently make efforts to update the config correctly with tools like logging.config.dictConfig by editing the current config or using the incremental and disable_existing_loggers correctly.

Logging is Great

I’d like to make it clear that I’m a fan of the logging module. It is extremely helpful as it makes logging reliable and can be used in a multithreaded / multiprocessing environment. You don’t have to worry about explicitly flushing the buffer or fsync’ing the file handle. You have an easily way to configure the output. There are excellent handlers that help you log intelligently such as the RotatatingFileHandler, WatchedFileHandler and SysLogHandler. Many libraries also allow turning up the log level to see more deeply into what they are doing. Requests and urllib3 do a pretty decent job of this.

The problem is that controlling output is a different problem than controlling logging, yet they are intertwined. If you find it difficult to add some sort of output control to your application and the logging module seems be causing more problems than it is solving, then don’t use it! The technical debt you need to pay off for a small, customized output control system is extremely low compared to the hoops you might need to jump through in order to mold logging to your needs.

With that said, learning the logging module is extremely important. Django provides a really easy way to configure the logging and you can be certain that it gets loaded early enough in the process that you can rely on it. Flask and CherryPy (and I’m sure others) provide hooks into their own loggers that are extremely helpful. Finally, the basicConfig is a great tool to get started logging in standalone scripts that need to differentiate between DEBUG statements and INFO. Just remember, if things get tough and you feel like your battling logging, you might have hit the edges of its valid use cases and it is time to consider another strategy. There is no shame in it!

Wed, 14 Jan 2015 00:00:00 +0000 <![CDATA[Build Tools]]> Build Tools

I’ve recently been creating more new projects for small libraries / apps and its got me thinking about build tools. First off, “build tools” is probably a somewhat misleading name, yet I think it is most often associated with type of tools I’m thinking of. Make is the big one, but there are a whole host of tools like Make on almost every platform.

One of the apps I’ve been working on is a Flask application. For the past year or so, I’ve been working on a Django app along side some CherryPy apps. The interesting thing about these different frameworks is the built in integration with some sort of build tool.

CherryPy essentially has no integration. If you unfamiliar with CherryPy, I’d argue it is the un-opinionated web framework, so it shouldn’t be surprising that there are no tools to stub out some directories, provide helpers for connecting to the DB and starting a shell with the framework code loaded.

Flask is similar to CherryPy in that it is a microframework, but the community has been much more active providing plugins (Blueprints in Flask terms) that provide more advance functionality. One such plugin, mimics Django’s file, which provides typical build tool and project helpers.

Django, as I just mentioned, provides file that adds some project helpers and, arguably, functions as a generic build tool.

I’ve become convince that every project should have some sort of build tool in place that codifies how to work with the project. The build tool should automate how to build and release the software, along with how that process interacts with the source control system. The build tool should provide helpers, where applicable, to aid in development and debugging. Finally, the build tool should help in running the apps processes, and/or supporting processes (ie make sure a database is up and running).

Yet, many projects don’t include these sorts of features. Frameworks, as we’ve already seen, don’t always provide it by default, which is a shame.

I certainly understand why programmers avoid build tools, especially in our current world where many programs don’t need an actual “build”. As programmers, we hope to create generalized solutions, while we are constantly pelted with proposed productivity gains in the form of personal automations. The result is that while we create simple programs that guide our users toward success, when it comes to writing code, we avoid prescribing how to develop on a project as if its a plague that will ruin free thinking.

The irony here is that Django, the framework with a built in build tool, is an order of magnitude more popular. I’m confident the reason for its popularity lies in the integrated build tool, that helps developers find success.

At some point, we need to consider how other developers work with our code. As authors of a codebase, we should consider our users, our fellow developer, and provide them with build tools that aid in working on the code. The goal is not to force everyone into the same workflow. The goal is to make our codebases usable.

Mon, 12 Jan 2015 00:00:00 +0000 <![CDATA[Good Enough]]> Good Enough

Eric Shrock wrote a great blog on Engineer Anti-Patterns. I, unfortunately, can admit I’ve probably been guilty of each and every one of these patterns at one time or another. When I think back to times these behaviors have crept in, the motivations always come back to what is really “good enough”.

For example, I’ve been “the talker” when I’ve seen a better solution to a problem and can’t let it go. The proposed solution was considered “good enough” but not to me. My perspective of what is good enough clashes with that of the team and I feel it necessary to argue my point. I wouldn’t say that my motives are wrong, but at some point, a programmer must understand when and how to let an argument go.

The quest to balance “good enough” with best practices is rarely a simple yes or no question. Financial requirements might force you to make poor engineering decisions in favor of losing money in that moment. There are times where a program hasn’t proven its value, therefore strong engineering practices aren’t as important as simply proving the software is worth being written. Many times, writing and releasing something, even if it is broken, is a better political decision in an organization.

I suspect that most of these anti-patterns are a function of fear, specifically, the fear of failing. All of the anti-patterns reflect a lack of confidence in a developer. It might be imposter syndrome creeping in or the feeling of reliving a bad experience in the past. In order to programmers to be more productive and effective, it is critical that effort is made to reduce fear. In doing so, a developer can try a solution that may be simply “good enough” as the programmer knows if it falls short, it is something to learn from rather than fear.

Our goals as software developers and as an industry should be to raise the bar of “good enough” to the point where we truly are making educated risk / reward decisions instead of rolling the dice that some software is “good enough”.

The first step is to reduce the fear of failure. An organization should take steps to provide an environment where developers can easily and incrementally release code. Having tests you run locally, then in CI, then in a staging environment before releasing to a staging environment and finally to production helps developers feel confident that pushing code is safe.

Similarly, an organization should make it easy to find failures. Tests are an obvious example here, but providing well known libraries for easy integration into your logging infrastructure and error reporting are critical. It should be easy for developers to poke around in the environment where things failed to see what went wrong. Adding new metrics and profiling to code should be documented and encouraged.

Finally, when failures do occur, they should not be a time to place blame. There should be blameless postmortems.

Many programmers and organizations fear spending time on basic tooling and consistent development environments. Some developers feel having more than on step between commit and release represents a movement towards perfection, the enemy of “good enough”. We need to stop thinking like this! Basic logging, error reporting, writing tests, basic release strategies are all critical pieces that have been written and rewritten over and over again at thousands of organizations. We need to stop avoiding basic tenants of software development under the guise being “good enough”.

Fri, 09 Jan 2015 00:00:00 +0000 <![CDATA[Software Project Structure and Ecosystems]]> Software Project Structure and Ecosystems

Code can smell, but just like our own human noses, everyone has his/her own perspective on what stinks. Also, just because something smells bad now, it doesn’t mean you can’t get used to the smell and eventually enjoy it. The same applies for how software projects are organized and configured.

Most languages have the concept a package. Your source code repository is organized to support building some a package that can be distributed via your language’s ecosystem. Python has setuptools/pip, Perl has CPAN, JavaScript has NPM, Ruby has gems, etc. Even compiled languages provide packages by way of providing builds for different operating system packaging systems.

In each package use case, there are common idioms and best practices that the community supports. There are tools that promote these ideals that end up part of the language and community packing ecosystem. This ecosystem often ends up permeating not just the project layout, but the project itself. As developers, we want to be participants in the ecosystem and benefit from everything that the ecosystem provides.

As we become accustomed to our language ecosystem and its project tendencies, we develop an appreciation for the aroma of the code. Our sense of code smell can be tainted to the point of feeling that other project structures smell bad and are some how “wrong” in how they work.

If you’ve ever gone from disliking some cuisine to making it an integral part of your diet, you will quickly see the problem with associating a community’s ecosystem with sound software development practices. By disregarding different patterns as “smelly”, you also lose the opportunity to glean from positive aspects of other ecosystems.

A great example is Python’s virtualenv infrastructure compared to Go’s complete lack of packages. In Python, you create source based packages (most of the time) that are extracted and built within a virtual Python environment. If the package requires other packages, the environment uses the same system (usually pip) to download and build the sub-requirements. Go, on the other hand, requires that you specify libraries and other programs you use in your build step. These requirements are not “packaged” at all and typically use source repositories directly when defining requirements. These requirements become a part of your distributed executable. The executable is a single file that can be copied to another machine and run without requiring any libraries or tools on the target machine.

Both of these systems are extremely powerful, yet radically different. Python might benefit a great deal by building in the ability to package up a virtualenv and specific command as a single file that could be run on another system. Similarly, Go could benefit from a more formalized package repository system that ensures higher security standards. It is easy to look at either system and feel the lack of the others packing features is detrimental, when in fact, they are just different design trade offs. Python has become very successful on systems such as Heroku where it is relatively easy to take a project from source code to a release because the ecosystem promotes building software during deployment. Go, on the other hand, has become an ideal system management tool because it is trivial to copy a Go program to another machine and run it without requiring OS specific libraries or dependencies.

While packaging is one area we see programmers develop a nose for coding conventions, it doesn’t stop there.

The language ecosystem also prescribes coding standards. Some languages are perfectly happy to have a huge single file while others require a new file for each class / function. Each methodology has its own penalties and rewards. Many a Python / Vimmer has wretched at the number of files produced when writing Java, while the Java developer stares in shock as the vimmer uses search and replace to refactor code in a project. Again, these design decisions with their own set of trade offs.

One area where code smell becomes especially problematic is when it involves a coding paradigm that is different from the ecosystem. Up until recently, The Twisted and stackless ecosystems in Python felt strange and isolated. Many developers felt that things like deferred and greenlets felt like code smell when compared to the tried and true threads. Yet, as we discovered the need for more socket connections to be open at the same time and as we needed to read more I/O concurrently, it became clear that asynchronous programming models should be a first class citizen in the Python ecosystem at large. Prior to the ecosystem’s acceptance of async, the best practices felt very different and quite smelly.

Fortunately, in the case of async, Python managed to find a way to be more inclusive. The async paradigm still has far reaching requirements (there needs to be a main loop somewhere...), but the community has begun the process of integrating the ideas as seamless, natural, fragrant additions to the language.

Unfortunately, other paradigms have not been as well integrated. Functional programming, while having many champions and excellent libraries, still has not managed to break into the ecosystem at the project level. If we have a packages that have very few classes, nested function calls, LRU caches and tons of (cy)func|itertool(s|z) it feels smelly.

As programmers we need to make a conscious effort to expand our code olfaction to include as wide a bouquet as possible. When we see code that seems stinky, rather than assuming it is poorly designed or dangerous, take a big whiff and see what you find out. Make an effort to understand it and appreciate its fragrance. Over time, you’ll eventually understand the difference between pungent code, that is powerful and efficient, vs. stinky code that is actually rancid and should be thrown out.

Mon, 22 Dec 2014 00:00:00 +0000 <![CDATA[Flask vs. CherryPy]]> Flask vs. CherryPy

I’ve always been a fan of CherryPy. It seems to avoid making decisions for you that will matter over time and it helps you make good decisions where it matters most. CherryPy is simple, pragmatic, stable fast enough and plays nice with other processes. As much as I appreciate what CherryPy offers, there is unfortunately not a lot of mindshare in the greater Python community. I suspect the reason CherryPy is not seen a hip framework is due to most users of CherryPy happily work around the rough edges and get work done rather than make an effort market their framework of choice. Tragic.

While there are a lot of microframeworks out there, Flask seems to be the most popular. I don’t say this with any sort of scientific accuracy, just a gut feeling. So, when I set out to write a different kind of process manager , it seemed like a good time to see how other microframeworks work.

The best thing I can say about Flask is the community of projects. After having worked on a Django project, I appreciate the admin interface and how easy it is to get 80% there. Flask is surprisingly similar in that searching google for “flask + something” quickly provides some options to implement something you want. Also, as Flask generally tries to avoid being too specific, the plugins (called Blueprints... I think) seem to provide basic tools with the opportunity to customize as necessary. Flask-Admin is extremely helpful along side Flask-SQLAlchemy.

Unfortunately, while this wealth of excellent community packages is excellent, Flask falls short when it comes to actual development. Its lack of organization in terms of dispatching makes organizing code feel very haphazard. It is easy to create circular dependencies due to the use of imports for establishing what code gets called. In essence, Flask forces you to build some patterns that are application specific rather than prescribing some models that make sense generally.

While a lack of direction can make the organization of the code less obvious, it does allow you to easily hook applications together. The Blueprint model, from what I can tell, makes it reasonably easy to compose applications within a site.

Another difficulty with Flask is configuration. Since you are using the import mechanism to configure your app, your configuration must also be semi-available at import time. Where this makes things slightly difficult is when you are creating a app that starts a web server (as opposed to an app that runs a web service). It is kind of tricky to create myapp –config because by the time you’ve started the app, you’ve already imported your application and set up some config. Not a huge issue, but it can be kludgy.

This model is where CherryPy excels. It allows you create a stand alone process that acts as a server. It provides a robust configuration mechanism that allows turning on/off process and request level features. It allows configuration per-URL as well. The result is that if you’re writing a daemon or some single app you want to run as a command, CherryPy makes this exceptionally easy and clear.

CherryPy also helps you stay a bit more organized in the framework. It provides some helpful dispatcher patterns that support a wide array of functionality and provide some more obvious patterns for organizing code. It is not a panacea. There are patterns that take some getting used to. But, once you understand these patterns, it becomes a powerful model to code in.

Fortuately, if you do want to use Flask as a framework and CherryPy as a process runner / server, it couldn’t be easier. It is trivial to run a Flask app with CherryPy, getting the best of both worlds in some ways.

While I wish CherryPy had more mindshare, I’m willing to face facts that Flask might have “won” the microframework war. With that said, I think there are valuable lessons to learn from CherryPy that could be implemented for Flask. I’d personally love to see the process bus model made available and a production web server included. Until then though, I’m happy to use CherryPy for its server and continue to enjoy the functionality graciously provided by the Flask community.

Fri, 19 Dec 2014 00:00:00 +0000 <![CDATA[Thinking About ETLs]]> Thinking About ETLs

My primary focus for the last year or so has been writing ETLs at work. It is an interesting problem because on some level it feels extremely easy, while in reality, it is a problem that is very difficult to abstract.


The essence of an ETL, beyond the obvious “extract, transform, load”, is the query. In the case of a database, the query is typically the SELECT statement, but it usually is more than that. It often includes the format of the results. You might need to chunk the data using multiple queries. There might be columns you skip or columns you create.

In non-database ETLs, it still ends up being very similar to query. You often still need to find boundaries for what you are extracting. For example, if you had a bunch of date stamped log files, doing a find /var/logs -name 2014*.log.gz could still be considered a query.

A query is important because ETLs are inherently fragile. ETLs are required because the standard interface to some data is not available due to some constraints. By bypassing standard, and more importantly supported, interfaces, you are on your own when it comes to ensuring the ETL runs. The database dump you are running might timeout. The machine you are reading files from may reboot. The REST API node you are hitting gets a new version and restarts. There are always good reasons for your ETL process to fail. The query makes it possible to go back and try things again, limiting them to the specific subset of data you are missing.


ETLs often are considered part of some analytics pipeline. The goal of an ETL is typically to take some data from some system and transform it to a format that can be loaded into another system for analysis. A better principle is to consider storing the intermediaries such that transformation is focused on a specific generalized format, rather than a specific system such as a database.

This is much harder than it sounds.

The key to providing generic access to data is a standard schema for the data. The “shape” of the data needs to be described in a fashion that is actionable by the transformation process that loads the data into the analytics system.

The schema is more than a type system. Some data is heavy with metadata while other data is extremely consistent. The schema should provide notation for both extremes.

The schema also should provide hints on how to convert the data. The most important aspect of the schema is to communicate to the loading system how to transform and / or import the data. One system might happily accept a string with 2014-02-15 as a date if you specify it is a date, while others may need something more explicit. The schema should communicate that the data is date string with a specific format that the loading system can use accordingly.

The schema can be difficult to create. Metadata might need a suite of queries to other systems in order to fill in the data. There might need to be calculations that have to happen that the querying system doesn’t support. In these cases you are not just transforming the data, but processing it.

I admit I just made an arbitrary distinction and definition of “processing”, so let me explain.

Processing Data

In a transformation you take the data you have and change it. If I have a URL, I might transform it into JSON that looks like {‘url’: $URL}. Processing, on the other hand, uses the data to create new data. For example, if I have a RESTful resource, I might crawl it to create a single view of some tree of objects. The important difference is that we are creating new information by using other resources not found in the original query data.

The processing of data can be expensive. You might have to make many requests for every row of output in a database table. The calculations, while small, might be on a huge dataset. Whatever the processing that needs happen in order to get your data to a generically usable state, it is a difficult problem to abstract over a wide breadth of data.

While there is no silver bullet to processing data, there are tactics that can be used to process data reliably and reasonably fast. The key to abstracting processing is defining the unit of work.

A Unit of Work

“Unit of Work” is probably a loaded term, so once again, I’ll define what I mean here.

When processing data in an ETL, the Unit of Work is the combination of:

  • an atomic record
  • an atomic algorithm
  • the ability to run the implementation

If all this sounds very map/reducey it is because it is! The difference is that in an ETL you don’t have the same reliability you’d have with something like Hadoop. There is no magical distributed file system that has your data ready to go on a cluster designed to run code explicitly written to support your map/reduce platform.

The key difference with processing data in ETLs vs. some system like Hadoop is the implementation and execution of the algorithm. The implementation includes:

  • some command to run on the atomic record
  • the information necessary to setup an environment for that script to run
  • an automated to input the atomic record to the command
  • a guarantee of reliable execution (or failure)

If we look at a system like Hadoop, and this applies to most map/reduce platforms that I’ve seen, there is an explicit step that takes data from some system and adds it to the HDFS store. There is another step that installs code, specifically written for Hadoop, onto the cluster. This code could be using Hadoop streaming or actual Java, but in either case, the installation is done via some deployment.

In other words, there is an unsaid step that Extracts data from some system, Transforms it for Hadoop and Loads it into HDFS. The processing in this case is getting the data from whatever the source system is into the analytics system, therefore, the requirements are slightly different.

We start off with a command. The command is simply an executable script like you would see in Hadoop streaming. No real difference here. Each line passed to the command contains the atomic record as usual.

Before we can run that command, we need to have an environment configured. In Hadoop, you’ve configured your cluster and deployed your code to the nodes. In an ETL system, due to the fragility and simpler processing requirements (no one should write a SQL-like system on top of an ETL framework), we want to set up an environment every time the command runs. By setting up this environment every time the command runs you allow a clear path for development of your ETL steps. Making the environment creation part of the development process it means that you ensure the deployment is tested along side the actual command(s) your ETL uses.

Once we have the command and an environment to run it in we need a way to get our atomic record to the command for actual processing. In Hadoop streaming, we use everyone’s favorite file handle, stdin. In an ETL system, while the command may still use stdin, the way the data enters the ETL system doesn’t necessarily have a distributed file system to use. Data might be downloaded from S3, some RESTful service, and / or some queue system. It important that you have a clear automated way to get data to an ETL processing node.

Finally, this processing must be reliable. ETLs are low priority. An ETL should not lock your production database for an hour in order to dump the data. Instead ETLs must quietly grab the data in a way that doesn’t add contention to the running systems. After all, you are extracting the data because a query on the production server will bog it down when it needs to be serving real time requests. An ETL system needs to reliably stop and start as necessary to get the data necessary and avoid adding more contention to an already resource intensive service.

Loading Data

Loading data from an ETL system requires analyzing the schema in order to construct the understanding between the analytics system and the data. In order to make this as flexible as possible, it is important that the schema use the source of data to add as much metadata as possible. If the data pulls from a Postgres table, the schema should idealling include most of the schema information. If that data must be loaded into some other RDBMS, you have all you need to safely read the data into the system.

Development and Maintenance

ETLs are always going to be changing. New analytics systems will be used and new source of data will be created. As the source system constraints change so do the constraints of an ETL system, again, with the ETL system being the lowest priority.

Since we can rely on ETLs changing and breaking, it is critical to raise awareness of maintenance within the system.

The key to creating a maintainable system is to build up from small tools. The reason being is that as you create small abstractions at a low level, you can reuse these easily. The trade off is that in the short term, more code is needed to accomplish common tasks. Over time, you find patterns specific to your organizations requirements that allow repetitive tasks to be abstracted into tools.

The converse to building up an ETL system based on small tools is to use a pre-built execution system. Unfortunately, pre-built ETL systems have been generalized for common tasks. As we’ve said earlier, ETLs are often changing and require more attention than a typical distributed system. The result is that using a pre-built ETL environment often means creating ETLs that allow the pre-built ETL system to do its work!


Our goal for our ETLs is to make them extremely easy to test. There are many facets to testing ETLs such as unit testing within an actual package. The testing that is most critical for development and maintenance is simply being able to quickly run and test a single step of an ETL.

For example, lets say we have an ETL that dumps a table, reformats some rows and creates a 10GB gzipped CSV file. I only mention the size here as it implies that it takes too long to run over the entire set of data every time while testing. The file will then be uploaded to S3 and notify a central data warehouse system. Here are some steps that the ETL might perform:

  1. Dumping the table
  2. Create a schema
  3. Processing the rows
  4. Gzipping the output
  5. Uploading the data
  6. Update the warehouse

Each of these steps should be runnable:

  • locally on a fake or testing datbase
  • locally, using a production database
  • remotely using a production database and testing system (test bucket and test warehouse)
  • remotely using the production database and production systems

By “runnable”, I mean that an ETL developer can run a command with a specific config and watch the output for issues.

These steps are all pretty basic, but the goal with an ETL system is to abstract the pieces that can be used across all ETLs in a way that is optimal for your system. For example, if your system is consistently streaming, your ETL framework might allow you to chain file handles together. For example

$ dump table | process rows | gzip | upload

Another option might be that each step produces a file that is used by the next step.

Both tactics are valid and can be optimized for over time to help distill ETLs to the minimal, changing requirements. In the above example, the database table dump could be abstracted to take the schema and some database settings to dump any table in your databases. The gzip, upload and data warehouse interactions can be broken out into a library and/or command line apps. Each of these optimizations are simple enough to be included in an ETL development framework without forcing a user to jump through a ton of hoops when a new data store needs to be considered.

An ETL Framework

Making it easy to develop ETLs means a framework. We want to create a Ruby on Rails for writing ETLs that makes it easy enough to get the easy stuff done and powerful enough to do deal with the corner cases. The framework revolves around the schema and the APIs to the different systems and libraries that provide language specific APIs.

At some level the framework needs to allow the introduction of other languages. My only suggestion here is that other languages are abstracted through a command line layer. The ETL framework can eventually call a command that could be written in whatever language the developer wants to use. ETLs are typically used to export data for to a system that is reasonably technical. Someone using this data most likely has some knowledge of some language such as R, Julia or maybe JavaScript. It is these technically savvy data wranglers we want to empower with the ETL framework in order to allow them to solve small ETL issues themselves and provide reliability where the system can be flaky.

Open Questions

The system I’ve described is what I’m working on. While I’m confident the design goals are reasonable, the implementation is going to be difficult. Specifically, the task of generically supporting many languages is challenging because each language has its own ecosystem and environment. Python is an easy language for this task b/c it is trivial to connect to a Ubuntu host have a good deal of the ecosystem in place. Other languages, such as R, probably require some coordination with the cluster provisioning system to make sure base requirements are available. That said, it is unclear if other languages provide small environments like virtualenvs do. Obviously typical scripting languages like Ruby and JavaScript have support for an application local environment, but I’m doubtful R or Julia would have the same facilities.

Another option would be to use a formal build / deployment pattern where a container is built. This answers many of the platform questions, but it brings up other questions such as how to make this available in the ETL Framework. It is ideal if an ETL author can simply call a command to test. If the author needs to build a container locally then I suspect that might be too large a requirement as each platform is going to be different. Obviously, we could introduce a build host to handle the build steps, but that makes it much harder for someone to feel confident the script they wrote will run in production.

The challenge is because our hope is to empower semi-technical ETL authors. If we compare this goal to people who can write HTML/CSS vs. programmers, it clarifies the requirements. A user learning to write HTML/CSS only has to open the file in a web browser to test it. If the page looks correct, they can be confident when they deploy it will work. The goal with the ETL framework and APIs is that the system can provide a similar work flow and ease of use.

Wrapping Up

I’ve written a LOT of ETL code over the past year. Much of what I propose above reflects my experiences. It also reflects the server environment in which these ETLs run as well as the organizational environment. ETLs are low priority code, by nature, that can be used to build first class products. Systems that require a lot of sysadmin time, server resources or have too specific an API may still be helpful moving data around, but they will fall short as systems evolve. My goal has been to create a system that evolves with the data in the organization and empowers a large number of users to distribute the task of developing and maintaining ETLs.

Wed, 17 Dec 2014 00:00:00 +0000 <![CDATA[Dadd, ErrorEmail and CacheControl Releases]]> Dadd, ErrorEmail and CacheControl Releases

I’ve written a couple new bits of code that semeed like they could be helpful to others.


Dadd (pronounced Daddy) is a tool to help administer daemons.

Most deployment systems are based on the idea of long running processes. You want to release a new version of some service. You build a package, upload it somewhere and tell your package manager to grab it. Then you tell your process manager to restart it to get the new code.

Dadd works differently. Dadd lets you define a short spec that includes the process you want to run. A dadd worker then will use that spec to download any necessary files, create a temporary directory to run in and start the process. When the process ends, assuming everything went well, it will clean up the temp directory. If there was an error, it will upload the logs to the master and send an email.

Where this sort of system comes in handy is when you have scripts that take a while to run and that shouldn’t be killed when new code is released. For example, at work I manage a ton of ETL processes to get our data into a data warehouse we’ve written. These ETL processes are triggered with Celery tasks, but they typically will ssh into a specific host, create a virtaulenv, install some dependencies, and copy files before running a deamon and disconnecting. Dadd, makes this kind of processing more automatic where it can run these processes on any host in our cluster. Also, because the dadd worker builds the environment, it means we can run a custom script without having to go through the process of a release. This is extremely helpful for running backfills or custom updates to migrate old data.

I have some ideas for Dadd such as incorporating a more involved build system and possibly using lxc containers to run the code. Another inspriation for Dadd is for setting up nodes in a cluster. Often times it would be really easy to just install a couple python packages but most solutions are either too manual or require a specific image to use things like chef, puppet, etc. With Dadd, you could pretty easily write a script to install and run it on a node and then let it do the rest regarding setting up an environment and running some code.

But, for the moment, if you have code you run by copying some files, Dadd works really well.


ErrorEmail was written specifically for Dadd. When you have a script to run and you want a nice traceback email when things fail, give ErrorEmail a try. It doesn’t do any sort rate limiting an the server config is extremely basic, but sometimes you don’t want to install a bunch of packages just to send an email on an error.

When you can’t install django or some other framework for an application, you can still get nice error emails with ErrorEmail.


The CacheControl 0.10.6 release includes support for calling close on the cache implementation. This is helpful when you are using a cache via some client (ie Redis) and that client needs to safely close the connection.

Sat, 08 Nov 2014 00:00:00 +0000 <![CDATA[Ugly Attributes]]> Ugly Attributes

At some point in my programming career I recognized that Object Oriented Programming is not all it’s cracked up to be. It can be a powerful tool, especially in a statically typed language, but in the grand scheme of managing complexity, it often falls short of the design ideals that we were taught in school. One area where this becomes evident is object attributes.

Attributes are just variables that are “attached” to an object. This simplicity, unfortunately, makes attributes require a good deal more complexity to manage in a system. The reason being is that languages do not provide any tools to respect the perceived boundaries that an attribute appears to provide.

Let’s look at a simple example.

class Person(object):

    def __init__(self, age):
        self. age = age

We have a simple Person object. We want to be able to access the person’s age by way of an attribute. The first change we’ll want to make is to make this attribute a property.

class Person(object):
    def __init__(self, year, month, day):
        self.year = year
        self.month = date = day

    def age(self):
        age = - datetime(self.year, self.month,
        return age.days / 365

So far, this feels pretty simple. But lets get a little more realistic and presume that this Person is not a naive object but one that talks to a RESTful service in order to get is values.

A Quick Side Note

Most of the time you’d see a database and an ORM for this sort of code. If you are using Django or SQLAlchemy (and I’m sure other ORMs are the same) you’d see something like.

user = User.query.get(id)

You might have a nifty function on your model that calculates the age. That is, until you realize you stored your data in a non-timezone aware date field and now that you’re company has started supporting Europe, some folks are complaining that they are turning 30 a day earlier than they expected...

The point being is that ORMs do an interesting thing that is your only logical choice if you want to ensure your attribute access is consistent with the database. ORMs MUST create new instances for each query and provide a SYNC method or function to ensure they are updated. Sure, they might have an eagercommit mode or something, but Stack Overflow will most likely provide plenty of examples where this falls down.

I’d like to keep this reality in mind moving forward as it presents a fact of life when working with objects that is important to understand as your program gets more complex.

Back to Our Person

So, we want to make this Person object use a RESTful service as our database. Lets change how we load the data.

class Person(ServiceModel):
    # We inherit from some ServiceModel that has the machinery to
    # grab our data form our service.

    def by_id(cls, id):
        doc = conn.get('people', id=id).pop()
        return cls(**doc)

    def age(self):
        age = - datetime(self.year, self.month,
        return age.days / 365

    # This would probably be implemented in the ServiceModel, but
    # I'll add it hear for clarity.
    def __getattr__(self, name):
        if name in self.doc:
            return self.doc[name]
        raise AttributeError('%s is not in the resource.' % name)

Now assuming we get a document that has a year, month, day, our age function would still work.

So far, this all feels pretty reasonable. But what happens when things change? Fortunately in the age use case, people rarely change their birth date. But, unfortunately, we do have pesky time zones that we didn’t want to think about when we had 100 users and everyone lived on the west coast. The “least viable product” typically doesn’t afford thinking ahead that far, so these are issues you’ll need to deal with after you have a lot of code.

Also, the whole point of all this work has been to support an attribute on an object. We haven’t sped anything up. These are not new features. We haven’t even done some clever with meta classes or generators! The reality is that you’ve refactored your code 4 or 5 five times to support a single call in a template.

{{ person.age }}

Let’s take a step back for a bit.

Taking a Step Back

Do not feel guilty for going down this rabbit hole. I’ve taken the trip hundreds of times! But maybe it is time to reconsider how we think about object oriented design.

When we think back to when we were just learning OO there was a zoo. In this zoo we had the mythical Animal class. We’d have new animals show up at the zoo. We’d get a Lion, Tiger and Bear they would all need to eat. This modeling feels so right it can’t be wrong! An in many ways it isn’t.

If we take a step back, there might be a better way.

Let’s first acknowledge that that our Animal does need to eat. But lets really think about what it means to our zoo. The Animals will eat, but so will the Visitors. I’m sure the Employee would like to have some food now and then as well. The reason we want to know about all this sustenance is because we need to Order food and track it’s cost. If we reconsider this in the code, what if, and this is a big what if, we didn’t make eat a method on some class. What if we passed our object to our eat method.


While that looks cannibalistic at first, we can reconsider our original age method as well.


And how about our Animals?


Looking back at our issues with time zones, because our zoo has grown and people come from all over the world, we can even update our code without much trouble.


Assuming we’re using imports, here is a more realistic refactoring.

from myapp.time import age


Rather than rewriting all our age calls for timezone awareness, we can change our myapp/

def age(obj):
   age = - adjust_for_timezones(obj.birthday())
   return age / 365

In this idealized world, we haven’t thrown out objects completely. We’ve simply adjusted how we use them. Our age depends on a birthday method. This might be a Mixin class we use with our Models. We also could still have our classic Animal base class. Age might even be relative where you’d want to know how old an Animal is in “person years”. We might create a time.animal.age function that has slightly different requirements.

In any case, by reconsidering our object oriented design, we can remove quite a bit of code related to ugly attributes.

The Real World Conclusions

While it might seem obvious now how to implement a system using these ideas, it requires a different set of skills. Naming things is one of the two hard things in computer science. We don’t have obvious design patterns for grouping functions in dynamic languages where it becomes clear the expectations. Our age function above likely would need some check to ensure that the object has a birthday method. You wouldn’t want every age call to be wrapped in a try/except.

You also wouldn’t want to be too limiting on type, especially in a dynamic language like Python (or Ruby, JavaScript, etc.). Even though there has been some rumbling for type hints in Python that seem reasonable, right now you have to make some decisions on how you want to handle the communication that some function foo expects some object of type of Bar or has a method baz. These are trivial problems at a technical level, but socially, they are difficult to enforce without formal language support.

There are also some technical issues to consider. In Python, function calls can be expensive. Each function call requires its own lexical stack such that many small nested functions, while designed well, can become slow. There are tools to help with this, but again, it is difficult to make this style obvious over time.

There is never a panacea, but it seems that there is still room for OO design to grow and change. Functional programming, while elegant, is pretty tough to grok, especially when you have a dynamic language code sitting in your editor, allowing you to mutate everything under the sun. Still, there are some powerful themes in Functional Programming that can make your Object Oriented code more helpful in managing complexity.


Programming is really about layering complexity. It is taking concepts and modeling them to a language that computers can take and, eventually, consider in terms of voltage. As we model our systems we need to consider the data vs. the functionality, which means avoiding ugly attributes (and methods) in favor of orthogonal functionality that respects the design inherit in the objects.

It is not easy by any stretch, but I believe by adopting the techniques mentioned above, we can move past the kludgy parts of OO (and functional programming) into better designed and more maintainable software.

Mon, 03 Nov 2014 00:00:00 +0000 <![CDATA[Functional Programming in Python]]> Functional Programming in Python

While Python doesn’t natively support some essential traits of an actual functional programming language, it is possible to use a functional style (rather than object oriented) to write programs. What makes it hard is that some of the constraints functional programming requires must be followed done manually.

First off, lets talk about what Python does well that makes functional programming possible.

Python has first class functions that allow passing a function around the same way that you’d pass around a normal variable. First class functions make it possible to do things like currying and complex list processing. Fortunately, the standard library provides the excellent functools library. For example

>>> from functools import partial
>>> def add(x, y): return x + y
>>> add_five = partial(add, 5)
>>> map(add_five, [10, 20, 30])
[15, 25, 35]

The next thing critical functional tool that Python provides is iteration. More specifically, Python generators provide a tool to process data lazily. Generators allow to you create functions that can create data on demand rather than forcing the creation of an entire set. Again, the standard library provides some helpful tools via the itertools library.

>>> from itertools import count, imap, islice
>>> nums = islice(imap(add_five, count(10, 10)), 0, 3)
>>> nums
<itertools.islice object at 0xb7cf6dc4>

In this example each of the functions only calculates and returns a value when it is required.

Python also has other functional concepts built in such as list comprehensions and decorators that when used with first class functions and generators makes programming a functional style feasible.

Where Python does not make functional programming easy is dealing with immutable data. In Python, critical core datatypes such as lists and dicts are mutable. In functional languages, all variables are immutable. The result is you often create value based on some initial immutable variable that then has functions applied to it.

(defn add-markup [price]
  (+ price (* .25 price)))

(defn add-tax [total]
  (+ total (* .087 total)))

(defn get-total [initial-price]
  (add-tax (add-markup initial-price)))

In each of the steps above, the argument is passed in by value and can’t be changed. When you need to use the total described from get-total, rather than storing it in a variable, you’d often times always call the get-total function. Typically a functional language will optimize these calls. In Python we can mimic this by memoizing the result.

import functools
import operator

def memoize(f):
    cache = {}
    def wrapper(*args, **kw):
        key = (args, sorted(kw.iteritems()))
        if key not in cache:
            cache[key] = f(*args, **kw)
        return cache[key]
    return wrapper

def factorial(num):
    return reduce(operator.mul, range(1, num + 1))

Now, calls to the function will re-use previous results without having to execute the body of the function.

Another pattern seen in functional languages such as LISP is to re-use a core data type, such as a list, as a richer object. For example, associates lists act like dictionaries in Python, but they are essentially still just lists that have functions to access them as a dictionary such that you can access random keys. In other functional languages such as haskell or clojure, you create actual types that are similar to a struct to communicate more complex type information.

Obviously in Python we have actual objects. The problem is that objects are mutable. In order to make sure we’re using immutable types we can use Python’s immutable data type, the tuple. What’s more, we can replicate richer types by using a named tuple.

from collections import namedtuple

User = namedtuple('User', ['name', 'email', 'password'])

def update_password(user, new_password):
    return User(,, new_password)

I’ve found that using named tuples often helps close the mental gap of going from object oriented to a functional style.

While Python is most definitely not a functional language, it has many tools that make using a functional paradigm possible. Functional programming can be a powerful model to consider as there are a whole class of bugs that disappear in an immutable world. Functional programming is also a great change of pace from the typical object oriented patterns you might be used to. Even if you don’t refactor all your code to a functional style, there is much to learn, and fortunately, Python makes it easy to get started in a language you are familiar with.

Sat, 25 Oct 2014 00:00:00 +0000 <![CDATA[Parallel Processing]]> Parallel Processing

It can be really hard to work with data programmatically. There is some moment when working with a large dataset where you realize you need to process the data in parallel. As a programmer, this sounds like it could be a fun problem, and in many cases it is fun to get all your cores working hard crunching data.

The problem is parallel processing never is purely a matter of distributing your work across CPUs. The hard part ends up being getting the data organized before sending it to your workers and doing something with the results. Tools like Hadoop boast processing terabytes of data, but it’s a little misleading because there is most likely a ton of code on either end of that processing.

The input and output code (I/O) can also have big impact on the processing itself. The input often needs to consider what the atomic unit is as well as what the “chunk” of data needs to be. For example, if you have 10 million tiny messages to process, you probably want to chunk up the million messages into 5000 messages when sending it to your worker nodes, yet the workers will need to know it is getting a chunk of messages vs. 1 message. Similarly, for some applications the message:chunk ratio needs to be tweaked.

In hadoop this sort of detail can be dealt with via HDFS, but hadoop is not trivial to set up. Not to mention if you have a bunch of data that doesn’t live in HDFS. The same goes for the output. When you are done, where does it go?

The point being is that “data” always tends towards spcificity. You can’t abstract away data. Data always ends up being physical at its core. Even if the processing happens in parallel, the I/O will always be a challenging constraint.

Mon, 21 Jul 2014 00:00:00 +0000 <![CDATA[View Source]]> View Source

I watched a bit of this fireside chat with Steve Jobs. It was pretty interesting to hear Steve discuss the availability of the network and how it changes the way we can work. Specifically, he mentioned that because of NFS (presumably in the BSD family of unices), he could share his home directory on every computer he works on without ever having to think about back ups or syncing his work.

What occurred to me was how much of the software we use is taken for granted. Back in the day and educational license for Unix was around $1800! I can only imagine the difficulties becoming a software developer back then when all the interesting tools like databases or servers were prohibitively expensive!

It reminds me of when I first started learning about HTML and web development. I could always view the source to see what was happening. It became an essential part of how I saw the web and programming in general. The value of software was not only in its function, but in its transparency. The ability to read the source and support myself as a user allowed me the opportunity to understand why the software was so valuable.

When I think about how difficult it must have been to become a hacker back in the early days of personal computing, it is no wonder that free software and open source became so prevalent. These early pioneers had to learn the systems without reading the source! Learning meant reading through incomplete, poorly written manuals. When the manual was wrong or out of date, I can only imagine the hair pulling that must have occurred. The natural solution to this problem was to make the source available.

The process of programming is still very new and very detailed, while still being extremely generic. We are fortunate as programmers that the computing landscape was not able to enclose software development within proprietary walls like so many other technical fields. I’m very thankful I can view the source!

Tue, 15 Jul 2014 00:00:00 +0000 <![CDATA[Property Pattern]]> Property Pattern

I’ve found myself doing this quite a bit lately and thought it might be helpful to others.

Often times when I’m writing some code I want to access something as an attribute, even though it comes from some service or database. For example, say we want to download a bunch of files form some service and store them on our file system for processing.

Here is what we’d like the processing code to look like:

def process_files(self):
    for fn in self.downloaded_files:

We don’t really care what the filter_and_store method does. What we do care about is downloaded_files attribute.

Lets step back and see what the calling code might look like:

processor = MyServiceProcessor(conn)

Again, this is pretty simple, but now we have a problem. When do we actually download the files and store them on the filesystem. One option would be to do something like this in our process_files method.

def process_files(self):
    self.downloaded_files = self.download_files()
    for fn in self.downloaded_files:

While it may not seem like a big deal, we just created a side effect. The downloaded_files attribute is getting set in the process_files method. There is a good chance the downloaded_files attribute is something you’d want to reuse. This creates an odd coupling between the process_files method and the downloaded_files method.

Another option would be to do something like this in the constructor:

def __init__(self, conn):
    self.downloaded_files = self.download_files()

Obviously, this is a bad idea. Anytime you instantiate the object it will seemingly try to reach out across some network and download a bunch of files. We can do better!

Here are some goals:

  1. keep the API simple by using a simple attribute, downloaded_files
  2. don’t download anything until it is required
  3. only download the files once per-object
  4. allow injecting downloaded values for tests

The way I’ve been solving this recently has been to use the following property pattern:

class MyServiceProcessor(object):

    def __init__(self, conn):
        self.conn = conn
        self._downloaded_files = None

    def downloaded_files(self):
        if not self._downloaded_files:
            self._downloaded_files = []
            tmpdir = tempfile.mkdtemp()
            for obj in self.conn.resources():
        return self._downloaded_files

    def process_files(self):
        result = []
        for fn in self.downloaded_files:
        return result

Say we wanted to test our process_files method. It becomes much easier.

def setup(self):
    self.test_files = os.listdir(os.path.join(HERE, 'service_files'))
    self.conn = Mock()
    self.processor = MyServiceProcessor(self.conn)

def test_process_files(self):
    # Just set the property variable to inject the values.
    self.processor._downloaded_files = self.test_files

    assert len(self.processor.process_files()) == len(self.test_files)

As you can see it was realy easy to inject our stub files. We know that we don’t perform any downloads until we have to. We also know that the downloads are only performed once.

Here is another variation I’ve used that doesn’t required setting up a _downloaded_files.

def downloaded_files(self):
    if not hasattr(self, '_downloaded_files'):
    return self._downloaded_files

Generally, I prefer the explicit _downloaded_files attribute in the constructor as it allows more granularity when setting a default value. You can set it as an empty list for example, which helps to communicate that the property will need to return a list.

Similarly, you can set the value to None and ensure that when the attribute is accessed, the value may become an empty list. This small differentiation helps to make the API easier to use. An empty list is still iterable while still being “falsey”.

This technique is nothing technically interesting. What I hope someone takes from this is how you can use this technique to write clearer code and encapsulate your implementation, while exposing a clear API between your objects. Just because you don’t publish a library, keeping your internal object APIs simple and communicative helps make your code easier to reason about.

One caveat is that this method can add a lot of small property methods to your classes. There is nothing wrong with this, but it might give a reader of your code the impression the classes are complex. One method to combat this is to use mixins.

class MyWorkerMixinProperties(object):

    def __init__(self, conn):
        self.conn = conn
        self._categories = None
        self._foo_resources = None
        sef._names = None

    def categories(self):
        if not self._categories:
            self._categories = self.conn.categories()
        return self._categories

    def foo_resources(self):
        if not self._foo_resources:
            self._foo_resources = self.conn.resources(name='foo')
        return self._foo_resources

    def names(self):
        if not self._names:
            self._names = [r.meta()['name'] for r in self.resources]

class MyWorker(MyWorkerMixinProperties):

    def __init__(self, conn):
        MyWorkerMixinProperties.__init__(self, conn)

    def run(self):
        for resource in self.foo_resources:
            if resource.category in self.categories:
                self.put('/api/foos', {
                    'real_name': self.names[resource.name_id],
                    'values': self.process_values(resource.values),

This is a somewhat contrived example, but the point being is that we’ve taken all our service based data and made it accessible via normal attributes. Each service request is encapsulated in a function, while our primary worker class has a reasonably straightforward implementation of some algorithm.

The big win here is clarity. You can write an algorithm by describing what it should do. You can then test the algorithm easily by injecting the values you know should produce the expected results. Furthermore, you’ve decoupled the algorithm from the I/O code, which is typically where you’ll see a good deal of repetition in the case of RESTful services or optimization when talking to databases. Lastly, it becomes trivial to inject values for testing.

Again, this isn’t rocket science. It is a really simple technique that can help make your code much clearer. I’ve found it really useful and I hope you do too!

Sat, 05 Jul 2014 00:00:00 +0000 <![CDATA[Iterative Code Cycle]]> Iterative Code Cycle

TDD prescribes a simple process for working on code.

  1. Write a failing test
  2. Write some code to get the test to pass
  3. Refactor
  4. Repeat

If we consider this cycle more generically, we see a typical cycle every modern software developer must use when writing code.

  1. Write some code
  2. Run the code
  3. Fix any problems
  4. Repeat

In this generic cycle you might use a REPL, a stand alone script, a debugger, etc. to quickly iterate on the code.

Personally, I’ve found that I do use a test for this iteration because it is integrated into my editor. The benefit of using my test suite is that I often have a repeatable test when I’m done that proves (to some level of confidence) the code works as I expect it to. It may not be entirely correct, but at least it codifies that I think it should work. When it does break, I can take a more TDD-like approach and fix the test, which makes it fail, and then fix the actual bug.

The essence then of any developer’s work is to make this cycle as quick as possible, no matter what tool you use to run and re-run your code. The process should be fluid and help get you in the flow when programming. If you do use tests for this process, it may be a helpful design tool. For example, if you are writing a client library for some service, you write an idealistic API you’d like to have without letting the implementation drive the design.

TDD has been on my mind recently as I’ve written a lot of code recently and have questioned whether or not my testing patterns have truly been helpful. It has been helpful in fixing bugs and provides a quick coding cycle. I’d argue the code has been improved, but at the same time, I do wonder if by making things testable I’ve introduced more abstractions than necessary. I’ve had to look back on some code that used these patterns and getting up to speed was somewhat difficult. At the same time, anytime you read code you need to put in effort in order to understand what is happening. Often times I’ll assume if code doesn’t immediately convey exactly what is happening it is terrible code. The reality is code is complex and takes effort to understand. It should be judged based on how reasonable it is fix once it is understood. In this way, I believe my test based coding cycle has proven itself to be valuable.

Obviously, the next person to look at the code will disagree, but hopefully once they understand what is going on, it won’t be too bad.

Tue, 27 May 2014 00:00:00 +0000 <![CDATA[TDD]]> TDD

I watched DHH’s keynote at Railsconf 2014. A large part of his talk discusses the misassociation of TDD on metrics and making code “testable” rather than stepping back an focusing on clarity, as an author would when writing.

If you’ve ever tried to do true TDD, you might have a similar feeling that you’re doing it wrong. I know I have. Yet, I’ve also seen the benefit of iterating on code via writing tests. The faster the code / test cycle, the easier it is to experiment and write the code. Similarly, I’ve noticed more bugs show up in code that is not as well covered by tests. It might not be clear how DHH’s perspective then fits in with the benefits of testing and facets of TDD.

What I’ve found is that readability and clarity in code often comes by way of being testable. Tests and making code testable can go along way in finding the clarity that DHH describes. It can become clear very quickly that your class API is actually really difficult to use by writing a test. You can easily spot odd dependencies in a class by the number of mocks you are required to deal with in your tests. Sometimes I find it easier to write a quick test rather than spin up a repl to run and rerun code.

The point being is that TDD can be a helpful tool to write clear code. As DHH points out, it is not a singular path to a well thought out design. Unfortunately, just as people take TDD too literally, people will feel that any sort of granular testing is a waste of time. The irony here is that DHH says very clearly that we, as software writers, need to practice. Writing tests and re-writing tests are a great way to become a better writer. Just because the ideals presented in TDD might be a bit too extreme, the mechanism of a fast test suite and the goal for 100% coverage are still valuable in that they force you to think about and practice writing code.

The process of thinking about code is what is truly critical in almost all software development exercises. Writing tests first is just another way to slow you down and force you to think about your problem before hacking out some code. Some developers can avoid tests, most likely because they are really good about thinking about code before writing it. These people can likely iterate on ideas and concepts in their head before turning to the editor for the actual implementation. The rest of us can use the opportunity of writing tests, taking notes, and even drawing a diagram as tools to force us to think about our system before hacking some ugly code together.

Tue, 27 May 2014 00:00:00 +0000 <![CDATA[Concurrency Transitions]]> Concurrency Transitions

Glyph, the creator of Twisted wrote an interesting article discussing the intrinsic flaws of using threads. The essential idea is that unless you know explicitly when you are switching contexts, it is extremely difficult to effectively reason about concurrency in code.

I agree that this is one way to handle concurrency. Glyph also provides a clear perspective into the underlying constraints of concurrent programming. The biggest constraint is that you need a way to guarantee a set of statements happens atomically. He suggests an event driven paradigm as how best to do this. In a typical async system, the work is built up using small procedures that run atomically, yielding back control to the main loop as they finish. The reason the async model works so well is because you eliminate all CPU based concurrency and allow work to happen while waiting for I/O.

There are other valid ways to achieve as similar effect. The key in all these methods, async included, is to know when you transition from atomic sequential operations to potentially concurrent, and often parallel, operations.

A great example of this mindset is found in functional programming, and specifically, in monads. A monad is essentially a guarantee that some set of operations will happen atomically. In a functional language, functions are considered “pure” meaning they don’t introduce any “side effects”, or more specifically, they do not change any state. Monads allow functional languages a way to interact with the outside world by providing a logical interface that the underlying system can use to do any necessary work to make the operation safe. Clojure, for example, uses a Software Transactional Memory system to safely apply changes to state. Another approach might be to use locking and mutexes. No matter the methodology, the goal is to provide a safe way to change state by allowing the developer an explicit way to identify portions of code that change external state.

Here is a classic example in Python of where mutable state can cause problems.

In Python, and the vast majority of languages, it is assumed that a function can act on a variable of a larger scope. This is possible thanks to mutable data structures. In the example above, calling the function multiple time doesn’t re-initialize argument to an empty list. It is a mutable data structure that exists as state. When the function is called that state changes and that change of state is considered a “side effect” in functional programming. This sort of issue is even more difficult in threaded programming because your state can cross threads in addition to lexical boundaries.

If we generalize the purpose of monads and Clojure’s reference types, we can establish that concurrent systems need to be able to manage the transitions between pure functionality (no state manipulation) and operations that effect state.

One methodology that I have found to be effective managing this transition is to use queues. More generally, this might be called message passing, but I don’t believe message passing guarantees the system understands when state changes. In the case of a queue, you have an obvious entrance and exit point for the transition between purity and side effects to take place.

The way to implement this sort of system is to consider each consumer of a queue as a different process. By considering consumers / producers as processes we ensure there is a clear boundary between them that protects shared memory, and more generally shared state. The queue then acts as bridge to cross the “physical” border. The queue also provides the control over the transition between pure functionality and side effects.

To relate this back to Glyph’s async perspective, when state is pushed onto the queue it is similar to yielding to the reactor in an async system. When state is popped off the queue into a process, it can be acted upon without worry of causing side effects that could effect other operations.

Glyph brought up the scenario where a function might yield multiple times in order to pass back control to the managing reactor. This becomes less necessary in the queue and process system I describe because there is no chance of a context switch interrupting an operation or slowing down the reactor. In a typical async framework, the job of the reactor is to order each bit of work. The work still happens in series. Therefore, if one operation takes a long time, it stops all other work from happening, assuming that work is not doing I/O. The queue and process system doesn’t have this same limitation as it is able to yield control to the queue at the correct logical point in the algorithm. Also, in terms of Python, the GIL is mitigated by using processes. The result is that you can program in a sequential manner for your algorithms, while still tackle problems concurrently.

Like anything, this queue and process model is not a panacea. If your data is large, you often need to pass around references to the data and where it can be retrieved. If that resource is not something that tries to handle concurrent connections, the file system for example, you still may run into concurrency issue accessing some resource. It also can be difficult to reason about failures in a queue based system. How full is too full? You can limit the queue size, but that might cause blocking issues that may be unreasonable.

There is no silver bullet, but if you understand the significance of transitions between pure functionality and side effects, you have a good chance of producing a reasonable system no matter what concurrency model you use.

Fri, 28 Feb 2014 00:00:00 +0000 <![CDATA[A Sample CherryPy App Stub]]> A Sample CherryPy App Stub

In many full stack frameworks, there is a facility to create a new application via some command. In django for example, you use startproject foo. The startproject command will create some directories and files to help you get started.

CherryPy tries very hard to avoid making decisions for you. Instead CherryPy allows you to setup and configure the layout of your code however you wish. Unfortunately, if you are unfamiliar with CherryPy, it can feel a bit daunting setting up a new application.

Here is how I would set up a CherryPy application that is meant to serve basic site with static resources and some handlers.

The File System

Here is what the file system looks like.

├── myproj
│   ├──
│   ├── *
│   ├──
│   ├──
│   ├──
│   ├── static
│   ├── lib *
│   └── views
│       └── base.tmpl
└── tests

First off, it is a python package with a If you’ve never created a python package before, here is a good tutorial.

Next up is the project directory. This is where all you code lives. Inside this directory we have a few files and directories.

  • : Practically every application is going to need some configuration and a way to load it. I put that code in and typically import it when necessary. You can leave this out until you need it.
  • : MVC is a pretty good design pattern to follow. The is where you put your objects that will be mounted on the cherrypy.tree.
  • : Applications typically need to talk to a database or some other service for storing persistent data. I highly recommend SQLAlchemy for this. You can configure the models referred to in the SQLAlchemy docs here, in the file.
  • : CherryPy comes with a production ready web server that works really well behind a load balancing proxy such as Nginx. This web server should be used for development as well. I’ll provide a simple example what might go in your file.
  • static : This is where your css, images, etc. will go.
  • lib : CherryPy does a good job allowing you to write plain python. Once the controllers start becoming more complex, I try to move some of that functionality to well organized classes / function in the lib directory.
  • views : Here is where you keep your template files. Jinja2 is a popular choice if you don’t already have a preference.

Lastly, I added a tests directory for adding unit and functional tests. If you’ve never done any testing in Python, I highly recommend looking at pytest to get started.

Hooking Things Together

Now that we have a bunch of files and directories, we can start to write our app. We’ll start with the Hello World example on the CherryPy homepage.

In our we’ll add our HelloWorld class

import cherrypy

class HelloWorld(object):
    def index(self):
        return 'Hello World!' = True

Our is where we will hook up our controller with the webserver. The is also how we’ll run our code in development and potentially in production

import cherrypy

# if you have a config, import it here
# from myproj import config

from myproj.controllers import HelloWorld

HERE = os.path.dirname(os.path.abspath(__file__))

def get_app_config():
    return {
        '/static': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': os.path.join(HERE, 'static'),

def get_app(config=None):
    config = config or get_config()
    cherrypy.tree.mount(HelloWorld(), '/', config=config)
    return cherrypy.tree

def start():

if __name__ == '__main__':

Obviously, this looks more complicated than the example on the cherrypy homepage. I’ll walk you through it to let you know why it is a little more complex.

First off, if you have a that sets up any configuration object or anything we import that first. Feel free to leave that out until you have a specific need.

Next up we import our controller from our file.

After our imports we setup a variable HERE that will be used to configure any paths. The static resources is the obvious example.

At this point we start defining a few functions. The get_app_config function returns a configuration for the application. In the config, we set up the staticdir tool to point to our static folder. The default configuration is to expose these files via /static.

This default configuration is defined in a function to make it easier to test. As you application grows, you will end up needing to merge different configuration details together depending on configuration passed into the application. Starting off by making your config come from a function will help to make your application easier to test because it makes changing your config for tests much easier.

In the same way we’ve constructed our config behind a function, we also have our application available behind a function. When you call get_app it has the side effect of mounting the HelloWorld controller the cherrypy.tree, making it available when the server starts. The get_app function also returns the cherrypy.tree. The reason for this is, once again, to allow easier testing for tools such as webtest. Webtest allows you to take a WSGI application and make requests against it, asserting against the response. It does this without requiring you start up a server. I’ll provide an example in a moment.

Finally we have our start function. It calls get_app to mount our application and then calls the necessary functions to start the server. The quickstart method used in the homepage tutorial does this under the hood with the exception of also doing the mounting and adding the config. The quickstart can become less helpful as your application grows because it assumes you are mounting a single object at the root. If you prefer to use quickstart you certainly can. Just be aware that it can be easy clobber your configuration when mixing it with cherrypy.tree.mount.

One thing I haven’t addressed here is the database connection. That is outside the scope of this post, but for a good example of how to configure SQLAlchemy and CherryPy, take a look at the example application, Twiseless. Specifically you can see how to setup the models and connections. I’ve chosen to provide a file system organization that is a little closer to other frameworks like Django, but please take liberally from Twiseless to fill in the gaps I’ve left here.


In full stack frameworks like Django, testing is part of the full package. While many venture outside the confines of whatever the defaults are (using pytest vs. django’s unittest based test runner), it is generally easy to test things like requests to the web framework.

CherryPy does not take any steps to make this easier, but fortunately, this default app configuration lends itself to relatively easy testing.

Lets say we want to test our HelloWorld controller. First off, we’ll should set up an environment to develop with. For this we’ll use virtualenv. I like to use a directory called venv. In the project directory:

$ virtualenv venv

Virtualenv comes bundled with a pip. Pip has a helpful feature where you can define requirements in a single test file. Assuming you’ve already filled in your file with information about your package, we’ll create a dev_requirements.txt to make it easy to get our environment setup.

# dev_requirements.txt

-e .  # install our package

# test requirements

Then we can install these into our virtualenv by doing the following in the shell:

$ source venv/bin/activate
(venv) $ pip install -r dev_requirements.txt

Once the requirements are all installed, we can add our test.

We’ll create a file in tests called Here is what it will look like:

import pytest
import webtest

from myproj.server import get_app

def http():
    return webtest.WebTest(get_app())

class TestHelloWorld(object):

    def test_hello_world_request(self, http):
        resp = http.get('/')
        assert resp.status_int == 200
        assert 'Hello World!' in resp

In the example, we are using a pytest fixture to inject webtest into our test. WebTest allows you to perform requests against a WSGI application without having to start up a server. The request.get call in our test then is the same as if we had started up the server and made the request in our web browser. The resulting response from the request can be used to make assertions.

We can run the tests via the py.test command:

(venv) $ py.test tests/

It should be noted that we also could test the response by simply instantiating our HelloWorld class and asserting the result of the index method is correct. For example

from myproj.controllers import HelloWorld

def test_hello_world_index():
    controller = HelloWorld()
    assert controller.index() == 'Hello World!'

The problem with directly using the controller objects is when you use more of CherryPy’s features, you end up using more of cherrypy.request and other cherrypy objects. This progression is perfectly natural, but it makes it difficult to test the handler methods without also patching much of the cherrypy framework using a library like mock. Mock is a great library and I recommend it, but when testing controllers, using WebTest to handle assertions on responses is preferred.

Similarly, I’ve found pytest fixtures to be a powerful way to introduce external services into tests. You are free to use any other method you’d like to utilize WebTest in your tests.


CherryPy is truely an unopinionated framework. The purpose of CherryPy is to create a simple gateway between HTTP and plain Python code. The result is that there are often many questions of how to do common tasks as there are few constraints. Hopefully the above folder layout along side the excellent Twiseless example provides a good jumping off point for getting the job done.

Also, if you don’t like the layout mentioned above, you are free to change it however you like! That is the beauty of cherrypy. It allows you to organize and structure your application the way you want it structured. You can feel free to be creative and customize your app to your own needs without fear of working against the framework.

Mon, 24 Feb 2014 00:00:00 +0000 <![CDATA[Queues]]> Queues

Ian Bicking has said goodbye. Paste and WSGI played a huge part of my journey as a Python programmer. After reading Ian’s post, I can definitely relate. Web frameworks are becoming more and more stripped down as we move to better JS frameworks like AngularJS. Databases have become rather boring as Postgres seems to do practically anything and MongoDB finally feels slightly stable. Where I think there is still room to grow is in actual data, which is where queues come in.

Recently I’ve been dragged into the wild world of Django. If you plan on doing anything outside of typical request / response cycle, you will quickly run into Celery. Celery defines itself as a distributed task queue. The way it works is that you run celery workers processes that use the same code as your application. These workers listen to a queue (like RabbitMQ for example) for task events that a worker will execute. There are some other pieces that are provided such as scheduling, but generally, this is the model.

The powerful abstraction here is the queue. We’ve recently seen the acceptance of async models in programming. On the database front, eventual consistency has become more and more accepted as fact for big data systems. Browsers have adopted data storage models to help protect user data while that consistency gets replicated to a central system. Mobile devices with flaky connectivity provide the obvious use case for storing data temporarily on the client. All these technologies present a queue-like architecture where data is sent through a series of queues where workers are waiting to act on the data.

The model is similar to functional programming. In a functional programming language you use functions to describe a set of operations that will happen on a specific type or set of data. Here is simple example in Clojure

(defn handle-event [evt]
  (add-to-index (map split-by-id (parse (:data evt)))))

Here we are handling some evt data structure that has a data key. The data might be a string that gets parsed by the parse function. The result of the parsing is passed to a map operation that also returns an iterable that is consumed by the add-to-index function.

Now, say we wanted to implement something similar using queues in Python.

def parse(data, output):
    # some parsing...
    for part in parts:

def add_to_index(input):
    while True:
        doc = input.get()

def POST(doc):
    id = gen_id()
    indexing_queue.push((id, doc))
    return {'message': 'added to index queue',
            'redirect': '/indexed/%s' % id}

Even though this is a bit more verbose, it presents a similar model as the functional paradigm. Each step happens on a immutable value. Once the function receives the value from the queue, it doesn’t have to be concerned with it changing as it does its operation. What’s more, the processing can be on the same machine or across a cluster of machines, mitigating the effect of the GIL.

This isn’t a new idea. It is very similar to the actor model and other concurrency paradigms. Async programming effectively does the same thing in that the main loop is waiting for I/O, at which time it sends the I/O to the respective listener. In theory, a celery worker could queue up a task on another celery queue in order to get a similar effect.

What is interesting is that we don’t currently have a good way to do this sort of programming. There is a lot of infrastructure and tooling that would be helpful. There are questions as to how to deploy new nodes, keep code up to date and what happens when the queue gets backed up? Also, what happens when Python isn’t fast enough? How do you utilize a faster system? How do you do backfills of the data? Can you just re-queue old data?

I obviously don’t have all the answers, but I believe the model could work to make processing streamable data more powerful. What makes the queue model possible is an API and document format for using the queue. If all users of the queue understood the content on the queue, then it is possible for any system that connect to the queue to participate in the application.

Again, I’m sure others have built systems like this, but as there is no framework available for python, I suspect it is not a popular paradigm. One example of the pattern (sans a typical queue) is Mongrel2 with its use of ZeroMQ. Unfortunately, with the web providing things like streaming responses and the like, I don’t believe this model is very helpful for a generic web server.

Where I believe it could excel is when you have a lot of data coming that requires flexible analysis by many different systems, such that a single data store cannot provide the flexibility required. For example, if you wanted to process all facebook likes based on the URLs, users and times, it would require a rather robust database that could effectively query each facet and establish a reasonably fast means of calculating results. Often this is not possible. Using queues and a streaming model, you could listen to each like as it happens and have different workers process the data and create their own data sources customized for the specific queries.

I still enjoy writing python and at this point I feel I know the language reasonably well. At the same time I can relate to the feeling that it isn’t as exciting as it used to be. While JavaScript could be a lot of fun, I think there is still something to be done with big data that makes sense for Python. Furthermore, I’d hope the queue model I mentioned above could help leave room to integrate more languages and systems such that if another platform does make sense, it is trivial to switch where needed.

Have other written similar systems? Are there problems that I’m missing?

Wed, 12 Feb 2014 00:00:00 +0000 <![CDATA[Immutability]]> Immutability

One thing about functional programming languages that is source of frustration is immutable data structures. In Clojure there are a host of data structures that allow you change the data in place. This is possible because the operation is wrapped in a transaction of sorts that will guarantee it will work or everything will be reverted.

One description that might be helpful is that Clojure uses locks by default. Any data you store is immutable and therefore locked. You can always make a copy efficiently and you are provided some tools to unlock the data when necessary.

I’m definitely not used this model by any stretch, but it seems the transactional paradigm along with efficient copies makes for a good balance of functional values along side practical requirements.

Sun, 02 Feb 2014 00:00:00 +0000 <![CDATA[My First Cojure Experience]]> My First Cojure Experience

Clojure is a LISP built on top of the JVM. As a huge fan of Emacs, it shouldn’t be suprising that there is a soft spot in my heart for LISP as well functional programming. The problem is lisp is a rather lonely language. There are few easily googable tutorials and a rather fractured community. You have a ton of options (Guile, Scheme, Racket, CL, etc.) with none of them providing much proof that a strong, long lasting community exists. It can be rather daunting to spend time trying to learn a language based on a limited set of docs knowing that it is unlikely you will have many chances to actually use it.

Of course, learning a lisp (and functional programming) does make you a better programmer. Learning a lisp is definitely time well spent. That said, this reality of actually using lisp in the real world has always been a deterrent for me personally.

Clojure, on the other hand, is a little different.

Clojure, being built on the JVM, manages to provide a lisp that is contextualized by Java. Clojure doesn’t try to completely hide the JVM and instead provides clear points of interoperability and communicates its features in terms of Java. Rich Hickey does a great job explaining his perspective on Clojure, and more importantly, what the goal is. This all happens using Java colored glasses. The result is a creator that is able to present a practical lisp built from lessons learned programming in typical object oriented paradigms.

Idealism aside, what is programming in Clojure really like?

Well, as a python programmer with limited Java experience, it is a little rough to get started. The most difficult part of learning a lisp is how to correctly access data. In Python (and any OO language) it is extremely easy to create a data structure and get what you need. For example, if you have a nested dictionary of data, you can always provide a couple keys and go directly to the data you want. Lisp does not take the same approach.

It would be really great if I were to tell you how best to map data in python data structures into Clojure data structures, but I really don’t know. And that is really frustrating! But, it is frustrating because I can see how effective the platform and constructs would be if only I could wrap my head around dealing with data.

Fortunately, Rich gives us some tips by way of Hammock Driven Development, that seem promising. A common concept within the world of lisp is that your program is really just data. Cascalog, a popular hadoop map reduce framework, provides a practical example of this through its logic engine. Here is a decent video that reflects how a declarative form, where you program really is just data used by a logic engine. Eventually, I’m sure my brain will figure out how to effectively use data in Clojure.

Another thing that is commonly frustrating with a JVM for a Python programmer is dealing with the overwhelming ecosystem. Clojure manages to make this aspect almost trivial thanks to Leiningen. Imagine virtualenv/pip merged with in Django and you start to see how powerful a tool it is.

Finally, Clojure development is really nice in Emacs. The key is the inferior lisp process. If you’ve ever wanted a Python IDE you’ll find that the only way to reliably get features like autocomplete to work with knowledge of the project is to make sure the project is configured with the specific virtualenv. Emacs makes this sort of interaction trivial in Clojure because of tools like Cider that jack into the inferior lisp process to compile a specific function, run tests or play around in a repl.

I highly recommend checking out Clojure. Like a musical instrument, parens may take a little practice. But, once you get used to them, the language becomes really elegant. At a practical level you get a similar dynamism as you see in Python. You also get the benefits of a platform that is fast and takes advantages of multiple cores. Most importantly, there is a vibrant and helpful community.

Even if you don’t give Clojure a try, I encourage you to watch some of Rich Hickey’s talks online. They are easy to watch and take an interesting perspective on OO based languages. I’ve become a fan.

Fri, 31 Jan 2014 00:00:00 +0000 <![CDATA[Code by Line]]> Code by Line

I saw this tweet:

Limiting lines to 80 characters is a great way to ensure that variable names remain cryptically short while lines break in confusing places.

It makes some sense. For example, if I had something like::

put_to_s3(project_bucket, resultant_keyname, use_multipart=True, overwrite=False, confirm=True)

One way to a shorter line would be to make some variables names a bit shorter:

put_to_s3(bucket, key, use_multipart=True, overwrite=False, confirm=True)

Unfortunately, this doesnt’ quite do the trick. A better tact, that has benefits that go beyond 80 characters, is to utilize vertical space. Or in simpler terms, code by lines rather than variables. For example, I would have refactored the original code like this.


I get to keep my more descriptive names and when the signature of the function changes or I have to add another keyword argument, the diff / patch will be much clearer. Also, and this is obviously subjective, if the vertical listing seems to grow large, you have a more obvious “smell” to the code when you are browsing the codebase.

It is understandable to assume that limiting line size could result in cryptic variable names, but more often than not, longer lines end up being more difficult to read and decode. More importantly, you end up fighting the endless suite of line based tools we utilize in version control. The next time you feel limited by the line length, consider the vertical space you have and if that might allow you to have your descriptive variable names along side your line based coding tools.

Tue, 21 Jan 2014 00:00:00 +0000 <![CDATA[Announcing CacheControl 0.9.2]]> Announcing CacheControl 0.9.2

I’ve just released CacheControl 0.9.2! As requests now supports response pickling out of the box, CacheControl won’t try to patch the Response object unless it is necessary.

Also, I’ve heard that CacheControl is being used successfully in production! It has helped us replace httplib2 in our core application, which has pretty decent traffic.

Download the release over at pypi and check out the docs.

Sat, 11 Jan 2014 00:00:00 +0000 <![CDATA[Hiding Complexity vs. Too Many Layers]]> Hiding Complexity vs. Too Many Layers

If you’ve ever tried TDD there is a decent chance you’ve written some code like this:

from mock import patch

def test_upload_foo(upload_client):


In this example, what is happening is we are testing some code that uploads a file somewhere like S3. We patch the actual upload layer to make sure we don’t have to upload anything. We then are asserting that we are uploading the file using the right filename, which is the result of the new_filename function.

The code might look something like this:

from mypkg.uploader import upload_client

def new_filename():
    return some_hash() + request.path

def do_upload():

The nice thing about this code it is pretty reasonable to test. But, in this simplified state, it doesn’t reflect what happens when you have a more complex situation with multiple layers.

For example, here is an object that creates a gzipped CSV writer on some parameters and the current time.

class Foo(object):

    basedir = '/'

    def __init__(self, bar, baz, now=None): = bar
        self.baz = baz
        self._now = now
        self._file_handle = None

    def now(self):
        if not self._now:
            self._now ='%Y-%m-%d')
        return self._now

    def fname(self):
        return '%s.gz' % os.path.join(self.basedir,,
                            , self.baz)

    def file_handle(self):
        if not self._file_handle:
            self._file_handle =
        return self._file_handle

    def writer(self):
        return csv.writer(self.file_handle)

The essence of this functionality could all be condensed down to a single method:

def get_writer(self):
    now = self._now
    if not now:
        now ='%Y-%m-%d')

    fname = '%s.gz' % os.path.join(self.basedir, now,
                         , self.baz)

    # NOTE: We have to keep this handle around to close it and
    #       actually save the data.
    self.file_handle =
    return csv.writer(self.file_handle)

The single method is pretty easy to understand, but testing becomes more difficult.

Even though the code is relatively easy to read, I believe it is better to lean towards the more testable code and I’ll tell you why.

Tests Automate Understanding

The goal of readable code and tests is to help those that have to work on the code after you’ve moved on. This person could be you! The code you pushed might have seemed perfectly readable when you originally sent it upstream. Unfortunately, that readability can only measured by the reader. The developer might be new to the project, new to the programming language or, conversely, be an author that predates you! In each of these cases, your perspective on what is easy to understand is rarely going to be nearly as clear to the next developer reading your code.

Tests on the other hand provide the next developer with confidence because they have an automated platform on which to build. Rather than simply reading the code in order to gain understanding, the next developer can play with it and confirm his or her understanding authoritatively. In this way, tests automate your understanding of the code.

Be Cautious of Layers!

Even though hiding complexity by way of layers makes things easier to test and you can automate understanding, layers still present a difficult cognitive load. Nesting objects in order to hide complexity can often become difficult to keep track of, especially when you are in a dynamic language such as Python. In static languages like Java, you have the ability to create tools to help navigate the layers of complexity. Often times in dynamic languages, similar tools are not the norm.

Obviously, there are no hard and fast rules. The best course of action is to try and find a balance. We have rough rules of thumb that help us make sure our code is somewhat readable. It is a good idea to apply similar rules to your tests. If you find that testing some code, that may be reasonably easy to read, is difficult to confirm an isolated detail, then it is probably worth creating a test and factoring out that code. The same goes for writing tons of tests to cover all the code paths.

About the Example

I came up with the example because it was some actual code I had to write. I found that I wanted to be able to test each bit separately. I had a base class that would create the file handles, but the file naming was different depending on the specific class that was inherited. By breaking out the naming patterns I was able to easily test the naming and fix the naming bugs I ran into easily. What’s more, it gave me confidence when I needed to use those file names later and wanted to be sure they were correct. I didn’t have rewrite any code that created the names because there was an obvious property that was tested.

It did make the code slightly more ugly. But, I was willing to accept that ugliness because I had tests that made sure when someone else needed to touch the code, they would have the same guarantees that I found helpful.

Test are NOT Documentation

Lastly, tests are not a replacement for readable code, docs or comments. Code is meant for computers to read and understand, not people. Therefore it is in our best interest to take our surrounding tools and use them to the best of our abilities in order to convey as clearly as possible what the computer will be doing with our text. Test offer a way to automate understanding. Test are not a replacement for understanding.

Finally, it should be clear that my preference for tests and more layers is because I value maintainable code. My definition of maintainable code is defined by years (5-10) and updated by teams of developers. In other words, my assumption is that maintenance of the code is, by far, the largest cost. Other projects don’t have the same requirements, in which case, well commented code with less isolated tests may work just fine.

Mon, 06 Jan 2014 00:00:00 +0000 <![CDATA[Announcing CacheControl 0.8.2! Now with ETag and Vary Support]]> Announcing CacheControl 0.8.2! Now with ETag and Vary Support

I’ve released CacheControl 0.8.2. Thanks to tow for submitting the ETag support!

Take a look at the docs and specifically the way etags are handled. I believe it is a subtle improvment over httplib2’s behavior.

Lastly, I’ve also confirmed the test suite is running under Python 3.2+. This is my first foray into the brave new world of 3.x, so please open tickets for any issues or suggestions.

Tue, 26 Nov 2013 00:00:00 +0000 <![CDATA[Milk the Cat?]]> Milk the Cat?

I finally read Dune. It was more of a fantasy story than pure sci-fi. The picture Dune paints reminded me of something you’d see in Heavy Metal or some other comic, so it was a pretty fun read.

Then I made the mistake of watching the movie.

First off, the positive. The movie really tries to fit as much as the book as possible. It uses a narrator to fill in a lot gaps and includes the internal dialog prevalent in the book. It ends up being pretty cheesy though and reminded me of The Wonder Years.

Now, obviously the book is better the movie. What is funny is where the movie took liberties. Generally, the “powers” of the different characters feel more magical than in the book. This isn’t a huge deal, but it cheeses the movie out.

The worst and most ridiculous is the milking of the cat.

In the book there is a character that is captured by the antagonist camp. He is given a poison that requires a daily dosage in order to keep the mortal effect at bay. In the book they give it to him in his food.

What do you think they did in the movie?

That’s right. They brought in a totally stupid contraption built around an annoyed white cat that had a rat on its back and told this character that he had to milk the cat every day in order to keep the poison at bay. I have no idea...

The worst part of it all was that the movie made the book feel cheesy. It was so bad that the imagery and story the book painted started to feel like a cheesy 80s B sci-fi flick. It was kind of bummer.

Bask in the awfulness.

Thu, 21 Nov 2013 00:00:00 +0000 <![CDATA[Our Eyes are Modems]]> Our Eyes are Modems

I read an article about a new retinal HMD. Virtual reality has never been a huge interest of mine, but seeing as I look at a screen all day as a programmer, anything that could improve what I see all day seems worth a look (no pun intended).

It occurred to me that the development of this sort of technology is really similar to a modem. If we think back to early days of the internet (it wasn’t really that long ago in the grand scheme of things) we had modems in our computers. A modem is a “MODulator DEmodulator” and it took analog sound from the phone line and translated it into data for your computer.

Our eyes act like a modem. We sense changes in light and translate that to data. We then act on that data accordingly. Sometimes that data causes us to blink while other time it stirs up emotions. This last bit is why I believe most of these technologies focus on films as an example use case. A movie is really a physical representation of experiences as told through light. If a movie causes the viewer to experience some emotions, it has effectively communicated it message.

I’m still on the fence as to whether this sort of direct analogue connection to our brains is beneficial or just plain old scary.

Thu, 14 Nov 2013 00:00:00 +0000 <![CDATA[Announcing CacheControl 0.7.1]]> Announcing CacheControl 0.7.1

I’ve release CacheControl 0.7.1. This release includes patching of the requests Response object to make it pickleable. This allows you to easily implement cache stores on anything that can store pickled text. I’ve also added a Redis and Filesystem based cache.

The FileCache requires lockfile be installed and obviously, the redis cache requires the redis module be installed.

I also added docs!

Please give it a go and file bugs!

Thu, 07 Nov 2013 00:00:00 +0000 <![CDATA[Avoiding Virtualenvwrapper in Build Tasks]]> Avoiding Virtualenvwrapper in Build Tasks

Virtualenvwrapper is a really helpful tool that allows you keep your Python virtualenv’s organized in a single location. It provides some hooks to make working with a virtualenv in a shell simple. Unfortunately, it is not well suited to organizing automated virtualenvs used in a project’s build tasks.

First off, I should say that my goal with any build task is that it can be run without any external requirements. No environment variables should need setting. No virtualenv activated. No other services be up and running (within reason). My goal with any project is to support something like this.

$ git clone $project
$ cd $project
$ make bootstrap
$ make test
$ make run
$ make release

In this case I’m using Make as an example, but Paver, Rake, Invoke, SCons, CMake, Redo, Ant or any other build tool would work.

The problem with virtualenvwrapper is that it assumes you are using it from a shell. It implements its functionality as shell functions. It is necessary that it does this because it is impossible for a child process to adjust the environment of the parent in a way that lasts after the child process ends. Virtualenv’s user interface wants to enable a virtualenv after it has been created, so the shell functions are the best way to do this.

None of this means that a developer cannot use virtualenvwrapper. It simply means that using virtualenvwrapper to create and bootstrap your environment is more complex and could be more brittle over time. It is safer and more reliable to just create the virtualenv yourself, while making it configurable to utilize a virtualenv previously created by virtualenvwrapper.

Wed, 30 Oct 2013 00:00:00 +0000 <![CDATA[Protect Privacy]]> Protect Privacy

There have been a ton of discussions regarding privacy recently in light of Snowden NSA revelations. Many discussions revolve around encryption, using services outside the US and generally how to make it difficult for a snooper to read information passed around the internet. I’m definitely in favor of tools the enable keeping information private.

At the same time, wouldn’t it be better if our understanding of the internet and technology changed such that the users could be considered the owners? It seems as though tools like copyright, professional privilege and unlawful search and seizure should extend to our life online. Privacy as an ideal should not be limited by the current technology of the day. Privacy is a concept that should permeate our laws, no matter the current state of technology.

As an aside, I’d wonder if the RIAA should consider suing the NSA for copyright infringement. Imagine the number of songs and copyrighted works that flow through email the NSA might have “copied” digitally. Anyway...

I’m sure the government is never going to offer its constituency true privacy. To put it generally, it makes life harder for law enforcement. If you can’t see what people are doing, then how can you punish (and most importantly tax) them. I have a hunch that there are still plenty required mediums that make auditing and discovering wrong doing possible. A warrant is a piece of paper that describes an exception to privacy. That seems like a pretty reasonable way to go about finding evidence. After all, we are innocent until proven guilty.

On the other side of the coin, I believe society could greatly benefit from a society of privacy. That little black box the insurance companies want to put in your car would be more appealing if you owned the data it recorded and could feel safe the government can’t simply ask the insurance companies for the data without proper cause. Smart phones could be tracking your every move for your own usage, not the governments. Technology can create new ways of recording and using data without having to be concerned that user’s privacy could be compromised. There is a world of automation that is available when you don’t have to worry about the data becoming public knowledge.

I don imagine any of this will happen. Most likely our government will continue to make hidden strides into destroying privacy in order to maintain power. Technology will try to curb this threat to privacy and users will become increasingly accepting of big brother watching everything we do. I only hope that some in our government will realize the danger of stealing privacy and make a stand to keep it safe both now and in the future.

Tue, 29 Oct 2013 00:00:00 +0000 <![CDATA[Framework Frustration]]> Framework Frustration

At work we use two frameworks, Django and CherryPy. The decision to use one or the other typically comes down to who is starting the project and, to a lesser extent, whether the app is primarily a user facing app or an API. For example, if we need to put together an app to show off some data publically, Django is our go to framework. If we are creating an internal REST API for other services, CherryPy is typically the way to go.

Developers typically feel more comfortable with one framework. I’m definitely a CherryPy guy, while the rest of the folks on my team fall on the Django side of the fence. The result is that I’m often working on Django code, which ends up being pretty frustrating.

First off, the nice thing about Django is that if you commit to the ecosystem and learn it, there is a wealth of 80% tools you can use to create a functional web app. This is true of any opinionated full stack framework and I’d consider Django a prime example. When you understand Django, you can get a lot of stuff done.

The problem is that when you don’t know Django, getting things done is challenge. The reason being is that the framework hides general python techniques in order to hide complexity. As I said, when you understand what happens under the hood, hiding the complexity is fine. The problem is that many full stack frameworks, such as Django, don’t make it easy to look under the hood and follow the stack to the necessary code.

CherryPy, on the other hand, makes uncovering the layers of complexity much easier. You can typically isolate bits of the framework relatively easily and test them in a prompt or simple script to discover issues. The source code is also small enough that diving into its algorithms is not unreasonable. Sure, the documentation is lacking, there are fewer high quality plugins and you will probably have to make more decisions as to how to implement common idioms, but the result is that uncovering the logic is rarely a problem.

Personally, I like CherryPy because you can take the codebase and figure what is going on. When you do hit frameworks such as sqlalchemy or templates such as mako or jinja2, the documentation is typically of a high quality because of the smaller set of topics that need covering. Also, while it is possible to create CherryPy specific integration points, it is just as easy to write your own classes and functions to hide complexity as the need arises.

It can be frustrating working on Django because it is difficult to peel back the layers. For example, we use Tastypie for some API endpoints. It is exceptionally nice for exposing models. You get pagination, multiple authentication schemes, and a whole host of other bits that are nice. That said, when you need to adjust the API, it is cumbersome and produces somewhat ugly code. Here is an example, from the docs.

class ParentResource(ModelResource):
    children = fields.ToManyField(ChildResource, 'children')

    def prepend_urls(self):
        return [
            url(r"^(?P<resource_name>%s)/(?P<pk>\w[\w/-]*)/children%s$" % (self._meta.resource_name, trailing_slash()), self.wrap_view('get_children'), name="api_get_children"),

    def get_children(self, request, **kwargs):
            obj = self.cached_obj_get(request=request, **self.remove_api_resource_names(kwargs))
        except ObjectDoesNotExist:
            return HttpGone()
        except MultipleObjectsReturned:
            return HttpMultipleChoices("More than one resource is found at this URI.")

        child_resource = ChildResource()
        return child_resource.get_detail(request,

First off, you have to understand a suite of concepts. Tastypie generates URL regexes for you. You can override these via the prepend_urls method. Second, the get_children method contains some custom exceptions that come from Django core that are caught in order to return tastypie specific error return values. Finally, the get_detail method is a helper that automatically will render the object found in get_children method and return a proper tastypie response.

As you begin to understand the code it is not a huge mystery what is happening. With that said, there is a lot of reading that has to happen before you can begin to understand what is really going on. You also have to understand the implicit barriers between tastypie and django. Finally, these are all on a semi-magic set of Resource objects that inject into the list of URL patterns, removing the benefit of having all your URLs in one place.

Hopefully it is clear how trying to understand and debug this type of code is challenging and can be frustrating. While it hides a great deal of complexity for you and adds many feature that you may or may not need, it presents a chasm between the code and the actual impact that must be crossed by reading documentation.

At this point I should mention that this kind of code is a pet peeve of mine because it is difficult to maintain. Someone approaching this code without a strong background in Django and Tastypie would have to spend a good amount of time gettig up to speed before being able to try and fix a bug. What’s more, that person would not be able to simply open up Python prompt or write a test without further reading about what specialized tools are available and how to use them. Obviously, it is not a waste of time to make the investment, but for me personally, I’d rather learn by writing code, isolating functionality and writing tests than reading docs, hoping they are up to date.

Thu, 24 Oct 2013 00:00:00 +0000 <![CDATA[Mechanical Switching Keyboards]]> Mechanical Switching Keyboards

I decided to take the plunge and buy a keyboard with mechanical switches. What put me over the edge was the claim that it improved typing accuracy because your hands got used to sound and feel of the actual switch.

After using it for a week or so, I can’t say that I’m in love just yet. The sound of the keyboard really is loud. I’ve found I have to pay attention to how I’m typing as well. I’ve heard that once you get used to it you don’t really press all the way down because you can feel where the switch engages. Whether or not this is entirely true, using a light touch seems to help avoid mistyping. The most common error when I’m trying to type quickly is when the wrong letter comes first. It is almost like a real typewriter in that it feels like I need to type slowly and deliberately in order to make sure I get it right. Another common frustration is repeated keystrokes. Often when I have to delete more than one letter or word, I’ll press the delete key and not fully press it where it engages. The result being I start typing and have to start over since due to the spacing being incorrect.

I will say that I’ve never been a very strong typist. While I can type quickly at times, my error rate is pretty high. My hope is that this new keyboard will help improve my accuracy and so far I think it might be working. At least as far as typing on this keyboard is concerned. When I move to my laptop keyboard it feels rather foreign and takes a bit of getting used to. That also might be due to having a new X1 Carbon rather than my MacBook pro. The X1 has a pretty good keyboard, but it I wouldn’t consider it any better than my mac keyboard.

I do hope that this change is helpful. The keyboard does feel really rugged and having something new to type on does add a little spice to writing code. At this point I can’t say I’d recommend mechanical switches, but I can definitely see how over time someone could really fall in love with the feel and sound.

Mon, 21 Oct 2013 00:00:00 +0000 <![CDATA[Recent Finds]]> Recent Finds

For the past few years I’ve been more or less happily developing on OS X. Thanks to Emacs, I have a nice text base interface to work with that allows me to manage most of my core applications (text editor, IRC, email, terminal, etc.) in a keyboard centered environment. At the same time, I missed the stripped down environment of my tiling window manager choice, StumpWM. A recent associate moved on to “googlier” pastures and left behind a X1 Carbon that was up for grabs. Seeing as my MacBook Pro always had problems running VMs and the disk was always almost full, it seemed like a good time to switch.

When switching environments, it usually is a time you are forced reinvestigate the current tools available. Here are some tools that I’ve found interesting.



Helm is the reinvention of anything.el. Many people compare it to Spotlight, Alfred and Quicksilver on the Mac in that it helps you configure smart look ups to find things. People use it for everything from autocomplete to a nice interface to spotify. The spotify video inspired me to write some code to browse the files in my recently converted blog. It was really easy to do and puts a pretty face along side a usable UI for very little time.

Expand Region

Anyone who follows Emacsrocks probably already has seen Expand Region. It is a really simple package that helps you to semantically expand what is selected. Here is a short video showing how it works. The nice thing is that if you are refactoring code, this makes it easy to select the current expression, function or class and cut/copy it where you need it to go. Likewise, you needed to search/replace in a semantic block, it is trivial to do without having to move around to make the selection.

S and Dash

S is a string library for elisp and Dash is a library for working with lists in elisp. Neither are arguably that helpful if you already know elisp and/or common lisp, but for someone like me that doesn’t have a strong lisp background, these are really helpful libs.



I recently migrated my blog to Tinkerer. The nice thing about it is that it uses Sphinx for generating the static pages.


Toolz extends itertools, functools and operator modules in order to provide a more robust functional programming pattern in Python. After playing with it a bit, it was clear how helpful a tool it can be in a distributed processing model. It is trivial to construct a complext pipeline of transforms and pass it to a multiprocessing pool to quickly crank through some data.

Python Daemon

There are tons of tutorials and libraries out there for creating proper unix daemon. PEP 3143 proposed a module in the standard library since it is something that hasn’t changed in a long time. The result was python-daemon. The python-deamon module is really easy to use and makes doing helpful bits like changing working directory and capturing stoud/stderr trivial.


Invoke is a python build tool that is similar to Paver. What is interesting about it is that it has a mechanism for including other source files as extensions. It has a focus on calling multiple tasks at the same time and handing each task’s arguments correctly. I haven’t had a chance to mess with it very much, but my cursory overview has been positive. It cleans up a couple annoyances I had with Paver regarding task arguments. It also comes from the folks that wrote Fabric.

This process of setting up my dev environment has been fun. It has been much simpler to get my emacs up and running thanks to keeping my config files in source control and package listings up to date. My fingers remembered how to use StumpWM. It is as though I never switched! Hope you enjoy my recent finds!

Fri, 18 Oct 2013 00:00:00 +0000 <![CDATA[Hosting Changes]]> Hosting Changes

I’ve had a VPS host since 2007 in order to run Python web apps. The reasoning is that most shared host, in addition to killing long running processes, rarely made it easy to create your own environment. I’ll remind you, this was in the early stages of virtualenv and there were still hairy tutorials on how to get a Python WSGI process running on Dreamhost.

Since then the landscape has changed quite a bit. There are far more hosts that support long running processes. There are services such as Heroku that make deployment of Python apps a cinch. VPS hosting has also become more common and easy to get up and running.

Beyond the technical differences, the biggest reason I switched from VPS Link to Digital Ocean was because of the price. As anyone who has used a VPS, ram is a fleeting resource. With only 256MB running any LAMP stack is pushing the limits. Nevermind being able to use something like MongoDB or some other more interesting NoSQL store. I’m now spending $10 a month for a gig of memory where I was spending $25 a month for 256MB. It was a no brainer.

The other change is that I’ve switched from Pelican to Tinkerer for my blog. I’m not positive I’ll stick with it since sometimes it is nice to have the WordPress infrastructure in place. Now that I can actually run a database, I wouldn’t mind switching back. For the time being though, I’m going to check out Tinkerer and write some elisp to make using it easy in Emacs.

Tue, 15 Oct 2013 00:00:00 +0000 <![CDATA[Electic Sheep and RFID]]> Electic Sheep and RFID

I’m finally getting around to reading Do Androids Dream of Electric Sheep?. The premise is effectively what would happen if we had robots that were so real we couldn’t tell them from actual living things. It is an interesting problem. The methodology in the book for determining an entity as an android is a test of empathy. The innovation that throws a wrench in everything is the creation of a new type of android that might be able to fake the empathy test.

While the book has been fun, the premise is a little annoying at times. Why is an interview necessart at all? Just stick something like an RFID chip inside an android and you’re done. Obviously, as the book was published in 1968, a future of RFID and GPS may not be at the forefront of futuristic thinking. Unfortunately, that doesn’t take away from the fact that we do have things like RFIDs today.

Still, our reality is a minor annoyance. The book is still a great read and I can recommend it to anyone who enjoys scifi.

Mon, 09 Sep 2013 00:00:00 +0000 <![CDATA[Data Realization]]> Data Realization

17 hours and 35 seconds. That is the amount of time I’ve been running an Elastic Map Reduce job. The file is a 16 GB compressed. I’m using 10 small EC2 instances.

It is my first EMR experience and it wouldn’t surprise me in the least if I was doing something wrong.

The thing that is interesting about this process is that all this time has been spent moving data around. My understanding (and the monitoring tool mrjob provides confirms) is we must first take all our data and make it available to our EC2 instances. I’m supposing this process means copying it from s3 and putting it in a hadoop filesystem for processing. My guess is that the actual processing won’t take nearly as long... At least I hope not.

It just goes to show that data has a price. In theory, digital content can be moved all over the world in an instant. In reality our networks are insanely limited, our hard drives are slow and the more moving pieces the slower things become. It doesn’t come a surprise, but it is something to think about when you have a lot of data you want to work with.

When I started working on this project, I questioned the decision of putting the data in S3. I wondered if uploading it directly to an HDFS cluster would be a better tact. Seeing as we hadn’t settled on a processing system, it felt like a premature optimization. Yet, in the back of my mind, I wondered how Amazon could conceivably take huge files, put them on an HDFS cluster for processing, run a job and clean things up faster than I could dump a database to a decent sized machine and run my processing locally. It appears that my intuition might have been right after all.

More data, more problems I suppose.

Thu, 29 Aug 2013 00:00:00 +0000 <![CDATA[Amazon Data Pipeline]]> Amazon Data Pipeline

Amazon Data Pipeline is a tool make running data processing jobs easier in the AWS cloud. Essentially it tries to wrap up the process of grabbing data, putting it on EC2 instances, and running some processing via EMR (Hadoop). It takes care of a bunch of details like starting up a cluster, pulling and uncompressing data from S3 or other data sources, pushing the results back to S3 and a whole host of other bits that you’d have to sort out yourself if you were building a system on hadoop.

The biggest benefit of AWS data pipeline is that it automates a lot of small details. It is non-trivial to spin up a hadoop cluster, populate the filesystem across nodes, distribute your tasks, schedule jobs and return the results to a known location on S3. There are a lot of balls needing juggling and data pipeline provides a reasonably low point of entry.

The downside is that you really need some knowledge of what is going with data pipeline in order to use it effectively. There are a ton of options to consider and unless you are familiar with the specific storage medium or processing system you are working with, it can be pretty overwhelming and confusing.

Seeing as I’m just getting started with data pipeline, I’m unclear how big a benefit it is or where it really excels. If I didn’t run large jobs often or had a small number of them that needed to run, it seems like a good platform. The debugging and knowledge required to do a specific task a few times is reasonable if you are going to get reproducible data for a year or more. It remains to be seen if it is a good platform for when you have more robust requirements.

As an aside, it goes to show how difficult it can be process data. There are no shortcuts taking a data source and processing it. Computers can’t really guess b/c they can’t check if they were right. People have the ability to try something, check the result and make an assertion on how correct it is. In this way we can see patterns that are opaque to a computer. Within the context processing huge amounts of data, it would be extremely helpful if a query could describe to the computer how to pull out data from some dataset in order to push it into a processing system. If we solved this problem, we might be able to expose data processing to the masses.

The state of social media suggests the power process massive amounts of data would probably amount to better cat videos, but who doesn’t love cats!

Tue, 27 Aug 2013 00:00:00 +0000 <![CDATA[Reader and Facebook]]> Reader and Facebook

We’ve lost Google Reader. Facebook continues to suck. Can we please consider a new way to publish our data?

It always has bothered me that social networks have managed to garner so much attention when they have so little to truly offer. Social networks that lack a creative focus (like Facebook, Plus, MySpace and Twitter) are kind of like pyramid schemes. Users are tasked with growing the network with friends so they can use it and have a place to communicate. What is somewhat seedy about it all is that users feel entitled, when in fact they are slaves to the network. They sign agreements that provide the social network with the ability to profit from user’s data and network. The social network can change anything they want and eventually charge for aspects that were previously free.

The first thing that needs to change is users need to own their data.

The other aspect of social networks that seemed off is the actual network of friends. Most social networks have begun to rally around facebook and use that as a basis of your network. That doesn’t mean that when you publish data that your network sees it. Facebook started throttling this sort of thing, most likely because handling the traversal of the network is expensive when the system is also responsible for publishing.

The second thing that needs to change is users need to own their list of friends.

The answer to owning data is to create a blog. It doesn’t matter so much if that blog is on or is self hosted because a blog is something you migrate relatively cheaply as services rise and fall. The answer to owning the friend list is more difficult, but definitely doable.

Now that Reader is gone, there is an opportunity to start taking ownership of the friend listing. Each “friend” is a URL to their blog (or Facebook profile). You subscribe to your friends feeds. When they get sick of Facebook, or add another slick social network, the feed reader can be notified in the feed and the user can add the new network. With that in mind, assuming the social network can post content to the blog, then the new content can just show up.

The fly in the ointment is that this whole system doesn’t have to be based on advertising. Losing advertising helps users a great deal because there are less distractions. More importantly though, applications that want to include content in a person’s feed must actually provide some value and/or enjoyment to that user. If they don’t provide enough value to cause a person to pay for the service, then they can’t survive. If you are a glass half full kind of person, you’ll quickly realize that app developers can focus on writing tools that help people express themselves rather than having to worry about scaling up some infrastructure to handle the massive social network they are creating. While it sounds fun to be tasked with “scaling”, from a business perspective, you can stop paying for servers and development time.

For this to work there will need to be some re-branding. There needs to be a change in the social dictionary that provides a name for this social “Windows 95” moment. The users also need to take ownership of their content as something of value. I suspect that sharing ad revenue from the readers applications could be a great motivator here.

I’m not holding my breath that we’ll see a huge paradigm shift any time soon. At some point it has to come though. The amount of money spent on maintaining a social graph is daunting to say the least. It is unlikely that we can continue to centralize our resources into social silos. Eventually, users will see the value in an online persona they own and won’t have a problem paying for the privilege.

Fri, 05 Jul 2013 00:00:00 +0000 <![CDATA[Reading Sci-Fi]]> Reading Sci-Fi

A while back some friends and I started watching “2001: A Space Odyssey” as a bit of late night background movie. The last time I experienced 2001 was enjoyable, even though I had no idea what happened at the end. Yes, it was really cool and artsy, which appealed to my subversive side, but beyond that, I was clueless.

This time around, I once again got sucked in and even though we didn’t finish it, I wanted to know what happened. A quick look on wikipedia made me realize a father of sci-fi, Arthur C. Clark, had written the screenplay and the book. There are criticisms that mention the book ruins the mystery of the movie. Naturally, it seemed like a good way to get the inside scoop, so I bought the book on my kindle app and dove in.

Despite having an appreciation for the geekier side of life, the sci-fi literature has never been a part of that experience. There isn’t a good reason for avoiding sci-fi books other than it didn’t seem very interesting to me. With 2001, that has all changed dramatically and now I can’t stop.

My path has been to visit the greats before descending into more specific and lesser known authors. Here is a list:

  1. “The Sentinal” - Arthur C. Clark: A collection of short stories, including The Sentinal which inspired 2001.
  2. “Childhood’s End” - Arthur C. Clark: Aliens visit earth, providing a catalyst for the next phase of evolution.
  3. “Rendezvous with Rama” - Arthur C. Clark: A really slow space thriller exploring a seemingly dead space ship.
  4. “Starship Troopers” - Robert Heinlein: A political discussion of the true cost for citizenship.
  5. “Ender’s Game” - Orson Scott Card: There are two stories here. The first is Ender and going through battle school. The second, and probably more important for the rest of the series, is that of his siblings. Check it out!
  6. “Neuromancer” - William Gibson: A tougher read due to the creation of some native slang, but good nonetheless. This obviously spawned The Matrix movies.
  7. “Old Man’s War” - John Scalzi: Get drafted into the unknown space army when you’re 75 years old? Loved it.
  8. “The Ghost Brigade” - John Scalzi: The continuation of Old Man’s War. Again, loved it.
  9. “The Last Colony” - John Scalzi: The last formal novel in the Old Man’s War. Once again, really great stuff.
  10. “Speaker for the Dead” - Orson Scott Card: Revisiting the Ender Quintet. This book was awesome! This is where the side story of Ender’s Game comes the front and center.

I’m reading “Xenocide” by Orson Scott Card now, the third book in the Ender Quintet. Once again, it develops the storyline of “Speaker for the Dead”. I can’t seem to put these books down.

Besides the entertainment factor, reading sci-fi has helped me become more excited about technology. Before I had read these books, news about NASA and space stations seemed like a waste of money. Privacy and publishing everything we do had started to feel like a weight pulling down my desire to be online. Seeing the potential of fanciful technology and excitement of exploring space has helped me rediscover how technology can be cool. Seeing as I’ve been somewhat burned out this past year, these stories have been a great tool in reigniting my interests in programming.

Wed, 26 Jun 2013 00:00:00 +0000 <![CDATA[Announcing CacheControl]]> Announcing CacheControl

A while back I took the time to make the httplib2 caching libraries available in requests. I called the project HTTPCache.

Recently, there were some changes to requests that Ian submitted some patches for. I also found out there was another httpcache project! It made sense to take a minute a revisit the project to see if there were some improvements. Specifically, I wanted to see if there was a better way to integrate with requests and httpcache had provided a great example.

With that said, I introduce to you CacheControl! There are few important differences that I wanted to point out.

The httplib2 Cache Logic as a Library

You can import a class that will accept a minimum set of requirements to handle caching. Here is a quick example of how to use it.

import requests

from cachecontrol import CacheController

controller = CacheController()

resp = requests.get('')

# See if a request has been cached
controller.cached_request(resp.request.url, resp.request.headers)

# cache our response

This still assumes a requests response for caching, which I might end up refactoring out, but for now it seems like a reasonable API. For an in-depth example of how it is use in CacheControl’s actual adapter, take a look at the code.

Use the Requests Transport Adapter

Thanks to Lakasa for telling me about Transport Adapters. Requests implements much of its functionality via the default HTTPAdapter, which means you can subclass it in order to make more customized clients.

For example, if you had a service at that you wanted to create a custom client for, you could do something like this:

from requests import Session
from ionrock.client import IonAdapter

sess = Session()
sess.mount('', IonAdapter())

The adapter then can do things like peek at the request prior to sending it, as well as take a look at the response. This is really handy if you needed to do things like include application specific headers or implement something that non-trivial in a general HTTP client such as Etags.

In the case of CacheControl, it allows the ability to change what is cached before the response is constructed. The nice thing about this flexibility is that you could considering storing a more optimal version of the response information. While CacheControl doesn’t do anything special, now we can if the default behavior is too slow or the cache store requires a specific format.

Project Changes

I actually released an package for CacheControl and plan on keeping it up to date. In addition to a new package, I’ve moved development to github, most importantly, because I’ve moved most of my packages to git.

The test suite has also been revamped to use webtest rather than the custom CherryPy test server I used. You can run the tests and get up and running for development by using the and paver.

Take a look at the PyPI page or the README for help on how to use CacheControl. At this point I believe it is reasonably stable. My next steps are to provide better documentation and work on making sure the cache implementation has a reasonable performance when compared to a similar threadsafe cache I’ve used with httplib2.

Please let me know of any comments / questions!

Tue, 25 Jun 2013 00:00:00 +0000 <![CDATA[Understanding Code]]> Understanding Code

Beautiful code is a myth. It is trivial to look at code and feel it is lacking or could be written more clearly, yet it is rare that we write code and end up with a piece of work that is well written. Seeing as software engineering is still very much in its infancy, we can’t be too critical. While we might consider the great computer scientists comparable to great authors, the reality is, these monuments of software engineering are much closer to the ancient Egyptians or Mayans. They have seen the development of language, yet they have not mastered it to the point of art.

In light of this discovery that we lack truly beautiful code, it becomes extremely important that we read and understand code. We start with the basic elements of what a snippet of code is doing and start building on that understanding. Quickly we will find that the programming language is least of our concerns. When we are forced to understand old code that has been used and abused for years we are confronted with the true challenge of understanding.

If I only knew the secret to make understanding extremely complex and mature easier, I would happily share it with the world. But I don’t.

What I have found is that some code is easier to read than others. There is the obvious style consistency that improves the situation. PEP 8 in the python world is a great example of a “good enough” description of style that makes reading python code much less opinionated.

Commenting is another valuable tool. The comments that make the biggest impact to me have been the ones that reflect insecurity and doubt. The reason being, IMHO, is working code is extremely confident. When a programmer leaves a nugget suggesting that the code offers opportunity for improvement, it provides a sense of relief after following a bug to that location in the code. It is proof that something was “good enough” at the time, but might need some changes later. It is empowering to know when trying to understand code the author empathized with your confusion and confirms to the reader they are not alone.

Along side commenting, naming things (aka one of hard things in computer science) is another critical tool in understanding code. Where naming becomes powerful is when we cease to critique solely based on length and transition to truly descriptive terms. Changing the index key from i to index is not enough. Great authors communicate intense concepts with very few words and we should attempt the same in how we name things. Literary authors have the theasaurus to help find words with similar meanings. It would be pretty slick to have a theasarus for variable and routine names. Something that examined the usage, arguments and context to help provide feedback that you might be naming some concept inconsistently or better yet, that there is a larger name that communicates the encapsulated set of concepts.

Understanding code is really hard. It is not going to get radically easier any time soon. As someone who has worked on older codebases for a while, I’m empathetic to those who read my code after me. My hope is that empathy yields practical benefits by saving those poor souls some time as they are forced to unravel my mess just as I had to unravel the mess presented to me.

Sun, 09 Jun 2013 00:00:00 +0000 <![CDATA[Own the Data]]> Own the Data

With the recent public recognition of our governments intent to spy on us, it begs the question what we can do right now to subvert these attempts. My understanding on why the government has any right to data such as text messages or email is because they are stored with a third party. There are certain third parties that are protected from being queried such as the communication between a lawyer and a client, but these are few and limited. The vast majority of the time the third party can be forced to turn over records. A possible solution to maintaining our privacy is to take over the storage of the data.

There are couple issues with owning all your own data. The first is availability. As someone who has run his own website from home, it is difficult to provide a level of availability like that of other services. Tools like Gmail provide users massive amounts of bandwidth, storage and optimizations in order to create an excellent user experience. Owning your own data also means considering your own infrastructure, which may be difficult.

Another issue is transferring the data. When you use email or make a web request there are often many different systems listening in. You web host most likely keeps a caching web proxy in its localized data centers in order to save bandwidth and provide a better faster load times to users. It is nice feature that does save some time, but it also means that anytime you transfer data over the Internet, there is a good chance you’ve indirectly made that data available to third parties.

At the moment, there are too many modes of communication to make owning all your data practical. When you consider phone calls, text messages, emails, chat (IRC, skype, google hangout, facebook) among others, it seems exceptionally difficult to keep your data to yourself, while still making it useful. That said, from a technical perspective it is totally feasible to manage the vast majority of services. The user experience would be a nightmare currently, but the fact it is possible means there is hope we can avoid always having a third party in order to utilize technologies and networks.

I should also mention that this third party issue is not solely the blame of the government. Internet providers have long made attempts to listen in on communication in order to get a better view of their users. The move to curb piracy probably has more to do with making sure our pipelines to the web must go through a third party that does listen in. Hopefully laws can be passed that force network providers to be dumb pipes. This way, they provide access with the understanding that they aren’t responsible for the content. This is similar to utility companies in that they do not consider what you do with the electricity or water, only the amount you use.

It is pretty scary that simply conversing within a community is not safe. It is even more frightening that the reason it is not safe lies in the governments guise to keep us safe. While it can be scary for some, we need to recognize the government is a tool that we the people use to lower the complexity of life. A government takes care of some big details that are difficult to achieve on a local community level. Our government has gone well beyond this task and moved into the realm of a parent. As children we don’t truly own anything. Our parents can provide us with a bountiful life or a prison. The reality is we are not children and we must enforce our ownership of our lives and information.

Sun, 09 Jun 2013 00:00:00 +0000 <![CDATA[Creativity in New Mediums]]> Creativity in New Mediums

Thingpunk is the idea that we apply the design of physical things to all mediums, in spite of our new networked digital world. With the web providing a new medium and platform for expression that is wholly outside of the physical realm, how do we allow users to be creative without forcing an understanding of the underlying technology?

There is a market for empowering users to create interesting digital content. There a huge number of photo editing apps for your phone and computer. Instragram is an excellent example, yet it also reflects the Thingpunk aestethic. The photos are not being edited into something new. Instead they are given treatments to make them appear from another time where physical film was a necessity.

Looking back at film photography, it was possible for anyone to experiment with the same techniques as professional photographers. The amateur could easily play with different f-stops or apertures and see the result. There isn’t the same analogy when it comes to digital content as a whole. Even a reasonably simple tool such as CSS is rarely edited by every day users to customize or “remix” their web browsing experience. The chasm between user and creator is too great.

The technical divide is truly tragic. Imagine students creatively dissecting terabytes of data. Social networks become pointless when people can simply compose their own graph of relationships, pointing them to each person’s identity (real and created) on the web. The walled gardens for content can finally come down because the masses have the ability to stake a claim and start a business without technical gatekeepers.

As a programmer, my wish is that people would take the time to learn, but that wish is naive. My wife proves this to me every day when we send links via email. Even though the information she needs is a single search term away, I am called upon to find it, package it in an email message and send it, knowing the process will be repeated. This is a failure of our technology.

There are small nuggets of creativity that suggest a future in the machine. Infographics are data centered communication that mix visual design and textual data to communicate effectively and creatively. Currently, creating an Infographic requires analyzing some data that is typically in some technical silo such as a remote database or set of data files.

Imagine if users could create a spreadsheets of any data point. Anything from text messages to receipts would be interesting to look at.

    |  # of Texts to Mom                                                   |
| 1 | =SUM([1 for text in PULL(mobile://alltexts) if text.contains(mom)])  |

The specific cases aren’t difficult to implement, but a system where general data exists in a meaningful way and can be accessed is extremely difficult. Yet, it is necessary that we force our computers to do the work. The mundane is where our machines excel.

The fear of “Big Brother” goes away if users are empowered and given control. We are concerned about privacy because users must rely on others to provide the technical benefits. What if that barrier didn’t exist? What if true amateurs could take control of their data and safely share details they deem fit? What if they could sell that data themselves? Most importantly, what does the digital medium look like when any user can play with the same settings as the professionals?

While we patiently wait for users to learn the power of personal computing there are steps we can take right now. Hadoop is a great example of an ecosystem trying to provide general users with massive analytical power. There are multitude of languages and systems that provide non-programmer types the means to query huge amounts of data. While it is still far from being generally accessible, it is a start.

Creativity and design are dependent on experimentation. Musicians “jam” to find melodies and rhythms. Painters sketch ideas before committing them to ink. Tools such as Photoshop provide features such as layers and locks in order to allow users a way of experimenting and iterating on their work. We are far from riffing on our data and publishing it to the new medium of the web, with the result being we still live in a world ruled by Thingpunk design.

Thu, 23 May 2013 00:00:00 +0000 <![CDATA[Learning Lisp]]> Learning Lisp

Programming is often an iterative process. When learning new technology it is often the case the first attempt doesn’t work out. After a little time away, a second look might get you a bit further in the process. Eventually, the technology stops being something you are learning and becomes something you are using. This has been my experience with Lisp.

My first foray was via Emacs of course! I remember the first time I set out to write a small function to help me paste some code and managed to make it work. A python script did much of the heavy lifting, but it was still a breakthrough to be able to select some text, send it to a process and write the output to a buffer. When I finally managed to refactor that code to do what the python script did, it was when I first used Lisp and wasn’t strictly learning it.

My most recent Lisping came from trying out elnode. Elnode is named after node.js and provides a similar service, an async web server written in Emacs Lisp. My task was to make a small web UI for some services that I typically start to work locally. I use a package called dizzee to start services in a similar way as you’d use foreman to start services for development.

I was able to get some HTML returned to the browser and handle some requests to start up services, which felt pretty good. I also wrote some advice for dizzee to help keep a data structure with all my services. Dizzee provides macros that create new functions, so there isn’t a listing you can simply query. Where things fell apart for me was trying to stream the output of the service. I wanted to be able to tail the service to my browser, but that proved more difficult than I hoped.

After hitting a wall, yet appreciating the elegance of writing application code in lisp, it seemed like a good time to give clojure another try. My first attempt at clojure was short lived, primarily due to my lack of experience in Java. Thankfully, there is leiningen to help those without much Java experience use clojure successfully. In this case I was able to get through a tutorial for noir, a clojure web framework. My hope is that I can get some more time with clojure and noir to create a more involved web app.

What is interesting is that when I returned to Python, my primary language, it became clear that it feels very comfortable to me. I rarely need to look things up in documentation. Solving environment issues such as dependency resolution doesn’t require much thought. The process of coding in Python has become natural. My hope is that someday I can say the same of lisp.

Wed, 03 Apr 2013 00:00:00 +0000 <![CDATA[CherryFlask]]> CherryFlask

Today I noticed a core, albeit simple, application we wrote uses flask. This seemed odd since typically we would consider ourselves a CherryPy and Django shop. Since the app is so small with no actual UI, it doesn’t make sense to use Django. But, why not use CherryPy? The author is typically a Django dev, so I’d assume it seemed easier (and possibly more fun) to give another micro-framework a try.

Seeing as I believe CherryPy is more than capable enough to make life easier in these situations, I was curious if I could replicate (more or less) flask’s API using a minimal amount of CherryPy.

Without further adieu, CherryFlask:

import cherrypy

class CherryFlask(object):

    def __init__(self):
        self.dispatcher = cherrypy.dispatch.RoutesDispatcher()

    def route(self, path):
        def handler(f):
            self.dispatcher.connect(path, path, f)
            return f
        return handler

    def run(self):
        conf = {
            'global': {'script_name': '/'},
            '/': {'request.dispatch': self.dispatcher}
        cherrypy.quickstart(None, '/', config=conf)

app = CherryFlask()

@app.route('/')  # cool, I can use cherrypy tools as decorators!
def hello():
    return {'message': 'hello world'}

if __name__ == '__main__':

Obviously this is just a proof of concept, but it doesn’t seem that difficult to continue to build on the idea to construct the same sort of micro-framework with a minimal amount of code. Also, it shows that CherryPy could be used to create other, more specialized, APIs.

Wed, 30 Jan 2013 00:00:00 +0000 <![CDATA[A Cultural Observation: Git, Mercurial and Publishing]]> A Cultural Observation: Git, Mercurial and Publishing

Warning! I’m not an expert git user! My goal is to describe the philosophical difference between mercurial and git as I’ve seen them.

Recently I’ve had the opportunity to spend some time in git. It is a really interesting to get some first hand experience and see exactly why people feel as though there is such a strong difference between git and mercurial. After enjoying my time with git and going back to mercurial, I realized that the differences, while exist, are primarily cultural.

The essence of what differentiates the two cultures is what it means to publish code.

DVCS defines a concept of a “push” in order to take local changes and add them to a repo. It is reasonable to assume that pushing is analogous to publishing, but on a cultural level it is not entirely correct. Publishing code in a DVCS is the point at which code goes from “development” to “mainline”.

When code is in “development”, it is considered outside the scope of the source tree. It may still be in version control, but it has not be included in the official history of the code base. When code is “mainline” it has become part of the public lineage of the source. Development is where you write drafts that end up being published in mainline for consumption.

To clarify, “mainline” does not mean “master” or “default” as it does in git and mercurial respectively. A source repo could include a wide variety of “mainline” branches that are used to take code from unstable changes to stable releases. In this case “mainline” means the set of branches that are considered part of the public workflow. Once code hits “mainline” it cannot be changed.

Mercurial culturally has taken the approach that code goes from development to mainline by way of merging. You create a branch of some sort (bookmark, queue, branch, anonymous head, etc.), do work and when it is done you merge it into some mainline branch. You may push your branch to the public repo, but if it hasn’t been merged into one of the mainline branches, then even though the branch exists publically, it hasn’t been truly published. Culturally, it is considered OK to make your unfinished work public. It is series of merges in the mainline branches that we strive to make sense and keep clean.

Git culturally takes the perspective that code goes from development to mainline when it is pushed to a public repo. The reasoning is simple. Once you push the code, you can’t very well adjust the history as it could be conflicted with others. Users of git consider it good practice to shape and edit the history before pushing it. Locally a repo can be as “messy” as you want, as long as the code you push has a well written history.

These two cultures are not entirely based on technical differences, especially today. Mercurial historically has been less forgiving of rewriting history at a technical level. Mercurial tries to keep track of where code is coming from (ie what branch). If you merge a branch and push the code, the branch that was merge will be pushed as well because the merge changeset has a link back to the original branch. Git on the other hand is more than happy to consider things only terms of patches. My understanding is there is still the same sort of linkage, but editing those links is supported as a first class function in git (via tools like rebase and the reflog). Currently, both tools offer a very similar toolset assuming one perspective on how code goes from development to mainline is understood.

The important thing to take from these two different perspectives on how code gets published is that you should pick a model and create a workflow around that model. Where things get difficult and frustrating is when the publishing process is undefined. A mercurial repo can take a published when pushed model and a git repo can use a publish on merge technique. If you mix models, then things become more frustrating because you can’t rely on the consistency provided by adhering to a clear publishing model.

To be clear here are a few example where mixing models can cause problems. Imagine a CI system that built things based on every commit pushed to a remote repo. Pushes of “development” branches don’t fit in that model and could break the CI build. Similarly, in a merge based model, tools that expect branches to align with features for a release would become incomplete.

No matter what model you subscribe to, it is critical to communicate what it means to publish code. Whether it is an organization or open source project, the process of taking code from development to mainline should be a defined standard that is communicated to everyone working on the code. It will make life easier for everyone involved and remove much of the frustration when working with git and/or mercurial.

Mon, 07 Jan 2013 00:00:00 +0000 <![CDATA[Git vs. Mercurial]]> Git vs. Mercurial

I’m going to start by saying that I’m not arguing git/hg is better. They are both great and do an excellent job. I’d also like to point out that I’ve never really used git, therefore my descriptions are based on what I’ve read of git and git user’s arguments against mercurial branches. In any case, it is not meant to be negative to either side and simply is here to potentially put a name on a face.

A large point of contention between the primary DVCSs is the concept of branching. The primary difference has actually nothing to do with either system’s branching feature. Instead it has everything to do how it stores the underlying tree of changes, merges, tags, etc.

Mercurial acts like a pack rat. When you commit, mercurial wants to keep track of everything.

Git just cares about the patch. Everything else is just metadata.

Mercurial’s model means that you always know everything about a changeset. If something changes, it needs a new changeset.

Git’s model means that if you want to say that branch never existed, then just remove that metadata. It disappears.

The big difference is in the cultural impact this has.

Git has always been friendly to “commit everything and let rebase sort it out” crowd. The workflow is to commit code (commit might be the wrong word here...) and create the perfect patch to push. Over time, the mentality is that your repository is not simply a collection of changes. It is the culmination of perfectly pruned patched.

While mercurial works best with a plan, hg isn’t nearly as inept as it might seem though. While you are less likely to see the perfectly pruned repo (ie no merges), with a bit of organization, you can have a repo with obvious scopes of development. A well organized hg repo will let you see a feature start its life as a branch, potentially have some child branches for experiments, and finally find the right solution. When the branch is merged back to default you can see the entire perfectly pruned changeset while still being able to go back and see how it came to be.

The git folks think this is ugly. Why create a persistent branch when you really just need that single patch in the grand scheme of things? Keeping your history means you can analyze it and understand it. The big list of perfectly pruned changesets you see in a well managed git repo is really nice when you look at 10, 20, maybe 50 changes. When you are looking at 1k, 20k, maybe 30k changes, that single list is not nearly as helpful. In those times it would be really nice to go back to the original branch and see what happened. If you use named branches in mercurial with a good naming convention (feature/$ticket-$title works for me), that history can actually be useful in the long run.

I should be clear that there is no reason you can’t have the same model in git that you would have in mercurial. At least that is my impression. I’m positive there are things in git that are possible you can’t do in mercurial. I also understand the draw of a “clean” set of changes. Personally, I think I value the pack rat mentality of mercurial, but I plan on giving git a formal try sooner than later. After all, the mercurial mode I use in emacs is based on a git mode!

Tue, 18 Dec 2012 00:00:00 +0000 <![CDATA[No Management]]> No Management

Steve Yegge argues that the most valuable skill a programmer can master is the ability to market or brand ideas. I found out that at github there are no managers. Leading by example is really a matter of communicating your ideas and persuading others to follow by doing. Living in a world without management means leaders are really those who communicate ideas clearly and convince people to work on the idea.

One might believe that the only way to do this is by doing, but this method is problematic. If you are the one doing the work implementing your idea, then who are you leading? You lose the benefit of the team because you are the only one on the team.

Leading by example is doing the work and clearly communicating what you are doing. I’m terrible at doing this, yet I’d like to become a leader.

In order to change, there needs to be a clear metric. There should be a clear set of questions that advertising an idea by example should strive to answer. Here is a first pass.

  1. How do I help?
  2. Why do I care?
  3. What problem does it solve?
  4. Where do I get help?

The medium for the answers need to come by way of documentation, blog posts, wiki articles and any other mode of communication that helps to codify the idea and makes it easy for contributors to help.

Like I said, this is a first pass. My goal is to create a situation where in the absence of management, I can still become a leader. I’m sure this metric will evolve and that I’ll crash and burn a lot, but hopefully the iterations will eventually yield a greater impact in the work I do.

Wed, 05 Dec 2012 00:00:00 +0000 <![CDATA[Testing Terminology]]> Testing Terminology

There are some common terms in testing that are often misunderstood. We have unit tests, functional test, integration tests, smoke tests, etc. I’m going to try and give each a definition in hopes of clarifying how to think about each type of test and how it fits within the scope of a test suite.

We’ll start with unit tests.

Unit Tests

A unit test typically will be a mirror of some class where each method of the class has a test method that asserts the functionality. Unit tests are great for testing classes that actually do things and have methods that perform specific actions. A unit test is pointless for a container class. Unit tests can also be really difficult for classes that defines operations on a lot of other classes.

Let me explain.

You often will have classes that serve as glue amidst your more traditional “object” classes. Your object classes are things like models or anything that maps really obviously to a real world thing. These are your “User” or “EnemyNinja” classes that are meant to provide an actual “thing” in your application.

You glue classes are the ones that take all these objects and do stuff with them. The glue classes often are containers of functions that are used to organized complex algorithms and maintain state. They usually will have one major “function” that will incorporate many methods. These kinds of classes are not the ones you need unit tests for because they really don’t provide a “unit” in your application. The glue really is providing some “functionality”, which means they really need a different kind of test.

Functional Tests

Functional tests are often confused as tests for functions. If you use this definition, it becomes unclear why you would have “functional” tests and “unit” tests since “unit” tests really just test things like methods. And lets face it, methods really are functions when it gets down to it.

A functional test is meant to test functionality. Functions are conceptually where you keep your algorithms. Therefore functional tests should effectively assert that your algorithm or operation is correct.

Often your functional tests are where you will be testing your glue classes that don’t fit the traditional “object” classes mentioned above. Your functional tests will confirm the interaction between objects and make sure that you are getting the right output based on some set of inputs. Functional tests should make sure your application functions as expected.

There is a difference though between the functionality of an application and how it functions when it is deployed. Functional testing is meant to assert at the algorithmic level, things are working as expected.

Integration Tests

Applications never run within a vacuum. Any software must run on actual hardware and the vast majority of cases, software must interact with other software. Integration tests are meant to test these interactions between different pieces of software and hardware.

An integration tests should verify that when an application interacts with some other system outside of itself, that it is doing the right thing. The “right thing” is going to depend on what the application is interacting with. An integration test asserts the APIs are used correctly.

This definition doesn’t meant that integration tests need to fully test any API the program interacts with. For example, if you read a file, you don’t need to test that your language is properly creating file handle, reading the bytes and converting them to the right string.

An integration test will assert that the parts of the API that are used work correctly. An integration test will also assert that when the output is not as expected, your application has a specified means of handling the problem.

Just to be clear, when I say API, this includes internal application APIs. Most applications define tons of APIs that are meant to be followed in the future. Whether they are a specific base class or some IPC protocol used between worker processes, integration tests should assert that things work as expected and that things fail as expected.

Smoke Tests

Smoke tests take the concepts of an integration test to the next level. A smoke test is meant to make sure that actual users experience the application as intended.

Smoke tests are difficult to maintain and perform because often times they are really boring. This makes it hard to find people to take the time to run the tests and report findings. It can also be difficult to communicate when things break.

Difficulties aside, smoke tests provide a sanity check. Smoke tests will allow you to be confident you didn’t introduce new issues that prevent users from doing whatever it is they need to do with your software.

How About Code Coverage

Code coverage is meant to see if you have tests that execute all the code in the code base. Looking at a higher level, unit, functional, integration and smoke tests are all tools to help make sure your code is fully tested without sacrificing efficiency or accuracy. Each type of test layers on top each other in order to create provide a stable base.

It is sort of like a pyramid.

The bottom of the pyramid is the guarantees of your platform. You have an operating system, hardware and a programming language that acts to provide you reliable answers. When you add 1 + 1 in your language, you will always get 2. When your user types a Q on the keyboard, you will get a Q in your application.

The next layer of the pyramid are your essential functions and objects. These are small “units” of work that can be accomplished atomically using the stable layer of your platform. Unit tests cover these functions and make sure that everything else that depends on these units of work receive the same stability the platform offers.

The next layer is the algorithmic layer. You combine your objects and functions into more complex algorithms and operations. These operations assume simple use cases only need to deal with seemingly atomic operations. Your functional tests make sure this level of code works as expected.

The next layer is your interaction layer. This is where you consider what happens when you have threads and processes using the same resources. The code here talks to services and maintains state in such a way that your algorithms can function atomically and reliably. Your integration tests cover this class of code.

Finally, you have your completed application where you see it not as individual pieces of code, but a single tool. Smoke tests consider the software as a whole and reflect this completed perspective.

These layers work together to provide guarantees and stability. The result is that the code usage becomes “covered”.

Testing Distribution

Up until this point I’ve tried to give each type of testing a somewhat equal standing in the grand scheme of things. The truth is that each application is very different. If your application does nothing but crunches numbers, then your integration tests will most likely be very minimal since you do not need to interact with different systems. If your application acts as an middleman between many different systems, you might have very few unit tests and no smoke tests because your users are other systems that do most of the actual work.

Instead of looking at testing as something that you do to code you’ve written, testing as a whole should assert a quality design. The layering of tests should align with the layering of your application functionality. Like a pyramid, the goal is to create stable layers of functionality. If it is easy to organize your tests and cover all your code with your unit, functional and integration tests, then you can also be confident you’ve designed a quality piece of software. Your smoke tests can then confirm that the internal quality of the code is reflected to the users.

Tue, 06 Nov 2012 00:00:00 +0000 <![CDATA[Going Static]]> Going Static

I’ve had a blog for a really long time although it has never been an ultra important part of my life. Most of the time my blog is simply a vessel for playing around with different programming languages or technology. Once again I’m giving a new blog technology a go.

I got the itch to change a little while back when I realized that I’d been running the same version of my blog software for a really long time. The design was boring and I had a huge archive of old blog posts that were not worth keeping around. I had no real interest in putting together a better design, but I did want to reduce the amount of memory I was using on my VPS host. Running any extra services was next to impossible and it would be nice to try and have a IRC bouncer to help keep track of the backlog on channels at work.

I backed up my necessary files, reinstalled the OS to start from scratch and migrated all my blog posts from my Wordpress site to reStructuredText. I wrote a simple script to format the text files and make sure the site generation worked as expected. I’m using nginx to serve the files and ZNC for my IRC bouncer. So far so good!

I had wanted to use o-blog. It is a blogging system that is similar to Pelican that uses org-mode and Emacs for generating a static site. It was a bit more work to get running and required that my entire blog be in one file. This wasn’t necessarily a deal breaker, but seeing as I didn’t have a desire to go crazy customizing some blog software, I just stuck with Pelican as I had given it a try a little while back. At some point I’d like to try and write some essays or more indepth documentation where I could probably use org-mode’s publishing feature.

There is still a bit of work to do making sure old feed links work, but for now I’m pretty happy with how things are working. It didn’t take forever to set up and it should be easy to archive and start fresh should another tool strike my fancy.

Mon, 15 Oct 2012 00:00:00 +0000 <![CDATA[Forcing a Refresh]]> Forcing a Refresh

Web applications are interesting because it is exceptionally challenging to maintain state. The client and the server act independently and there very little a web application developer can do to reliably keep the client and server in perfect sync. Yet, there are times where you need to sync your client with the server.

When a server API changes it means your client code needs change. When the update to the server involves multiple processes, there is even more chance for the client and server to get out of sync. One way to make sure our client and server processes are in sync is to force the client to reload its resources. This involves downloading the new versions of static resources such as JavaScript, CSS and images.

In order to force our client to refresh, we need a couple pieces in place. First we need to be sure our client communicates what version it is running. Second we need our client to understand when the response from the server is indicating we need to refresh. There are other pieces that can be developed such as a UI explaining to the user what is happening, but to start this is all we need.

To start we should make sure our client has access to some version number. The version number is defined by the server and can be evaluated however you want. An easy way to do add the version is via hidden element in the HTML. If you wanted to do something clever you could limit a refresh when there is a major version bump. For simplicity sake, I’d recommend using the application version and only do simple equality comparisons when deciding whether to return a refresh error.

Whatever version number is in the HTML needs to be sent to the server on each request. How you send that data is totally up to you. I’ve used a ‘__version__’ named value submitted as a form value. You could have a __version__ key in a JSON document you post, make it part of the URL, use a header or use a cookie value. Whatever it is needs to be understood by the server API.

Once you have a version number being sent by your client, the server then needs to test whether or not the versions match before returning a response. If they don’t match, then you should send an error message. For example, you could return a 400 status along with a JSON message that says to reload the page. It is important that you return an error to the client rather than simply return a redirect because the client needs to refresh the page and make sure to avoid cached resources on the refresh. When the client gets the error message, the JavaScript can call ‘window.location.reload(true)’ in order to reload the page, avoiding the cache.

It should be noted that this doesn’t avoid using things like timestamps in the paths to static resources. Making the URL paths reflect the age of the file is still very helpful as it makes sure that subsequent reloads will reference completely different resources that are not cached. The system I’ve described is focused on reloading the initial page vs. performing some operation to clear the cache or determine what resources need to be reloaded. By keeping these concerns separate, we can keep the implementation simple.

I don’t think any of this is extremely difficult, but I think it is a critical part of any distributed web app. When you release new software to a cluster of applications, you want to slowly roll it out. This methodology ensures that as a roll out occurs, your client code can be upgraded at the same time as well as client code that has become stale.

Mon, 08 Oct 2012 00:00:00 +0000 <![CDATA[Better CherryPy Tools]]> Better CherryPy Tools

CherryPy allows creating extensions to the request level functionality by means of Tools. There are a standard set of tools that provide extra functionality such as gzipping responses, HTTP auth, and sessions. You can also create your own tools to provide specialized functionality as needed for your application.

From a design perspective, Tools are exceptionally powerful. Like WSGI middleware, they provide a way to layer features on top of request application handlers without adding extra complexity. The only problem is that writing your own custom tools can be a little tricky, depending on what you want to accomplish.

What makes them a little tricky is that they can be used in many different scenarios. You can attach them to hook points in the request cycle and they can be used as decorators. A Tool also requires that you attach a callable to a specific hook point, which is convenient when only one hook point is necessary for some functionality. When you need a tool to attach to more than one hook point, it requires a somewhat awkward extra method that can be a little confusing. None of this is impossible to understand, but it can be less than obvious when you are first working on writing Tools.

In order to make this process easier, I wrote an alternative Tool base class to make writing tools a little easier. Lets start with a simple tool that logs when a request starts and ends.

import cherrypy

class BeforeAndAfterTool(cherrypy.Tool):

    def __init__(self):
        # Attach the initial handler. This is also the method that
        # would be used with a decorator.
        super(BeforeAndAfterTool, self).__init__('before_handler',

    def _setup(self):
        # CherryPy will call this function when the tools is turned
        # "on" in the config.
        super(BeforeAndAfterTool, self)._setup()

        # The cherrypy.Tool._setup method compiles the configuration
        # and adds it to the hooks.attach call so it gets passed to the
        # attached function.
        conf = self._merged_args()
                                      self.log_end, **conf)

    def log_start(self, **conf):
        cherrypy.log('before_handler called')

    def log_end(self, **conf):
        cherrypy.log('before_finalize called') = BeforeAndAfterTool()

If you’ve never written a tool the above is a decent example to explain the process of attaching more than one hook. It should also make it clear why this API is not exactly obvious. Why do you attach the ‘before_handler’ in the __init__ call? There is really no reason. The Tool API has to accept an initial callable in order to allow using it as a decorator. Similarly, we could have used the log_start method to attach our ‘before_finalize’ and avoid using the _setup at all. What methodology is correct? There isn’t a right or wrong way.

It would be nice if we had a slightly more straightforward way of creating tools that was clearer in how they worked. This was my goal in writing the SimpleTool base class. SimpleTool provides a very simple wrapper around the above pattern to make creating a tool a bit more straightforward.

Here is the code for the actual base class.

from cherrypy import Tool
from cherrypy._cprequest import hookpoints

class SimpleTool(Tool):

    def __init__(self, point=None, callable=None):
        self._point = point
        self._name = None
        self._priority = 50

    def _setup(self):
        conf = self._merged_args()
        hooks = cherrypy.request.hooks
        for hookpoint in hookpoints:
            if hasattr(self, hookpoint):
                func = getattr(self, hookpoint)
                p = getattr(func, 'priority', self._priority)
                hooks.attach(hookpoint, func, priority=p, **conf)

Here is the example tool above, using this new base class.

class BeforeAndAfterTool(SimpleTool):

    def before_handler(self, **conf):
        cherrypy.log('before_handler called')
    callable = before_handler

    def before_finalize(self, **conf):
        cherrypy.log('before_finalize called')

How does that look? Is it a little more obvious to see what is going on? I think so.

There are still some features we haven’t covered yet. When hook points are attached, they are ordered according by a priority. There are two ways to set the priority. The first is by setting a priority value on the method itself.

class HighPriorityTool(SimpleTool):

    def before_handler(self, **conf):
        cherrypy.log('this is high priority!')
    before_handler.priority = 10

The second way to set the priority is via the _priority attribute. This will be the default priority for any hook functions. Here is an example using the _priority attribute.

class PriorityTwentyTool(SimpleTool):

  _priority = 20

  def on_start_resource(self, **conf):

The last aspect of tools we haven’t covered yet is how to use these tools as decorators. As I mentioned earlier, the initial callable passed to the Tool is used for the decorator functionality. Using the SimpleTool base class it is simply a matter of setting the callable attribute.

CherryPy is pretty much voodoo free, so implementing the default tool behavior where the initial callable is applied when the tool is used as a decorator is straight forward.

class DefaultTool(SimpleTool):
    def on_start_resource(self, **conf):
        cherrypy.log('starting a resource')
    callable = on_start_resource

Pretty simple right?

If you use CherryPy and give this base class a try, let me know how it works out. Likewise, if you think the API could be improved I love to hear any suggestions.

As an aside, if you are curious what Tools bring to the table over WSGI middleware, there is an important distinction. WSGI ends up nesting function calls where tools are called directly. The result is that if you utilize a lot of tools, the cumulative effect is much smaller compared to WSGI middleware. Most of the time this doesn’t make a huge difference, but it is good to know that if you use Tools in your application design, you be insured against the tools eventually becoming a bottleneck. The other benefit of tools is that they are much simpler to write and can be easily applied to content consistently via the CherryPy framework (ie via the config) rather than a simple decorator. These are not huge gains, but as complexity grows over time, Tools are a great way to keep the code simple.

Sun, 16 Sep 2012 00:00:00 +0000 <![CDATA[Silly Lawsuits]]> Silly Lawsuits

I just skimmed an article arguing the Apple vs. Samsung case doesn’t really matter. It is a trend I’ve noticed on Slashdot as of late where articles increasingly touch on things like patents and lawsuits rather than actual technology.

These copy and patent cases have little actual meaning, yet they are really dangerous. These companies are using the courts and patents to hurt competitors. It is nothing more than an expensive tactic that we pay for as a society. We should be angry that companies are using our courts as tools in their quest to gain marketshare and fight competitors. If that was the intent by providing things like patents, then we need to completely rethink the way we protect the ideas of others.

The root of the issue comes down to the ability to compete. A patent can be a valuable asset when you are talking about an inventor trying to build a business. The inventor can use the patent as a means of protecting the idea against those that can immediately copy and bring a product to market. When we are talking about companies like Apple and Samsung, both have no problems whatsoever taking an idea from patent to product reasonbly quickly. Do they really need patent protection when they have the means to beat competitors fairly? Should we allow these companies to sue simply because the cases look the same?

The situation is similar to privacy for celebrities. The courts have ruled that when a person is considered a celebrity, you can use images taken in public without that person’s consent. I’m not entirely sure of the logic of the court, but it makes some sense in my mind. Famous people could create an entire industry based on suing news corporations for unsanctioned photos. Similarly, if a celebrity doesn’t want a photo to be taken, they have the means to prevent it from happening. Most of these patent lawsuits are not about protecting ideas. They are celebrity companies using the patent system as business strategy.

Sun, 19 Aug 2012 00:00:00 +0000 <![CDATA[Decoupling Data Storage and The UI]]> Decoupling Data Storage and The UI

One thing that has always frustrated me about MongoDB is that it is very difficult to assert your data has been written. Eventual consistency is a tricky thing. You can’t always be positive the next read you make includes all the data. How do you get around this limitation?

One technique is to use a smart client. Rather than store everything in the database and rely completely on it as your primary source of data, your client loads the state and maintains its own state internally. This does not work when the client needs to consider a rich hierarchy of records, but for things like a user session’s settings, it is pretty reasonable.

For example, say you have a list of records you are displaying in your browser. Your initial request loads the records and displays them all. When you change the name of some record, the user interface is updated directly from the submitted data, while it is persisted to the database in the background. If the person managed to open another browser and reach the same UI, the record might not yet be updated, but that is the tradeoff you make for eventual consistency.

This technique works relatively well as long as the client is able to take the state and directly apply it to the user interface. If the new data impacts a wide variety of data or requires information that must be coordinated / evaluated centrally, the result is the client becomes much more complicated. For example, imagine if you TurboTax had to do all the calculations for the tax code in the browser. It ceases to be a valid tactic.

Just because you need to visit the server for some evaluation of your data, it doesn’t mean you need to store the data in your database prior to evaluating the data. Most data driven applications use some concept of a model in order to translate the data from the data store format to your application format. There is no reason your application cannot submit data that gets translated to the same model format and is evaluated for a result.

I’ll use an example from my job to explain how to implement this kind of design.

As I mentioned numerous times, we do opinion polling. This involves writing questionnaires in a specialized DSL. The DSL supports things like logic blocks and flow control so authors can ask different questions and process data based on the answers. The flow control and logic are too complex to consider implementing them in the browser. It would mean introducing client issues into a core element of our processing, which is an unnecessary addition of complexity. The problem is that we need to consider this complex logic when finding the next question (or questions) we want to ask.

Currently we require that for each submitted answer, we have to write it to the database before returning our next question to the UI. This requirement is because the user might contact another node in the cluster. Therefore, we need to be sure when there is a node change, the new node pulls it state from our database.

What I propose then is that we decouple the data submission from the evaluation for the next page. When a question is answered we submit our data to a data queue that will process it and store it in the database. Rather than waiting on an answer, we send our newly updated state and current location in the interview to an evaluation service that will take the state and find the next question we need to ask.

It is possible that the user could try to change clients and switch nodes between the time the data is actually written. The worst case scenario in this situation is they might be asked the same question twice. It is reasonable to require that the data be written and made available within a few seconds, which would virtually eliminate this scenario from happening.

One thing I haven’t touched on is whether or not the data is validated before writing or when looking for the next question. I would presume the former is a better tact as the validation should be relatively quick and we reduce the chance of bad data being written. I should also mention that in our scenario, our data, once written is effectively static. This means that when requesting the data on a node change, there is a minimum requirement that it be available in a format suitable for interviewing. This might not be the same format we use for analysis. The analyzed format could require more time to process, but that time need not be considered for maintaining state during interviews.

Has anyone tried this methodology before? Any pitfalls or obvious problems I’m missing?

Tue, 14 Aug 2012 00:00:00 +0000 <![CDATA[Announcing AutoRebuild]]> Announcing AutoRebuild

At work we’ve started using sass for our CSS. The developer actually working on the code asked if we had a good way to rebuild the CSS when the sass file changed. I said no, but it would be really easy to write a CherryPy plugin for it.

AutoRebuild is that plugin.

AutoRebuild will let you configure a set of glob patterns and watch the files for changes. If they change, it will trigger a function to be called. It is up to you to write the function, but the README provides an example of how you configure a function to call make.

This plugin is really simple, but that is really the point. CherryPy makes it really easy to add integration points so working with other services and tools is easy and automated.

Mon, 13 Aug 2012 00:00:00 +0000 <![CDATA[Libertarian Programming]]> Libertarian Programming

If you haven’t already read it, Steve Yegge wrote a post providing an axis of alignment for programmers. The basic idea is that programmers tend to fall on spectrum between liberal and conservative based on the aversion to risk.

For the most part I get it and can agree to an extent. My one beef is that the perspective on risk seems rather one sided. The question of whether something will break for users is only part of the story. The boring nature of “conservative” developers might appear to be solely in fear for the user’s well being, but in my experience, the aversion of risk actually involves everyone involved with the code.

People, while having some pretty strong differences, still have survival instincts. If you are getting burned, your natural instincts kick in and you move extremely quickly. In other scientific fields, we’ve been burned plenty of times and as such, our industries reflect good instincts. Companies have had to shut their doors due to mistakes and carelessness, which in turn “burned” the entire industry in such a way as to cause an instinctual reaction.

Why is it in programming we haven’t seen a similar trend? Sure, we’ve seen source control and the Joel Test become relatively standard, but past that, what have we learned as an industry that is clearly the best practice for everyone involved in coding?

I have a theory why we have been somewhat slow developing our survival instinct.

If you follow sites like Slashdot, Hacker News and Reddit, you’ll see articles come up every once in a while that deal with age and programming. A quick search for “programming age” quickly returns some examples questioning if age is a limiting factor for a programmer. Generally, programming is considered to be something for the young, even if we don’t say so out right.

The result of this subtle ageism is that we often think new concepts can only come from young minds who have yet to be jaded by the old ways of thinking. The liberal programmers who don’t acquiesce to the conservative curmudgeon down the hall are the ones who will truly revolutionize the industry. As a culture, the new technology is what is most important, with the youth having the loudest voice.

Steve suggests that the longer you are a programmer and experience the pains of failure, you become more risk averse, hence becoming more conservative. I’d argue that the real risk you are adverse to involves spreading incomprehensible code and wasting the time of those that will come after you.

The developers I respect most have a knack for taking old code that makes no sense and turning it into clear and concise code with tests, tooling and documentation. Even though they are not making huge strides in new user functionality or creating new models for dealing with big data, they are exponentially saving the precious time of every other developer that works on the code. The potentially boring decisions and technology they used manages to hide complexity so effectively, it never seems to rear its head again.

If I were to put a political category on this line of thinking, I’d say it was Libertarian. New technology is great, as long as it doesn’t effect the liberty of other developers. Every new feature and bug fix should be tested for validity as well as reviewed in light of whether it adversely effects the ability of other developers to work on the code.

If I were describe myself, I’m a libertarian programmer. My desire is to write code that helps the users while staying out of the way of others I work with. I don’t always do a great job of this, but I’ll claim the naivete of youth and mention that I’m listening to my elders in order to overcome my lack of understanding.

Passionate programmers are going to disagree. It can be helpful to put terms on the different perspectives because, as Steve says, it provides an easy way to agree to disagree. At the same token, our industry makes a lot of mistakes and assumes that our flexibility and disagreements are simply clashes of opinion. The fact is our metric is often too focused on the result without considering the means. The time we save our users can be exponentially increased by saving our developers time.

The next time you find it difficult to read some code or set of tests it might be worthwhile to consider what impact your code has on others. The freedom you feel when writing new code is because you have not been encumbered by the code of others. There is no reason that when you are working on code others have written, the same freedom is not possible. As a libertarian coder, my goal is to write code that protects liberty of my fellow programmer.

Sun, 12 Aug 2012 00:00:00 +0000 <![CDATA[Announcing MgoQuery]]> Announcing MgoQuery

If you remember not long ago I wrote a post about caution writing DSLs. Despite my own advice and reluctance towards DSLs, I wrote one for creating MongoDB queries.

The big question should be why. After all, MongoDB just uses simple dictionaries for queries. What could be easier than that? Surprisingly, a lot.

Where MongoDB queries get complicated is when you are combining AND and OR operations along side range operations that may use non-standard types such as dates or timestamps. It is for this reason I wrote MgoQuery (which stands for Mongo Query if that wasn’t obvious).

Here is a really simple example:

from pprint import pprint
from mgoquery import Parser, Query

p = Parser()
result = p.parse('"x:1,y:2"|"x:3,y:4"')

# print something like
# {'$or': [{'$and': [{'x': '1'}, {'y': '2'}]},
#          {'$and': [{'x': '3'}, {'y': '4'}]}]}

As you can see, we get a pretty concise language for expressing a rather complex query. You can also pass in a conversion function that will help to convert the values to their proper type in the query.

Feel free to take a look at my first stab at this. I’m using PyParsing to create the gammar and parser. Not having a ton of experience with it, my initial impressions are very good. Please feel free to send along comments or recommendations.

Thu, 09 Aug 2012 00:00:00 +0000 <![CDATA[Be Careful Designing DSLs]]> Be Careful Designing DSLs

During my brief stint as a Ruby developer the flavor of the week seemed to be writing Domain Specific Languages or DSLs. The DLS trend was exceptionally disturbing to me as I came from the world of XML, which is effectively a world of DSLs with angle brackets.

FWIW, I never avoided XML. I still think XSLT is a great, yet highly misunderstood, language. The problem with XML was the users. Most users of XML (myself included) don’t realize the technical debt required for an XML format. You start using XML and creating your own formats without understanding that those formats are exceptionally important for data interop. What’s more, many times you don’t even need your own format since something like JSON or even a RPC call would be better.

The thing that draws people into the XML rabbit hole is the parser. You don’t have to go very far from your core language to parse and process XML data. That is a really big deal! Writing parsers and designing languages are hard. But just because it is easy to parse, doesn’t make it a good language.

Recently I started writing a simple query language for MongoDB. The idea is to provide a concise language to be used in a URL query string. I’ve made an effort to write down my ideas on what could work. My goal is not to construct an entirely new language here and instead build on the shoulders of giants by drawing from things like the GMail advanced query format and Xapian.

What I’m finding is that writing a good DSL is really hard! The parsing is half the trouble because it is easy to feel like you designed a good language only to realize your simple parser is going to become really complicated. Likewise, if you do choose to emphasize easy parsing, there is a good chance your language could suffer. Not to mention the subjective nature of what makes a language “good” or “bad”. I had a short epiphany that using a LISP like dialect would make parsing really easy and knew in the back of my mind its syntax would be jarring to most users.

The point being is that writing a parser is not terribly difficult once you understand the principles. Coming up with a small DSL can be relatively easy as well. What is difficult is finding the right balance where your language and parser are both maintainable for the foreseeable future. Finding a balance of theoretical quality among practical constraints is a huge challenge and no one to be taken lightly. Hopefully, my current attempt will find some of the balance and I won’t regret the result.

Mon, 06 Aug 2012 00:00:00 +0000 <![CDATA[Great Music Requires Hard Work]]> Great Music Requires Hard Work

The thing that has been sticking out to me with our shows with Helmet and The Toadies and our show with Jane’s Addiction is how much work goes into playing music. With each new level of success, there is also a new level of performance you strive for. There are different reasons this happens.

Live music can be a profitable endeavor, but there are limits to what you can charge. If your fan base is not really growing and you want to make more money, your only choice are higher ticket prices. If you want to charge more you’re going to have to justify those costs to your fans, which in turn means upping the ante at your shows. This could be with lights, more musicians/sounds, etc. Likewise, you could try for a more intimate setting or trying some sort of theme. In any case, you can’t just go up and do the same thing you’ve always done and expect fans to pay more money.

If you fan base is growing then you can stick to what you’re doing a bit longer as it is obviously working. You may not need to inject a new light show into the mix or add extra props. That said, you also will be expected to play longer. You can’t typically headline a 1500 cap room with a 30 minute set. This usually is going to mean more gear and dealing with the logistics of having to play a 75 minute set. There is more merch to sell and settle with at the end of the night and more gear to move in and out. Sound checks are earlier and tours are closer to a month, rather than 2 weeks. A bus seems like a luxury, but the reality is that you can’t make a 3pm sound check with a 9 hour drive without driving all night.

All of these things are practical reasons bands have to raise the amount of work that goes into shows. I think any band would tell you the real reason they use a bus or need a quiet green room with healthy food is so they can put on a good show for the fans. We’ve always been extremely appreciative of the people that come out to shows. They pay their hard earned money in hopes of experiencing a band’s music in a powerful and interesting way. I’ve been at shows where I had a lump in my throat b/c it was such an amazing experience. These shows have been huge productions like Roger Water’s The Wall as well as small club shows as a kid. As a band grows in stature and gets more fans, there is a desire to say thank you by putting on a better show. It is more work and costs money, but when you experience thousands of people screaming for your music, especially when they paid to see you, putting the best show possible is really a privilege.

Sun, 05 Aug 2012 00:00:00 +0000 <![CDATA[A Base Class for CherryPy]]> A Base Class for CherryPy

I’m a fan of CherryPy. It is a great framework that hits a sweet spot in terms of features and flexibility. My biggest struggle though in learning CP was always based on dispatchers.

When I first started learning CP, my perspective was to use something like Routes. The problem is the routes dispatcher provided by CP doesn’t support other helpful aspects such as Tools. This disconnect was always frustrating until I understood how the handlers work. In CP, a handler is passed a set of arguments and keyword arguments that are the path segments and form values respectively. Understanding this aspect makes it much clearer how to use the framework effectively in addition to using Python (as opposed to framework specific libraries) for writing apps.

Fast forward to today and things are even better. CP allows you to define a _cp_dispatch method that can be used to pick another handler. Here is an example:

class Bar(object):

    def __init__(self, name): = name

    def index(self, baz=None):
        return 'I am %s. baz == %s' % (, baz)

class Foo(object):

    def index(self):
        return 'I am foo'

    def _cp_dispatch(self, vpath):
        if vpath:
            bar = Bar(vpath.pop(0)
            return bar

In this example, the Foo class is able to see that it gets an extra path segment and returns the Bar object instead. This allows you to intelligently dispatch without having to depend on attribute names like the traditional dispatcher.

Seeing as this is extremely simple, lets improve this pattern with a simple base class. I’d also like to point out that I copied this from a local project that was written by Tabo and Fumanchu.

class BaseHandler(object):

    exposed = True

    def _cp_dispatch(self, vpath):
            dispatcher = self.dispatch
        except AttributeError:
            return dispatcher(vpath)
        except cherrypy.HTTPRedirect:
            # pass redirections as usual

And here is an example using the Base class.

class MyData(BaseHandler):
    def __init__(self, id): = id
    def GET(self):
        return db.get(

class MyAPI(BaseHandler):
    def dispatch(self, vpath):
        if vpath:
            id = vpath.pop(0)
            return MyData(id)
        return False
    def GET(self):
        return db.listing()

In this example suppose we had our app mounted at ‘/api’ and were using the MethodDispatcher. When a GET request was made to ‘/api’, we return some listing of IDs. A GET can be made to ‘/api/1234’ in order to get ID 1234 and we dispatch to a new instance of MyData which handles the GET request providing the specific object’s data.

You can also see how I’m using the json_out tool to return JSON to the client.

Hopefully this example is helpful in understanding the simplicity and power of CP’s dispatching model. You can use the above base class to implement any logic you might need, such as transforming the segment to the proper type, all without having to pollute your actual handler method or contend with a specialized syntax. Obviously this is not a perfect solution for every use case, but CP is more than happy to allow you to configure different dispatching paradigms at different paths.

Mon, 30 Jul 2012 00:00:00 +0000 <![CDATA[Continuous Integration Technique]]> Continuous Integration Technique

I’m going to start off by saying I’ve never set up a CI system, so my perspective might be completely naive. With that out of the way…

I took a quick look at BuildBot. I didn’t really have a reason outside of seeing a link on a blog while watching some logs. At work we use Jenkins and since it is the only tool I’ve used, it seems to work just fine. Yet, I always have a nagging feeling we’re missing an important aspect of CI.

In order to use Jenkins, we’ve created scripts that do things like build an environment and get things ready for testing. The part that doesn’t sit well with me is the process of installing and running the tests in the environment is completely different than our production system. Sure, it uses Python and a virtualenv, but that is about it.

This disconnect between CI and a production deployment isn’t terrible, but it seems to suggest other problems. Why can’t you use the same deployment pattern to set up a test environment as your production environments? What if your production is using easy_install and your CI is using pip? Are you definitely using the same repositories? What about libraries that require building modules in C? Are details like environment variables and library going to be the same in the production environment as the CI?

It should be a goal to have a single starting point for running tests or creating a built and running an application in production. By providing the single, standard process of preparing an environment, you eliminate any questions regarding things like correct library versions, installation paths, dependency resolution, configuration questions, etc. This goal is not always easy, but it will make your CI server a better test that your code is really ready for deployment to production.

Sun, 29 Jul 2012 00:00:00 +0000 <![CDATA[Blogging to the Bear]]> Blogging to the Bear

I’ve been working on debugging an issue on and off for a few weeks and at this point, I’m stumped. I’m hoping writing about it might help dislodge whatever it is in my brain that is preventing me from new hypotheses.

The symptom is the database data getting out of whack. We keep a document in MongoDB that gets updated after each request. Our system has answers that are required, but for some reason, these answers are not being answered. There are two reasons I can think of.

The first is the user submits an answer and in the process of responding to the request, things get confused and we skip the questions we were supposed to ask. The second is that we ask the question but we don’t record the answer.

Essentially my theory just wraps our entire request/response process. The problem either happens when the data comes in or when it goes out. Not very specific.

What has been done so far is that we confirmed when we do save the data, we do so atomically. It is not a transaction where an error would be rolled back, but we can be relatively confident that there is not a race condition trying to write to the database.

Thinking about it more now, here is my theory. The problem is in the page we send out. There could be a thread safety issue with that would allow our “next page” algorithm to be incorrect. Our state is kept in a cursor object that I suspect might have some thread safety issues.

I don’t know that I’ve really come up with anything new here, but in writing this I do think I’ve helped to clarify my theory. The big question is how to reproduce the issue since it doesn’t seem to happen very often, which suggests there is a synchronization issue at play.

Thu, 19 Jul 2012 00:00:00 +0000 <![CDATA[Platform Focused Tools]]> Platform Focused Tools

The more I learn as a programmer, the most I recognize why Linux is the way the it is. There are a set of tools that devops considers for deployments and system administration. Sometimes, as developers we avoid learning things. It is more fun to write code and consider how this code we write might very well be tool that finally solves some problem. The irony is that the tool that finally solves the problem is probably already written and you’d rather code than read how to use it!

There is a really good chance that tool already exists within the *nix ecosystem. You need a deployment and packaging system, try RPM or APT. You want to build things, Make (while having warts) is up to the task once you understand it. You need to be sure your daemons are up and running and can be controled correctly, check out Upstart or init. They may not be perfect, but they are well tested. Also, these tools were written with *nix in mind as opposed to your application or programming language. While that means they are not optimized for you, they are optimized for the operating system, your platform you deploy to.

This realization came about recently because at work we use daemontools to manage our deployment tool. The deployment tool has two main components it uses on each node. There is a launching piece that has a heartbeat of sorts that checks it has instances of the applications running. If something isn’t running, it will try to start it. When it starts it, it actually starts a wrapper process that will accept signals from our main server. If you wanted to terminate some application process, you tell the main server, it tells the wrapper process to term, and the launch tool realizes it is gone and starts a new instance.

The problem with this system is daemontools knows nothng of this hierarchy. Daemontools assumes you’re working with processes that are meant to be run as daemons (get it “daemon tools”). Our system cuts off some of daemontools responsibility to restart processes. The result being that if our hiearchy of processes fails for some reason (broken network connection for example), daemontools is unaware and we have to log into the node to get things back in shape. If we kill a node via the master and it dies, but our lauching tool doesn’t recognize it is gone, then we need to restart the launch the launch tool. But that orphans our other processes and we have to kill them manually before starting back up our daemontools managed process.

Even if you didn’t follow any of that, the moral of the story is to consider the *nix tools. If you have a long running process that should start with the OS, use the init or upstart system. That is what it is made for and has been doing a good job of for a long time. Does it suck to write init.d scripts? Yes. Will you need to consider how your app handles signals the like? You betcha. Once you have that in place will you have problems restarting your app due to your tools? Nope!

As I’m still relatively new to a lot of these sorts of tools, so I’m sure I’m paiting a prettier picture than reality. That said, the reality is that we deploy our apps on *nix platforms which have a specific model in place for managing processes and the file system. Tools like init, Make, RPM, etc. all were designed with these realities in mind. While it may seem like a lot of work to understand and use these tools, I want to remind you it is even more work to create your own and maintain them.

I’m also not trying to suggest that you shouldn’t use supervisord or create your own deployment strategy based on virtualenv and pip. Many people have found success creating a Python specific release strategy. The point is that while it may seem obvious and easier to consider your language and application specific tools, a platform focused deployment strategy may be a better tact in the long run.

Thu, 12 Jul 2012 00:00:00 +0000 <![CDATA[Emacs, Email and Mu]]> Emacs, Email and Mu

I’ve once again ventured into the world of using Emacs as my email client. The problem with email is that is a messaging and event platform. You have actual messages you are sent from real people and then there are the host of services that use your inbox as a user interface. When email becomes out of control and you are a programmer with a powerful editor like Emacs, it makes you consider a world where you don’t have to don’t have to leave your editor and you can use all the slick tricks you have for editing text for managing the massive amount of email.

In the past I was a reasonably happy Wanderlust user. I’ve given Gnus plenty of tries, but never have felt comfortable using it. Tools like VM, RMail and Mew all have not made the cut, primarily because installation is painful. This time around I’m giving something new and different a try, mu4e.

Mu4e is built on the email indexing tool Mu. I’ve used Mu in the past for searching through archives when Wanderlust wasn’t really up to the task. At the time I was using Linux for my desktop and it came with a simple GUI that made it really easy to find what I was looking for. When I found out about mu4e and that it was being worked on by emacs-fu, it seemed like it could be what I was looking for.

Mu4e is based on searching for your mail. Mu it is a simple Xapian index for your mail messages. It’s really fast and handles tons of email really quickly. It uses Maildir directories for reading mail, which means it is well suited for use with Offlineimap. The mu4e package allows you to call a command prior to reindexing your maildir. Mu4e will also refresh the index every so often to keep your messages up to date.

For sending messages I opted to use msmtp, which was easier to configure with multiple servers. After moving my work email over, it seemed like it’d be a good idea to have my personal email in Emacs as well.

Here is what my config looks like. I’m still tweaking things, but overall, I’m pretty happy and things feel rather natural.

(require 'mu4e)
(setq mu4e-debug t)
(setq mu4e-mu-binary "/usr/local/bin/mu")
(setq mu4e-maildir "~/Mail") ;; top-level Maildir
(setq mu4e-sent-folder "/Sent") ;; where do i keep sent mail?
(setq mu4e-drafts-folder "/Drafts") ;; where do i keep half-written mail?
(setq mu4e-trash-folder "/Deleted") ;; where do i move deleted mail?
(setq mu4e-get-mail-command "offlineimap")
(setq mu4e-update-interval 900) ;; update every X seconds
(setq mu4e-html2text-command "w3m -dump -T text/html")
(setq mu4e-view-prefer-html t)
(setq mu4e-maildir-shortcuts
         '(("/YouGov/INBOX"     . ?i)
           ("/GMail/INBOX"   . ?g)
           ("/YouGov/archive" . ?a)
           ("/YouGov/error emails" . ?e)))

(setq mu4e-bookmarks
      '( ("flag:unread AND NOT flag:trashed" "Unread messages"      ?u)
         (""                  "Today's messages"     ?t)
         (""                     "Last 7 days"          ?w)
         ("mime:image/*"                     "Messages with images" ?i)
         ("\"maildir:/YouGov/error emails\" subject:paix" "PAIX Errors" ?p)
         ("\"maildir:/YouGov/error emails\" subject:ldc" "LDC Errors" ?l)))

(setq w3m-command "/usr/local/bin/w3m")

;; sending mail
(setq message-send-mail-function 'message-send-mail-with-sendmail
      sendmail-program "/usr/local/bin/msmtp"
      user-full-name "Eric Larson")

;; Choose account label to feed msmtp -a option based on From header
;; in Message buffer; This function must be added to
;; message-send-mail-hook for on-the-fly change of From address before
;; sending message since message-send-mail-hook is processed right
;; before sending message.
(defun choose-msmtp-account ()
  (if (message-mail-p)
            ((from (save-restriction
                     (message-fetch-field "from")))
               ((string-match "" from) "yougov")
               ((string-match "" from) "gmail")
               ((string-match "" from) "gmail"))))
          (setq message-sendmail-extra-arguments (list '"-a" account))))))
(setq message-sendmail-envelope-from 'header)
(add-hook 'message-send-mail-hook 'choose-msmtp-account)

What I like most is the speed at which I can load up messages and find what I’m looking for. Offlineimap has been a bit flaky at times, but I’m hoping I can continue to tweak my settings there to be sure I’m getting my mail in a timely fashion. If you don’t use Emacs but you have a lot of mail you want to analyze programatically, I encourage you to take a look at mu.

Sun, 08 Jul 2012 00:00:00 +0000 <![CDATA[Emacs and Doing One Thing Well]]> Emacs and Doing One Thing Well

I read a bit of this thread about how Emacs fits in with the Unix idea that you have small programs that do one thing well. Two responses I thought were very poignant.

The first was that Emacs does one thing well, evaluate elisp. When you really think about what Emacs does, this is right on the money. Elisp is just a specialized lisp dialect that contains a ton of tools for working with and editing text.

The other comment effectively pointed out Emacs is for working with code. This is another important distinction to make. People usually point out things like the shell and pipes when they talk about what makes Unix and small programs doing one thing well. In the case of Emacs, it allows a developer to easily participate in the craft of programming. If you needed to talk on a mailing list about some feature, trade code snippets, chat on IRC and commit code into your version control system, you can do it all within Emacs.

Both of these answers helped to clarify how a general purpose text editor that has a mode for everything under the sun still adhered to the Unix Philosophy. And even if the old school *nix hackers didn’t think Emacs fit in, it doesn’t suprise me that it was the editor of choice for many authors of *nix operating systems.

I don’t know about you, but can you imagine needing to copy and paste code into an email? I suspect it could look something like this:

head -34 some_source.c | tail -15 &> message && send message


Emacs on the other hand spawned a lot of modes helping programmers participate in programming and working with code. Just because that “thing” is rather large and complex, I’d argue that Emacs does a pretty good job of doing it well.

Thu, 05 Jul 2012 00:00:00 +0000 <![CDATA[My Own Python File Locking Snippet]]> My Own Python File Locking Snippet

My Own Python File Locking Snippet

Having already written about the plethora of file locking libraries and snippets, it only seemed fair to further muddy the waters with my own half baked example! Check it out, maybe give it a try and then walk, don’t run to better tested implementations, but only after letting me know why what I’m doing is wrong. Thanks for reading any feedback provided!

Thu, 28 Jun 2012 00:00:00 +0000 <![CDATA[File Locking in Python]]> File Locking in Python

I’ve been working with a super simple threadsafe file based caching mechanism. The httplib2 library has a nice and simple file based cache, but it is not threadsafe. I tried to create a version that uses a single writer thread to write to the cache files, but that ended up being rather complex. I then started trying to write my open simple lock file rather than use the fcntl module. This also fell over in tests.

I started considering the fcntl module when I found out there is which seems to support the same flags.

I have tests passing and some really simple code that seems to work, but I suspect I’m missing something. Jason wrote yg.lockfile which uses zc.lockfile. The difference being that zc.lockfile does the dirty work of finding the lock file mechanism supported by the system (ie fcntl or msvcrt). It also uses a separate lock file where the locking flags get set. There is also this lock file implementation mentioned on Stack Overflow. It shouldn’t be surprising there is another file locking library that might have some useful tips within its design.

I’m not going to suggest that I have the answers for perfect file locking. That said, after looking through the different examples, there are a few basic use cases you need to be aware of.

  1. Lock a lock file, not the file you want to work with
  2. Use something to identify the process the originally locked the lock file
  3. Locking is platform specific, so consider your use case

The first two notes deal specifically with state getting corrupted. You need to be sure if you are locking files that you handle the case where the process exits unexpectedly. If this happens and your lock is still around, what do you do to take back that lock? Using a separate file with the pid in the contents is a good way to do that without locking the actual file you are working with.

The last point regarding the platform independence is primarily because if you don’t need to deal with a specific platform, then don’t. Most of the time you are locking a file, it is because you are dealing with a concurrency condition. Adding more code that does a bunch of platform checks simply can’t be helpful in making the file locking check as atomic as possible. It may not be detrimental, but you might as well master one platform vs. handle anything available and save yourself the trouble of understanding the low level filesystem flags. There are a lot of articles out there on using fcntl, all of which saying that it can be tricky. Avoid the headache of also learning the Windows specific bits if you don’t have to.

It is frustrating that this has been this difficult. I honestly expected there to be a file locking library or function in the standard library. That said, it is definitely possible that I’m simply missing how different the use cases are. It might be best that people take different strategies as a common locking mechanism similar to the basic description above would not be beneficial to most. I also wouldn’t be surprised if this is a pretty consistent pain point when the issue comes up as it almost never sits amidst simple single threaded/process code.

Thu, 28 Jun 2012 00:00:00 +0000 <![CDATA[Debbugging .emacs in OS X]]> Debbugging .emacs in OS X

It is rare as an Emacs user that I’ll close my editor. Every package I want to try out can be made available without ever having to restart Emacs. While it is extremely helpful, it can also throw you for a loop when you do need to restart Emacs.

I recently started using Melpa and upgraded some packages. Unfortunately my MacBook Pro has a bug where the graphic driver seems to go south and freeze my machine, forcing a hard restart. Upon restarting wasn’t happy loading my .emacs. Since this is OS X and it is an actual “app”, it wasn’t obvious how to use the “–debug-init” flag when opening Emacs.

This ended up being really easy. I opened a terminal and navigated inside the .app file. If you’ve never done played around with this before, all the applications are really just directories with a known file structure that allows OS X to keep each apps resources separate. It works kind of like an RPM or deb package that never actually gets expanded on the file system. At least that is how I look it. Our goal is to find the executable in order to run it with “–debug-init”. On my system I changed to:

cd /Applications/

There is an “Emacs” executable and you can run it, debugging your init file.

./Emacs --debug-init ~/.emacs

I usually open my .emacs file when I do this since you most likely will be changing something there anyway. There might be other ways to do this that are more Mac-like, so feel free to leave examples in the comments.

Mon, 25 Jun 2012 00:00:00 +0000 <![CDATA[Announcing (the poorly named) HTTPCache]]> Announcing (the poorly named) HTTPCache

I’ve been a fan of httplib2 for a long time. It does caching right and has support for authentication via HTTP Basic and Digest. The problem with httplib2 is that it is not threadsafe. There are some data structures used within the code that make unsafe for reuse with threads.

One thing to do to improve this situation is to use a different caching mechanism, but it still means that you need a thread specific instance in order to avoid issues using it with threads.

The new hotness in HTTP client libs is requests. The API is reminiscent of httplib2 and has support for many of the same authentication schemes. More importantly though, it is threadsafe and even provides a session construct that helps to make using requests similar to using any other data resource connection. These are all great features, but the one thing it lacks is HTTP caching support.

Enter HTTPCache.

HTTPCache is a really simple wrapper that ports httplib2’s caching for use with requests. Here is a simple example:

import requests
from httpcache import CacheControl

sess = CacheControl(requests.session())
resp = sess.get('')

There is still some work to do, but if you were interested in the caching of httplib2 with everything else in requests, take a look.

Thu, 21 Jun 2012 00:00:00 +0000 <![CDATA[Touring and Merch]]> Touring and Merch

People often feel artists can make a living by hitting the road and selling merch. There is obviously some truth to this claim, but generally, it is a lot harder than it sounds. Making money on merch is difficult because it is a lot of work with very slim margins. The most obvious thing to sell on the road is music. It is good to have both vinyl and CDs available when ever possible. Most clubs do not try to take a percentage of “plastic”, which is a good thing. Unfortunately, unless you released the music yourself, you’ll need to keep really good records of what is sold for your record label and sound scan. If you sell a CD for $10, you probably are giving the label half of that. You’ve also got to lug it in and out of the van every night. None of this is exceptionally hard, but it isn’t as simple as just showing up. You are running a mobile store that get set up and torn down every night.

T-Shirts are another obvious merch item to bring along. Shirts are a little easier in that you usually paid for them, so unless you want to keep really accurate records, the book keeping is a little more lax. That said, a shirt printed on a name brand blank with a few colors can be pretty expensive. We’ve seen people spending anywhere from $5 to $15 per shirt depending on the design and make (ie American Apparel vs. Gilden vs. Anvil, etc). T-Shirts are also a little harder to sell simply because not everyone wears band t-shirts. The bill of the show also makes a different. The crowd might be older and into heavy music, in which case you might need more larger sizes on black. If the other bands draw a younger audience, more color and smaller sizes could be better. In either case though you never really know.

The hardest thing about shirts is keeping them organized. If you have 2 designs with 5 sizes (Ladies, Small, Medium, Large, XL) that is 10 different “buckets” or “stacks” you need to maintain. If they are all on black shirts, keeping things organized can be a pain in the neck and you might not find the size someone wants and lose a sale. We’ve come up with a simple system of keeping shirts in plastic bags according to size. We usually have different colored shirts for different designs as well. We’ve been able to keep our shirts reasonably organized where we don’t have to spend all night hunting through a big sack of t-shirts for the right size. Shirts also take up a lot of space. Again, a big duffle back is helpful b/c it lets you stuff it into non-standard spaces in the van. We are able to keep our merch with our gear for this very reason. Selling merch on the road is a necessary part of being in a band, but it is not trivial. When you consider you are setting up amps/drums, maintaining instruments, and setting up a small store every night, the money you make from merch is hard fought. The more organized you make things the better as it helps to make it faster to sell, easier to pack up and simpler to move. You’re not as likely to get to the next gig and realize you left all your larges back at the last venue.

It is not all blood sweat and tears though in the merch booth. The merch is where you meet fans and they have a chance to express what they thought of the show. You sign autographs and pose for photos and most importantly you have a chance to see your music connect with someone. Even though it is hard work, in a way, it is one of the more intimate ways to get to know your fans.

Thu, 14 Jun 2012 00:00:00 +0000 <![CDATA[The New Boss]]> The New Boss

I finally got around to reading “Meet The New Boss, Worse Than The Old Boss?”. The one response I did read concluded with a prime example of the tech industry mindset the author feels is foul. Seeing as I have a lot of respect for the author, it was frustrating to see such a lack of empathy.

I don’t believe that vocal artists who are frustrated with the music industry are whining. Most articles or presentations are done by respected members of the music industry. These are people who have found some success by creating something people actually wanted to listen to. Because they have found some success, it seems reasonable that they recognize the risk in the music industry as well as the difficulty in making good music.

When someone is frustrated by the music industry it is important see if they really are a part of the industry or just play in a band that no one really listens to. Of course, by this logic you could probably disqualify myself. Seeing as we have managed to find some success, I hope you read on, but if not, I more than understand.

The New Boss article aims its sights directly at the tech industry. In my opinion it is accurate in its portrayal. As a programmer who made an effort to make a music start up work, I can say with out a shadow of a doubt that the tech start up community feels that innovation is more important than copyright. Musicians and labels should feel privileged that iTunes is willing to host an album’s worth of material and be provided a paragraph for a short description.

It might be rude of me to say this, but actions speak louder than words. I can get a web host for $10 / month that has enough space to store every album I ever make as mp3s. I can add a Wordpress blog and a link to download the track after paying along with Disqus comments and get effectively what I get with iTunes. The difference of course is that if I sell 10k songs, I still pay $10 for hosting (maybe more for bandwidth... maybe) where as on iTunes you pay $3k. That sort of pricing is pretty foul if you ask me when you have a band touring, creating content on social networks and paying for press.

There is still hope though. The key is to be smart.

At some point my hope is that the market corrects itself. Artists are getting smarter about their contracts and demanding they own their copyrights. Tools like Kickstarter do offer some options for skipping middlemen. Labels are finding better tools to protect casual pirates. Record labels can consider their successful artists as huge bargaining chips in order to bring in more revenue. Technology like Cash Music could be used to implement a distributed store where artists can sell music directly through a centralized user interface. We need to work together.

I’m not going to argue that bands should be paid huge advances and the major label system was better. But, I do think the tech industry has convinced the world that music should be free. Changing this is going to involve adjusting the market.

There is a reason it took a long time before The Beatles and Pink Floyd showed up on iTunes with higher prices than other records. The copyright holders felt the music had more value and negotiated a better deal as seen by a higher price on the store. This theme is a good thing. Hopefully indie labels will join a coalition to act collectively in order to secure higher revenue. Majors can also demand that they function on an artist by artist basis. If it becomes prohibitively expensive for tech companies to maintain the gateway to online music, they will stop and others will come in their place with tools that can empower those making music.

Wed, 06 Jun 2012 00:00:00 +0000 <![CDATA[Coupling and Cohesion in an Organization]]> Coupling and Cohesion in an Organization

At work we have a team that is trying to make our deployments better.

They are taking cues from the 12 Factor App model. Generally, I believe this model provides a great start for designing your applications in terms of deployment. The one caveat is that you may be missing out some cohesion in your organization by assuming it is coupling.

The 12 Factor App was written by the fine folks at Heroku. Heroku is an application hosting platform that supports a variety of runtimes.

This is important because it means that Heroku’s use case is more generalized than most organizations. In order to support different runtimes consistently, there must be a lowest common denominator that works for all runtimes. Having this kind of generic requirement is not a negative, but it suggests that you are avoiding coupling that may not be necessary in an organization.

A good example of this is logging. The 12 Factor App defines that all logging should be a stream of events that are logged to stdout.

Obviously, if you are supporting multiple runtimes, demanding a specific logging platform is suboptimal. The language may not support the logging platform or some libraries might be dependent on other platform specific details. For example, supporting syslog could be done via the C library or simply calling the local syslog command in a process. Requiring an application log to a specific tool is an example of negative coupling for Heroku, but that doesn’t have to be the case in an organization.

A company that has control over its own deployment platform can very easily support a more specific logging system such as syslog. Using tools like puppet or chef, each host can meet any requirements necessary for the specific runtimes. Similarly, you are not tasked with inventing a system of log management as tools like syslog already have that functionality built in. More generally, you can separate management of logs from application development because you have a powerful tool you can depend on to be present. In one sense you are coupling yourself to a logging system, but in another sense, you really are finding a cohesive tool that makes things simpler across an organization.

Often times as developers we try to make solutions that are generic in order to meet needs now and in the future. Preparing for the future is a good idea, but it is always important to recognize when a generic solution is adding unnecessary complexity that doesn’t help the organization meet its needs. Coupling and cohesion are abstract concepts that change based on context. When you can set up a simple, seemingly highly coupled system, and it works with no problems for the next 5 years, then I’d argue that coupling was helpful in creating a cohesive solution.

Fri, 01 Jun 2012 00:00:00 +0000 <![CDATA[End the Fed]]> End the Fed

NOTE: For anyone who reads my blog, this is the first political post I’ve ever written. Politics is not a theme I’d like to add to my published persona as it is rare that casual politics does anything but start arguments. For this reason, I hope that anyone reading this does so with an open mind and understands my goal is to reflect my own discoveries and hopefully, others find motivation to look deeper on their own. In other words if you want to call people names or be divisive, then you’ve already missed the point. Thanks for reading!

I just finished End the Fed by Ron Paul. Now I know some people think Ron Paul is a total kook. I had my doubts when I first started looking into his message, but over and over again, I found myself feeling his statements as logical, practical and realistic. One thing that has been a theme in his campaign and his fans is the desire to end the Federal Reserve. Not really understanding this aspect of his campaign, I decided to read his book.

It is funny because when I first learned about the fed in school, it didn’t make sense why they said it “created” money by adjusting interest rates. In my mind, it didn’t truly “create” anything except when it printed money. Ron Paul argues that very truth. No matter what the theory says, you can’t create something from nothing and the perceived benefits of more money in the economy is really just semantic. The most interesting perspective regarding the Fed is the moral argument. People have strong opinions regarding funding of government programs, yet they don’t mind that the fed makes their wages worth less and less each year. Ron Paul argues this is immoral and acts as a evil against society. Honestly, I believe him. If you raise a wage to keep up with the rise of inflation, you are really just paying someone the same thing. There are many people that never get these yearly raises as wage earners and the result is that every time the Fed “creates” more money, their wage is reduced.

If you think Ron Paul is crazy, I’d challenge to read End the Fed. I can’t say it is the most concise book I’ve read or fully explains the economic beliefs of Ron Paul, but it does reflect the principles he believes would help remove a vast amount of corruption in the government. Even though he may not be a perfect candidate, his perspective on war, freedom and the role of government are worth investigating as his message avoids party affiliation in favor of acting on principles. They say money is the root of all evil, which means ending the Fed is in fact helping to remove evil from our government.

Thu, 31 May 2012 00:00:00 +0000 <![CDATA[Better Mocking]]> Better Mocking

Recently I’ve been updating a lot of tests. I’ve been switching out Dingus for Mock as well as updating tests that were slow or unclear. It has been a good experience because it has forced me to look at old tests and mocking techniques and compare them directly with newer code that utilized mocks more effectively. I’ve come up with a couple best practices to help avoid some mistakes I made initially.

First off, avoid mocking. The ideal situation is that you’ve written code in such a way that you never need to mock anything because the code has been designed to be isolated and modular. As soon as you start to deal with I/O this become extremely difficult, but as a general rule of thumb, try running the code rather than mocking.

Secondly, be careful what you mock. One reason to mock is to help isolate code under test. If you find yourself mocking a ton of objects and asserting complex sets of methods were called, then that is a red flag that your code could be refactored. If you think of mocking as a means of providing a barrier between code under test and the code that you know works, it becomes slightly clearer when to mock. Your goal should be to mock the point at which the code under test is accessing code that you assume is working correctly. For example, when you write tests, you don’t test things supported by Python. Your assumption is that things like opening files or sockets work correctly. When you are mocking, the same distinction should be made and that is the point where a mock will be most helpful.

Finally, mock only one level deep. Specifically related to databases, there is usually some namespaces that databases provide. MongoDB, for example, provides a database, which contains collections, that contain the actual documents. The API then provides access via similar nesting.

When mocking this type of connection, don’t try to mock the connection.

Instead mock the point where the query is actually made. Generally, this makes the mock much simpler and it also makes the assertion solely based on the query rather than the setup necessary to connect to the database and find the namespace you will be using.

Hopefully these tips help avoid some mistakes that I made when I first started trying to use mocks. Using these simple guidelines has made using mocks much easier. It has also helped to improve and simplify the code, which is partially the point of writing tests. Good luck and happy mocking!

Sun, 13 May 2012 00:00:00 +0000 <![CDATA[Circus and Process Management]]> Circus and Process Management

I’ve been looking at Circus bit as a process or service management tool for testing. Often when you are testing an web application, there are a suite of services that are required in order to test different levels of the stack. One argument is that you should mock most of these interactions, which I simply don’t agree with. It makes sense to mock and assert how you are calling your objects that do I/O, but it is dangerous to assume these sorts of tests truly verify your code is working. Only when you actually touch the code can you be somewhat sure that you are on track.

In the project I work on, we’ve started creating three different test types. There are unit tests that do utilize mocking techniques and aim for a very explicit exercising of an objects method. So far these have been focused on our models as they typically would do I/O and they have very focused methods. The second type of tests are functional tests. These are tests that assert functionality is working by using a large portion of the stack. We still mock some aspects in order to control the output, but generally, the idea is to use stubs and let each service handle the requests. The last set of tests are integration tests. These include things like selenium tests or tests that are known to be long running because they are metrics against threading or synchronization issues.

In the last two scenarios, it is important to be able to start up services for the tests on demand. Circus is of interest to me because in addition to a command line interface, it also provides a library for orchestrating services. There is one thing that I’d personally like to see in a tool like this that I believe others would appreciate as well.

In the example Circus config there are listed a couple delay keys that I presume are there to help processes wait until some resource is available. A good example would be waiting for a database to be available prior to starting a service that depends on that database. There are two obvious problems with any sort of time delay when running tests. The first is that you could be waiting longer than you have to. If you start up a lot of services and have to wait a second or two for each, then you are almost certain to have a slow test suite. The second is that things will break inconsistantly if the delay is not long enough. You end up working with a lowest common denominator where the person with the slowest machine defines the delay time for everyone, including those with powerful machines or the CI server which most likely has a good amount of processing power.

In our test suite, we’ve solve this by creating a service manager that instantiates service classes. These classes have some knowledge of how to determine a service or resource is really up. For example, we use MongoDB. Our service class for MongoDB will start up the process and block until a connection can be made. We also have some applications that provide a status URL that is not available until the application is truly ready.

In a system like Circus, it is difficult to do this because you don’t want to require each application or service to implement a specific API in order to use it with Circus. With that in mind, I’d propose that Circus allow using a process to define the wait time before proclaiming the process is available. The config might look something like this:

cmd = python
args = -u $WID
numprocesses = 5
wait_for_cmd = python
wait_for_args = -p $PORT

Circus would use this “wait_for” process to finish before reporting the actual process as available.

I haven’t tried Circus just yet, but I plan to take a look and see if this idea could work for it. When you try to orchestrate services as generic processes it is very difficult to maintain a featureful API because within the concept of a process there is such a wide breadth of implementations that supporting anything but the typical start/stop/restart is close to impossible. Adding this sort of deterministic hook could allow for more advanced features such as dependencies and priorities that would make it possible to reliably start up a suite of services in such a way that it was always consistent and didn’t have any unnecessary waiting.

Fri, 20 Apr 2012 00:00:00 +0000 <![CDATA[Code Editing Ideas and Emacs]]> Code Editing Ideas and Emacs

I saw a Kickstarter campaign for Light Table and it got me thinking. First, I started thinking about how the code I work on is organized.

The code is Python, which means that files must adhere to the importing rules defined by the language. It also means there are some known metadata files we use to help our software play nice in a packaged environment. When I compare our Python code with something like Java or C# in terms of files, there is a pretty stark difference. Java (and to a lesser extent C# if I remember correctly) requires a rather strict association of one file per class. Python, on the other side of the spectrum, requires nothing more than one file in its most extreme case.

Thinking about this in terms of actually working on code, it makes me wonder why you see such a stark difference that seems to be paralleled in its community. Java and C# enjoy powerful IDEs that are able to maintain in-depth discovery of the code and allow extreme refactoring using nothing more than simple dialogs. Python (and other languages like Ruby, Perl and to a lesser extent PHP) is typically coded in an all purpose editor. Navigating the set of modules is often done by searching be it by everyone’s close friends grep and find or within the editor itself. Refactoring is a function of search and replace vs. identifying actual references.

Going back to the code I work on, there are files that have 30+ classes and helper functions where others are closer to Java or C# with one class. This inconsistency (or flexibility) requires tools that can handle a wide array of styles, rarely becoming a master of any. Yet, even though these general purpose editors are the tools of choice, it makes me wonder if there isn’t a better way. More importantly, it makes me wonder if there isn’t a better way using the existing tools.

Personally, I use Emacs (as I’ve said plenty of times). Emacs is a very powerful editor and as I’ve learned to use it and started to dabble in elisp, the possibilities have only become more evident. When I watched the Light Table proposal video, I realized that pretty every general purpose editor is able to understand the concept of a “block” of code. In Emacs, the different programming modes all provide the understanding of the syntax to the point that you can always jump to the beginning and ending of most methods, functions and classes. Emacs also allows you to “narrow” a view of a file to a specific region. Using these two very simple models, it seems incredibly simple then to create editing features like you’d see in Light Table.

Thinking in terms of Emacs specifically, it seemed totally possible to add a feature to a mode that:

  1. Finds a specific function/class/method
  2. Creates a “narrowed” view of that block
  3. Display it in a new frame or within a buffer arrangement

I gave this a quick try and in a few minutes I was able to select the class, open a frame and momentarily narrow the region. It didn’t quite work as I had hoped. With that said, for someone with my very limited elisp experience to even come close to this means two things. One, a skilled lisper would have a larger bag of tricks to pull from in order to control the Emacs UI. Two, our generic editors already have the knowledge needed to help us navigate and edit our code in a more meaningful way.

My goal here is not to proclaim that I’m starting a new project or will be attempting to create these Light Table features. But I do plan on playing around with the idea of viewing blocks of code as a set. I shouldn’t need to adjust the file system layout to support this model as we’ve already proven that grep and find are very powerful tools with powerful text editors gluing things together and providing the last bits of functionality.

Outside of file and code organization, Light Table and Bret Victor also suggest creating a space where we can iteratively see our code execute. These are really slick ideas, but I think our programming communities are more apt to solve the iterative development problems. If the language provides an interface to its AST and means of inspecting in such a way as to provide standard output, conforming that output to a UI in an editor seems very doable. Again, you’re not going to see me writing any of this code. But, things like PyPy are proof that we are quickly coming to common conclusions regarding taking code people write and adjusting it for machines in such a way that we can watch the transformations happen and visualize exactly what is happening. That may be something different than the slick demos you see where some graphics change in real time. In fact, I’m hoping it will be even better where we can visualize our system scaling or failing as we meet demand.

I should mention that I’m really just thinking aloud here. In encourage you to check out the Light Table video as it has some pretty concrete concepts that present a really interesting editing environment. My only request is that as you watch it, consider how the ideas might be made available in your very own editor and language of choice. And by all means, consider donating! I’m just rambling here while the Light Table project is trying to make reality a truly amazing coding system. I wish them the best of luck!

Of course, secretly, I hope they end up ditching the web based backend and just add all these features to Emacs.

Thu, 19 Apr 2012 00:00:00 +0000 <![CDATA[More Mocking]]> More Mocking

Recently the codebase I’ve been working on has gone through a major change that has brought about an opportunity to make some major changes in our test suite. My perspective has been that it is generally a good thing. Our goal is to simplify the code, remove complexity and create a test suite that helps improve our code. This last bit is really important because in the past our test suite, while relatively comprehensive, did not always help make our code better.

Specifically, the tests were brittle. There was a lot of code necessary to make the tests work correctly. This either involved complicated mocking or starting/stubbing a lot of services. The result was that the tests were extremely slow and were easy to break.

Since the reboot of the code, we’ve got a new test suite that has around 400 tests. Some are still really slow, but we’ve start organizing tests into different categories, unit, functional and integration. Obviously the integration tests are slower. The functional tests are a mix bag. Some are slow where others are relatively quick. Many could be faster with some mocks, which leads me to the unit tests. These are meant to be extremely fast and isolate the code under test as much possible.

The unit tests are the newest and test the newest code. I realized relatively quickly that I would need to use mocks to make these test fast since they all interact with HTTP services heavily. If I didn’t mock the responses then the tests were extremely slow and interacted with a large part of the system. The positive aspect of using the mocks was the the tests were really fast. This is a really great feature! I can run all the unit tests in less than half a second.

The downside is that sometimes these tests seem to replicate the code in the method. This is not terrible, but it doesn’t help test what the method should really achieve. Instead it focuses on the steps taken to presumably achieve the correct end goal. This indirection is not always entirely opaque and I can see that at times it is really effective. It also raises issues in the code that reflect a method is trying to do too much.

Also, each test ended up following a similar pattern. The initial steps built the state you required. Sometimes this was trivial, but many times it could be a pain in the neck. My suspicion is that dealing with state for a complex document oriented system intrinsically is difficult to test functionally because the system must always be adjusting state instead of performing operations. For example, if you had a method that adds two numbers, there is a limited set of input that confirms the output is going to be correct. But, when you have a document that needs to be mutated, the input and the output becomes less obvious.

The next step of the pattern was doing the action. This traditionally involved instantiating the object and calling the method. This was generally extremely simple and as expected.

After the method was called it was then time to assert the results. Where ever possible my goal was to assert a value returned by the method. Often times though, I had to make sure specific API calls were made to the backend service using the correct data. This didn’t bother me as a tactic because the APIs were mocked and by confirming the behavior, in theory, we confirm we are using the API correctly. If we were having to mock more of the system, then I’d presume that could be considered evidence the system design is not as good as it could be.

In the past I had tried to use mocks for tests and came away with a feeling that my tests weren’t really very effective. When you isolate the functionality by fooling your code, you are not really asserting the code functions correctly. That said, it is clear that you cannot expect to build a large system and always test all components. In my new testing environment, I’m loving mocking in my isolated unit tests. I’m also mocking lightly in my functional tests to help speed them up where possible. In taking a layered approach I’m able to cover the code without hiding when errors actually happen. At the same time, I’m finding that every test I write now feels like it needs a ton of setup in order to run. Part of that is code smell, but at the same time I also think it is a negative attribute of using mocks.

Time will tell if the tactics I’m using are best. I do feel I’m using the mocks more effectively than I have in the past. The type of code that uses mocks also feels more natural than the code I tried to mock in the past. Finally, I also switched to using Mock, which I’m pretty happy with.

Tue, 10 Apr 2012 00:00:00 +0000 <![CDATA[SXSW 2012 Recap]]> SXSW 2012 Recap

It has been a week since SXSW finished up. A lot people ask us what it’s like to play “south by” as a local. I realize that it is probably a lot easier than stopping through on a tour. When you are from out of town, unless you got lucky with a hotel room, it is tough to have a home base where you can get away from things for a bit. I’d imagine finding a quiet place to park for an hour or two with some cold water would be quite the luxury. As a local, the nicest part is we can go home.

The thing that makes south by hard is that you have no excuse to not play as much as humanly possible. You live in town, you know how to get around and you probably know many people helping to put on shows for the folks out of town. Sometimes you can over commit yourself. Also, since you are local, it is not as though you have a day or two to play. For us, things start getting crazy during interactive and don’t end until Sunday, when many locals still play shows for other locals and out of towners still hanging around.

Not that I’m complaining or anything. SXSW is always an interesting week that leave me excited and totally exhausted all at the same time. Some day I hope we don’t have to play it and I can just grab my bike and see bands.

This year was a good year for us. We were able to get two official showcases and played some great parties, like Hotel Vegan for Brooklyn Vegan. We ate dinner with Anthony Bourdain. Lauren and Rachel were able to play Subb’s outside (I was at PyCon). We did an interview for Rolling Stone France and wrapped up the week hanging with Thrasher. It was a ton of work of course, but we had a great time.

This weekend we also headed up to Hot Springs to play Valley of the Vapors. Most people probably don’t know this, but Hot Springs has an amazing art community that the whole town supports. Both the promoters and crowds do an amazing job making shows there tons of fun to play. It was an honor to play the fest again and hang out with some old friends.

We are officially off the road for a bit, which couldn’t have come at a better time. Our yard needs quite a bit of love to get ready for summer and I have plenty of ‘grammin that needs to get done. This is the first time after SXSW I was ever looking forward to being home for a while. Usually spending the week at the festival I want to hit the road or go in a studio. This time around I was definitely ready to stay at home for a rest. I’m sure I’ll get antsy soon enough for some van life, but for the time being, I’m going to enjoy hanging around the house and getting to spend time with our friends.

Sat, 24 Mar 2012 00:00:00 +0000 <![CDATA[Some Post PyCon 2012 Musings]]> Some Post PyCon 2012 Musings

PyCon is already over for me this. SXSW Music means that I have obligations back in my home town to hang out, play music and have some free beer when I can. It also means that I was unable to stick around for the surrounding PyCon activities such as sprints, which I had hoped there was a chance of attending. Such is life.

This year I took a different tact than in the past. Previous PyCons brought the opportunity to hack on something. The process would involve thinking of some technology to utilize and writing some application that uses it. Rarely would I finish it, but usually it gave me a reason to be on my computer during talks and generally have some goal for the weekend of geekery. Sometimes it is really nice to spend a weekend with nothing to think about except code.

Rather than occupy my time with a throw away project, my goal was to listen. My laptop stayed closed in the sessions I attended. I took some time to actually look at the schedule and pick out talks I was interested in. When I’d go to a talk, unless it was extremely uninteresting, I’d sit and focus on the talk. The experience was definitely fruitful because instead of walking away from PyCon with some toy application, my impression is that I might be walking away with some helpful “teach a man to fish” type insights.

The first is testing. I’ve tried to be a student of TDD school and probably have flunked out more times than I’d like to admit. My feeling is that TDD is a tactic and not a goal in and of itself. TDD is really another tactic to help a programmer slow down and think about what code should be written. Comments, documentation and specs can all do the same thing and none are replacements for each other.

When someone says tests are good documentation, they are not being honest. Tests can be helpful in understanding how to use a piece of code. Tests make terrible docs. Communication is hard enough in a robust language like English, so don’t think some DSL for testing or unittests provide some panacea of clarity because they do not. People are not computers in that they have the ability to make jumps. When a computer fails, it stops and will not move forward or try again unless explicitly told to do so. People will make a jump and assume they understand something and act on that perceived understanding. Therefore, you should make every effort to meet the communication needs of others and that often means repeating the same message over different mediums.

Tox is cool! One of the things Tox does is to create a virtualenv for installing the package under test. It installs the dependencies and runs the tests from this env. This is what all tests should do because it emulates not only the production environment, but it emulates on idea of an application goes from source to deployed. Hopefully that build step becomes configurable as I believe it is important to make that deployment and execution environment work outside of a single language or runtime.

The key to handling subtly complex details is to be explicit. Unicode and dealing with date/times are examples of this. I saw a talk on each and in both cases it was clear that the desire to avoid thinking about all the details was a mistake. It is better to accept you need to deal with all the nitty gritty details. Your helpers and libraries will not be that hard to write and maintain, so go ahead and write them.

Never assume some technique is going to be correct. Bob gave a great talk on maintaining CherryPy for many versions of Python. He assumed the prescribed methods from the community were correct. They were wrong. Mocking is another example where you can be leery to drink the cool-aid. The community makes an assumption that mocking is a critical part of a good test suite. This is simply not true. Again, it is a helpful tool, but it is not required. Remember, what you hear from the community as the “best” option is really the option that has the best marketing. This is not a negative because good marketing requires you to do your homework. But as a user, just because everyone seems to be saying they use some tool or package, it doesn’t mean you’ll find success with it.

Always keep things in perspective. Being a working musician and programmer allows me to see two radically different communities. One thing I’ve noticed is that the people who come across with the attitude that they do nothing but code or play music (no sleep, no food, power through the pain, etc.) rarely end up making a big difference. Instead it is clear that the people who find success are the ones working hard in all facets of life. They work hard and play hard.

That is about it for now. I still have tons of information settling in my brain at the moment, so I’m sure there will be more realizations. I’m also looking forward to watching the videos on the talks I missed. Thanks for a great PyCon!

Sun, 11 Mar 2012 00:00:00 +0000 <![CDATA[To Mock or Not]]> To Mock or Not

The first day of PyCon is almost over and it seems my interests have swayed towards testing. It is probably not terribly surprising as I’m currently working on making some major changes to the tests for the project I work on at my job.

One talk I was very interested in was Stop Mocking, Start Testing. Some developers gave an overview of conclusions they’ve made regarding testing while working on Google Project Hosting. This talk, I’d argue, was a “con” in terms of mocking. They still mocked I/O resources, but generally, they aimed to make their classes and operations easily testable, which in turn, removed the need for lots of isolation via mocks.

Another talk I was interested in was Fake It Til You Make It: Unit Testing Patterns With Mocks and Fakes. This talk discussed mocking techniques and basic Unittest best practices. This was definitely a “pro” mocking talk and assumed that mocking was the best means of isolating tests.

I’m hear to tell you I’m thoroughly confused as to what methodology is better. Personally, I like the “con” argument because the result of the tests was simple, encapsulated code that did small operations. The “pro” argument had relatively “normal” code, but the quality seemed tied to the fact that each component was tested in complete isolation. Everything is tested individually and the goal is coverage of every statement, which just means that anytime you exit some tested scope, you have a test (ie any return and exception).

I honestly don’t know that either approach is better. One aspect that I think frames both talks is the state of the code when the tests were written. In the “con” case, the code was old and had no real tests. The process was taking legacy code and writing new tests and code within an orchestra of services. The “pro” case I believe was primarily written in the situation of new projects written from scratch. At least these were the examples that were given. I wanted to ask about the size of the codebase(s) that the “pro” speaker used these techniques with as well as when the tests were written, but there wasn’t enough time. I suspect that legacy code might be kinder to refactoring for better tests without mocks, but I could be wrong there.

It is interesting to me that these questions are not obvious. In fact it seems that no one really has a solid answer for whether or not to mock extensively. This is not to say there are not strong opinions, but I suspect they are limited in experience. For example, a programmer that does mostly contract jobs using Django is going to have a different set of code requirements then someone picking up on a 10 year set of chatty services providing a critical business need. Hopefully, I’ll have some conversations that might help shed a little more light on the decisions others have made.

Fri, 09 Mar 2012 00:00:00 +0000 <![CDATA[A Guiding Principle]]> A Guiding Principle

There was a discussion on the Emacs Reddit regarding an inspiring video that was very relevant to editing code. The purpose of the video though was to reflect on a larger idea. The presenter in the video makes the point that you should be creating things out of principle.

His personal principle is things should never get in the way of an idea. His example is an editor where small tools help to visualize instantly the effect of changing the code. If you haven’t already, go check out the video. Besides the editor example being really slick, it really hits home how he acts according to his guiding principle.

This got me thinking about my own guiding principle. I had never thought about it before in these terms, but the exercise of finding ones guiding principle is helpful to contextualize why you do the things you do. What I came up with was:

“Always be obvious.”

Seeing as this is my first attempt at naming this principle I reserve the right to change the actual wording. The meaning shouldn’t change.

When the speaker discussed how he felt it morally wrong to get in the way of ideas, I knew exactly what he was talking about. I feel the same way when time is wasted figuring out how to perform a simple task.

Seeing as the speaker provided an example, I’ll do the same and see if that helps.

In Unix, the shell provides the entry point to the operating system. It provides some slick tools like files and streams to make things happen. Once you learn the basics, new concepts and usages become available because the constructs are obvious. The questions you ask about how to take the output of some process and read it into another doesn’t require thought because you know immediately you can do the operation via a file handle.

The simplicity of Unix and its constructs have been used to create incredibly complex, yet maintainable systems. The reason for this is that the constructs are simple and become obvious when working on larger problems.

Taking a specific example of a simple task such as reading a log, we can see why the Unix way is obvious and makes the entire process much simpler. If you task is to read a log, then your first question is how to get the content of the log into your applications code. If you chose the route of logging to some database server then you have to answer questions about connecting to the database, login credentials, network partitioning and many other small issues that can get in the way. Your log reading script then ends up dealing with traversing some network and dealing with database connections.

If we took the obvious route and just read the log file, we could simply program our log reader to accept stdin and read directly from the log file. It is all very simple and direct. If this is the methodology across an entire organization, then the question of how to read a log file becomes obvious, log into the box where the process is running and read its log from its log file. Do you want to collect logs from the same app across many servers then simply loop over the hosts and get each processes log file. It might not be ideal, but it is obvious and you spend the majority of your time working on getting the information you need rather than dealing with network issues, credentials, or database cursor/threading problems.

This is my guiding principle. Always be obvious. Finding clever solutions is great in that it helps to try new things that could eventually become obvious. But in order to easily find new solutions it is exceptionally helpful to have an obvious base to work from. Unix provided this and I believe the fruits of those decisions are abundant.

So, always be obvious!

Thu, 01 Mar 2012 00:00:00 +0000 <![CDATA[Focus on the Goal]]> Focus on the Goal

My pet peeve recently has been people who want to do something great in order to say they did something great. It is a dangerous mindset because it diverts your attention from creating quality.

If your goal was to write the Great American Novel and you are constantly focused on things like finding a publisher and artist for the cover, then the chances of writing a great book taper off rather quickly. If you want to write a great book, you need to spend time writing.

This is especially frustrating when working with others. They are concerned about moving their agenda forward instead of simply putting out quality. Even though it seems counter productive at times, if your focus is on the goal, the exterior details will fall into place in your favor.

Wed, 22 Feb 2012 00:00:00 +0000 <![CDATA[Day Two of Riding a Fixie: Thankyou Brakes!]]> Day Two of Riding a Fixie: Thankyou Brakes!

I rode downtown to do some work and eventually meet up with some folks for a drink. It was my second ride on my fixed gear. There were some healthy hills to go down and a good deal more traffic to contend with. My overall conclusion? Brakes are awesome!

It shouldn’t be surprising that riding a fixed gear is very different. It is pretty much like a regular single gear most of the time, except where it is nothing like a single gear. Where things diverge most are hills and intersections.

On hills, you really need a pretty tough gear ratio unless you are really fast as pedaling. My new ratio is pretty decent for the time being as it wasn’t too fast going downhill, but it was definitely fast. A couple times I tried to slow down without my brakes and realized pretty quickly that I was not able to stop very quickly. With time an experience I’m sure this will change somewhat, but physics suggests that it will never be like using brakes. Thankfully, my brakes are working like true champs and stopping is not really a problem.

The other difficult thing is dealing with traffic at intersections. Typically on my single speed, I can push off and coast for a second to get my feet in the toe straps. If I need to take a corner somewhat quickly, I can coast through it safely. On the fixed gear, neither is trivial. Getting my feet in the toe straps as the pedal is moving is tricky, especially when my shoes are wet (it had just rained a bit this afternoon before riding). Likewise, taking a corner with some speed is a little more nerve racking since I can’t stop pedaling. I’m sure over time these issues will be less of an issue. Some more practice should help in both cases. The few times I managed to nail a turn or get my foot in the toe strap on the first try, it was actually a really good feeling.

I’m really excited I got these new wheels and the fixed gear. It is tough to find reasons to get out an exercise when you almost always have things you can do at home. Having a new toy to play with has been a great excuse to hop on my bike instead of thinking I don’t have the time. I’m hoping the extra time pedaling will help get me ready for tour in February!

Thu, 02 Feb 2012 00:00:00 +0000 <![CDATA[Riding a Fixie]]> Riding a Fixie

This past Christmas I received a gift card to my local bike shop, The Peddler. Seeing as I had just gotten some new tires and bar tape on my geared bike, it seemed like a great opportunity to start fixing up single speed.

My single speed is a steel frame Panasonic Sport 1000. It is a pretty typical inexpensive Japanese road frame. The fine folks at Freeze Thaw Cycles built it for me. They specialized in building bikes from scratch using used parts. It has always been a great bike. My only complaint was that the gear ratio was a little too easy at times.

With some disposable funds to spend, and knowing that my wheels needed a bit of work (some spokes had broken at one point and could probably used a bit of love replacing the bad ones), I decided I wanted to get some new wheels and a higher geared chain ring. They had a set of blue Retrospec deep v wheels that came with tires and a flip flop hub. Seeing as I’ve always thought the hipster fixie look striking, I thought this was perfect. As for the chain ring, I went from 39 to 46 teeth. They had to special order the chain ring and a lock nut for the fixie side of the wheel, so I was able to ride around on the new wheels as a single speed while waiting for them to come in.

My initial reaction was rather mixed. They were extremely smooth, but also a lot stiffer of a ride. My old tires were a bigger, so obviously it was going to take some adjustment. The brakes also were honking like no ones business. After reading up a bit from every cyclists friend, Sheldon Brown, my pads seemed toed in just fine, although they were off center. I centered things up and it helped with the squawking. The rims were also unmachined, so I’m sure the pads wearing off the paint had a lot to do with noise. On the positive side, the wheels were much faster. It was trivial to get going fast enough where my legs would start spinning due to the easy gear of my chain ring.

I had ridden out to the coffee shop next door to The Peddler, Flightpath Coffee, in anticipation of my new chain ring and lock nut. They put the parts on right away and I began my first fixie ride back to the house. Knowing it would take some getting used to, I took a back road part of the way and spent a minute or two trying to do a track stand. It was definitely harder than I expected. I’m sure once I get more familiar it shouldn’t be too difficult. Learning to get my feet in my pedal clips was also much more difficult than I expected.

Past the initial awkwardness, it was a ton of fun. The bike is silent and the new chain ring feels great. I still have my brakes in case I need to stop really quickly. Surprisingly, the brakes work really well. My impression was that brakes on a fixie didn’t really do much, but they were just as effective as before. Having always had a desire to try out a fixie, when I rode my single speed, I’d keep pedaling almost all the time. This ended up being great practice and helped a great deal in feeling comfortable.

I’m really excited to get out on the road. People say that riding a fixie allows you to make a strong connection with the bike. Your body is directly responsible for all aspects of starting, cruising and stopping. I didn’t really experience that just yet, but I’m hopeful that it is something I noticed as I get more experienced.

Wed, 01 Feb 2012 00:00:00 +0000 <![CDATA[Home Network Sysadmin]]> Home Network Sysadmin

After an excellent term of service, my Linksys WRT54G finally started showing its age and revealed a need to be replaced. It was the first wireless router I had ever had and served me well. Yet, with new hardware comes new possibilities.

Besides an upgrade to the newer specifications, one of my goals was to find a router that allowed network storage. NAS systems have become insanely cheap, but they are not mobile. We often need our files on the road, which means the router + hard drive combination is a slightly better fit.

I settled on a Netgear WNR3500L after a good 20 minutes at Frys randomly looking at the selection of networking gear. This process took much longer than expected as my home sysadmin skills have wavered in light of cheap VPS hosting. It was always a lot of fun to run a dynamic DNS service and host my website and media files at home. Unfortunately, I now value reliable service over noodling on geeky endeavors at the house. The result is that computer and home network hardware have failed to pique my interest for quite some time. I’d consider myself completely out of the loop when it comes to knowing what kinds of hardware is out there. Fortunately for me, home networking gear hasn’t changed too terribly much.

My new router supports something called Readyshare (they put a TM at the end of this, so I suspect it is something specific to Netgear) that will let you place a USB based storage device on the network. It shares it via Samba. It was trivial to set up and I was backing up my data in no time.

I also use Vonage for my home phone and previously had its router doing my local DHCP and sitting in front of my wireless router. Seeing as I rarely even use my home phone (it forwards to my cell), it made more sense to go ahead an put this new wireless router first in the chain. After a little trial an error, I configured my vonage router with a static IP and opened the necessary ports for UDP traffic, successfully allowing my home phone to function once again. The fact this didn’t take a few days of noodling to get to work made me feel pretty good about the whole process.

There are times where I wish I could be a sysadmin. Actually, I take that back. There are times I wish I knew what a sysadmin knows. None of it is so difficult you can’t understand it, but it takes practice and requires a different type of mindset that thrives in making software and hardware play nice. I’m confident when this router dies after 10+ years I’ll have another chance to flex my sysadmin muscles a bit at home and be thankful for the experience. But, I’m also thankful I don’t have to do it every day. Thanks to all you sysadmins out there!

Thu, 26 Jan 2012 00:00:00 +0000 <![CDATA[Ubuntu HUD]]> Ubuntu HUD

I had seen an article on Ubuntu HUD recently and finally got a second to actually read a little about what it does. The idea seems to take the use of tools such as Quicksilver or Alfred and apply the same workflow to menus actions in all applications. Overall it seems like a slick idea. The video in the post above shows how it can be used on the command line! Very nice.

My biggest question is that of discoverability. I remember when I was working on the NLD 9 usability and felt the push for search was somewhat misguided. Search is extremely powerful, but it is also something of an art form. As a programmer, many times my goal in asking questions in forums or on IRC is to gather new terms to help aid in my searches. When you are a young programmer with no formal training trying to understand how to work with files, knowing terms like “handle” or “EOF” can be tough to discover on your own.

What is easy to appreciate in the HUD idea is that you never need to leave your keyboard. This is the reason I’m enamored with Emacs. It provides an interface to do so many things, that you rarely need to leave the comfort of your frames and buffers. In Emacs the use of plain text enables this sort of tool to work and I’d assume that GTK+ (more or less) is what allows HUD to take things like menus out of the application and add them to an index. If this assumption is correct, we might finally see how a free platform has an advantage in providing a better user interface. I doubt you could ever see such a widespread conceptual change in something like OS X or Windows, partially because it may not make sense to users, but also on a technical level it seems really difficult.

No matter the long term effects, it is exciting to see some innovation in user interfaces that reflect the maturity of the “computer” as a tool that the majority of people use.

Wed, 25 Jan 2012 00:00:00 +0000 <![CDATA[Making Transitions]]> Making Transitions

I can’t tell you how many times a transition has made a song work. A riff or chorus, no matter how interesting or catchy, is only as good as the transition that introduces it. This theme also holds true when developing systems. APIs and tools such as databases are only as good as the data formats used in the transitions. A well designed architecture with a poorly designed data or storage format will quickly gain complexity, losing the benefits of the system design.

At work I’ve been taking an existing internal API and transitioning it to a service based API. The process has proven to reveal a wide arrayf of complexity. In my efforts to manage this complexity, the best tactic has been to focus on the data format passing between the two systems. Defining expectations and the contract using the format, it has enabled simplifying the service as well as making the client code manageable.

While I’m still very much in the process of making the changes, the strategy of adjusting the data going between the systems has made other decisions much simpler. The result is that both the service and the client application can be implemented with elegance and simplicity.

Fri, 20 Jan 2012 00:00:00 +0000 <![CDATA[Slowing Down]]> Slowing Down

Speed is overrated and yet my mind thinks it is of extreme importance. The extra few minutes spent triple checking some detail is agonizing to my silly brain that simply wants to go go go. It is extremely frustrating that the process of communicating ideas is rife with peril. Just taking the words from my mind to my keyboard is a struggle at times. Typos are just one example. My eyes look at what was written and find the concepts I wanted to communicate instead of words that need proofreading.

The one tool I know I have to battle this curse is to simply slow down. Instead of the silent reading most do, I can read aloud to help improve my comprehension. My own ears filter the cruft and make clear when an idea is simply wrong. When I was a kid I took a test that suggested I was an auditory learner, so my assumption is that by communicating audibly, my mind gets a healthy does of reality.

Even though I recognize this powerful tool, my mind wants to avoid it in the quest for speed. I need to slow down and speak clearly to myself. This post is meant as a reminder.

Fri, 13 Jan 2012 00:00:00 +0000 <![CDATA[Testing and Design]]> Testing and Design

A well designed system does not always imply it is easy to test. Adding a flexible, modular design can make testing difficult. The complexity that hides behind a good design still exists, which means we need to be sure the interfaces between the different layers of the design are fully tested.

Conceptually, testing interfaces seems relatively simple, but as layers build upon other layers, things can become more difficult. Typically a layer of design will interact with libraries which may in turn have similar layers. A change at one layer may have a larger impact than expected.

Fortunately there are strategies to help manage the complexity when managing complexity! Mocks can be a good options as long as you also test the real interface using the same parameters. There are other tools such as code injection, but it is something I’m personally familiar with.

If you do have to test a complex, well designed system it may not be easy to test, but it should still be testable. The difference lies in the fact that testable software can be tested effectively. Even though the process of testing a system might be difficult at times, it should be clear that it is possible. If the code doesn’t seems impossible to test effectively (brittle tests, tons of stubs/framework for small corner cases, extremely slow tests), then it is probably a sign of a poor design. This is unfortunate as code that is difficult to test is also code that is difficult to confidently refactor.

Thu, 12 Jan 2012 00:00:00 +0000 <![CDATA[A Busy New Year]]> A Busy New Year

I’ve been slacking on my writing this year thanks to life getting in the way. With that in mind, I have had a couple small things I wanted to post about even if they don’t really need a full fledged blog post.


I’m coming the conclusion the biggest advantage of Dad is that is really light weight and simple. This is a good thing in development but I don’t know that in production it is missing important features. I plan to keep playing with the idea, but other tools are already doing a good job managing processes.


Xenv on the other hand is proving to be more useful. I recently read how to make chroot jail and it occurred to me how Xenv provides a similar yet different use case. The key difference is that a chroot jail aims to create a new system from scratch, effectively copying everything in some existing system. Xenv on the other hand tries to provide a simple layer on top of an existing system. From a DevOps perspective, the Xenv gives developers a way to meet their requirements without clobbering sysadmins. This may seem contrary to DevOps, but in larger organizations or even when someone is better as a sysadmin than a programmer, having an easy way to layer requirements means a smoother transition from development to deployment.


Unbeknownst to most, I’ve been taking vocal lessons. My goal is to help improve my speaking voice as well as help me start singing a bit. I don’t have any really concrete goals at this point, but it is something I’d like to do. A bit of formal training goes a long way in getting a better idea of what might be a good goal and gives me tools to get there.

What is interesting about the methodology is how little it has to do with singing. Instead it is understanding how you naturally vocalize sound and training your body to do the same thing when you “sing” as when you speak. The process is simply singing scales using different vowel and consonant sounds. It is like free weights for your voice. I can say with confidence that it has already made a big difference in the shower and on the Karaoke stage.


We’ve been demoing some tracks to show to folks who might be interested in working with us in the future. The biggest thing I’ve learned this time around is how important is to be confident in your sounds. This predicated by taking time outside of the studio to practice and really understand your sound. If you are able go in with confidence it goes a long way to saving time. You aren’t trying a million things trying to figure out how something can sound. Instead, you have the sound and you merely need to capture it.

Hopefully I can get back to a more regular writing schedule, but in the mean time I’ll keep collecting my thoughts whenever possible.

Mon, 09 Jan 2012 00:00:00 +0000 <![CDATA[Other Programming Languages]]> Other Programming Languages

Recently I’ve found myself having a desire to check out some other programming languages. It has nothing to do with Python being lacking anything, but it just sounds like fun. When I begin actually taking a closer look, my desire quickly fades away into the reality that learning another language would be really hard and have very little benefit. Python is a great language and no matter what other language I try, heading back to Python always ends up being a beneficial decision.

If I did have a bit more motivation here are some languages I’d like spend some quality time with.


There are really tons of different Lisp dialects and runtimes. Here are some more specifics.

Scheme / Racket

Racket is the new name for PLT Scheme. It provides a web framework.

GNU Guile

The blessed scripting and extension language for GNU systems. There has been talk of rewriting Emacs using Guile, naturally, I’m curious.


The Java ecosystem has some powerful tools that always are enticing. Being able to write some service in Clojure and immediately reap some of the benefits of years of VM fine tuning always sounds appealing. Getting over the hump of the classpath and Java-isms always stands in the way.


There are some web frameworks in CL. The window manager I use also can use CL.


I’ve dabbled in Haskell here and there, but it never stuck. The static philosophy is something I can definitely appreciate and functional paradigms make sense to me. I think the syntax is where I lose some interest. Yesod was a recent framework that looked interesting, but not enough to lure me from Python.


The other day I spent some time going through a more advanced Go tutorial and had a nice time doing it. The problem here is that I don’t have any ideas of what to write. Again, Python always gets in the way by being too easy.


I’m pretty dangerous with C now and would like to fix that. The larger concepts like pointers make sense, but I’d like to learn it more fluidly. Understanding the many tools surrounding C (autotools, make, etc.) is of interest to me as well.

There are others of course, but these are probably the languages I’d actually start spending time with sooner than later. Not long ago I had a similar interest in asynchronous programming that I did managed to satisfy. My conclusions then ended up sending me happily back to plain old Python. With the exception of C, my guess is the same thing will happen, but time will tell.

Fri, 06 Jan 2012 00:00:00 +0000 <![CDATA[A Silver Lining Amidst Tragedy]]> A Silver Lining Amidst Tragedy

The new year brought a painful dose of tragedy in the loss of our friend Esme. To learn more about what happened and to find information on how to help check out these words from waterloo records. Esme was a really amazing person. She loved music and was an example of the kind of person that makes all the hard work and money spent being in a band completely worth it.

Losing someone is never easy as it creates a void. Something is missing that you wish was still there. The experience is different for everyone, but what is consistent is the mark left behind. It is safe to say that Esme left a beautiful and deep mark on everyone around her.

It is this mark left behind that we can see some positivity within the tragedy. People begin to speak up and share stories. Friendships are grown and the community rallies. Some people find new meaning in what they do and others make commitments beyond themselves. I hope that as we reflect on our lost friend we do so with a desire to change how we live today. Let great things rise up from the sadness we feel in the loss of our friend. We as a community have already begun to react and while our feelings are sad, the actions are becoming more powerful and spread memories of a great person.

Wed, 04 Jan 2012 00:00:00 +0000 <![CDATA[Holiday Recap]]> Holiday Recap

I’ve slacked a bit on writing thanks to the holidays, but that seems totally appropriate. Christmas is always a busy time because it provides some time without work which ends up being a lot of time for other work. Whether it is shopping for gifts or working on new music, there is something that needs to be done.

This time we went into the studio for some demoing. There are a lot of reasons to demo music. A demo is really when you record without a specific plan to release some music. Our goals were to put together some of the new songs we’ve been working on with our new drummer to see what people think and get folks excited about helping us with a new record. We’re pretty happy with how things are sounding, which only makes us more excited to keep writing music.

As music ended up taking a vast majority of free time, it meant that the work I did need to get done took up the rest of the time. This wasn’t terribly difficult work, but it was time consuming. One thing that made it easier was Xenv, a thin wrapper around virtualenv. This little tool is really just my own experiment. I found when using virtualenv it was all too easy to just use Python. This is an advantage in many ways because it means the tool is getting out of your way. But, when I wanted to work on creating a new environment (not just a virtualenv), it was not as direct. Xenv gave me a really simple place to connect to a specific environment. It also lent some clarity to my Dad project. It isn’t a lot of code and I don’t think it has to be used as much as implemented for other situations. The concept I think is most important aspect and generally it has felt like a personal success.

After Christmas, we did more recording and had a great New Year’s Eve show. It was the first time using some new Christmas gear and things seemed to go very well. New Year’s Day we went with some friends to The Choctaw Casino before heading to Dallas for the evening. We enjoyed a great Ethiopian meal and saw both my alma maters battle it out in the Ticket City Bowl. It was a great time.

I don’t have any major resolutions or new commitments this year. My goal is to be an incrementally better person, which seems pretty doable.

Tue, 03 Jan 2012 00:00:00 +0000 <![CDATA[Instinctual Programming]]> Instinctual Programming

Programmers often are searching for optimizations to their workflow. Editors, shell scripts and customizable tools are all examples making your development experience faster. As a programmer, you dedicate your life to knowing tools well in order to use them effectively. Effectiveness with many tools is a matter of making their use instinctual.

The irony of instinctual programming is that the tools you’ve become dedicated to finally get out of the way of your thinking and focusing on a problem. You find flow in thinking about and solving problems. The code ceases to be something you are typing into your editor and takes shape on your screen as though you are publishing your thoughts as you think them. Your body ceases to act based on your command and instead listens to an instinct you’ve developed through dedication to your tools and environment.

As nice as this may sound it is critical to recognize what I said in my previous paragraph. The tools you are dedicated to finally get out of your way. All the tooling and customizations you’ve made to optimize your development experience fall away to the background. Many times the tools you believe really help may actually be standing in the way of your focus and the natural instincts you have to solve the problem.

I’m not suggesting you shouldn’t use powerful tools. I can say from experience there have been many a time when my job was not to conceptualize and implement complex algorithms. It was actually mundane text editing that required next to nothing in terms of critical thinking. In these cases, mastering my editor became an optimization to help in hurrying through the mundane to get back to the important work of the day.

But just as my editor and tools have helped me to optimize the mundane, they have proven to be distracting. An infinitely customizable piece of software offers endless optimizations as well as an endless supply of pointless things to change. Change inhibits instinct. Instinct is when you body takes over and your mind is free to do other things. There is a reason cars all have roughly the same interface. Drivers immediately know how to drive any vehicle and to do so instinctively because they do not have to think about the basic tasks like steering, breaking and acceleration.

In the development world this relates to things like working with source control, build/test scripts and deployment processes. All these things should be instinctual processes that do not require excessive thinking. The same goes for documentation. Imagine your favorite programming language didn’t have a central place for documentation. You had to constantly search for random articles in order to find out how to use the language. You’d quickly change your opinion of the language and find something that was more usable.

Instinctual programming is not about optimization. It is about repetition to the point of mastery. It is important to recognize when the process you are repeating may not be optimal, but at the same time, it is important to beware that constant change will never allow you to act instinctively.

Thu, 22 Dec 2011 00:00:00 +0000 <![CDATA[Paver and Throw Away Scripts]]> Paver and Throw Away Scripts

I’ve been making an effort to automate as much as I can recently. Part of this effort has been to utilize more Unix tools, but as a Python programmer, it isn’t always obvious how to combine the two. Paver is a build tool written in Python that aims to be similar to Rake for Ruby. If you’ve ever tried to maintain a Makefile and have felt out of your element, Paver is a great tool to help bridge the gap. It doesn’t have the same target based design that make does, but in terms of keeping a collection of operations dealing with files and/or running processes, Paver does a great job to bridge the gap between Python and the shell.

Beyond being helpful in builds, Paver provides a great tool for writing throw away scripts. If you’ve ever had a task that you needed some code for that didn’t need a full fledged module or package, then a pavement file can help to keep things organized. Its concept of tasks helps to keep small operations organized and allows you to keep your code semi-modular as you hack away. Paver also provides some helpful tools to make things like command line flags and input simple and direct. If the throwaway code does end up becoming a module that to needs stick around, you are one step closer to making it official as Paver provides some tooling for setuptools.

Using the simple example of examining log messages, I’ll show you how Paver makes the process really simple and intuitive.

First off, you have a log file somewhere you want to copy to your system. The easiest thing to do would be to copy it via scp. Here is an example in Paver.

# we'll assume the rest of these examples import this
from paver.easy import *

def grab_logfile():
    sh('scp .')

The ‘sh’ function calls a command much like the Popen class. You can capture the output as well in a variable. Another helpful aspect is that you can add command line arguments. Here is a good example of how you can find the package name and version in a Python package.

@cmdopts([('pkg=', 'p', 'The path to the package')])
def pkg_name_version():
    if not options.pkg_name_version.get('pkg'):
        print 'I need a package name'

    scratch_dir = path('scratch_dir')

    pkg = path(options.pkg_name_version.pkg)
    sh('tar zxvf %s' % (pkg.basename()), cwd=scratch_dir)
    name = sh('python --name', cwd=(scratch_dir / pkg.basename())).strip()
    version = sh('python --version, cwd=(scratch_dir / pkg.basename())).strip()
    print 'Name: %s' % name
    print 'Version: %s' % version

I’m making an assumption that you have a tar.gz package that has the same name as the package. The code uses a ‘-p’ or ‘–pkg’ flag to get the package name. It makes a scratch directory where it will copy the tar.gz into. From there we unpack the tar.gz and ask the package’s to tell us the name and version.

You can see it is really simple to do things like run commands in specific directories and capture input as needed. Also, Paver includes a really handy path library that helps to make path operations more intuitive.

I haven’t really covered anything that isn’t in the docs, but hopefully you can see how some of its tools help make simple throw away scripts easier. It should also be clear that Paver doesn’t make these scripts perfect. You can see from my example above that in order to provide a better interface you’d probably want to use something than Paver’s built in command line options support. But for a throw away script, it is simple and gets the job done. The same goes for the path library. Sometimes it can be a little verbose at times because you need to be more specific by using things like the basename or abspath method. Again, it gets the job done adding just the right amount of framework to make things easier, not perfect.

The big win with Paver is that you have all the benefits of a Python environment while having easy access to the shell. Here is an example of reading and filtering a log file with Popen vs. Paver.

def read_with_popen():
    log = path('/var/log/app.log')

    p1 = Popen(['tail', '-f', log], stdout=PIPE)
    p2 = Popen(['grep', 'foo'], stdin=p1.stdout, stdout=PIPE)
    p3 = Popen(['grep', '-v', 'bar'], stdin=p2.stdout)

def read_with_sh():
    log = path('/var/log/app.log')
    sh('tail -f %s | grep foo | grep -v bar' % log)

You can always redirect the output to some file or you could use the ‘capture=True’ argument in the ‘sh’ call to do further processing. Either method certainly works, but for a quick script, Paver does a great job utilizing an obvious pattern that allows quick access from Python.

I’ve never been a huge advocate of Paver in the past because it didn’t seem like a critical tool, but I’m beginning to become a real fan. It is a tool that you can live without if you’ve never had it, but once you recognize where it excels and you begin to use it, it quickly becomes indispensable.

Tue, 20 Dec 2011 00:00:00 +0000 <![CDATA[Broken Python Packaging]]> Broken Python Packaging

Over the weekend I took some time to take a project at work and try to build a set of binary packages. My goal was to be able to untar a package into a virtualenv and it work correctly. This isn’t that different from distributing egg packages or a RPM, so I figured it would be pretty easy.

My first step was to determine all its requirements. To do this, I created a blank virtualenv and installed my app. I then used pip freeze to record all the eventual requirements. Again, my goal was to have a set of package like files that can be untarred (or something similar) that will install the files in the virtualenv correctly such that I don’t need to perform some sort of build (compile C extensions for example) or get other requirements.

Having my requirements in place, I then starting downloading their packages from our local cheeseshop using pip’s “–no-install” flag. Again, this was very convenient and felt promising.

Once everything was downloaded, my script would then visit each package and try to build a binary package. I tried quite a few options here but none seemed to work correctly. The biggest problem was that each package had some inconsistencies that made using the same command for all of them fail. One package gave an odd character error when trying to create a tar.gz. Another didn’t recognize different bdist formats. Trying an RPM format on a whim was useless. Taking a step back I tried doing a “build” of each package and manually putting them together in a tar, but that was non-trivial and different per-package.

I’m going ahead and giving up for the time being as it is clear that Python packaging currently has a requirement to use source vs. pre-built packages. This is really too bad because often times distributing source and forcing your end-user system to have the necessary tools makes a package unusable. In my case it means that releasing a new deployment requires running the through the entire setuptools process for each package that gets installed. The benefit of avoiding this process is that you reduce the number of variables present when deploying. This makes a deployment much faster, which becomes more and more important in a distributed environment.

Hopefully with distutils2 and Python 3 the community can find some better solutions for packaging. I understand that the source based installation makes a lot of sense in development and even in many production environments, but that shouldn’t make a consistent binary package system impossible.

Mon, 19 Dec 2011 00:00:00 +0000 <![CDATA[The Skill of Listening]]> The Skill of Listening

Usually a couple times a year most or all of the remote developers get together to take some time to meet and reset. The goal is to have some time together to discuss and design some of the bigger picture details of our systems as well as work together to sprint on some code.

I remember one specific instance where we had some new developers and we were discussing some potential changes to our deployment system. One of the new developers had a rather strong opinion about keeping our own sources and packaging everything together when we release. My opinion was the exact opposite of course. It seemed ridiculous to avoid the convenience things like setuptools and easy_install for what I thought was no real gain.

Every once in a while when there is a technical discussion it is helpful for me to think back to that discussion. My opinions have changed and now I agree with his points. At the time I had a bias that was unsubstantiated. The fuel driving me to disagree was really my own close mindedness to new ideas. It was certainly OK to disagree, but my resolution should have been to recognize the impasse and contemplate on the ideas presented. Instead, I was impatient and kept arguing.

None of this was damaging or that heated, but lately I’ve been thinking about it a lot along side other technical arguments I’ve gotten myself into. The theme I keep coming back to is the importance of listening. For myself, that means explicitly shutting my mouth and thinking about the problem. Sometimes it means talking about it to myself out loud or trying to write some code. The main thing is that I stop arguing and start listening in order to consider and compare the ideas.

The other thing about listening is that it does not imply that my idea or direction is wrong. I’m not giving up on the argument or becoming acquiescent. This is also important because assuming I’m wrong means I’m not listening to myself and giving my own ideas a chance to mature and develop.

This process is something that takes a lot of practice and between writing code and writing music, I get my fair share. The irony is that only now am I realizing how poor my listening skills actually are, even though I could have been practicing for most of my life. You never stop learning.

Fri, 16 Dec 2011 00:00:00 +0000 <![CDATA[Testing MongoDB]]> Testing MongoDB

We’ve been noticing some issues with our data in MongoDB. Seeing as others have had less than stellar experiences, it seemed like we should make sure that MongoDB wasn’t the one causing problems.

My main goal is to see if incrementing updates by a large set of clients can cause data to be corrupted. I’m defining corruption as anything that diverges from the original document. Specifically, we’re looking for some keys that should have an answer but don’t. My theory is that under some conditions it’s possible develop a race condition when writing the documents.

The good news is that so far I haven’t found anything suggesting my theory is correct. The bad news is that the test takes an extremely long time to run. This could be a function of the python client not being very fast, but I suspect MongoDB is at least partially to blame based on the logs. There are some long operations like moving indexes and creating new datafiles that seem to be slower than expected, which I’m assuming is because of the heavy write load.

The larger picture this test paints is that incrementing documents in MongoDB is probably a bad idea for any mission critical or highly available app. The incremental updates play into MongoDBs weaknesses such as the global write lock and the way it uses mmap to manage how files are created. For our application I have long thought that we should have a two phase system. One data store for “live” data and keep static data in another store. In an initial iteration of this design I would picture a MongoDB instance for both and migrate data inbetween, doing some compaction along the way. Hopefully these tests, even if they prove MongoDB has been doing fine, will help support my design ideas.

Tue, 13 Dec 2011 00:00:00 +0000 <![CDATA[Talking About Music]]> Talking About Music

It is extremely difficult to talk about how something should sound. When we practice we hear different ideas for how the song should progress. You have to reference where in the song you want to try something and give some semblance as to what you are going for. Those that are going to play whatever it is you’re thinking have to have some idea they can internalize and try to actually play. Rarely is the process simple even though it can feel like it should have been extremely simple.

One thing that I’ve noticed is that I usually want to communicate these ideas quickly. Part of the reason is simply because I don’t want to lose the idea. A rhythm or melody can be easy to forget when first write it because your hands haven’t memorized it yet and your ears most definitely haven’t. Another reason I want to try and speak quickly about what I’m thinking is that experimenting takes a very finite amount of time. When you have a song that is 4 minutes and you are practicing for 2 hours, the most you could play the song is 30 times. The reality is you’re going to take a minute or two in between each play and if someone messes something up (which is not uncommon at all), you might have to start over. When you have such a limited time period to work, communicating efficiently is critical.

That last sentence though should speak volumes because trying to say something quickly is rarely going to be the most effective means of saying what you mean. This is especially when you’re talking about music where the concepts can be really abstract and qualities are subjective. I’m coming to the conclusion if you want to communicate some musical idea there is no better way than to just try and play it. Take the time to set a context and emphasize the part(s) you are working on so everyone see when and what you’d like to change. This kind of patience is not something I seem to have, but I’m most definitely going to try and learn how to take my time and become more effective.

Mon, 12 Dec 2011 00:00:00 +0000 <![CDATA[Devops Mindset]]> Devops Mindset

I’ve never said it, but I’ve never really been a fan of the devops concept. I have no problem with developers doing deployments or having root access to servers or something silly like that. My main complaint is that the divide between a sysadmins and developers is helpful as it forces an interface and protocol. I’m now thinking that perspective was wrong.

The devops movement always seemed to be born of the startup trend where developers are hired and must be responsible for every part of the system. Services like Amazon EC2 only made it a logical position to create in a small company that aims to build on the shoulders of giants. Seeing as I have no historical context to say whether my assumption was right or not, I’ll assume it was correct for the purpose of making my point.

The real impact of devops is providing empathy for both concerns in applications and system design. When you throw away the preconceived misconception each type of stake holder has, there are some really powerful solutions that can be created that help everyone out. The developers can write code to help cross the chasm between development and production while sysadmins can have their needs met more intimately with applications packaged and written with the system in mind.

In a way it is simply an optimization. Prior to devops, the consensus was that there was an actual wall you had to work around. Devops doesn’t remove the wall, but instead suggest that both sides work to build doors and windows that each can depend on. This is a much better situation. I’m glad I finally started to see it.

Fri, 09 Dec 2011 00:00:00 +0000 <![CDATA[Dad and adding Build Steps]]> Dad and adding Build Steps

I’ve been working on the same project at work for a while so when I have a little free time I like to work on Dad. I’m not sure if I’ve mentioned it before, but its a process manager application that is meant to support a more comprehensive set of management tasks. It came from my idea of a devserver much like Foreman, with biggest difference being that instead of simply controlling whether a process is up or down, you can ask it to perform tasks within its own sandboxed environment.

It got me thinking about how most people in the Python community end up looking to virtualenv for their deployment, when there might be other tactics. Virtualenv is a great tool and use it all the time for development. There are tons of people that find great success with it as a deployment tool as well, so I don’t play to sway the opinion away from it. I do think there could be better ways though.

One of the reasons developers choose to write applications in languages like Ruby or Python is because it is easy to get a large portion of the functionality done quickly. Honestly, during this phase of development a tool like virtualenv makes a ton of sense because it provides a similar speed when deploying. Installing all your dependencies at deploy time is not a huge deal because there aren’t that many and most of the time they don’t change. In other words, it isn’t broken so why fix it?

The reason to consider a different methodology early on is because it is easier to do it early on. Eventually the simple Django app you wrote is going to push the limits of what the framework offers out of the box. You start adding functionality that doesn’t have an obvious fit within the repo. As these problems evolve the application gets complex. The complexity moves from being a simple repository level complexity to abstractions via services and processes. There is nothing wrong with this, but it presents a different sort of problem to manage. You no longer are simply refactoring code, you are orchestrating services both in development for testing and in production. There is a lot that can change going from dev to production and I don’t believe that tools like virtualenv really help that issue. They do too many things.

With that in mind if you start out your application with a healthy appreciation for processes and services, there is a good chance you can avoid much of the pain of learning to orchestrate these details consistently. This is why I started writing Dad. It is a process manager that you can use in development to orchestrate your services. You can then use it to run your tests and eventually the idea is that your development process from code to test to production is same on your local machine as it is when you deploy to production.

The key to making this transition is to recognize the need for different steps of deployment. In Python it is easy to just pull some code on a server, run install in a virtualenv and fire off a command to start it up. The problem is when you app actually needs 10 other services, one of which is a database written in C++ and another is a Java application. When you have more steps for deployment, you have points in the process where you can create the necessary pieces to make deploying as simple as untarring a tarball vs. running, finding dependencies, downloading them, installing console scripts, etc.

The first step in a deployment process is to make a package. This is something that makes it easy to put files in the execution environment easy. Creating a tar.gz via setuptools is probably a good first step. RPMs or dpkg might be another option. But in either case, you need something you create that can be used to build the actual “thing” that will be deployed.

Once you have a package, you then need to make a build. The difference between a package and a build is that a build has to be able to be run in a production-like environment without having to do anything but copying all the files from the build. Pip supports a “freeze” operation where it takes all the currently installed requirements and their versions to create a requirements file that can be used to recreate an environment. This sort of function should happen at the package phase, prior to the build. The build stage should find all the requirements and compile them together accordingly. The result of that operation then is what is used to create the actual build.

After the build, it is time to actually deploy. The deployment should be really simple. You should have an execution environment where you are going to put the files. That is where the build is installed. This is where Dad comes in. When you add an application to a Dad server it creates a sandbox for you to create an execution environment. You have your executable files and supporting files as needed and Dad is configured to call commands in order to manage the processes.

While all the above seems like a ton of work it really is a lot easier than you might think. It is pretty easy to write a short and simple script or Makefile that has to run “python sdist”. From there it is pretty simple to run a command that installs it into a brand new virtualenv and resolves dependencies. At the end you can pop the build in your Dad sandbox and run the tests. All in all, it really should be simple, especially if start when your application is still manageable.

Lastly, Dad is no where near finished. The design has been hashed out a few times with different models and only now would I argue that the model is correct. There is still a lot to do. If it interests you, feel free to fork it and try hacking on it.

Thu, 08 Dec 2011 00:00:00 +0000 <![CDATA[Coding Pride]]> Coding Pride

Are you proud of the code you write? What is it about some code are you specifically proud of? For me, I’ve become less and less proud of my code.

This doesn’t seem like a bad thing. It helps me to be humble and less attached. Lacking pride doesn’t imply that I don’t care about the code, but simply that when I’m done, it doesn’t feel like something I’m excited about showing people.

I can say it does concern me slightly. My perspectives seem to have changed slightly looking at code that I respect. It is typically pretty boring and nothing too interesting really happens. The meat of the code ends up being comments that describe rather clearly what should be going on. None of this is a bad thing, but I’m also finding myself less interested in new language features or using a feature that that is not well supported. In short, my code has become rather conservative.

Some would argue that this is a good thing and generally, and it should be obvious, I agree. With that said, it also seems to be indicative of a more fundamental issue of questionning my skills. Is the reason I don’t use some experimental feature because I don’t trust it will continue to be developed or is it really that the feature isn’t very easy for me to use? That slick new language feature that everyone is talking about looks helpful, but I’m not sure I really understand what the big deal about it is. Maybe I’m missing something.

It is important that I recognize when my code is wrong or none optimal, but it also is important that I feel I can code confidently. My lack of pride in my code actually feels more like a lack of confidence. As is with most of life, the solution is plain old hard work.

Wed, 07 Dec 2011 00:00:00 +0000 <![CDATA[The Envconf Tool]]> The Envconf Tool

Continuing to think about using env vars, it seems like a really easy way to start using them would be to have a way of configuring them using a file in development. For example, in our apps at work we traditionally have some sort of a base config YAML file that keeps all our defaults. The YAML format is pretty helpful in keeping things organized for development but there is really no reason you couldn’t use environment variables.

Without further adieu, I wrote envconf. It is really meant for development where you have some configuration files and you want to read and create env vars from them. It only supports YAML files and for lists it uses JSON to create a more universally parseable value. I’m not crazy about this list support, but one easy work around is to explicitly use a string and parse it yourself. Not ideal, but probably good enough.

In theory if someone actually uses this and wants to use another config format like .ini, JSON or some other format, it should be pretty easy to add a new type. Implementing a different configuration type is a matter of reading a file and returning a dict, which shouldn’t be terribly hard.

I should also mention that envconf doesn’t overwrite anything currently in the environment. The result is that you could layer things if you wanted. Here is an example of how you could do it:

envconf -c dev.yaml envconf -c base.yaml run-my-server

You could also just pass it in explicitly in your shell:

APP_EXAMPLE=foo envconf -c base.yaml run-my-server

Unfortunately, I haven’t had a chance to try it out at my job as we have a semi-involved configuration pattern that would need to be adjusted and I don’t have the time. But hopefully I’ll get a chance to give it a go.

Tue, 06 Dec 2011 00:00:00 +0000 <![CDATA[Using Env Vars]]> Using Env Vars

Someone at work mentioned The Twelve Factor App in our IRC topic. It is a great read as it simply and concisely states the current general best practices for writing a distributed app.

One thing that did confuse though was the mention of Environment Variables for configuration. Here is what it said:

The twelve-factor app stores config in environment variables
(often shortened to env vars or env). Env vars are easy to change between deploys without changing any code; unlike config files, there is little chance of them being checked into the code repo accidentally; and unlike custom config files, or other config mechanisms such as Java System Properties, they are a language- and OS-agnostic standard.

The idea of using env vars as a means of passing configuration information makes sense in a general case. But, if you application has more complicated configuration, it seems like it could be somewhat daunting.

That is just my initial impression coming from a YAML based configuration system. After taking a look at our production configuration I could see how most details could be reflected as traditional env vars. Where things do become more complicated it seems reasonable to consider moving that information to a service.

As a though experiment, I’m going to consider moving some of our configuration to env vars to see if it feels reasonable. Starting off, the vast majority of values seem like they could easily be migrated to prefixed variable names. As we do use YAML, we have some dictionary and list structures. The lists seem easy enough to parse as strings. For example:

# set the env var as JSON
import os
os.environ['APP_HOSTS'] = '["", "", ""]'

# load the env var
import json
hosts = json.loads(os.environ['APP_HOSTS'])

While JSON seems like it might be somewhat heavy handed, it is important to have some standard otherwise you’d end up seeing lots of code that looks like:

hosts = os.environ['APP_HOSTS'].split()

Maybe this paradigm is totally fine. I would imagine it could break down though depending on the values and if they needed spaces.

For dictionaries it seems looking at our config that most could, and probably should, be moved to some service. It is still possible to use something like JSON, but again, I think that might prove cumbersome in some situations. Writing raw JSON in a BASH script might be rough. I’m also noticing that many times a dictionary really is more of a visual organizing tool. It is possible to flatten the dictionary using an obvious delimiter that could be used to expand things later if need be.

Here is an example:

conf = {
    'csv': {'extra': {'repair': True, 'delimiter': '|'},
            'type': 'text/csv'},
    'xml': {'extra': {'xsl': 'http://host/path/to/foo.xsl'},
            'type': 'text/xml'}

This might be flattened in the env vars as the following:

APP_CSV_TYPE = "text/csv"

APP_XML_EXTRA_XSL = "http://host/path/to/foo.xsl"
APP_XML_TYPE = "text/xml"

What we need to do is take this kind of configuration and use it via env vars to configure a download service. Here is one option:

class DownloadService(object):
    def from_env(cls, type):
        kwargs = {}

        vkey = ('app_%s' % type).upper()
        for key, value in os.environ.iteritems():
            if key.startswith(vkey + '_EXTRA'):
                kwargs[key.split('_', 3).lower()] = value

        kwargs['headers'] = {'Content-Type': os.environ['APP_%s_TYPE' % type.upper()]}

        return cls(**kwargs)

Offhand it feels a little bit kludgy, but I think the reality is that it is not that complicated. The fact I wrote this method inline in a minute or two suggests that wasting any more time to something cleaner is probably just a waste of time.

In summary, it seems pretty reasonable to define configuration via environment variables. One concern is that keeping the configuration for development and testing might be cumbersome at times. But with that in mind, there is no reason one couldn’t keep them in a YAML or JSON or even Python and simply generate them on the fly when you run the application. I’ve also found at times using environment variables can be tricky with scripts where I preferred to use command line flags. My gut tells me that the real issue is that I’m not doing the right thing wrapping some script or configuring how it is run.

I hope this exercise has been somewhat helpful. Obviously a complex configuration is going to end up with pretty massive set of environment variables. The gut reaction might be that it seems sloppy or that it will become too complex. This might very well be true, but I have a feeling it can be pretty easily managed. Writing a simple YAML to env var script would be pretty trivial. Maintaining a global namespace of env var names is effectively what Emacs does and I can say from experience that while the namespace has become huge over the years, I’ve never hit a collision.

Mon, 05 Dec 2011 00:00:00 +0000 <![CDATA[Seemingly Disconnected]]> Seemingly Disconnected

It seems that my DNS service expired over the last week or so, the result being I missed some email. Life has also gotten in the way of my writing, which has added to the appearance of my disconnectedness. Even though I’m sure no one really even noticed, it still annoys me.

One of the reasons I haven’t been writing this past week is that my computer seemed to receive an update that it wasn’t happy with. I have a i7 MacBook Pro and unfortunately it has a history of locking up. I’m 99% it is a driver issue and my money is on thunderbolt being the cause. To make things work, VMWare Fusion, the tool I use to run Linux and do real work, seems to make things worse. It has been really frustrating to recognize I didn’t write anything for the day to sit down at my computer, see some odd pixelization and know the computer froze up again.

The other pain in my neck (and wallet) has been our van. We played a show on Tuesday and after loading the gear into the van at the practice space it wouldn’t start back up. It had been running poorly that day, but that happens every once in a while and it usually seems to work itself out. This time it didn’t. Fortunately, after we practiced it did start up and I drove it home. I’m going to take it up to the shop this afternoon and probably end up dropping way too much money to get it fixed.

My only saving grace is that if it is something that commonly breaks on the van, then my mechanic usually has a means of fixing it where it won’t break again. After buying the van we ended up finding out that there are a lot of components that fail. My mechanic is something of a diesel genius and can fix most of them in a way where they don’t break again (or he can get rid of them completely!). This is definitely a good thing as I have heard horror stories of people with similar engines fixing the same thing over and over again.

For those that do read and pay attention, my apologies. I should be back on track now, at least until my computer freezes up again.

Thu, 01 Dec 2011 00:00:00 +0000 <![CDATA[Avoiding the Web]]> Avoiding the Web

The other day I unsuscribed from Hacker News. It is simply too much information. I also switched back to using StumpWM. I decided to give Kindle for Android a try and read Get in the Van: On the Road With Black Flag. All of this has been to help fight the focus sucking power of the web. They say that email is worse than marijuana when it come to focus and honestly I can believe it.

The thing about the web that is so dangerous is that it requires so little dedication to use it. No one wants to go online and scroll down a page reading for comprehension. Instead it is more enticing to click around and search for random information. You can let your mind wonder down the never ending stream of hyperlinks until you forgot where you came from and where you hoped to end up.

Obviously, as a person who works daily on creating software for the web, no matter how much I’d like to break away from the web, it simply isn’t an option. That said, it doesn’t mean that I can’t take real steps to avoid the time suck it can be. The first step is admitting you have a problem.

There is value in reading long challenging work or going through some tutorial. It can also be valuable to spend time on IRC helping others or trying to further an open source project. Reading documentation online is paramount to any developer. Clearly, not all of the web is truly dangerous.

The biggest danger on the web is the short spurts of information. Hacker News is a great example of this because it provides a never ending stream of headlines awaiting a click. There happens to be some discussion at times that can seem interesting, but it is really a farce. Discussion on Hacker News means nothing in the grand scheme of things. It is simply a means to justify reading it constantly. After all, you’re not simply reading news, you’re engaging in a community. You’re there for the conversations and the people. I’m not buying it.

This is traditionally my problem with social media sites in general. They act as though they are creating a new means of communication and relationship. The idea you can take your offline life and place it in a social network that allows you scale up your social interactions in a way never thought possible. The problem is pipe is not getting bigger. You only have so much attention and focus to offer the world. In computers context switches slow things down and people are no different.

When I was in school a student and I were working on a project. In talking to him about his interests he shared that he really didn’t like programming as much as he enjoyed math. More specifically, he enjoyed “computer” math, meaning algorithms that computers do well. He described that there are many times the best way for a computer to actually compute things is to stick to repetitive, simple calculations. As people, we often want to optimize because it seems wasteful to slog through tons of work, but a computer excels at this type of work. I don’t think we are really that different. Instead of doing 10 things all the time, it is better to do one thing at a time.

The real goal in avoiding the web is to relearn how to focus. The result is twofold. First off, I should be able to get more done. The tough problem that seems to be never ending will most likely become solvable when given some time to focus on it. Secondly, with more focus it is possible to spend that focus on real relationships. Listening takes a great deal of focus and I’ve come to the conclusion that my introverted nature makes me an ideal listener. Yet, even with this natural affordance, my lack of focus steals my effectiveness. This is something I’d like to remedy and it feels as though avoiding the web is a great first step.

Mon, 28 Nov 2011 00:00:00 +0000 <![CDATA[Stop SOPA]]> Stop SOPA

As a musician that cares about fans, copyright law and lawsuits like this one between Universal and GrooveShark are both frustrating and encouraging. I can’t stand that people have to pay a massive amount of money for pirating music. A song costs a buck, so if I stole 1000 songs, I should owe $1000. I’d even go so far as to say that the person enforcing the copyright could get a court order to find out who I had shared the songs with in order to enact the same penalty. This kind of enforcement, while still kind of dumb, has some semblance of logic and sanity. If someone steals your car, the insurance company doesn’t give you 150000% its value to buy a new one. You are lucky to get what it is worth and you don’t lose the hassle of dealing with the entire situation. Stealing sucks and people shouldn’t do it, but it also sucks to provide a benefit to those who were stolen from. That suggests that the victim should be a profitable position, which is simply not natural.

Even though is frustrates me to know end people are getting royally screwed for downloading music, it is a good thing to know that songs we write can be protected. I’m not asking the RIAA to sue kids out of existence for downloading music illegally. But, it is beneficial to know that I do have rights for content that I create. When someone tries to use my music for something I don’t believe, I can say no. When someone else is trying to sell my songs, I can say no and sue them or go to the authorities. These are good things because while I’m personally fine with kids getting my music online for free, I’m not OK with my music showing up in pornography or sold by someone who isn’t going to pay me for the use. Having a copyright helps me to control my work and create a career.

The nice thing about copyright currently is that it is primarily a system based on owner enforcement. It is the copyright holder’s responsibility to enforce their rights, otherwise a violation might be considered authorized if the copyright holder knew and did nothing about it. Again, this is a good thing. If someone downloads my music on bittorrent (and I’ve seen our new record on plenty of torrent sites), I’m OK with it.

What does all this have to do with SOPA? SOPA aims to make it easier to sue and penalize those who infringe on copyrights by allowing rights holders do things like shut down payments to the site and make search engines stop displaying the site. Streaming copyrighted content also becomes a felony. The result of this sort of action is that copyright ceases to be a tool for artists to retain their rights and instead includes the government in the process of stopping piracy. The result is that major labels will most likely try to sue most web services out of existence for losses that were never truly incurred. Not to mention that it censors the web by allowing copyright holders a mechanism for abuse. Imagine a label that cuts off some deal with Google and then sues them for copyright infringement. In theory, they could potentially have Google search results removed from Google!

The fact is we need the government to stay out of copyright. If you write or create anything, you now own something. It is your prerogative to share it and make it available or keep it private. By involving the government more and more in copyright, you invite them to your creative table, to your privacy and your property. They don’t belong here.

If you a musician you should recognize that copyright is not why you make a living. You can make a living as an artist by providing great content. That content is defined as create (in terms of a career) by how well it motivates people to pay for it. It doesn’t matter if they steal the song if they come to the show and buy the record, a poster and a t-shirt. Instead of worrying your music is being spread across the world without you getting paid, it might be a better exercise to consider why people don’t feel compelled to pay you.

Tue, 22 Nov 2011 00:00:00 +0000 <![CDATA[Studying Live Music]]> Studying Live Music

(NOTE: I found this post when perusing a folder of posts, so I don’t know what inspired me to write it. Not that it should matter, but I thought it might be relevant to me if I stumble on this post online)

I’m pretty sure I’ve said it before, but when I go to shows it ends up being more like a lecture than simple entertainment. No matter what band I’m watching, my mind wanders towards what they are doing right that has moved others deeply. How are they helping people enjoy the music live?

When I say “lecture” I don’t want to imply it is something sterile or boring. When I was in school for my history degree I took some classes that had graduate level parrellels. For example, the grad students would have a much heavier load in terms of papers and readings. Those grad students at lectures were deeply engaged. They would listen intently, writing organized notes and write down questions they planned on asking about or researching later. The lectures that would sometimes put me to sleep were enthralling to the grad students who had dedicated their life to study. When I say “lecture” I mean in this graduate experience sense where it is no longer a learning requirement, but enjoyment.

It can be especially interesting to see a live band who comes from a non-mainstream subculture. The big band in the little pond so to speak. Bands like this end up teaching both their own music and performance as well as what the scene is really about. Sometimes it also sheds light on why the subculture doesn’t break out on the mainstream, while other times it becomes clear why it might break out in the future.

I remember hearing a quote from Bowie saying that he could look at a band playing live and know immdiately if they could make it. That is a pretty bold statement, but I believe he is right. When he originally said it, making it probably meant finding some success with a major label deal. I’d argue that today “making it” is a little more subtle and revolves around finding an audience rather than something concrete such as finding a label. Sometimes a band needs to find that subculture and genre that can and will support them, even if it means they are not likely to play stadiums. Other times it might be best to just stop worrying and play what the people want to hear. No matter what kind of band it is, there are always interesting things to learn.

Thu, 17 Nov 2011 00:00:00 +0000 <![CDATA[The Spectrum of an Audience]]> The Spectrum of an Audience

Melt Banana played tonight in their traditional grind meets pop magnificence. We watched the show at Mohawk on the balcony and had a great view of the crowd breathing with the music. Having a birds eye view I noticed there seemed to be a spectrum of different types of audience members.

Starting from the nucleas closest to the stage were those that felt the most. They had a crowd pushing on their backs the whole time as they watched first hand the energy and subtleties of the band. To them the crowd was nothing but a reminder that they were in a public forum where everyone was going crazy. Up front it is always one of the craziest shows you’ve ever been to.

Immediately behind the front row is what many would call the “pit”, but really it is more dynamic than that. A “pit” really should be a designated area for potentially dangerous dancing. It doesn’t really happen that often and I’d argue there wasn’t a “pit” tonight. Instead there was a collection of people that refused to let the crowd control them. If there was someone in front of you, move in front of them. If someone pushed up against you, push them back. None of it was angry or malicious. It is simply a place where people try to own their space along with other doing the same thing, all in time and inspired by the music.

Outside of those people moving in the middle were those that had the rare experience enjoying both the music and the madness of the crowd. The band was a short distance away and the people bouncing and moving to the music were immediately in front. Every once in a while you get shoved or run into, but it is just part of the excitement. The entire experience is entertaining.

Beyond this is where the show becomes more lax. People are there with friends, talking and drinking. Some take pictures with each other or of the crowd. A person heads to the bar to get a round of drinks. They bob their heads and generally enjoy being somewhere that others also wanted to be. This strata doesn’t want to get sweaty or involved in the madness. They are there for the music and to hang out, with the latter sometimes trumping the former.

This is traditionally the rough layout of most every crowd (not in an arena or venue with seats). Sometimes more people move and other times the first row melds with those moved by the music leaving the rest to those who are making apperances. No matter who is there though, as long as people are enjoying the music, it is a great night.

Wed, 16 Nov 2011 00:00:00 +0000 <![CDATA[Mundane Exercises]]> Mundane Exercises

It can be frustrating to see a ton of code that needs to be refactored in order to use a new API. Often the process can be automated, but at the same time, taking the time to automate it might take longer than just plowing through it. Even though it can be a pain in the neck and you might fee like you’re slinging text, there is usually something to learn in the process.

I remember a time at my first job that I refactored some code that was written using a strange (at least to me) style. I found the style was meant for C code and it emphasized a means of debugging. Nonetheless, I found it extremely frustrating that I had to edit the file so much just to read and understand the code. My manager at the time said that was a good thing, but personally, I felt a lot of pressure to get this done quickly and thought the process was a waste. Now that I’m a bit older and wiser, I understand what my old manager meant. It is helpful to get hands on with code and to see what breaks. Breaking things and getting your hands dirty is partially why you have tools like version control and test suites. At the same time, it doesn’t make you feel any better editing mundane text.

The way around this is try and establish a repotoire of repitive tools. You can always write scripts to read in files, look for some text and change it accordingly. Like others have said before me, learn regexes and try to enjoy them because they can make repetative tasks much easier. But what if a script really is too much work and yet find/replace isn’t quite up to snuff? This is when your editor and your knowledge of that tool becomes paramount.

But before we get to editing, one thing that is extremely helpful is to have simple typing skills. Sometimes it is pretty easy to just type things out rather than look up some keyboard shortcut or command. It is not a matter of typing fast, but typing quickly with accuracy can be a huge win. Now back to editors.

Seeing as I use Emacs there are couple tools that have been really helpful. The first is a regex search and replace. Emacs allows a rather large set of complex regex operations as well as incrementally performing the replacements in a way that you can also small edits midstream. Honestly, I always forget the keybindings for this, so I’m still working on getting this tactic into my own toolbelt.

The biggest win for me personally has been macros. In Emacs you can record a macro that will let you do anything in Emacs and repeat the process. This includes switching to other files, copying and pasting and executing shell commands. For example, I wanted to change some import statements in a set of files. I was able to grep for the statements I wanted to change and then create a macro to visit each point in the file buffer and adjust the import. It was extremely easy and made a somewhat error prone manual process automated and reliable.

If you don’t use Emacs (shocking!!) then take time and learn how to do automated editing in Vim (I hear it excels as this sort of thing) or via command line tools like sed. In taking the time to work through the mundane, whether it is with a tool or just slinging code, you’ll come out the other side understanding more.

Tue, 15 Nov 2011 00:00:00 +0000 <![CDATA[Web UI Frameworks]]> Web UI Frameworks

The web is an interesting place because it is both visual and text based. The browser provides a middleground between text and what we see with GUI toolkits. HTML 5 seems to expand this role closer in both directions by providing more interactive tools along side meta data to make documents contain more information.

My interests have traditionally been on the server side of web development. JSON as an interchange format has been helpful in allowing a simple goal of computing data and utilizing it in web applications. REST based services utilizing JSON has become a welcome norm to allow dynamic applications in web browsers.

What has always been missing for me personally are the tools to make the UI look good easily and fully utilize a JSON data service. The idea of a UI created completely in the browser has always been appealing, but impelementing such a system has always been difficult. Fortunately, that all seems to be changing.

In terms of making consistent, clean UIs, we have a wealth of CSS frameworks to choose from. I’ve used YUI and Blueprint with some success. More recently, Bootstrap from Twitter seems like a promising option to get a great headstart on a nice interface. Most of these sorts of tools focus on some basic defaults and patterns, but that is often times extremely helpful when you’re like me and your interests are on the server side.

Another really interesting project that I hope to try is Obviel. It is basically a client side templating system that relies on JSON data to fill in the content. This isn’t an entirely new idea, but considering there are plenty of options that perform this sort of function suggests the idea is getting clearer and more concise. The issues that others found in creating systems like this seem to be getting solved, which is great news.

When I first got out of school I worked at a company that had an actual desktop application. I wrote a few GUI tools in addition to working on the actual applications. It was always fun designing a GUI, yet horrible to actually implement it. The nice part was that almost all the basic design questions were answered. You knew the colors and the menu structures. You had a suite of widgets to work from and they all were pretty consistent in look and feel. Actually implementing these ideas was never trivial though, primarily because C# was complex and required tons of code.

With web applications the complexity can live in the browser, the server or both. The result being that there are more opportunities to contain complexity. Fortunately, thanks to tools like the ones above, there seems to be actual means of hiding more complexity along with tools that help to keep the process consistent and simple. While I think many people see more excitment in HTML 5 features like video and WebGL, I’m personaly happy it is easier than ever to provide the services I enjoy with clean, simple interfaces.

Mon, 14 Nov 2011 00:00:00 +0000 <![CDATA[Writing and Testing Code]]> Writing and Testing Code

Throughout the years my methodology for writing code has changed quite a bit. I’ve tried tons of editors and IDEs and jumped back and forth from gui tools to the command line many times over. Since settling on Emacs and learning some of its features it has become clearer the common thread that flows between all the different tools.

The most obvious thread is compiling code. Since we use Python at work, we don’t actually “compile” things, but we do test them which happens to produce errors very similar to compilation. Fortunately, Emacs has a very handy compilation mode that helps to make the process very fluid.

When I’m writing code my basic iterative process is to create a test that is re-run each time the code changes. Using Pytest I’m able to do just that. This pytest process runs in a compilation mode buffer and allows me to jump between the different exception / error file points and immediately open that file to the line where the error occurred.

The key is making these tests run quickly as possible. In order to do this, I use what I call a Dev Server that keeps the test instances of any services up and running. It makes sure my databases are started along side any service that I might be testing. This way my test suite doesn’t shut down a database and reinstall any stub data for running a single test that might not need the service in the first place. Likewise, for the tests that need a good deal of orchestration, all the services are already up and running.

The whole process is pretty well integrated. If want to paste a traceback or some code, I have some lisp code to paste to our internal pastebin. I can then paste it in IRC which is also running in Emacs via ERC. I can even use a tool like pdb to debug directly in Emacs, opening the current file where the execution was paused.

Before finding the compilation mode I ran tests in a shell buffer or in a terminal. The problem with this was that it was never trivial to find initial test error or see what failed. I would search around to find the error but sometimes tracebacks were extremely long and even include a “pretty” version that added to the visual clutter.

There is still more I can probably do help make my testing experience more helpful, but for now I’m relatively happy. One thing that I’ve noticed is that many times my quest to write tests is partially used as a means to avoid switching to a web browser. While in theory this is a good optimization, I’ve found myself needlessly avoiding firing up the application and dependencies even though my dev server handles that for me just fine.

Sat, 12 Nov 2011 00:00:00 +0000 <![CDATA[A Vegas Vacation]]> A Vegas Vacation

Lauren and went to Fabulous Las Vegas over the weekend with some friends to celebrate some birthdays and generally take some time off. Even though Lauren and I travel quite a bit, it is rarely for fun, so we were really excited to have some time to have a good time without having to think about work or getting to the next show.

We stayed in Caesars Palace, which was surprisingly convenient. We have stayed in the MGM a few times, The Golden Nugget and The Stratosphere and Ceases was by far my favorite. It seemed to be right in the middle of everything and it was pretty nice to boot.

If you’ve never been to Vegas, there is an interesting dichotomy on the strip and to a lesser extent downtown. Among the high class, gaudy super hotels like The Bellagio and Caesars there are some old, cheap casinos like Casino Royale and Bill’s that seem out of place. Much like the downtown hotels, they reflect the traditional feeling of old school Vegas. They are kind of dirty and run down with lots of mirrors and classic western appeal. The ceilings are low and reflect a no-nonsense approach to gambling. You go to play and not to gawk at statues or to see a show. Fortunately, my favorite places to play, Casino Royale, Bill’s and O’Sheas were all right across the street from Caesars

One advantage of the older small casinos is that they often are a little looser and offer lower limits on table games. I’ve relatively consistent luck playing machines in the older casinos on the strip and downtown, where as the large casinos seem to always steal my money. This trip my favorite game to play was craps and the old casinos all had low limit games, which made my education a much cheaper process. I also found out that many of the cheaper casinos also allowed betting larger odds bets, which helped me out on a few big wins.

Outside the typical gambling we also ate some great food. We had a 13 course meal at a place well off the strip called Abriya Raku that was phenomenal. It was a mixture of homemade tofu, rare winter mushrooms, raw and cooked fish along homemade soy sauce. They also had a slick Japanese toilet in their water themed bathroom, complete with bidet and heated seat. Needless to say, I want one.

On our last day we ate at Lotus of Siam, which is easily the best Thai food I’ve ever had. It is also pretty reasonably priced, which can be a nice surprise after losing a bunch of money over the course of a couple days.

Consistent with my theme of enjoying old school vegas, we hit up the steak house in the El Cortez (one of my favorite casinos). It was classic Vegas and while the food was relatively standard, the inside looked like one of the dining rooms in the original Oceans 11. Another old school steak house we ate at was at the 4 Queens called Hugo’s Cellar. This is a truly great steakhouse serving traditional steaks cooked to perfection. We had eaten there before this trip and both experiences were amazing. It is a lot of fun to head into the basement of an old casino for a great steak with some wine.

After our steak we headed out to play some games downtown and managed to catch High on Fire after hearing some brutal heavy music coming from Beauty Bar. They are a great heavy band that we were really lucky to see in such a small club with a tiny crowd. To think that I saw them open for Mastodon and headline in Austin to a sold out crowd, it was really cool to see them play a small intimate venue.

We really had an awesome trip. It was first time in a very long time that coming home wasn’t a huge relief. We wanted to stay another day and secretly hoped we missed our flight so we could priceline a room downtown and have one more night of fun.

One negative of Vegas is that it can be rather seedy at times. There are tons of people clicking cards of questionable moral material. Also, it is sad when you see people gambling who probably should avoid it. Vegas is definitely a place where people can make poor decisions.

Personally though, I enjoy the games and glam. It is kind of silly and amazing at the same time. Visiting allows us a chance to live like money doesn’t matter for a few days. You might blow a couple hundred dollars in an hour or two and it is easy to chalk it up to dumb luck. Some might think that is extremely wasteful and I can’t really deny that, but at the same time, money isn’t everything and being able to experience that lack of concern makes me really appreciate the things I do have.

One of things I really liked about craps was that there were a whole table of people having fun playing together. It was fun to watch others be obnoxious or wear some gaudy outfit. High fiving a person you just met or throwing a few bucks to the person who rolled the dice and won you some money is a communal experience. The money your up or down is not going to change your life, but spending time enjoying the company of others just might.

Sat, 12 Nov 2011 00:00:00 +0000 <![CDATA[Learning Emacs Lisp]]> Learning Emacs Lisp

If you use Emacs then you need to learn Emacs Lisp. Lisp as a language has widely been considered one of those languages that will change the way you program by opening your mind to new ways of thinking about problems. While I think this is true, my suggestion to learn lisp is completely practical.

One of the benefits of Vi(m) is that you can stay in the shell most of the time. Vim can be like an advanced pager at times, batch editor or diff tool because the workflow is to stay in the shell and open it on demand. This means if you are a Vim user, it is important to understand your shell and recognize it as a place to customize your environment. Emacs, on the other hand, is really like a text based OS. You start it up and use it to interact with the filesystem and files using the paradigms it offers. You have a file manager, process runner, manual reader, IRC, etc. all from within Emacs and all using a text based paradigm. Therefore, if you want to optimize your environment, you need lisp.

If you’ve ever known someone good at a GUI toolkit you’ll notice they are very quick to produce simple GUI tools. They see a UI and can immediately start coding something up that presents them with the data quickly. Likewise web developers can often whip up a quick HTML interface for some server process without batting an eye. When you learn Emacs Lisp, you are learning a similar skill in that you immediately can see places where you’d like to optimize your system and create a UI for it.

A really good example of where this makes a difference is in copying and pasting. It is one of those things developers do all the time. They paste code or URLs from things like terminals, email, browsers, etc. When you work in Emacs, you have the tools immediately available to grab the URL your cursor is on and open it in a browser without even having to copy and paste. You can grep for logs and save the output in a file. One of my personal favorites is doing bunch of thing in a shell and saving the shell session as a file then editing it for an example. What all these examples show is how you don’t have to think about moving text around in terms of selecting it, copying it to some clipboard, changing applications and pasting it. Instead you can see some text and immediately run commands on it, using the output immediately afterwards directly in your editor.

A great example of this workflow is a paste function I wrote. We have a local pastebin at my job (you can download your own version here) along with an IRC bot. As all the developers work remotely, the ability to paste code quickly along with links is very powerful. Previously, I had written a command line script in Python to accept stdin and paste the result to our pastebin. In IRC you could use a command and the bot would get the most recent paste by your username and print the URL in the channel. This was slick, but not optimal. It seemed a waste to maintain a Python script for POSTing to the pastebin. It also didn’t make sense that the bot would have to grab the URL since I should know the URL after POSTing the code. I rewrote my code to do POST the selected code to the pastebin, use the mode of the current buffer to find the type of code (for highlighting), get the resulting URL and add it to the kill ring so I can immediately copy it in IRC, which I’m running in Emacs.

This might sound like a lot of work, but honestly it wasn’t much work at all. I’m not an Emacs Lisp guru by any stretch but by picking up a little here and there and getting used to the docs, writing small functions like this has become relatively easy and ended up being quite helpful. There are obviously times when it is simpler to just use another tool, but if you are a programmer and have dedicated yourself to an editor, then finding this flow when customizing your tools should be pretty natural and profitable once you get used to it. If you are using Emacs and not learning its Lisp dialect, then you really are missing out.

Wed, 09 Nov 2011 00:00:00 +0000 <![CDATA[Ideas and Tools]]> Ideas and Tools

It is interesting to see the progression of cloud technology. Amazon has been extremely successful selling the concept of cloud based systems and it has had an impact on the entire software landscape. Companies like 10gen who make MongoDB have a product that assumes your datacenter lives in the cloud and that having multiple machines to provide replication is a given and that all your indexes can fit in memory. I don’t think we’d see things like MongoDB and the wealth of virtualization tools had it not been for Amazon’s initial push. It was a pretty powerful idea.

What has developed since then is a suite of tools that consider the machine the unit of deployment. Instead of pushing code to a server directory, you spin up new image and reinstate some application code. Instead of having constraints like a single directory to keep files you have no filesystem whatsoever. The data must be kept outside the application at all times because when the machine reboots it also means reinstalling everything. Again, it is a powerful idea.

The machine being the unit of deployment has impacted developer tools. Things like distributed version control can make it trivial for a script to check out some code or configuration on a server to set things up. Tools like chef and puppet help manage configuration across any number of machines that have been configured in an virtual machine image. Other tools like fabric allow the automation of running commands on a machine. All these tools build on the assumption that you have a set of machines that are exactly (more or less) the same that you need to consistently configure.

The point being is that the ideas behind much of the cloud has had a huge impact on the tools. It makes me wonder though what might have been possible had there been a different set of ideas in place. What if EC2 was more like a chroot jail with a single directory to run processes. Would will still see so many NoSQL databases and would they be as focused on being distributed? Would tools like puppet and chef still be a widely used or would we simply see more use of rsync? Would something like Hadoop be simpler by allowing direct use of the filesystem instead of its HDFS? It is interesting to think about.

At some point though there seems to be a moment where trying to do things differently isn’t really worth the time. Python has been a language that has competed with typical shared hosting and has not been very successful. I think most Python developers consider it a given that you should use a VPS to host a Python web app. But is that really the case? WebFaction has been successful providing Python hosting in a shared environment, so we know it can be done. Yet, it seems that the ship has sailed and we’d rather always configure entire machines rather than untar a tarball in a directory.

The irony in my mind is that power behind the ideas has pushed the tools so far that the seemingly inefficient process of setting up entire machines has become easier than simply writing applications under the assumption there is a filesystem and a single directory to work from. It makes you wonder if all the presumed savings behind the cloud really is worth it. Virtualization is a double edged sword in that it provides flexible environments quickly while abstracting away access to efficient hardware. All those CPU cycles we use to make sure application A doesn’t interfere with application B by providing two different machines could have been used by both application A and B had they been written more carefully to use a sandbox on the filesystem. Maybe it is too cheap to matter, but it still makes me wonder.

I don’t think I really have an important point here past it is important to question the status quo every once in a while. I’ve always been a fan of the cloud, NoSQL and distributed version control, while at the same time looking back at my time with PHP and shared hosting and realizing it was really easy back then to get an application out the door. Maybe we are doing it wrong.

Wed, 09 Nov 2011 00:00:00 +0000 <![CDATA[Distributed Computing and Flat Files]]> Distributed Computing and Flat Files

Lately, I’ve been thinking a lot about different MapReduce implementations. I noticed a theme regarding most systems was the establishment of some file system. One reason is to have some level of abstraction over the distributed data, but the more I think about it, the real reason is because you need to have atomic pieces in order to compute something in a distributed manner.

The great thing about a file is that you can always read it. At some level all databases end up reading files (except the in memory ones...) and this works well because reading doesn’t interfere with anyone else. This is why you see web servers handle static files an order of magnitude faster than dynamic pages. A dynamic page has to consider state and anytime you need to check the state that means something gets computed or something changed. In order to distribute computation over distinct resources, you need to avoid those state changes and what better way than files.

With that concept in mind, I want to test the speed of reading files in different systems but the process of testing it is not apples to apples because some system might do some low level tricks or read other files. For argument’s sake, I’m going to disregard the physical disk. If a disk is faster then it will be faster and if you are doing something at a very low level to be sure data gets written in a way that avoids seeks on the platter, then it is beyond simply “reading a file” because it is dependent on how it was written. Hopefully this gives me a decent enough picture to see what reading a file really costs.

I’m going to throw out the typical disclaimer that these are just silly benchmarks that hopefully help me drawn a general conclusion regarding reading files. I’m not trying to say one language is faster than another past finding out whether a higher level language using a C function is an order of magnitude slower than a compiled language that uses the same C system calls.

The two languages I chose are Go and Python. Python should be an obvious choice because I use it every day and if I were to write some sort of distributed processing system based on reading files, it would be Python I’d turn to first. The decision to use Go is because it is a “system” language and I’m somewhat interested in it because it does do concurrency well, even though this test has absolutely nothing to do with concurrency.

Here is the Python:

import sys

if __name__ == '__main__':
    fname = sys.argv[1]
    print 'got', fname
    afile = open(fname)
    count = 0
    chunk =
    while chunk:
        chunk =
        count += 1
    print count

We are reading the file once through via 512 byte chunks instead looping over the lines just because in Go we’ll have to read the file by chunks.

Here is the code in Go:

package main

import (

func main() {
     var fname string = os.Args[1]
     var buf [512]byte
     var count int

     fmt.Printf("Got %s\n", fname)

     afile, _ := os.Open(fname)
     for {
         nr, _ := afile.Read(buf[:])
         if nr == 0 || nr < 0 {

I’m just learning Go and honestly don’t know nearly as much as I’d like about C, so please take this with a grain of salt.

I ran both programs a few times using the time command. The results were pretty similar.

# Using Go
real 0m0.049s
user 0m0.012s
sys 0m0.036s
# Using Python
real 0m0.054s
user 0m0.044s
sys 0m0.012s

The conclusion I would draw is not that one is faster than the other, but simply, that opening a file is usually about the same in in most languages. This is an important point because it means that some database is typically going to have a similar level of overhead just reading files. If you can optimize things further once you open the files, there is a decent chance you can create something that is faster than a generic database. This optimization is probably non-trivial. Parsing the content of the file could be expensive and navigating the contents could also be difficult. That said, most databases keep a customized format on the disk in order to find delimiters and limit creating tons of files. In a MapReduce type system, looping over all the files in a directory and knowing that the whole file needs to be read as a single entity could actually be an optimization vs. having to seek to different portions in order to find specific information.

No matter what sort of system you need to build, it is good to know that you are building it on the shoulders of all the great software engineers that came before you. At some point the concept of a “file” became a critical interface and that interface still stands as a valuable tool to getting work done. It is not surprising that the simple file still manages to be at the center of innovation as we move towards more and more cores, crossing massive networks.

Wed, 09 Nov 2011 00:00:00 +0000 <![CDATA[MongoDB Headroom]]> MongoDB Headroom

There is a thread on Hacker News that discusses some issues someone had with MongoDB. The party line on MongoDB is that while it has problems, 10gen (the company behind MongoDB) is very supportive and that is definitely true in my experience. My problem with MongoDB has never been support, but rather that it offers no headroom.

Bass players often talk about how much headroom an amplifier offers.

What they are talking about is how loud you can turn it up before things start to go bad. Bass is not like a guitar because the low frequencies it can produce have the ability to break things. It shakes material and can blow out speaker with longs wave forms, which means a speaker cone is traveling a long way with every movement. Having headroom means you can still turn it up and not worry things are going to break.

In our experience, it feels as if MongoDB never gives us any headroom.

The usage pattern we’ve had has been to release code, notice a problem in MongoDB days or week after the code change  and try to understand what changed knowing the code didn’t change recently. There have also been times when we did change code and saw an immediate negative effect.

I remember one instance where we added an index (we had only one at the time) and our performance grinded to a halt. Now obviously you can’t assume that adding an index is without cost, but since we had only 1 on the collection, you’d think that adding one more would be pretty reasonable. This is what I mean when I talk about headroom.

At this point I’m not a fan of MongoDB. I don’t understand where its sweet spot really is. Using it as a key value store is pretty decent if you don’t have many writes. We have an gettext like service using MongoDB and language catalogs don’t get updated too terribly often.

MongoDB has been just fine for this kind of usage (although, I’ll point out that a client uses HTTP and caching to make it as fast as gettext).

We download lots of data from MongoDB and it seems reasonably fast, but we also have had to move some processing out of queries because it was more expensive than just doing it on the application side. We had also had to implement an implicit external index patter where we have a different database that keeps an index on another database’s data, the “databases” being MongoDB databases. Outside of the obvious potential consistency issues, this pattern has worked pretty well. Yet even though it is “working”, we are constantly concerned about the next outage and when memory will start going up for no apparent reason.

There is obviously a reason for our problems and I’m sure our code is part of the problem. But, I also don’t think we are doing anything massively outside the realm of what MongoDB should excel at, yet it always feels that way. We’re not talking about terabytes of data we have to crawl through all the time. The documents are all effectively atomic.

Once they are done being written, they never written to again. Yet, we these obvious constraints that would help with performance and providing headroom, we seem to have none.

I have a feeling that we will continue to use MongoDB, but I’d like to figure out exactly where it excels and use it there. Otherwise, I’d be happy with other systems picking up some of the slack to help get a little headroom for the future and a better plan of attack to scale into the future.

Sun, 06 Nov 2011 00:00:00 +0000 <![CDATA[Music Needs Better Metrics]]> Music Needs Better Metrics

I’m not sure if people are really aware of how poorly the music industry works with metrics. While I’m sure major labels do provide revenue projections and other similar corporate goals, I’m also sure a major label does not consider all the available data before making a decision.

The argument might be that there is some “it” or “X” factor that makes something work, but that is simply not true.

Beyond making decisions, often times people in music turn to the charts for their sign of how an artist is doing. For those that don’t know there is a company called Soundscan that handles tallying up the albums purchased. They weigh the values in some cases depending on the music store, so something that sells well at Amoeba Records is going to have a bigger impact than if sells well in Boise. Presumably, this is a pretty decent metric based on actual sales. The problem is that it is easy for a label or band to take a few hundred copies to record stores, put them on consignment and immediately purchase them back. The seemingly quality metric of records sold is actually rather inconsistent because it doesn’t take in account the time or situation.

What I wish would happen is that bands could keep excellent records of everything they are doing. Everything could be recorded. At the end of the night some clubs give a rundown of who paid, guest list, expenses, advertising, hospitality, etc. All this kind of information should could end up in this system. It could also be cross checked by the venues and bookers putting on the shows. Then, you’d tie in the record store sales along side the time/date when the sales took place. You could also allowed weighting based on certain factors like the number of shows going on in town (if all the bookers used the system, this would automatic), other competing events, extreme weather, etc. If you were to analyze all these details I think they would reveal a much better picture of what a band is really doing in terms of getting new fans.

The nice thing about it is that with the right data, you could see a band doing well regionally and predict how they might do nationally. For example, if a band played 2 shows in an area and made a ton of sells, it might be an indicator the band has the chance to explode. Similarly, if the numbers reveal a slow build for a band, you could assume it might take longer to develop an artist, but that it is worth it for a long term market. When you have this sort of information, you might think twice about advancing the band hundreds of thousands of dollars to record a new record. You’d have a chance to establish a more realistic budget for where to spend money. It might be worthwhile to keep a band on the road and focus on tour press because they get fans playing live.

Other bands might be blog darlings where keeping a band in the studio is more beneficial. The metrics would never be 100% accurate by a long shot, but at the very least you could try.

The lack of bad metrics has surprisingly gotten worse with social media. Having a “fan” on Facebook is meaningless in terms of turning that relationship into something valuable, yet the number of likes is often considered a critical measure of an artist. The same situation happened with MySpace and the number of plays you had. The result is that you cheat. It is easy to find services to game the social networks and give yourself the best numbers. The sad thing is that the opportunities you get from positive social metrics are only going to result in false positives. The band with a million hits on their viral YouTube video might be given a large guarantee for a show, but if their “fans” that clicked a thumbs up don’t want to pay $12 to see some band play one song that was in a video on YouTube. You’re not getting “fans”, you just getting a click.

The online advertising business has started recognizing this problem and has turned to metrics to help argue the case that ads on the web do make sense. They are proving their value and providing a picture of what it takes to go from random stranger to customer. I don’t see why the music industry doesn’t take the same tact. Instead, it is as though they would prefer to fail over and over until they get that one “hit” rather than do a bit of research and see if their first impressions really do equate to value. It is sad really because it wastes the talents and hard work of a whole industry.

This is all not to suggest having fans on social networks is not important. I hope to use social networks to connect with our fans whenever possible. Social networks are a great medium to communicate things the band is doing or to spread new music. That doesn’t mean every click we get is a dollar in the bank. It is nothing more than a click.

But the 10 CDs sold at the show with 20 kids on a weeknight in Boise, ID? That might be a good sign the band is worth the time and money to support.

Sat, 05 Nov 2011 00:00:00 +0000 <![CDATA[Pivoting Code]]> Pivoting Code

Programming as a culture has changed quite a bit throughout the years.

The focus on largely geeky topics has become less and less as more people “startup” companies and consider how they can follow in the footsteps of other great hackers and innovators. The subject matter of most tech news relates to things like scaling, game theory and understanding markets. There is still plenty of information about purely geeky pursuits, but I would argue the geek of today is not interested in the same geekery as those of yesteryear.

The term “pivot” as taken from the startup world means to change the focus of some business in order to find success. The goal with pivoting is that you cease to spend time polishing a turd and instead change your focus to where actual customers are. This concept is not limited to startups though. It is important to understand when to pivot actual code as well.

At work we’ve been reworking an important system and throughout the process I’ve been challenged to change the way I think about some of the problems. My mentality up until this point has been focused on being a good steward of the previously written software and make every attempt to provide a fluid transition from one system to the next.

Unfortunately, I needed to pivot. The fact was that writing code that let every old test pass was not a sustainable measure. The old tests were wrong in some cases. They contained assumptions that were no longer true and adjusting the tests or the code to bend to these assumption only reduced their benefit. The converse of fixing the test often meant doing a lot of work on new components that really had not been specified.

The pivot came to me when I realized that I hit a barrier where the new code needed to start becoming intermingled with the old code. It is when the tests run in isolation become worthless because nothing is actually using the code. I hope this sound familiar because it is almost the exact same experience a startup founder might experience when the product is actually released. If no one is using it then what is the point.

With this barrier firmly in place my next steps were to see what was working and failing. In analyzing the current state, it became clear that I wasn’t nearly as far off as I anticipated. I could continue some of the work adjusting the tests and fixing bugs and it was actually beneficial. What is important though was that I needed to analyze the current state of things before moving on. When pivoting code you have to take time to see where things are making sense and where they are not.

It is not fun work because it is monotonous and requires focus pouring through the code, but it is beneficial and will help in pivoting how you work on the code.

The practice of performing a pivot is as follows:

  1. Stop and recognized when things are too complex
  2. Analyzing the current state (what breaks and what doesn’t)
  3. Establish the metric that matters (fix what breaks or write new tests)
  4. Take action

Personally, I’ve always had a problem making major changes to code. I’d either just want to rewrite everything or fix it. There was no in between. For me, pivoting is a good way to organize the middle ground.

It manages the mental complexity my brain is trying to add in a way that is productive so I can move forward. If you ever find yourself stuck in a rut with some code, there is a good chance a pivot would be helpful to get things moving again.

Fri, 04 Nov 2011 00:00:00 +0000 <![CDATA[Quality Libraries]]> Quality Libraries

Have you ever read about a slick library that someone just released and thought how awesome it is, yet you don’t immediately start using it?

This happens to me all the time. People are constantly releasing interesting code that seem applicable to the projects I work on. Yet, even though it seems like a decent solution to problems I’m having, I’ll rarely try it out.

The reason is pretty simple. It takes time to learn some library and it takes experience to understand it well enough to make a qualified decision. If I tried all the libraries that I saw, nothing would get done! The converse to this is that I’m missing out on some packages that actually would be a good fit.

This issue came up when I read about Obviel, a JavaScript view/template library. We are currently in the process of migrating many services from being a mix of RPCs and HTML base Ajax to a more RESTful platform. Obviel seemed like a pretty slick solution. The blog post I read about it also mentioned Fantastic, which seemed like another slick tool to help with static resources. Even though both seem like interesting solutions, the doubt comes up in my head that they will continue to be maintained or that I will choose these packages only find a killer bug sometime down the line that causes us to switch.

The reality is I have a classic case of NIH syndrome. When code is open source/free software my level of comfort should be high. I can always try to fix my own bugs. I can always take bits and pieces if need be to make something customized. If we have to switch 6 months or a year down the line, that is OK. That is 6 months to a year of time that we got for free because we didn’t have to write or maintain some code.

I’m going to try and make an effort to try out the packages that look interesting to me. I did spend some time with Backbone.js and Underscore.js, both of which were very nice. The packages mentioned above seem really interesting and might fit extremely well in our homogenous environment built on CherryPy. Instead of being worried about the bad things happening, it seems much better to just give things a try.

Wed, 02 Nov 2011 00:00:00 +0000 <![CDATA[Keep Consistent Specialized Data]]> Keep Consistent Specialized Data

First off, I have to mention that I’m still rather sick. This thing has been going strong for 3 days now, which really rare. I don’t believe I’m getting fevers as consistently, but it doesn’t change the fact I feel really weak and thinking has been a challenge. Hopefully I’ll start feeling better today as I’ve made a decision to let up on the rest and make sure I get more work done. Being sick sucks.

Speaking of work, we recently put in a good deal of work testing specialized solution for analyzing data. The initial tests were promising, but in the end I think there were some limitations in the actual implementation tools that kept it from seriously flying. This has me thinking about how you can make specialized processes for data in order to get the speed you need while still keeping some element of consistency across the different data sources.

The problem that comes up is that if you do work to get the data to some specialized format in order to do some special processing, that has a rather expensive cost. This expense is traditionally in the form of extra time. One solution is to make sure you write in the specialized format initially instead of pulling it from some other database. The problem then is that original database which might have been good for something then has to jump hoops to get it to a format it requires. This leads the solution that you might write it twice. Disk is cheap afterall, so when you save your data, write it to the database and also write it to the specialized format. The potential problem here is that the transaction to write the data is rather lengthy and failures might be hard to handle. It really isn’t an easy problem.

One solution is to rethink the flow of data. Instead of the data being something that sits in a database until you query it, the data could instead be part of a workflow of transformations. This doesn’t mean that you don’t put your data in a database, but it does mean that you don’t always have to. If some portion of the workflow doesn’t require updating the data, a flat file might be a great way to handle the data. The goal is that by providing a workflow with the data, you change the expectations and expose the real qualities of the data at each step in such a way that you can capitalize on them.

This idea is extremely vague because even though I’ve been thinking about it for a while, an actual implementation hasn’t come up that makes obvious sense. Hopefully someone reading might have some ideas on how this design could work and what tools would be helpful.

Wed, 02 Nov 2011 00:00:00 +0000 <![CDATA[Being Sick]]> Being Sick

It is pretty frustrating to be sick and work from home. There is a subtle pressure to keep working even though you feel awful. After all, it is not as though you need to get dressed up. You can’t get anyone else sick. It isn’t that much work to be on your computer anyway. All of that doesn’t mean you feel any better. I thought I had fought it off before tour, but obviously not. Tonight I’m hoping a good nights rest helps turn the tide and break my fever.

Tue, 01 Nov 2011 00:00:00 +0000 <![CDATA[Graphs and Stats]]> Graphs and Stats

Seeing as I work for a company that does statistics, you’d think I would be really good at it. Unfortunately, I’m not. It is an entire world of math and formula that I have limited experience with have yet to have requirements that force me to acquire that experience. What I do see pretty consistently are graphs. I can say with the utmost confidence, that graphs are not what statistics are really about.

A graph is all about communication. It is a visual representation of some data that is meant to convey some relationship or lack of a relationship. Graphs can be changed and manipulated to express differences as huge jumps or radical contrasts. They can be used to reveal real trends and find outliers. A graph is just a tool used to communicate, which means it can be accurate or biased.

Statistics on the other hand are math, plain and simple. They are formulas and theories that help take answers and calculate relationships with specific understood precisions using specific constraints. From the outside it looks like statisticians looking at graphs and giving their thoughts, but really it is calculating values and adjusting inconsistencies in order to put a number on some trend. This was a tough thing to understand when I took stats in college. The class was really all math and we were told very specifically how to phrase results. The reason being is that stats is not about predicting or finding correlation, it is about assigning values to results much like equations are used to map things like sound waves and graphics.

I mention this because recently I’ve seen some graphs on social networks with commentary on the results as though something has been proven correct. One thing my experience at my job has taught me is that statistical data doesn’t prove anything. When I see graphs I appreciate the communicative aspects,  but rarely do I trust the conclusions. I’d rather see the stats.

Sun, 30 Oct 2011 00:00:00 +0000 <![CDATA[I Can't Sleep Right Now]]> I Can’t Sleep Right Now

I’d like to take a quick nap but I’m parked outside some dorm room and some guy is screaming the words to different songs. It is distracting when I’m trying to sleep. But it is also inspiring.  We’re playing a show in the basement of the dorm. I hope this guy comes down, buys a record and screams to our music. That would be awesome.

Fri, 28 Oct 2011 00:00:00 +0000 <![CDATA[Agile Specs]]> Agile Specs

Agile development methods typically don’t mention specifications as part if the process. You might hear things like user stories, but a specification typically is considered territory for waterfall methodology. This is really a shame because a spec is a great tool when writing software. A good spec can help clarify purpose, known limitations and define finite details that might be otherwise left undefined. These sorts of constraints help to define what the big picture really is for the software.

What makes a spec fit in well with agile practices is a reference implementation that is developed along side the spec. As your spec helps to define a larger design, the reference implementation helps keep the unknown details in focus. When you don’t have a reference implementation it opens the door for massive feature creep and design of features that are unnecessary. It is like trying to paint a picture with one big brush. It doesn’t work because you can only paint an enormous image with no real detail. The reference implementation provides an opportunity for detail, which means that while some system may become complex, it is still manageable because you have the big picture in the spec and an implementation that supports the design described by the spec.

My spec writing skills are not the best, but I’ve found it is worth it to try and design up front. Just as comments have been an essential tool for me to keep track of what I’m thinking when writing code, a short spec is helpful logically working through the details of the application. It also can generally morph from technical notes to documentation, which is always helpful. One way software is like art is that you can look at every day and still not grasp everything you see.

There are always small bugs or features that you may miss. A spec gives you a cheat sheet for better understanding code no matter what development method you use.

Thu, 27 Oct 2011 00:00:00 +0000 <![CDATA[Driving Music]]> Driving Music

I missed out on writing something yesterday because I a million other things that needed to get done before we started driving north for a short tour. Fortunately, Lauren grabbed my all time favorite record to drive with, Sea Change by Beck.

The sounds on the record feel so warm and inviting. You can almost see the musicians sitting around playing and watching each other for changes in order to give each riff the perfect subtlety. There is a looming sadness that adds to the warmth of each track that makes you want to smile at the music and yet be completely silent in light of the tragedy painted by the lyrics. The opening line of the record helps to secure its space as a great driving record. “Put your hands on the wheel / Let the golden age begin”.

The other record we listened to was The Hunter by Mastodon. I’ve listened to Mastodon quite a bit but not a whole album the entire way through. Heavy music can be tough for long stretches at times, but The Hunter does end up being pretty reasonable to listen to. The vocals get really eerie and dark while not losing clarity. The subject matter seems pretty typical of metal with things like blood and fire coming up pretty consistently. What stands out more than anything is the drums. That drummer never seems to play straight ahead. I kept listening for a beat where it wasn’t altered by a roll or rhythmic pause and can’t remember one moment. While this can be difficult at times to focus on, it totally works. Eventually the rolls and misses merge together and spreads out the sound of the song like a knife spreading a bit glob of peanut butter on a piece of bread.

I have to be careful with driving music because it can get me thinking about other things. Whether it is playing music or remembering some past experience, if a song takes me somewhere it can make the actual driving secondary, which is never good. Be careful on the road!

Wed, 26 Oct 2011 00:00:00 +0000 <![CDATA[Models are Important]]> Models are Important

Models are a critical part of any large system because they define what can and cannot be expected. The NoSQL set of databases are praised because they are “schemaless” but I believe that benefit is really a hindrance. If you do not have a schema, then you can never truly know whether or not some data exists. It is possible to fake this knowledge for a time by just assuming some attribute will exist with a usable value, but that practice quickly begins to fail when you have to make any changes to these sorts of assumed values. The better thing to do is define the schema using models.

By “model” I do mean the “M” in MVC. Most people associate them with ORMs, which makes it especially easy to drop them when creating a system that doesn’t require mapping to a relational paradigm. The “model” is the data and that is all you need. This is completely false. The data can have “extra” information a model might not use, but the model still provides the contract that others can depend on. In fact the model can give the appearance of data existing, when in fact it is a calculated value using a business defined algorithm. This is the power of a model.

Where this realization became especially important recently was in a central concept in our code called a “context”. The context provides a singleton that, based on configuration, calls specific methods according to the “context” defined in the configuration. This sounds pretty slick, so lets see an example:

config = {'env': 'devel'}

from import save as devsave
from import save as prodsave

class Context(object):

    def __init__(self, conf):
        self.conf = conf

    def save(self, key, value):
        if self.conf['env'] == 'devel':
            devsave(key, value)
            prodsave(key, value)

This is a contrived example, but hopefully the point is clear. Instead of always checking the config everywhere, you keep that test in your “context” and provide a single function to call. This is a clever way to get around constantly checking your configuration and polluting your logic with redundant if statements.

By why are you doing this in the first place!!! Just pass the configuration details to your models and let them deal with the connection. If the configuration is pointing at a development database, then you are good to go. You never need the if statement in the first place.

Going the other way, if you need to get data out, you can just ask your model to do it for you. The model will pull it from the database and validate the contract is correct. You never will have to see code that checks different values to understand the current state. For example:

if doc.get('name') and not doc['name'].startswith('/implicit'):
elif not doc['name'] and len(doc.get('children', [])) > 0:
    name = doc['children'][0].get('name', gen_name())
    create_element(doc['element'], name=name)

This kind of code is terrible to debug later because you never really know what concept you’re enforcing. Why does it matter that the name starts with ‘/implicit’? Is that a URL or a path? What scenario would there not be a ‘name’? Is that a fix for an old bug that introduced bad data? Why are you calling “gen_name”? What does that really do? Models can help with all these issues.

A model can encapsulate not only data but a business concept. The above code can become as simple as:

element = Element.create_from_doc(doc)

This works because the information for what is optional is defined in the model. For example, the name attribute we tested above can be generated immediately if it isn’t present:

class Element(Model):

    def name():
        if not
   = gen_name()

The concept of “implicit” that was mentioned above can also be encapsulated in a property or method. Not to mention, it can be tested easily and in isolation instead of having to construct complex state in order to traverse a particular code path.

I remember when I first learned about MVC and Rails and it seemed like a handy way to organize code. You had fewer questions to answer such as where to put files and which files were for working with the database vs. filling in a page template. I was naive to the problems MVC was truly addressing. The goal of a model is to encapsulate concepts in order to minimize confusion. This happens by making sure your models provide expectations via the contract within their API.

Mon, 24 Oct 2011 00:00:00 +0000 <![CDATA[Thinking About Walking Dead]]> Thinking About Walking Dead

We started watching the AMC series, The Walking Dead. It isn’t anything terribly special in terms of the story or acting, but we enjoy our friends and it gives us a good chance to hang out. What I do like about it is that you constantly think about what you might do in a situation like that. What happens when a plague turns people into flesh eating monsters? Do you fight or flee? Is it worth the effort? Do you just let go and sacrifice yourself along or in a blaze of glory? It is an interesting thing to think about and I believe that is why the series has been a success.

Since Lauren and I have been feeling under the weather recently, we’ve been watching more TV than typical. We rented The Tree of Life which was another movie that was not as interesting without thinking about it. Lauren is exceptionally good at this sort of thing, which makes an abstract movie like Tree of Life a lot more of a positive memory than a bit of wasted time. She is able to pick out ideas and communicate them so clearly, that things you might have been thinking begin to take shape. It is a really great talent and skill she has honed throughout the years.

It is interesting how simply thinking about some subject changes your perception of it. There have been tons of bands and movies that I heard and walked away initially wondering why they were important or big deal. Then, after thinking about things, hearing/seeing more and generally letting whatever it was sink in, I end up really appreciating it. I like that art can change and that you can change yourself in order to enjoy it.

Sun, 23 Oct 2011 00:00:00 +0000 <![CDATA[OS Panacea]]> OS Panacea

You can’t have everything. I updated my VM to use the latest Ubuntu and while I like it just fine, it makes me want to delve deeper. I’d be alright using it every day as that I what I used to do, but alas, OS X is pretty darn reliable where it counts (network just works). The appeal of flexibility is often trumped by the reality of quality. When things just work it is hard to justify using something else.

NOTE: I wrote this post yesterday. I’ve been feeling sick the past few days and it felt like whatever sickness was waiting for me to push myself too hard before kicking in with a vengeance. My conclusion is that living like you are going to get sick is just as bad or worse than being sick. So, yesterday Lauren and I hung out with friends, hence this post coming a day late. I’m still on course for blogging every day for 30 days!

Sun, 23 Oct 2011 00:00:00 +0000 <![CDATA[Writing PHP]]> Writing PHP

PHP was my first programming language and set the stage for the rest of my career. This is both a good and bad thing. I wish my experience had a bit more compilation involved, but it didn’t. I started with a dynamic language that was special cased for inclusion in HTML.

Once I exited the world of PHP, I really didn’t look back too often. I had a couple scripts I maintained for the band website, but that was pretty much it. Recently though, I found CASH and it got me thinking about adding some code to the band website.

It is interesting coming back to PHP after so long away. I can still read PHP code, but that is a far shot from writing it from scratch. I had to look up basic syntax like how to create a class and loops. The whole public/private/protected paradigm was also rather awkward looking and it seemed best to avoid it. One funny thing is that my yasnippets library didn’t come with any pre-bundled PHP snippets. That was surprising, but probably says more about the recent popularity of the language.

The thing I like most about PHP is how simple it is to deploy. If you want to do something dynamic in a web page (note the use of “page” and not “application”), it is really hard to beat PHP. It is easy to find a host, drop some files on a server and you’re off and running. You don’t import code, you include or require it based on the filesystem. It simplicity all the way and that is really refreshing.

One thing that was kind of a pain was getting MySQL set up. Fortunately there are packages like MAMP that make the whole thing really painless.

It is fun to try and write code in PHP after a long time away writing Python (with a little C# and XSLT thrown in). It makes you realize that much of programming has nothing to do with the tools. Solving the problem ends up trumping everything, especially the language.

Fri, 21 Oct 2011 00:00:00 +0000 <![CDATA[Activity Monitors are Scary]]> Activity Monitors are Scary

Have you heard of Usermetrix? They have a slick video you can check out. The idea is that you use their library to see exactly how users are using your mobile application. It sounds great. You know what steps people are taking and what is causing errors. It emails everything your customers are doing with your app so you can take that information and make your product better. Everyone likes better products right?

Progressive is another company making an effort to understand their customers better as well. You can have a monitor installed in your car that will tell Progressive how you are driving. Finally, all those bad drivers out there will see that their poor driving costs others money and you will see a huge discount for your attention to safety. Nothing wrong with that. Right?

Wrong. Both of these ideas are ludicrous. The world is moving towards more and more monitors in the theory that this expanse of information will help make life easier. The fact is this information does nothing. If you could just log interactions and find out what is wrong, then software could feasible fix itself. Did you get a call that a family member had been hurt badly and needed you to speed over to help? Well guess whose insurance might go up. If it doesn’t go up, then why? Aren’t they paying attention? All this information is rarely useable.

Most importantly though, these monitors inhibit privacy. Privacy is a core tenant of freedom. The boundary created by freedom is worthless without enforcing privacy. Should the government say, “Do what you want, we just get to watch”? When you invite organizations to monitor your every move, you are saying your freedom doesn’t matter. The $100 discount you might save per year is worth handing over part of your car ownership to an insurance company. The $1 app purchase is more important than keeping your contacts safe. The gains you are making are trivial. The people wanting these logs are lazy. Your privacy is worth more than a few bucks.

This is why Open Source and Free Software are so important. When own something, it is yours. Free Software is meant to protect your right to use that computer as you see fit. Are people going to abuse these tools. Of course. People abuse everything! That doesn’t mean we should limit everyone’s access. Freedom is not perfection, it is a right. It is your right to recognize opportunity, take on the risk and find success or failure. By providing access to your life via these monitors, you are only lowering the risk taken by the companies asking for the information. They won’t be able to use the information in a way that truly benefits you. Just say no.

Thu, 20 Oct 2011 00:00:00 +0000 <![CDATA[Watching a Log in CherryPy]]> Watching a Log in CherryPy

Last night I realized that I’m slow switching hats. When there is a sysadmin type problem, it can be tough for me to immediately know where to start looking. It is not that I can’t figure it out, but rather it seems to be a difficult transition. My theory on why this can be hard is because I don’t do it very often. My development environment is meant to be free from sysadmin tasks wherever possible. I’ve made an effort to automate as much as I can away, so I can focus on the task at hand. The downside of this is that I then lose some practice being a sysadmin.

Because of this mental jump that always seems to trip me up, I started writing a really simple sysadmin server for myself. The idea is that when the need to switch to the sysadmin world arises, I can run a command and be off and running watching logs and checking our metrics.

With that in min, I wanted to be able have a link and tail a log file.

My server/framework of choice being CherryPy means that I’m inherently threaded and streaming responses like this are not well suited to the model. My resolution then was to find a work around to this problem.

The first thing I did was create a monitoring script. It was easy to add commands for the logs I’d want to watch, the hard part was starting up a CherryPy server that served the responses and closed when I stopped watching them in my browser. Here is what I came up with:

#!/usr/env python
A monitor is a web server that handles one request. That one
request though is one that will return the output of a command and
stream it as long as the command is running.

For example, streaming a log to a web browser.

The script expects an environ variable called PERVIEW_PROC_PORT to
contain an integer for the port to run the server on.

import os
import sys
import time

from subprocess import Popen, PIPE, STDOUT

import cherrypy

from cherrypy.process import plugins

class MonitoredProcessPlugin(plugins.SimplePlugin):

    def __init__(self, bus):
        super(MonitoredProcessPlugin, self).__init__(bus)
        self.bus.subscribe('start_process', self.run_command)
        self.proc = None

    def run_command(self, cmd):
        if not self.proc:
            self.proc = Popen(cmd, stdout=PIPE, stderr=STDOUT)

        def gen():
            while self.proc.poll() == None:
                yield self.proc.stdout.readline()
        return gen

    def stop(self):
        if self.proc:
            except OSError:

class MonitorServer(object):

    def __init__(self, cmd):
        self.cmd = cmd

    def index(self):
        cherrypy.response.headers['Content-Type'] = 'text/plain'
        p = cherrypy.engine.publish('start_process', self.cmd).pop()
        return p() = True
    index._cp_config = {'': True}

class StopServerTool(cherrypy.Tool):

    def __init__(self):
        super(StopServerTool, self).__init__('on_end_request',

    def kill_server(self):
        cherrypy.engine.exit() = StopServerTool()

def run():

    args = sys.argv[1:]

    if not args:

        'server.socket_port': int(os.environ['PERVIEW_PROC_PORT'])

    config = {'/': {'tools.single_request.on': True}}
    cherrypy.tree.mount(MonitorServer(args), '/', config)

    # add our process plugin
    engine = cherrypy.engine
    engine.proc_plugin = MonitoredProcessPlugin(engine)


if __name__ == '__main__':

You can run the server with a command like this:

PERVIEW_PROC_PORT=9999 python ssh tail -F /var/log/myapp.log

I called my app “Perview” for my own “personal view” of some systems.

It is not something I plan on releasing or anything but it is always good to have a somewhat descriptive directory name.

One thing to note is that I had to create the generator that I would return in the index method in the same scope as the Popen call. I suspect the reason being is that when it goes out of scope, the handle to stdout is released. I used a plugin for creating my process because I can easily tie it in with the main CherryPy process. By using the SimplePlugin as a base class and defining the “exit” method, it will kill the process when the server is asked to stop and exit. The stop and exit calls happen in a simple tool that waits for the end of a request.

The “on_end_request” hook is called after *everything* is done in the request/response cycle so it is safe to do any cleanup there. In the case where a user just closes the page, the socket eventually times out and the on_end_request hook is called. That takes a little while unfortunately, so when I do start working on a better UI, part of that will be to recognize when a user wants to kill the process.

Eventually, it’d be nice to have a more user friendly HTML page that will make the streamed content a little easier to watch. For example, making sure it scrolls up and follows the new content coming in.

Hopefully there are already some nice libraries for this sort of thing.

Hopefully this is example shows off some nifty features of CherryPy. I think the plugin makes a great model for wrapping these kinds of operations where you have some other processes you need hooked into your CherryPy server. The hook model is also nice because it allows your handlers to focus on providing a response. Had I not had the tool and/or plugin, getting a reliable way to stop the process or know when it stopped producing output would need a good deal more code.

Wed, 19 Oct 2011 00:00:00 +0000 <![CDATA[Betting on the Dogs]]> Betting on the Dogs

My understanding is that there are some questionable aspects of this sport, so I’m very sorry if my attendance is offensive.

If you’ve never been to a dog track, it is a fun experience. It is really cheap to get in. You sit and watch the dogs walk out to the track. A lot of times they poop and/or pee. They then sprint around the track with amazing speed and you can win some money.

To make it easier to bet, they give common bets names. There is a list with descriptions here.

The bets I find most interesting is the Quiniela and the Perfecta. Both are betting on the order of first and second. The Quiniela bet is for any order where as the Perfecta is for the specific order. What is interesting is the payouts for these bets. A win with the Quiniela was between $20-30 for a $2 bet. The Perfecta was $3 or $4 per bet, but the winnings were a couple hundred dollars. Seeing it was a long time ago that I went my numbers could be off.

I had won the Quiniela one time and realized that order I initially chose was correct. I based it off the dogs past history and probably accounted for someone pooping on the way to the gates. When I saw the payout of the Perfecta, it was clear I could have been smarter betting.

The reason being was that I would bet the Quiniela along with a bet or two on some other dogs. In all I would spend around $5 on a race. It would have been better had I just bet on the Perfecta in both orders and spent $6. I think I ended up winning on the Quiniela 3 times that day.

Had I been throwing away my random bets and just stuck with the two Perfectas, I would have walked away with a pretty huge profit.

I read this article on making sure you get a Fair Coin, which made me think of my experience at the dog tracks. I don’t think the two scenarios are mathematically related, but one made me think of the other.

Wed, 19 Oct 2011 00:00:00 +0000 <![CDATA[Ubuntu 11.10 and Old UI Design]]> Ubuntu 11.10 and Old UI Design

Over the weekend I updated my development VM to the latest Ubuntu release, 11.10. The big change in this release is the Unity UI as the default. My understanding is that the release prior also had Unity as the default, but it was only enabled on machines that had 3D acceleration.

My overall impression: I like it.

Long ago Garret gave me a quick demo of Quicksilver on the Mac. Beagle was still the search engine of choice and apps like Banshee were very much in beta. It was the sort of tool that when you became familiar with it could easily become indispensable. It was a theme that we used at the time when we were redesigning the Novell Linux Desktop start button menu. The idea was that you’d get something very similar to what Unity is doing now.

Fast forward to today and the world of mobile and it is interesting to see how the ideas we were building off of have become more reasonable to the masses. Now the app of choice on OS X seems to be Alfred. Mobile applications don’t have a button so to speak but use a lot of the same type of search technology but instead of finding things, it attempts to help complete thoughts in order to reduce the need to type.

Going back to Unity, what I like is that the concepts of Quicksilver and Alfred are first class citizens. I also like how the maximize works as it feels a little like the tiled window managers. The UI gets out of your way and you can focus more on the applications.

I should also mention that because I’m using it within a VM, that I’m probably not get the full experience. I usually only have 3 apps running, Emacs (of course!), a web browser (Firefox) and thg (the tortoise hg tool). All of these apps have their own models for containing different projects and organizing their sub-content, which might reveal why I care more about the UI of the application than the UI of the shell.

On a side note, 11.10 sets Chrome (Chromium actually) as the default browser. This is kind of funny because I had been using Chrome pretty exclusively for a while, leaving Firefox. When Firefox switched over to the short release cycle with its major release I gave it a try. I dedicated a day or two to using it and the big feature that sold me on it was the tab grouping. You can group your tabs and switch between tabs via a keyboard shortcut that works kind of like Expose. The nice thing about it is that I can once again start typing and it will highlight the tabs according to my search terms. The result is that I can dedicate a set of tabs for things like “work”, “music” and “reading” and switch back and forth as necessary using the keyboard. The grouping also allows me to avoid distractions by keeping things like reader or social networks on a whole other tab group.

I’ve always tried to avoid the mouse and while it is often the best tool for the job, there is something to be said for finding better models. Typing a word may not seem very efficient but I can type “fir” and hit enter to pull up my browser without having to negotiate where my cursor is or what icons are available. As we continue to evolve our systems to include larger datasets and most likely more space for UIs, having a way to be accurate may very well trump the flexibility of the mouse and pointing.

Tue, 18 Oct 2011 00:00:00 +0000 <![CDATA[Regions]]> Regions

One of the interesting things about copyright is that it is not universal. Every country has different rules regarding what rights an author has when creating a work. Another added complexity is that copyrights are not traditionally granted universally. It is common to negotiate different licenses across different regions. This is beneficial because if your content is generating revenue in one region, it is more likely to be well received elsewhere.

As a musician, this has its benefits. If you put out your own record in the US, it might help get you a really good deal with a label in Canada and the UK. Some bands have also found success in the US by first finding an audience in Europe. This whole scenario rarely occurs for artists on major’s though because most majors have branches in other regions. Also, a major will traditionally provide a rather healthy advance to make the record. This usually ends up meaning they own the masters, which in turn means they control the use of those masters.

While this separation of markets can be profitable for an artist, it also provides a level of complexity. When you release your music, it is trivial to also include iTunes in foreign markets. If you do this, it hurts your chances of finding a label. It can be effective to go without a label, when you are talking about a foreign country it is much more difficult. It is hard to know who is a reputable press person or booking agent. If you do try to promote your record in another country, how can you be sure the people working with you are legitimate? How can you hold them accountable? Do you have any legal recourse if something goes bad?

It is tough enough to be a band, book tours, and record in one country.

Taking it worldwide is only that much harder.

For us it is frustrating at times because we’ve gotten requests from folks over seas wanting to buy our music. Services like Bandcamp are great, but they don’t help in these situations because they don’t have a way to limit by region. It is also difficult to find similar services in other countries. If they are out there I haven’t found them. Another option is to just do the geo filtering myself, but that is a lot of code to write for a pretty limited number of sales. That is not to say these sales don’t mean anything, but rather, it is tough to justify the work.

Especially if we did end up finding a label in the UK or Europe.

It is an interesting problem because the way of the web is global, yet the content is not. It makes the argument that ISPs should act like utilities a little more complicated. A utility provides a resource that they create or purchase. An ISP provides a link to a network of resources they have no control over whatsoever. Not that I’m advocating ISPs do region filtering, but it seems like it does add a subtle hint of gray to the whole issue.

If anyone does know a non-US version of bandcamp please let me know. My other option is to keep a close eye on CASH and hopefully find some solutions.

Tue, 18 Oct 2011 00:00:00 +0000 <![CDATA[Safety First]]> Safety First

It should be a design goal in software to make the systems safe. This is different from security because instead of preventing unrequested access, you are preventing accidental access. You keep a gun in a safe so no one else shoots it, but you keep the safety on so you don’t shoot someone else.

The key to designing a safe system is in the data. Specifically, the data should be immutable whenever possible. The reason being is that you fire off a script that ends up writing the same document over and over again, you keep your damage small. This can be partially a data design detail, but it is also related to how the application actually works.

The canonical example of this is the decision to actually delete data.

When you first look at something like a blog and you remove some blog post, you write some handler that deletes it from the database. Problem solved. But at some point you recognize you just deleted something you needed. Whoops. Instead actually deleting things from the database, you decide to just mark the the “thing” as deleted and update your code to check for the deleted flag. This is a great example of writing safety into your software.

A friend at work forwarded this article on beating the CAP theorem that I think also speaks to the benefit of safe software. Immutability and idempotency are two great features to have for certain operations that can go a long ways making your software safer. The result is fewer trips to your keyboard in the middle of the night looking through backups to find the data you just screwed up royally.

Sun, 16 Oct 2011 00:00:00 +0000 <![CDATA[Deep CherryPy]]> Deep CherryPy

My first experiences with Python web development was with CherryPy. I did the Hello World and that was pretty much it. It didn’t strike me as that compelling at the time so I moved on. Later I tried Rails and it made a ton of sense. At the time I also was doing some Python for school and really liked the language. When I finally got a job out of school, Python was the preference of the two and I started looking more for web frameworks like Rails. Django and TurboGears were brand spankin new and of the two I liked TurboGears the best. What struck me though was that most of TurboGears was really just extra fluff on CherryPy.

Once again, I started playing around with CherryPy. For whatever reason, I still wasn’t entirely satisfied. I ended up going with for a time and enjoyed it. Then I stuck to raw WSGI. Of course, when I looked for a server I ended back on CherryPy’s.

Now, I work at a job where I work with CherryPy every day and honestly, things couldn’t be better. Every time I take a peak over the fence at other frameworks and tools it never ends up having the right balance of features to flexibility like CherryPy. I’ll often hit some bug or limitation that isn’t in CherryPy. There will be some moment where the model this framework wants to use just doesn’t really fit. It has happened every single time and now, I don’t really look very often.

I did stumbled on this framework for Ruby called Renee. What struck me was that it was essentially like my transitions between TurboGears, and CherryPy. Each had a different model for how the application should be written. CherryPy gives you flexibility to do all the above.

First you start with Rails where you define routing via a central file.

This is how Django works and how Pylons (now Pyramid) used to work.

In CherryPy you can mimic that behavior with the following:

import cherrypy
from my_controllers import *

urls = cherrypy.dispatch.RoutesDispatcher()
d.connect('main', '/', HomePage())
d.connect('blog', '/blog/:year/:month/:day/:slug', BlogPage())

This model can be really powerful but there is also a subtle disconnect at times. Often times you want to not only dispatch on the URL path, but also on the HTTP method. had a similar model where you would define your routes, and then have a class handle that request based on the method. Here is how you can do that in CherryPy:

import cherrypy
from myapp import models

class MyHandler(object):

    exposed = True

    def GET(self):
        params = {'error': cherrypy.session.last_error}
        cherrypy.session.last_error = None
        return render('myhandler.tmpl', params=params)

    def POST(self, **kw):
        '''The **kw are the submitted form arguments'''
        except InvalidData as e:
            cherrypy.session.last_error = e
        raise cherrypy.redirect(cherrypy.request.path_info)

d = cherrypy.dispatch.MethodDispatcher()
conf = {'/': {'request.dispatch': d}}
cherrypy.tree.mount(MyHandler(), '/foo', conf)

This is nice, but what happens if you want to mix and match the models.

Here is an example:

import cherrypy
from myapp import controllers
from myapp import urls

d = cherrypy.dispatch.MethodDispatcher()

method_conf = {'/': {'request.dispatch': d}}
routes_conf = {'/': {'request.dispatch': urls}} # routes dispatcher
cherrypy.tree.mount(controllers.APIHandler(), '/api', method_conf)
cherrypy.tree.mount(None, '/app', routes_conf)

In this example, I’ve mixed to models in order to use the model that works best for each scenario.

With all that said, what I’ve found is that the more specialized dispatchers often are unnecessary. The CherryPy default tree dispatch is actually really powerful and can easily be extended to support other models. I think this is the real power of CherryPy. You have a great set of defaults that allows customizing via plain old Python along side a fast and stable server.

There are obviously some drawbacks in that if you application is going to keep open many long connections, then you will use up all the available threads since CherryPy uses a traditional thread per request model. There is the slim chance you’ll run into the GIL as well if you are doing rather extreme processing for each request since it is using Python threading. With that said, these limitations are really easy to work around. You can start more than one CherryPy server and use a load balancer quite easily. For processing intensive tasks you can obviously fork off different processes/threads as needed. CherryPy even includes an excellent bus implementation that makes orchestrating processes somewhat straightforward. In fact, it is what I use for my test/dev server implementation.

CherryPy may never be the hippest web framework around, but that is OK by me. I use Python because it helps me to get things done. Other web frameworks have some definite benefits, but CherryPy always seems to help me get things done quickly using a stable foundation. If you’ve never checked it out, I encourage you to take a look. The repository just moved to BitBucket as well, so feel free to fork away and take a closer look at the internals.

Sun, 16 Oct 2011 00:00:00 +0000 <![CDATA[Kim and Thurston]]> Kim and Thurston

If you haven’t already heard, Kim and Thurston have separated. It honestly came a quite a shock. Their relationship is not dinner conversation or a daily concern, but it was also an example of a rock and roll couple that seem to have made it. We get asked in almost every interview something about Lauren and mine’s relationship and how it impacts the music. It isn’t a touchy subject or anything, but it does feel slightly invasive in that it is our business. It obviously impact our relationship in both positive and negative ways, but every marriage has similar challenges, whether it is finances or kids.

Yet even though we like to believe our struggles and triumphs are just par for the course, musicians typically have a pretty bad track record for keeping relationships together. There is always the addition of substances, lots of time away and a really busy schedule that all add to the equation. But that was why Kim and Thurston doing well seemed like a small glimmer of hope. They were doing it just fine and seemed like a relatively normal couple. Amidst the examples of rock couples, I would have said they were making it all work.

Really though, I shouldn’t be surprised. Marriage and music are two tough things to do and doing them together doesn’t make anything easier.

Just because they were a success story or someone we could look at as an example, it doesn’t mean they didn’t have struggles or grew apart throughout the years. There are probably millions of couples out there having similar problems and separating, so why would Kim and Thurston be any different? Of course, that doesn’t make it any less tragic.

I do think Lauren and I have better examples for marriage than some folks in a great band. We both have parents that have good relationships. They’ve struggled and been through difficult times and managed to come away stronger and closer to each other. My Dad even wrote a book on the subject. It is good to know that when things are tough, if you can stick to it and try to change yourself, there can be good times again.

As an aside, if you are married and are ever having trouble, I can recommend my Dad’s book and seminar. I realize that it might not seem applicable if you don’t go to church or aren’t really religious, but the fact is that even if you don’t feel a commitment to some higher power, marriage is about a commitment to one another. His book and seminar really do a good job helping to put terms on fundamental needs in a way that can help in explaining how you might be feeling. This kind of understanding makes a huge difference when you are trying to communicate what hurts and how you can avoid causing each other terrible pain. Being able to talk frankly about how you feel makes a world of difference understanding how to make conscious decisions to help repair a broken marriage.

Sat, 15 Oct 2011 00:00:00 +0000 <![CDATA[OFF!]]> OFF!

Tonight I finally got to see OFF! and it was awesome. When I was a kid seeing punk bands I had a complete lack of appreciation for lyrics.

Don’t ask me why, but as long as it had a fast back beat and distorted guitars, I was pretty happy. Fortunately, my tastes have changed and lyrics definitely means something when listening to a band. Lyrics are still not the most important thing to me, but when they are great it makes the song amazing.

This is what OFF! did so well. Keith Morris is amazing. Period.

A lot of people don’t understand punk. It isn’t about anger or violence or negativity. Punk is self expression. Keith Morris understands this and is a great advocate. His lyrics are simple and understandable amidst the squall and rumble of the music. They connect with people directly without an ounce of cliche or reminiscence. He makes hardcore punk feel new.

The music has the same effect. The drums are intense with a background of elegance. You can completely ignore them beating away and for that split second when you end up hearing them, you’re floored. The guitar sounds thick and huge even though the effect is thin and scrappy. The bass breaks up and growls. It sounds odd outside the songs, but during it is a critical part of the wall of aggressive sound. I love it.

I missed many opportunities to see OFF! It was redeeming the show blew my mind and made the wait worth it. If you have a chance, go watch. It is not nostalgia yet it speaks to a time when the mainstream was a different world than the underground. You can feel like your a part of something and find a sense of community. You see musicians that are passionate and skilled, making loud, aggressive music that is infectious. I can only hope that one day people can say the same of the music I’ve made.

Fri, 14 Oct 2011 00:00:00 +0000 <![CDATA[Announcing Pytest.el]]> Announcing Pytest.el

The other day when I rebooted my Emacs set up to use the ELPA and Marmalade repos, I took a little time to see what other packages were out there that might be interesting. I found Nosemacs which seemed pretty slick and took a look. My experiments with Nose having failed for the time being, I looked at porting it to use pytest.

The result is pytest.el and it is available in the Marmalade repo.

The functionality is pretty much the same as Nosemacs. When you open a test file you can run the entire “module” (the file) with the pytest-module function. You can also run the test your cursor is in with pytest-one. To test all the files in the directory of the current buffer, use pytest-directory. Finally, you can use pytest-all to run the entire test suite.

Since I use virtualenv for many projects, I end up wanting to use my virtualenv’s version of pytest. Pytest.el supports this by allowing you to configure what test runner you want to use. You can read more about this in the actual code.

Here is what I currently have in my .emacs:

;; py.test
(load-file "~/Projects/pytest-el/pytest.el")
(require 'pytest)
(add-to-list 'pytest-project-names "run-tests")
(setq pytest-use-verbose nil) ; non-verbose output
(setq pytest-loop-on-failing nil) ; don't use the -f flag from xdist
(setq pytest-assert-plain t) ; don't worry about re-evaluating exceptions (faster)

;; python-mode hook
(defun my-python-mode-hook ()
   (linum-mode t) ; show line numbers
   (subword-mode t) ; allow M-b to understand camel case
   (setq tab-width 4) ; pep8
   (local-set-key (kbd "C-c a") 'pytest-all)
   (local-set-key (kbd "C-c m") 'pytest-module)
   (local-set-key (kbd "C-c .") 'pytest-one)
   (local-set-key (kbd "C-c d") 'pytest-directory)
   (setq indent-tabs-mode nil))
(add-hook 'python-mode-hook 'my-python-mode-hook)

This is my first Emacs package, so any suggestions are welcome on how to improve things. I’d like to eventually make the configuration variables customizable via customize.

Fri, 14 Oct 2011 00:00:00 +0000 <![CDATA[Talking is Not Honed]]> Talking is Not Honed

I just read this article on siri, apples new voice recognition software. The article makes a point to describe our evolution of speech as “honed”, which is utter crap.

I can just hear it now, “Um phone? Can you, like see if that one bar is still open? The one where that guy with tattoo on his face works?”

Language has definitely changed, but I wouldn’t say its been “honed”.

Thu, 13 Oct 2011 00:00:00 +0000 <![CDATA[My Phone Sucks]]> My Phone Sucks

When I first got interested in computers it ended up revolving around Linux. Back then people made a point to mention how well Linux can run on cheap hardware. Being a young, recently married college student scraping by with my wife by my side, doing things like purchasing high end hardware was out of the question. This also excluded Apple from being a preferred manufacturer.

After college and into my first real job, we ended up contracting to get some machines and they were blazingly fast. This was first time I really got it. When you are working on something, your tools need to help, not hurt. I had been using cheap machine in one way or another for a long time and while it could have been much worse, having a good tool makes a difference.

When my iPhone started to show its age and imminent demise, I wanted to get the best phone I could. I thought about an iPhone, but the hacker in me wanted something that I could potentially customize a bit and hack on. Android was the choice for me and I got the biggest/baddest dual core phone I could find, the Atrix.

This phone sucks! It reboots randomly. No clue why. The touch sensors on the screen vary from insanely over sensitive to impossible. The 8 mega pixel phone takes horrible photos. The supposedly 4G speeds are a joke. All in all, I’m not very happy with it and plan to replace it as soon as I can. I’ve already rooted in hopes I could remove whatever software that was causing so many problems. I’m hoping someone ends up distributing some software (ROM?) I can use replace the original OS with a more stock Android in hopes of getting rid of the bugs.

I should have gotten an iPhone. I thought the software could be better enough to trump the hardware being less polished and I was really wrong.

It is really important to recognize that details are not just details.

When something has been done with an attention to detail it means that someone didn’t just make something pretty or create a smooth line. An attention to detail means there was attention made to the things that you don’t want to pay attention to. Technology is composed of complex systems built on complex platforms. One small disregard for design at any level means potentially millions of lines of code sitting atop a fragile foundation. That is a recipe for disaster and why details mean more than a small design feature or subtle benefit.

My phone, while powerful, does not show an attention to detail and that sucks.

Thu, 13 Oct 2011 00:00:00 +0000 <![CDATA[My Mechanic is a Hacker]]> My Mechanic is a Hacker

Once again my van is getting fixed.  I’d traditionally be pretty frustrated, but having a chance to get it fixed at home under reasonably convenient circumstances makes me feel slightly better about it.

I also don’t mind visiting my mechanic. He is an honest guy and most importantly, he is a good debugger.

Even if you know nothing about cars,  it important to understand what is tough about fixing them. Replacing parts is generally pretty easy.

What is much harder is figuring out what parts need to be replaced. It takes a good deal of systematic troubleshooting, especially with so many computerized components. Understanding the systems in addition to how they are reporting data is nontrivial and takes tons of patience along side a healthy dose of experience.

My mechanic is like an old school unix hacker and I really appreciate that about him.

Wed, 12 Oct 2011 00:00:00 +0000 <![CDATA[Back to Pytest]]> Back to Pytest

My short trip into nose has become something of an uneventful exploration into basically the same thing as pytest. This is not a bad thing because it means I don’t feel my investment into pytest is a waste and that there was something obviously better out there. That doesn’t meant there isn’t something better, but I feel a little more confident saying that it is not nose.

None of this is to prescribe pytest or nose the winner. In fact, both definitely have their flaws and neither really differentiate themselves enough to warrant zealots. We’re talking about test runners here. It is not they are editors! Did I mention Emacs rules and Vim sucks? I kid.

I did learn one pattern that I will probably try migrating over to pytest. My issue (as I mentioned the other day) with my development server concept is that you really end up having to start it out of band because having one command ends up being too much of a hassle. The development server really needs more maintenance than a simple script can provide to start and stop. This is why we have tools like monit, init and the whole host of other process management tools out there.

Obviously all the PHP devs out there don’t have a problem running Apache and some database, so I’m going to call YAGNI on this one and just focus on making the tests faster.

What I can do in my test runner is check for running services and start them via the development service if they are available. This was actually really easy to do in the test runner and was efficient in between test runs. There is still the issue of setup / teardown for fixture data, but so far that hasn’t been a problem. I’m assuming this is because most tests are rather explicit in this regard. After all, just because some database is up and running, it doesn’t mean that it also has all your test data too. You always need fresh data.

The other gap I’d like to jump is how to make the transition from a “development” server to a “production” server. How can I take this idea and make production ready?

My first step is to continue to focus on the tests. I don’t want to become mired in the details of how to install apps or upgrade software.

Instead, we’ll stick with how I can start/restart/stop applications and update configurations in a way that is atomic and simple. Installation of new software is really an orthogonal problem that when you stop conflating with configuration changes becomes much simpler.

Unfortunately, my experiences have been with systems that do both more or less at the same time, so it has been a tough lesson to realize.

As an aside, are there other test runners out there that warrant some exploration? Same goes for continuous integration. The fact everyone seems to just use Jenkins really makes me think the problem could be solved differently.

Tue, 11 Oct 2011 00:00:00 +0000 <![CDATA[The Development Server and Porting A Test Suite to Nose]]> The Development Server and Porting A Test Suite to Nose

I have a dream of a test server. This server starts when you run your tests and stays up after it is done. It will help with things like running your failing test on every file change and make reporting to a continuous integration server a breeze. Deployment means moving the config to different servers and pulling in the new code, still being able to run the tests and even keep an eye on performance metrics.

Alas this dream has so far been a pipe dream. The reason is because I have a single requirement that never seems to make life easy. It is the 1 command test run that kills things. It is very difficult to start the whole process, wait for orchestra of services to start and then run the tests. Why this is so hard, I can’t really understand.

In py.test we have some configuration where a test can register what services it needs. I always felt that was elegant, yet had the ability to be somewhat wasteful. If you refactor your code, there is a good chance that many of the test requirements could have changed. I’d go so far as to argue they should have changed in that refactoring should help to make each test more atomic, which hopefully means fewer tests truly need the extra running services.

I’m making a go at porting our tests to nose and finding the simplicity of their test fixtures nice and simple. Yet, it also feels harder to control. Obviously if relaxed my one command constraint or made that command a script that started and verified the main server prior to starting my test runner then it should be more doable. I might in fact do just that. With that in mind, it feels a little off.

One could compare the struggle of my design vs. reality in a similar light as running applications in a VM vs. a single directory. Sometimes it is easier to just say heck with it and just copy the VM and put it out there instead of crafting your application and environment to run in a single directory. After all, I see tons of VPS hosting out there along side the “cloud” when developers talk about deploying apps. At the same time, there is WebFaction doing a great job of supporting seemingly complex application deployments with only a single directory.

I think I’m going to go ahead and punt on my running the tests with one command. It doesn’t seem like a huge deal to check that a process is running before moving onto the tests. That seems like a pretty safe way to start while still making it discoverable.

Tue, 11 Oct 2011 00:00:00 +0000 <![CDATA[Nose and Pytest]]> Nose and Pytest

My initial introduction to testing in Python was via py.test, so it shouldn’t be surprising that it has been my preferred test runner. With that said, I’ve recently started using Nose here and there. I honestly can’t say I have a huge preference, but there are a couple things about Nose that I *think* make it better.

The first thing is the command line arguments. Pytest uses a rather generic keyword flag to limit what tests are run. This always seemed rather elegant in that you had a single place to interact with the test gathering using powerful search idiom. When I saw that Nose actually lets you specifically use a $file:$TestClass.$test_method model, I thought it must be limiting. The more I use it though, this is a much nicer model as it lets you be more specific in a programmatic way.

Another thing I’ve been liking about nose is that it feels more pluggable. Pytest seems to always use a rather specific means of customization via its file. Again, seeing as I don’t have a massive amount of experience, this never bothered me. Now that I’ve seen some more extensive configuration, making this customizable outside of the test runner feels like a better model.

One aspect of Pytest that I do like is how it will try to replay a broken test to provide better debugging information. That said, I think this depth has its negatives. The xdist looponfail flag often fails, which I believe is due to this introspection.

What exposed some of these conclusions was the nose.el package for Emacs. It seemed really simple, so I started porting it to pytest. In the end the port was pretty easy to get going with, but I realized that it would never be as nice simply because pytest didn’t support the same models. I also recognized that the models seemed like they would be helpful in many situations and not simply in an Emacs library.

I think I might go ahead and try to port the project I work on at my job to use Nose and see what happens. There is already a good deal of stubbing and framework that has be created before running the tests, so that will either provide a good place to start or become a rabbit hole of services. In either case, my hope is that in learning nose I will widen my knowledge of testing and improve my debugging skills.

As an aside, I did look at nose a long time ago and quickly dismissed it b/c it seemed to prescribe a specialized way to run write tests that was incompatible with what I had at the time. That seemed like a bunch of work for little to no gain. Looking back, I might have misunderstood the requirements. Regardless, I hope people do not trust my suggestions when choosing a test running tool. Do the work yourself to see what will work best.

Sun, 09 Oct 2011 00:00:00 +0000 <![CDATA[Emacs Reset]]> Emacs Reset

I’ve had a pretty stable Emacs configuration for a while, but it also has been somewhat unruly at times. The big issue is that I really needed some package management. One option is to simply stick with the distro version of Emacs and apt for my package needs, but that would preclude me from trying new features and modes that have been developed recently.

For this reason, I build my own Emacs and have traditionally tried to keep my environment tidy myself.

As much as I thought I had things under control, whenever I set things up on a new machine it became apparent that my system for keeping track of things was less than ideal. Depending on what sort of development I had been doing recently, my environment would be at different levels of portability and rarely was it a lossless process moving around from machine to machine. Fortunately, it didn’t happen much, but it is enough pain to make me consider being a little more proactive with my Emacs config.

I read a post about the Emacs Starter Kit getting a reboot and while I knew it was a little to customized for my taste, the use ELPA and Marmalade were intriguing. These are package repositories for Emacs using package.el. After upgrading to the dev version of Emacs from git, I went ahead and started getting my packages in line, so I no longer had to keep my own /emacs.d/vendor/ directory in line.

Overall the process was actually pretty quick. I know I lost some custom functions, but I made a decision to bite the bullet and create some actual Emacs packages for those specialized functions and even found some new helpers. For example, I found dizzee which is a service starter that I used to replace all my customized functions. I also found nose.el for running nose on files and am in the process of rewriting it for pytest. This also exposed a project called Eco that looked like a nice middleground between the shell focused nature of virtualenv and a more explicit python setup.

This whole reboot might have also helped in diving into to rooting my phone too ;)

Sat, 08 Oct 2011 00:00:00 +0000 <![CDATA[Rash Communication]]> Rash Communication

Speed is an interesting thing. We often want to do things fast, yet there is an inherent danger doing anything quickly. This goes well beyond covering distances quickly. Going fast in anything can be a prescription for danger... Or at least some mistakes.

My key example of an issue with speed is typing. I’m not a terribly fast typist. What slows me down is not the speed at which my fingers hit the keys. The thing that really slows me down is inaccuracy. I’m incredibly fast hitting the delete key and probably use it more than any other on the keyboard.

Another area that speed can be a problem is communication. No matter how important it seems, saying the wrong thing is much worse than simply taking a few minutes to properly craft an idea. I know this is a huge problem for me. When I get an email with an idea that feels counter what I was thinking my initial reaction is somewhat panicked. I want to respond as quickly as possible how it is not right or some aspect is being missed in the solution. It happens when writing songs as well.

I’ll hear some effect or anticipate some feeling for a song and it is very hard to put that aside to listen to how others hear things.

The key to stopping this rash communication is to listen. Often times I’ll go ahead and type like crazy making points and getting my initial contrast out of my head. Then, when the typing starts to slow (this happens b/c when you don’t think about your message, you invariable will ramble), I go back and start reading the initial message and rereading what I wrote. The result is almost always a recognition that I missed something. Assuming there is still a point I want to make (many times I just throw it away b/c what I’m saying really didn’t make any sense), I’ll continue to edit and eventually come up with something that isn’t a rambling mess.

I’m still learning to control my rash communication, but making a point to listen has been key to overcoming my desire to blurt out nonsense and understand what others are saying.

Fri, 07 Oct 2011 00:00:00 +0000 <![CDATA[Blogging Every Day]]> Blogging Every Day

I’m going to give blogging every day a try. Lately I’ve felt like my communication skills have not been as sharp as I would like. What better way than to practice.

I’m not promising huge insights, specific topic matter or anything else that might put undue constraints. I’m pretty bad at keeping on commitments like this, so I’m keeping it simple. My goal is to go 30 days in a row with at least one solid paragraph per day at a minimum.

Wish me luck!

Thu, 06 Oct 2011 00:00:00 +0000 <![CDATA[Indie Music is For Friends]]> Indie Music is For Friends

I made an observation today. A rather large indie blog claimed how much they enjoy the new Girls record. From what I heard, I like it too. What struck me though, was that their first record was kind of terrible. The songs were OK, but they weren’t very different or added anything new.

This is just my opinion of course. I had planned on seeing them and then I saw live videos and promptly realized I would not be spending money to see them. Again, just my opinion.

Going back to my observation, I think that a lot of indie music is really about seeing your friends succeed. What I mean is that even though the first Girls record and their live performances were not very good, the indie community saw something in the band and worked to make them succeed. The result is a new record that seems really great and they don’t suck live anymore. I can’t say the music is that different or provides something new, but at the same time, it is really accessible.

When I was younger and getting in punk and independent music, the bands had a vision to do what they wanted to do. It had nothing to do with getting big or signing your life away. It was, idealistically at least, about making the music that came naturally. A great example is Nirvana.

When you watch Live! Tonight! Sold Out!! you can see immediately that their goal was not to put on a perfect show or perform their record. The goal was to do what they do and that was to play loud, have fun and do it for themselves.

I don’t get that perspective from indie music today. I get the impression that bands start hearing they could be someone big and start thinking in terms of how to get there. The indie music blogs also get a kick about being someone that discovers a band that breaks out vs. being someone who exposes people to new music. Gorilla vs. Bear is a good example of this kind of blogging as they have a distinct style they like and push, which means most bands end up sounding similar. This is fine when you want to find another chillwave artist, but when you are sick of dreamy music and want something electronic or heavy, you have to find someone else to help you out.

I’m not saying this is a negative. When I was first discovering punk music, it was very difficult to find bands similar to what I like. I listened to the Minor Threat discography over and over, partially because I loved it, but it was also because it wasn’t trivial to find bands that had similar characteristics. If I had a Gorilla vs. Bear of 80s hard core and punk as a kid, life would have been grand.

At least for a little while.

The more I listened to punk music the more punk expanded. I had bought Sonic Youth as a kid, shunned it as alternative, then rediscovered and realized it as punk. Nirvana was the same story. I found hard core and stoner rock and noise, all of which helped shape my perspective of punk and independent music. When we toured with the Meat Puppets it sealed the deal in terms of recognizing punk is not about the sounds, but where those sounds come from.

To be clear, I’m not saying that indie music today is insincere or too polished or really anything else that might allude to it lacking integrity. I am saying that the focus has changed. The gates are open and people see that success is a step through the door. It has become really easy to write songs without thinking some part sounds too “radio” or “main stream” because no one will say a band sold out. While it is really great that fans and press still support bands who do take steps to support themselves, it makes me wonder if that also lowered the standard bands have when writing. There is not a requirement to do something innovative or take risks. Taking risks might also be more difficult since the field of music has been saturated with artists for quite some time.

No matter the motives of artists, what really matters are that people are moved by music. Fortunately, music traditionally is not appreciated in a vacuum. Music helps to mark moments in time in a person’s life, which means that any song can be great to someone. My goal as a musician is to write the music we feel. We practice and work hard as a band to make great music and we want people to hear it, and most importantly, be moved by it.

Thu, 15 Sep 2011 00:00:00 +0000 <![CDATA[Packaging vs. Freezing]]> Packaging vs. Freezing

I’m currently waiting through setting up an environment for a project and start thinking about how slow this process is. In some ways it doesn’t really matter. How often do you really set up a development environment? That said, if you can’t easily set up your development environment, what makes you think you are setting up your production environment in a way that is reliable?

The essence of any installation is taking a set of files and putting them on the file system in a way some executable can run. There can always be other aspects such as updating databases or running some indexing process, but generally, you put the files in the folders.

One strategy for release is to take your development structure and simply tar it up. The benefits of this model is that it is simple and relatively reliable. When you are deploying to a platform where all you get is a single directory, this can in fact be advantageous. The downside comes about when you lose your development system or whatever system it is you are using to build from. The work you put into making sure things run is gone and even though you can recreate it, there is no telling what really changed.

Another strategy is to create a package. In this situation the important thing is not the “package” but the “creation”.  In this scenario you have some sort of a tool that takes the source, compiles it (if necessary) and places all the files in a package that can be extracted on the file system. Unlike freezing, this process is more reliable and understood in terms of knowing what has to happen to go from source code to a package. The biggest hurdle is that you have know where everything goes, which can feel like a daunting task. You also may have to package up other dependencies, which can also be a large endeavor.

Personally, I’m a fan of packaging. Even though my experience is limited, I’ve come to the conclusion that the process of understanding where everything is going to go, documenting it in a processable format and creating a step for that process is the only way you really will be certain your environment can be trust worthy.

When I say “trust worthy” I’m not talking about security. What I’m really talking about is confidence. Businesses consistently pay more for items that could be bought more inexpensively, but they pay the extra cost for the confidence the good with function or be quickly replaced when they fail. Packaging allows you to build that confidence so you can be sure when something breaks or goes down, reproducing the environment is just a script away. When you can’t be sure you’ve build your environment correctly, then you really can’t be sure any of the software will work as expected.

Thu, 01 Sep 2011 00:00:00 +0000 <![CDATA[Business Logic, Boring Code and Personal Perspective]]> Business Logic, Boring Code and Personal Perspective

Today we had a code review of some code. We do this in hopes of exposing some code someone else is working and getting a better vision of what others apps do, their design patterns and specifics that would help us contribute if (when) the need arises. One theme I’ve noticed is that most comments that are offered have little to do with the business logic. Not many folks care about some algorithm or function. This is most likely because you are not going to gain a deep enough understanding within the course of an hour to have a strong opinion on some algorithm being used for a project. The result is that business logic is boring.

I wouldn’t say that it is intrinsically boring, but rather that unless you are invested it is boring. Art and music can both have similar effects. You hear a song or see a painting and think it is simple and anyone could do it. Yet, to those well versed in the genre or style can quickly attest that the artist has in fact introduced a new revolution within the scene.

It applies to other areas as well. Take the car you drive. If you are a car person you might really like charger over a camero, but to a mechanic, who has a deeper understanding of the engines, they are just two shades of the same thing. Hopefully this example is not completely lost behind my massive lack of knowledge in cars.

Going back to the code review though, it is interesting to see what portions of the code did prompt comments. It was almost always opinions on things like library choice and code patterns. More generally, I might categorize these as framework-type opinions. This is interesting because many projects at my job to use different frameworks that traditionally are dependent on the preference of the author. The result being that these code reviews often end up being an exercise in a personal perspective on code rather than an exploration into the business logic and the project’s meat.

If this sounds like something is broken, then I agree. If it doesn’t then let me tell you why it is an indicator that something is broken.

When a team of developers all use their personal perspective to code a few different things happen. The first is that emotions become tied to the code. Code is not emotional. It is logical through and through. The second thing that happens with personal perspectives is a lack of instinct. When you approach another project to fix something or run the tests and it is an exercise in teasing apart config files and debugging, then you’re coding with personal perspectives instead of a standard.

Your goal should be to have a code base where anyone familiar with one application can automatically do things like run tests, start servers/daemons, add logging (print statements) and deploy to production any other application. This kind of familiarity and standard allows for instincts to develop. Instead of wrestling with understanding the meta data (testing, deployment, debugging), people working on the code can focus on the purpose of the application. They have a bug report and it is a matter of grepping the source, adding some logging to do some introspection and fixing the bug instead of understanding the different test runner or build script that is used to create a package.

I’m not saying that people should code like robots or anything else that sounds extreme and stale. Rather, I’m pointing out that your personal preferences should take a back seat to the goals of the code.

If an application is going to be thrown away, then sure, rigging up your own test harness to do something is totally fine. The start up with two developers shouldn’t feel bad they haven’t settled on a test framework or deployment strategy that considers applications written outside of their core language. But if you are debugging code someone else wrote, then you have hit the point where you should have some standards that help move the code to a higher level among developers. We use langauges like Python and Ruby (and even Java) because it hides a lot of complexity that doesn’t help us get the job done, yet when we consider testing, deployment and debugging, our opinions matter most. Adding standards for the common development practices is like adding memory management and garbage collection, it lets you stop focusing on the lower level details so you put your focus higher up the conceptual ladder. The result is better code.

By the way, regarding my critique of our recent code review and it revealing a symptom of a problem, it is something we’re in the process of fixing. Hindsight is 20/20 and we realize we could have made some better decisions earlier on that would have helped avoid the discrepancies between projects, but we didn’t at the time. Fortunately, it is never too late to change and change is exactly what we’re doing!

Thu, 01 Sep 2011 00:00:00 +0000 <![CDATA[Code Usability]]> Code Usability

Often times when we as developers says some piece of code “beautiful” it is in relation to some aspect of the functionality. Unfortunately, just because a piece of code does some function using a rather clever algorithm or pattern, it doesn’t make that code usable. Usability is typically considered in the realm of user interfaces, and to a lesser extent, programming APIs. I would argue that this compartmentalization of the word is in fact a deadly prescription for wasted time and money.

Every time you write code that gets deployed (whether that is a server, version control system or simply a /bin directory), an “interface” is created that should be usable. The interface I’m talking about here is not some config or command line flags, but the actual code. When an application is used by someone, its code needs to be maintained, which means it must be “used”.

The code interface is rarely given the actual attention it deserves, yet it is a huge source for wasted time and energy. Consider the amount of time you spend actually writing code vs. reading old code and fixing bugs. For arguments sake, lets say it is 50% or 20 hours a week. Now multiply that by your hourly salary and the number of people on your team and you can start to see how every extra few minutes spent dealing with code that is not usable costs time and money.

The big question then is how can we write code that is usable?

To be frank, I don’t know! It is a huge challenge to create things that are usable because you need to find a way to sit in the shoes of someone you may not even know. Something you may believe is clear and direct could be excruciatingly confusing to someone else. Like most everything in software development, there is no silver bullet.

With that in mind, there are some things that can help.

The most obvious way to help communicate how to work with code is via comments. That is why they have been present in programming languages for so long. It became immediately clear that there is value in providing text in the same space as code in order to help someone else understand what is going on. Those that say it adds noise or it makes the code difficult to read are not making the connection between comments that communicate and comments that just are. If the comments get in the way, then there is a good chance those comments are not very helpful.

The goal of comments should be to answer “why” when looking at a piece of code. Why did you write the code in the first place? Why does some variable have some function called on it? Why isn’t this abstracted into a library or other place in the code? Why shouldn’t I change it? When you look at code for the first time you rarely have an understanding of what is really important and what is a hack. It becomes extremely easy to think some piece of code is the product of the prior developer’s (who you might be working for!) painstaking attention to detail. The reality is that developer could have learned the code base starting in the very function you are looking at and understood next to nothing about what should really happen in the code. This is where a comment at the beginning of the function would be extremely helpful that mentions the writer of the code doesn’t really know what is going on and that there is most likely a better way.

This goes the other direction as well. When you write a concise algorithm that is fast, elegant and optimized, remember that you can add as many comments as you want describing your painstaking process and how it is probably not a good idea to change the code very much. Not only should take pride in making clear your process, but you make clear the assumptions you were working under at the time. When those assumptions change, the next developer working on the code can see immediately the pros and cons and evaluate them according to current needs.

Since some people do find comments difficult to parse all the time when writing code, I think it is safe to say that there are good and bad times to write comments. My thoughts are that you should comment before you start and before you push/commit. The comments you write before the code should reflect the design goals, why you are writing the code and what you hope to get out of it. They should be a thesis for the code.

They may change as the code develops, but if they are changing a lot, it might be a signal you haven’t written your comments very clearly. After you’ve written the code it is time to edit prior to publishing. Make sure your large design comments are clear and correct. Then look for places in the code where things might be confusing and leave some notes.

Finally, if you are using a DVCS, when leave good comments in the commit message before pushing. It is like a nice closing paragraph that wraps up the loose ends and gives someone a clue when slogging through the VCS logs fixing a bug you just introduced!

In terms of actual code, stick to a style. As trivial as this is, it is paramount to good design. No designer worth his/her salt uses tons of fonts and decorations. It is distracting and doesn’t help communicate a clear message. Code should do the same thing and sticking to a style is a great way to do so. I think this is a huge benefit of Python and the concept of being “pythonic.” It is a commitment to a style of code that we can all settle on in order to help reduce the number of visual complexities our brains have to deal with when reading a cryptic format.

Similarly, C# often has a very similar style because most everyone writing C# is using the same editor (Visual Studio). The point is, pick a style that works with the majority of developers and stick to it.

Related to the above point on style, consider the next developer’s development environment. If your company requires a certain dev environment, then you can depend on certain tools to be available. The converse is that every developer has a completely different environment, in which case you have a tougher job because the clarity you find in keeping every class in its own file is in fact a nightmare for another developer that feels more comfortable navigating within larger files vs.

on the file system. This is one really clear example of how code can be usable for one person and not for another.

I’m not suggesting that you need to create a hugely detailed style that permeates the entire code base or have rules for extreme comments. The point is to make the code easier to use for the next developer that looks at the code. Making simple agreements that lower the level of complexity developers have to deal with is a good way to make the code more usable. Just as keeping the application simple and removing features can help make a user more successful, communicating as much about the code as possible in order to make the next developer successful is the goal when writing usable code.

Mon, 22 Aug 2011 00:00:00 +0000 <![CDATA[DiffDB]]> DiffDB

At work a consistent problem is that we store a single document that can grow to be rather large. If you consider a database like MongoDB where you store relevant information in a single atomic document, it can be problematic to update the document. The unoptimized perspective is to just throw everything away and update the entire document on each change. This model is really pretty reasonable, but it does break down eventually because databases work at a rather low level at times.

Continually writing and rewriting large documents is not an operation many data stores are optimized for. Databases typically need to know exactly what pages a document lives on and just as adding a page in the middle of a book has an effect on the rest of the pages following, so does increasing a document in a data store.

The other thing to consider is that every document you update sends that entire document’s contents over the network. Considering I/O is almost always the bottleneck in many applications, continually sending MBs of information when all that really changes are a few bytes, it becomes apparent that there is a lot of extra info going over the wire for each request. I think in the long run RESTful services have some advantages in this space because there is caching built in, but generally, it is a really tough problem for a database to handle, hence most just punt and let the application logic deal with what gets stored and how.

My idea then is to create a database that is based on updates. I realize this is pretty much how MVCC databases work, so nothing too novel there. The thing that seems like it can be slightly more interesting is that the client is also part of the equation, which allows some different assumptions to be made that can hopefully help optimize the generic storage a bit. Perforce for example (and I’m positive about this) allows a client to check out a file, which prevents others. I’m not proposing this model, but rather simply that a database client could help out at times. In all honesty, I doubt this model adds enough value that it is worthwhile in many applications. That said, at my job, I know the incremental nature of the data is something that does cause problems at scale, so why not take a stab at one method of solving it.

I’m calling my experiment DiffDB. The idea is that a client starts with a document, and sends patches to the database, where they are then applied. There isn’t any actual storage happening, but the assumption is that you’d store the document and query it using some other database like MongoDB. DiffDB then really would act as a REST based API to support incremental updates on JSON documents.

The most interesting thing in this equation is the combination of HTTP (caching, etags, etc.) along side a known patch format for updating dictionary / hash objects. Since the HTTP layer is a bit more obvious (we use PATCH to send the patch and PUT to create a new document), I’ll focus on the patch format. The basic idea is to be able to update a JSON hash with new values as well as remove values.

Fortunately, Python has a rather handy ‘update’ method on dictionaries that makes much of the actual patch operation really easy. When you want to update a dictionary with new values, you just do the following:

a = {'x': 1}

a.update({'y': 2})

print a -> {'x': 1, 'y': 2}

Now, removing elements seems like it would be harder, but the fact is it is really pretty simple. If you had a nested dictionary, removing an element is the same as copying the parent, leaving out the elements you want to drop. Here is an example:

a = {'x': 1, 'y': {'z': 2}}

# removing z

a.update({'y': {}})

This means that the only time you actually have to explicitly remove something is when it is at the root level.

With all that in mind, the patch format then looks like this:


  '+': { ... }, # the dictionary you want to use in the 'update' method

  '-': [ ... ], # a list of keys you want to remove from the root dictionary


Nothing terribly complicated here. To apply the format, you first loop through the ‘-‘ keys and remove them from the original. Then, you can call the ‘update’ method on the dictionary using the ‘+’ value and that it is.

Again, this is just a simple idea with an even simpler proof of concept. I don’t know if this would be really slow or not, but I suspect by doing the diff and patch creation on the client side you could get the gains of small packets on the network, with a limited amount of CPU time. After all, it is rarely CPU bound operations that are your bottleneck, but the I/O.

On a more grander scale, I’d hope that the idea of a dict patch model has some legs for being helpful in other areas. There is some code here that has a diff and patch object you can play with to see if it could be helpful. It always seems like a good idea to put things out there in hopes others may find inspiration, even if the original idea doesn’t work out.

If anyone does try this out and have thoughts, please let me know.

Mon, 15 Aug 2011 00:00:00 +0000 <![CDATA[Nevermind]]> Nevermind

No, this post is not about ignoring something I said earlier, but it is in fact a short reflection on Nirvana’s seminal album Nevermind. My monthly Spin focused on the record in the recent issue and it is chalk full short snippets by people in the know reflecting on what it meant to them. Hindsight is always 20/20 and when asked your opinion on such things it seems obvious that you recognized greatness when it hit you in the face. I did not have the same realization that this music would change the world, but I can honestly say it is because I was too busy being changed by it.

I first heard Smells Like Teen Spirit at a skatepark in the valley (in California, not the great state of Texas). This was actually my first trip to a skate park, so the stage was definitely set for something amazing to happen in a young kid without an older sibling interested in punk. I skated most of the day and in the afternoon the intro riff started and immediately I felt like it was my turn in the skate movie.

The reality is I probably dropped into some mini ramp and got hung up on the coping due to a lack of turning knowledge. Yet, I can honestly say that after hearing it, I was changed. It was probably the first time I heard heavy music that wasn’t metal or industrial or some other extremely stylized music (aka Hair bands).

I was hooked. A friend of mine bought the record and I copied it. His dad was concerned it was too negative but we were able to persuade him that our interest lied in the music more than the lyrics, which was more or less true. I listened to that record like crazy and eventually everyone I knew also became hooked.

After finding Nirvana, I still wanted more music. Probably one of my favorites songs was Territorial Pissings because it was fast. Looking back at it now that I’ve played music for quite a while it is safe to say that it was the punk back beat that got me riled up. Soon after Nirvana, I hear Fugazi and that was it. I loved `13 Songs`_ and played it every second of the day, with my only complaint being I wished it was faster. Minor Threat happened to me and that was it. I loved hardcore punk and Nirvana was a sellout.

As my tastes evolved and I was exposed to more music like pop punk, emo (not the guy liner emo, but the Rites of Spring and midwest / Jade Tree / Polyvinyl / Crank Records emo), hardcore and the wealth of other “core” sub-genres (grindcore for example, thank you Lauren, Chris, Bob and 12 Blades). Eventually, I got somewhat bored with the general scope of bands that I listened to and to an extent the music I was playing. After the band I was in broke up, I didn’t really do much with music for a bit. I still listened, but there wasn’t a lot of purpose as there was before.

Fortunately, Lauren mentioned that we should start playing again and that is when Ume started. It began by going to some shows and playing in a garage. One thing that happened though was that I started finding new bands. I was introduced to No Wave and had been listening to Sonic Youth with new ears. We watched the `1991: The Year Punk Broke`_  and `Live! Tonight! Sold Out!!`_ (more than once!) and I recognized that my youth had hidden the fact that Nirvana were most definitely punk. While Nevermind wasn’t my album of choice (I had to make up for lost time with *In Utero*), Nirvana became an icon. The 20/20 vision of the hindsight made me realize my musical mistakes as a kid. I’m just happy that I came back around as it helped me to enjoy a whole new set of bands bubbling up in Houston. Even though most broke up (Handdriver 9 1/2, The Kants,  Drillbox Ignition), it was cool be a part of a scene that (I think at least) was inspired to be punk like Nirvana.

Thu, 28 Jul 2011 00:00:00 +0000 <![CDATA[SQL Isn't the Problem]]> SQL Isn’t the Problem

I read this article commenting on Facebook and how their MySQL based system must be so complex that it is hell to work with. At the end of the article the author brings up “NewSQL” which is basically the practice of providing ACID compliance, but without many of the other features of a typical RDBMS. My personal take is that it is a reboot on relational databases that recognizes potential “requirements” that can be safely dropped and advertised in order to gain better performance.

After spending a good deal of time with MongoDB and some similar home-grown databases, I don’t believe that SQL is really the problem.

More specifically, I don’t think the relational model is the problem, but rather how we think about storing things. Just because you are using a RDBMS it doesn’t mean you have to normalize your data. It doesn’t mean you have to store all your fields in tables. Instead, when using any data store, your main concern should be how you write and retrieve your data in a way that it will be performant. Looking at the data storage problem from a simpler perspective is helpful because when you take away larger preconceived ideas of how you “should” do things, it allows you room to consider what is both possible and realistically evaluate what could work for your situation.

Many times NoSQL solutions are based on the idea of having a key / value store. The need to query is often bolted on this aspect by providing data types and recognizing JSON formatted “values” in the store. The result being that the store is not really a hash table, but rather it is a set of indexes on fields within your formatted document that allow you to combine and “query” for more specificity than you could otherwise. You could just as easily keep index tables in a RDBMS that you update via a trigger or manually and store the document as JSON in a blob. There are downsides to this of course, but the reality is that an RDBMS and a NoSQL database are probably using many of the same techniques to do the core operations of storing the document efficiently and utilizing the indexes to find the data. Just because you are not using SQL, it doesn’t mean you magically remove the limitations of computer hardware and the years of research that has been done on how to effectively store data.

My point is not to tout RDBM systems as the perfect solutions because they are not. The point here is to recognize there isn’t a free lunch.

I’m sure if Facebook had a system based on MongoDB, Redis or CouchDB there would be a massive amount of complexity there as well. Storing massive amounts of data quickly and retrieving it is not a simple problem. No matter what data store you choose there will need to be similar operations done within the process of saving the data (adding and creating indexes) and retrieving it (finding the row / document, filtering or finding the fields requested). There are tons of options within the work flow, but they will have to happen. NoSQL often moves these to the application layer, so while your database system might become simpler, you may have increased your test complexity in your application. There really isn’t a free lunch.

Fri, 08 Jul 2011 00:00:00 +0000 <![CDATA[Passing Arguments]]> Passing Arguments

Have you ever found yourself in a programming situation where you see your function signatures growing? Typically, this is a sign that you need to think about refactoring. Unfortunately, sometimes a library you are using doesn’t really make that as simple as you might like.

Lately, I’ve been working with a parser and this ends up being a good example of how this can happen. A parser usually will help in parsing by providing some state tracking functionality. The problem is that it can be difficult to fit all the functionality within the confines of the parser’s API. In my specific case, I need to support line numbers, but the parser doesn’t really function on lines.

Finding the line numbers was relatively easy. Making them available at the right context was more difficult. The result was that I started adding arguments that would pass the necessary info along until it reached the right context. This added to the already somewhat long set of arguments.

I’m positive that there is a better solution. I’m sure I’ll find it sooner or later.

Fri, 01 Jul 2011 00:00:00 +0000 <![CDATA[First Bike Ride Back]]> First Bike Ride Back

Last night I went on my first ride since getting back from tour. Things went pretty well for not having ridden for three weeks. I’m also getting over a cold, which didn’t help.

General health aside, I noticed a couple things that I think will help improve my rides. I ride witha friend who has been riding a little while now and even though I know he is in better shape, I think there are some things that help him go faster more consistently. Here are some things that are different.

First off, he has skinnier tires. I’ve ridden his bike and it is definitly a rougher ride. But I think they do help keep your speed up.

He also has a bigger chain ring. Obviously, since this makes it harder to pedal, changing this is not entirely a physics issue. Still, there are many times when I am pedalling but I don’t get any power.

The last thing is that he has a bike lock that fits on his bike. I have a bigger u-lock that I have to bring my bag for. My theory is the wind resistance adds up. Even if that isn’t the case, I’d like to get a chain I can just wear instead of always having my bag.

I’m not denying that I need to get in better shape, but I think making a couple small changes might help remove some slightly minor issues and help make riding a little more consistent. One dumb thing I didn’t do was pump up my tires. That actually made a surprisingly big difference.

Sat, 25 Jun 2011 00:00:00 +0000 <![CDATA[Loving Music]]> Loving Music

We have a new record coming out and while the process was a huge struggle, I’m really glad it is coming out. Like anything that takes a lot of work, it is rewarding to finally see it come to fruition. But beyond that, I think the album, struggles we’ve had to get it finished and  getting on the road all reveal that as a band, we love music. That is why we do it.

Our goal as a band is to make the best music humanly possible. We want to be excellent players and write songs that stretch our abilities physically, but more importantly, we want to write music that connects with people in a challenging way. We want to write songs that demand attention and a desire repeated listening. We want to be someone’s favorite band. Most importantly though, we want to do this because it is what music has done for us.

It is interesting because even though I love music, it is difficult to listen when doing other things. It draws too much attention too easily.

I’ll hear a lyric or melody that makes me reflect on the effect it has me and others. It doesn’t have to be a style that I particularly enjoy either. If there is a song playing I listen. My desire is to understand how it touches others and often times how it touches me.

My hope and goal is that people enjoy our new record and in turn love more music. Our goal is to be popular, but not for any personal gain, but rather as a side effect of lots of people connecting with our music.

We’d like to be like The Rolling Stones, Fleetwood Mac, Nirvana, etc.

because those bands have all made an impact on a huge number of people.

People have wept when seeing big bands and it is not because they are sad. It is because they have the opportunity to take songs they’ve listened to over and over, through difficult times in their life, see it performed live by the very people who wrote and recorded the music.

I’ve had a few chances to meet musicians that have made an impact on my life. Rarely was I star struck or in awe. I always have an undeniable desire to tell them thank you.

Thu, 23 Jun 2011 00:00:00 +0000 <![CDATA[Installing TortoiseHG in Ubuntu]]> Installing TortoiseHG in Ubuntu

At work we use Mercurial for our source control needs and as such, there is an element of complexity that can be difficult to parse when dealing strictly with the command line. Mercurial comes with a “view” command that brings up a dialog showing the DAG and lets see where the flow of changesets are really going. This is a really helpful feature but it can also be truncated where it isn’t quite as valuable. Enter TortoiseHG.

Long ago at a company where I worked, we used CVS. And while most would say it is a terrible VCS, I actually felt is was a pretty decent system. The reason being was that we all use WinCVS and Araxis Merge in a very specific workflow. The result being, we rarely if ever really had to deal with CVS itself. In fact, this is where I started using Mercurial because I would keep my incremental changesets locally in my own repo, while committing my bug fixes via CVS and our workflow. My hope then is to see if I can get a similar workflow in TortoiseHG as I had in WinCVS in terms of reviewing and committing bug fixes.

The first thing to do is to add some software sources to your sources.list. You can do this in the Ubuntu Software Center, but being a long time Debian user, it was easier to just edit /etc/apt/sources.list. You’ll need to add sources for Mercurial PPA and TortoiseHG PPA. There are instructions on each respective Launchpad page.

# latest mercurial releases
deb natty main
deb-src natty main

# latest tortoisehg releases
deb natty main
deb-src natty main

You also need to make sure apt can trust these sources. This is done by adding the respective sources keys.

# mercurial
$ sudo apt-key adv --keyserver --recv-key 323293EE

# tortoisehg
$ sudo apt-key adv --keyserver --recv-key D5056DDE

After you’ve added the sources, you can apt-get update to make the new source’s packages available. From there, installing TortoiseHG is as simple as apt-get install tortoisehg.

This will install the thg command. From a terminal you can launch the command in any repository and you will get a window showing the DAG, uncommitted changes, etc.

I haven’t spent much time with TortoiseHG yet, but so far it is a bit more usable than the other tools I’ve tried. I do wish that Emacs had better mercurial support where the graph of changes could be viewed, but I have a feeling a dedicated app will do a better job creating the workflow I’d eventually like to establish.

Tue, 21 Jun 2011 00:00:00 +0000 <![CDATA[Personal Stats]]> Personal Stats

I had an idea this afternoon. I’d like to compile a set of personal stats. As a programmer it is probably a little easier since adding data can be a bit more automatic. The idea is to basically keep a log of things you get done and track them. Revolutionary, I know. Where I think this goes differently is that instead of just marking things as done, you are actually keeping track of more minuscule data. For example, if you write, you keep a running total of how many words you’ve written, deleted and a final count when you’re done at the moment. The idea is that you’d not only see a history of what you did, but you might get a better insight into things like accuracy. A lot of typing tutors do this kind of basic analysis, but taking the concept into more general aspects might be really interesting. Here are some examples:

  • How many complete articles did you read today vs. how many you started reading vs. how many your hovered your mouse over for second as though you were considering reading it?
  • How many files did you create vs. how many were created by your code vs. other code?
  • How much data do you typically save on your machine every day explicitly vs. implicitly (caches)

There is a whole ton of data that might be interesting to plot that might shed some light into where you’re really wasting time or resources. Honestly this idea is probably a micro optimization to help get more done, but it is interesting none the less. I’d wonder if looking at the data what sorts of goals someone might create? In any case, that was my idea.

Thu, 26 May 2011 00:00:00 +0000 <![CDATA[I Finished Life (The Keith Richards Book)]]> I Finished Life (The Keith Richards Book)

Last year at CMJ we played a party for Relics Magazine that was in honor of Keith Richards biography, Life. I was already on a huge Rolling Stones kick. When we were on the road, our drummer liked listening to NPR while driving and we heard Keith’s interview about the book every other day. I finally got to borrow Lauren’s parents copy and give it a spin.

I’m always a sucker for music biographies and documentaries. It is always really interesting to see where an artist came from and what sorts of breaks they might have had to help them make their career something special. Who were they playing with? What did their scene associate themselves with? Did they work really hard or get extremely lucky? How involved were they in business side of the music industry?

Were they conscious about being cool or not? All these sorts of questions are fascinating to me, which made Life right up my ally.

As something well written and easy to read, Life falls in the average category. The biographer added very little of his own words to clarify the story and Keith’s messages. The result is that it really reads as a long rambling lecture from Keith Richards. The positive side of this is that you get a somewhat clear vision of Keith as a musician, what his perspective is in terms of making music and how he would work. The downside is that some of it feels somewhat out of proportions of reality.

The big story of course is the relationship between Keith and Mick.

There is also plenty of info regarding the loves of Keith’s life, but I think they are traditionally secondary to the larger story of the Stones. The Keith/Mick divide seems almost predestined with infidelity between the two friends. You get the impression that it is something that just goes with rock’n’roll, but I have a feeling those sorts explanations are more a function of expectations than an actual reality.

The interesting thing is how Keith describes the song writing process.

He would start and Mick would finish. That is a pretty interesting model b/c I think it has parallels to how Lauren and I write. The reality is Lauren does the majority of work here, but in terms of providing and important ingredient that helps define the sounds, I think we have a similar relationship. That said, I kind of get the impression that there are times Mick is actually doing the majority of work, even though Keith explains otherwise. I’m not slighting Keith as a song writer here, but as a musician, it is pretty easy to come up with riffs. Taking a song from a riff or chorus to a great song is a tough road.

This is the overarching issue of believe-ability with Life that kind of gets on my nerves. It all seemed so easy. The drugs, sex and rock’n’roll seem heavy, and yet Keith handles it all in stride. Either he is not telling the truth regarding how hard is working (it is not cool to let people see you sweat) or he really had massive amounts of help (from Mick for example, working to make the Stones what they were). No matter how far from the truth Life wanders, it is fascinating to see the life through the eyes of a great musician like Keith Richards. I’m a fan and a musician myself, so I’m probably biased, but finding that you find certain songs (Gimme Shelter, Can’t You Hear Me Knocking, Street Fighting Man) hold a similar place in the original writers perspective is cool. I have a huge amount of respect for Keith Richards and I think Life goes a long way in demystifying the legend while exposing that no matter what your opinions are, he is a lover of music.

Fri, 20 May 2011 00:00:00 +0000 <![CDATA[A Rambling Multiprocessing Story]]> A Rambling Multiprocessing Story

At work I had a really clear example of a distributable problem. Pretty much just queueing up some tasks and having workers process the queues. I ran some benchmarks to see what the most effective worker was, threads or processes. The result was that processes were the fastest in my case. There were a couple important details that make it clear why. The first is that the “task” is really just a dict that can be small or large. The dict comes from MongoDB and is a MongoDB document. My initial design was to put the entire document on the queue, but it became apparent that it was much more efficient to just pass around the document object ids and let each worker fetch the data. This is obvious in hindsight, but at the time I was avoiding prematurely optimizing things and potentially making things more complicated. Along similar lines, I found in testing that if workers always connected, did some work and closed the connections for each operation, it also caused some problems. These discoveries created a couple requirements and opportunities.

  • Workers must reuse their connections across jobs
  • The smaller the job payload the faster we can go
  • A cursor from MongoDB can get all the object ids easily, but not entire documents

The process of starting a job then began with a thread that does the initial query from MongoDB. This happens in a thread because the job is kicked off from a request to a CherryPy application. The application starts a thread to do the work and then is able to return a response to the client immediately. The thread then queues up the results and starts a set of worker processes. The worker processes then get items off the queue and do the work, preserving the connection to MongoDB.

Also, in addition to the web application being able to start jobs, it also monitors the rate at which the jobs are being processed. The idea here is that if we see the application putting too much strain on the database, we can ease things up.

This whole application went through a few iterations before settling on the above design. I started with simple threads and found that it was trivial to max out a processor. When I switched from threading to processes I saw an increase in performance as well as all the processors being utilized. The difficulty in switching to processes was that it was not as simple passing objects around. This goes beyond the obvious serialization difficulties passing objects between processes.

When I first started switching to processes I had not yet made the decision to use the object ID and make the workers fetch the data. The result was that I had jumped through a lot of hoops. For one, I couldn’t queue all the documents before starting the workers. This resulted in trying to do a batch operation, while still keeping the workers available. I would do a batch, wait for the workers to finish, find more documents and repeat. The problem was that the workers then couldn’t effectively tell whether or not the queue was actually finished or just empty while more cases were being readied. Looking back, it would have been better to go ahead an let the process finish and start a new set of workers. My experience with connections failing silently pushed me to avoid stopping and restarting the workers.

When I took this model and tried to apply it to processes, my first inclination was to use Pipes. I made a very simple protocol where a worker would listen for a case and process it. This worked but it was cumbersome to test because I couldn’t easily make my workers join. When you are testing, you add an item to the queue and let the worker process it but how do you tell when you can check everything worked? If the worker is broken, then it doesn’t migrate the case and you’ve become deadlocked. That doesn’t make for a very good test because a failure doesn’t happen within the test runner, it just blocks forever.

I then began to use a queue like I had with the threads. It shouldn’t be surprising then that I ran into the same problems. With that in mind I took a minute to start investigating what really was the best methodology, outside the scope of the application. I tested different models using threads, processes, queues, pipes and data. Eventually I came to my conclusion regarding queueing the object id and letting the worker fetch the data. My original concern that it would strain the database was incorrect. Again, hindsight is 20/20 here.

With the better queueing strategy in place, it was trivial to test whether it was faster to use processes or threads and see if there was an optimum number of each. With my model in place I refactored the code to reflect my findings and got all my tests passing.

The next step was to introduce the reporting aspects for the web interface.

As an aside, when working on the web interface I took some time to investigate Backbone.js (which requires Underscore.js) and Head.js. I had heard about these libraries for a while, but hadn’t had an opportunity to see what sort of help they are. It is honestly very exciting to see things like Backbone.js because it begins to make it much more obvious how to organize Javascript code. Likewise, tools like Head.js make it possible to create a more effective system of including those files in a way that is efficient in both the browser and for development.

Back to reporting.

Again, the way the system works is that the CherryPy application receives the request, it processes the request and asks a manager bus to perform and operation. The bus then checks to see if it already has done the action (there is an idea of a “team” which is a set of threads and processes responsible for doing a “job”) and tells the team to get to work. The team then creates a thread to actually do the work. The thread begins by creating a result tracker thread. This thread will listen to queues the workers write to in order to provide some stats. Next it finds the documents that need to be migrated or copies and adds them to the queue. The worker processes are then started. Each worker has its own result queue that it writes to that the result tracker thread will read from. After the worker processes are started, the thread started by the team closes the queue, signaling to the worker processes that when the queue is empty, all the work has finished. The queue is then joined and the report tracker thread stopped.

Something that was important was where a queue was created. Originally I tried to create the result queue and pass it to the workers, but that didn’t work. I suspect that the queue can only have a single writer. Similarly, I had to create the work queue in the thread that creates the worker processes. Again, I believe this is an implementation issue in the multiprocessing library. My understanding is the multiprocessing library has to create a pipe for each queue and it starts its own thread to actually handle sending things via the pipe. The impact of this is that even though you would suspect you could pass that queue object around, it fails when it crosses the boundary of a thread.

There were some things I didn’t try that might be helpful in the future. I avoided using an external queueing system. Part of the goal of this project is to lower the system administration resources. Adding something like RabbitMQ or Redis doesn’t really help our sysadmins keep track of fewer processes. Another thing I didn’t really investigate as fully was using the logging module for getting statistics. Instead I used the logging.statistics protocol, which meant that I needed to collect the results in the main process, hence the reason for the result tracker thread. I considered just using a log file, but decided against it. Logging every result in a set of files would most likely end up with a negative impact because of the file I/O via actually writing to the disk or even simply taking up space.

With that in mind, I did add some tooling for working with logs for future improvements. It could be a very useful feature to journal the operations for either a replay or for importing into another system. Again, this would need to be something the sysadmins would want to do since it is taking compressed binary data from MongoDB and writing it in uncompressed JSON files. That could be quite a lot of data for a large database.

I’m hoping that when we see where this tool really fits we can eventually open source it. I think its strength lies in being able to move MongoDB data in a way that is tunable such that you can inadvertently queue things. What I mean is that when you have a system, incorporating a queue can introduce complexity. This idea lets you avoid the explicit queue and instead consider something more akin to smarter replication. Time will tell how valuable this is, but at the very least I learned quite a bit and I think the result is a tool that does one thing well.

Tue, 10 May 2011 00:00:00 +0000 <![CDATA[This is a Bug]]> This is a Bug

If you’ve ever programmed in Javascript, you’ve obviously dealt with “this”. In some ways, “this” is a pretty powerful concept because it lets you have a dynamic context. Where the idea excels is with events and DOM. At least that is where I think using “this” is really slick. Where “this” falls apart though is when you use it in a similar way to Python’s “self” concept. More generally, when you have an object and you want to reference that objects variable within a function of that object. Here is a really simple example.

var Foo = function () { = 'hello world';

Foo.prototype.handler = function (evt) {

Foo.prototype.connect = function () {
  some_object.bind('my_event', this.handler);

The idea here is that when you fire some event ‘my_event’ the Foo object’s ‘handler’ function is called and prints the Foo objects ‘bar’ value, which is ‘hello world.’ That is the idea anyway. What really happens though is that when you do fire that event, the “this” ends up referencing ‘some_0bject’ instead. Unless ‘some_object’ also has a ‘bar’ variable, it will print null.

On the one hand this is really cool. Again, it is a really effective model for DOM events. At the same time though, it is really sort of a pain in neck to write more organized objects because you have some hoops to jump through in order to change what “this” references. In fact there are plenty of examples where a library or tool tries to make this easier.

jQuery.proxy is one example that lets you set the value of “this” in a function. Coffee Script, which is something I’ve been playing with lately, has a slightly different function operator (=>) that will keep the value of “this” as the parent object instead of using the calling function. Backbone.js is another tool that aims to solve a similar problem in that it provides a more typical object model where it is most effective, defining Models. The point being is that others have hit this same “bug” and had to work around it.

Obviously, the behavior of “this” shouldn’t really change. And the fact that there are work arounds provided by libraries just means that it doesn’t have to be too big a deal. Still, I wish that there was a simple way to access an object’s method or attribute more reliably within Javascript. The other option is to construct more clear best practices regarding organizing code that doesn’t rely on typical objects. Not being a Javascript guru, I’m sure others have come up with solutions to both questions. Maybe I just need to look a little harder.

As an aside, I’ve been playing around with Coffee Script. It is a bang up great idea if you ask me. They have managed to create a language that is still Javascript in that it works with other libraries and tools, yet it manages to add some of the most helpful macros to make things simpler. Creating Classes is one example where it excels with things like list comprehensions and better looping being another huge win. What is also exciting to me is that it compiles to really basic Javascript. The result is that you don’t see code generated that is based on a library or framework. Instead you get plain old Javascript. In this way it becomes a simple way to abstract the structure and code while still optimizing the code by using straight Javascript. If you’ve ever needed to do things like process data in Javascript you’ll find that using something like jQuery can be slow at times compared to plain JS. Coffee Script takes a different tact in that it is an optimization of syntax that doesn’t effect the runtime performance very much. Check it out if you’ve haven’t already.

Wed, 27 Apr 2011 00:00:00 +0000 <![CDATA[Privacy is Important]]> Privacy is Important

I just read this short blog on Angry Birds reading contact information from your phone. Now, I bet most people don’t really care and honestly I don’t blame them. Who cares if Rovio gets some email addresses from your address book? That is what spam filters are for anyway. Besides, you can easily unsubscribe from any spam and, thanks to the CAN SPAM act, sue those who disregard your desire not to be bothered. What really bothers me is that my phone is not my computer.

Let me explain. A computer has an operating system that works atop a file system. You can install different operating systems or applications and you can use different filesystems in some cases. You can create folders and organize your files however you want. More than one person typically can use a computer thanks to the features of the operating system. Your computer is something that you can control at a very low level. This is very different from your phone.

A cell phone is on a provider’s network, which brings with it certain restrictions. They have limited physical resources and as such they want to control what can use those resources. Phones also have to adhere to standards set by the government for things like the spectrum at which they communicate and how powerful or limited the antennae function.

Generally, you can’t easily change the operating system the phone uses and installing applications is done through an intermediary (the iTunes App Store, Google App Store, etc.).  Most importantly though, phones are used by only one person.

The larger distinction here is that a phone is personal by default, yet it is the least personalized. If an application is run on your phone, there is a good chance it has access to everything on that phone, contacts, text messages, call history, minutes used, billing info, location information, other applications, etc. It is trivial to access this kind of information because the platform is completely controlled.

There is probably documentation that explicitly shares how to access all this information. Since a phone is personal, the relevance to some company is that the information found on the phone can be directly applied to a specific person.

This is honestly a little scary. It is not scary in that I have things to “hide” but it is scary because there are things that I don’t feel comfortable sharing. If a person comes to your home you welcome them into your living room. They may use your bathroom and you offer them a drink or snak from kitchen. You don’t give them a key to your filing cabinet or your medicine cabinet. These areas of your life are private for a variety of reasons. It is impolite to ask someone how much do you make, so in theory, we should be offended these apps take our personal information on our phones without even asking. It is rude to do as a guest (on your phone) and it is scary because they tell others for money.

As a programmer, the reasons apps don’t ask if it is OK is because it is inconvenient. I don’t believe Rovio is out to do anything more than make money selling games. If they can collect information in the mean time that other people want to use, then why not. User typically don’t want hurdles. Being asked every time your start up Angry Birds whether you mind if they use your location and contacts would be annoying and turn people off to the game. Since they want to make money, they just don’t ask. That doesn’t make it right and it’s rather offensive, but most would probably do the same thing in their shoes. You have a captive audience, a sample,  you can gather specific, accurate information from at almost any time in order to learn something new to help sell other products.

The tough part is that it is convenient for users. Google prides itself on the usefulness of their search results. They take all that anonymous information and use it to help understand what people really want. It is also convenient because instead of teaching your computer explicitly to do things, our machines simply watch what we do and learn our patterns in order to help us. As scary as that sounds, it is realistically very powerful. Computers really are not very smart. Despite what the movies would have you believe, computers need a lot of instructions and people rarely want to provide all that instruction. If you provide a value for all the variables your computer needs, it would be a total waste of your time. Again, if you were asked if your information could be used every time you opened Angry Birds, would still play it?

I’m not saying that we shouldn’t use Google or play Angry Birds. We simply need to be more understanding of what we really have in our pockets. Our phones are much closer to our wallet and private life than our computers are. A computer can do anything, but a phone is personal.

The convenience the applications provide is helpful to a point, but we as users need to find a way to force our desire for privacy. Taking the time to see what options are available for applications is a good first step to understanding what an application might be doing behind the scenes to give you a better experience. Avoiding advertisements is another way to take away the value of secret data gathering. If the emails don’t work, they are not going to keep sending them. People fear the world of Big Brother and 1984, but the reality is society is not ruled by an all seeing police force. Instead, we are becoming slaves to our own desires and wants such that we can be controlled simply by focusing our desires on what someone is selling. In this way it is comforting to know that defeating this kind of control is a simple as ignoring the ads and support those who support your privacy.

Tue, 26 Apr 2011 00:00:00 +0000 <![CDATA[Coming to Grips with Kickstarter]]> Coming to Grips with Kickstarter

Since SXSW is pretty much over and my RSS reader (yes I still use an RSS reader) is chalk full of new articles discussing the “music industry”, it seems like a good time to recognize a new trend that seems like it will stick. That theme is embodied by Kickstarter, the site that lets you ask for money in order to do some project.

I was driving the van downtown and for whatever reason I was thinking about how bands could make it without a label or press. A label really helps with physical products and acts as an investor, with a major part of that investment being in press. Press is what lets you scale. You have the ability to garner a huge audience and scale your exposure.

Conversely, it can be a total waste of money. Radio is a good example of where you can spend money where the returns can be huge but are most likely pretty limited. The name of the game though is exposure.

If a band doesn’t have a label they still might be able to hire a press company. Most of the time that isn’t possible. You can always ask people to work on spec, which is to say that you promise some future profit or income in trade for services now. Again, this was popular when major labels would sign long shots with big advances, but not so much now. The only thing you really can do is ask people for money in order to pay a press person.

Kickstarter meets this need really well. You can ask for money and you just might get it. There has to be some legs behind your idea if you really want to get the money, but the interest is something that allows you to further bond with those that believe in you. Up until this point, my biggest issue with Kickstarter is that I’d hate to ask for money, but that perspective is changing for me.

The big change is that by using something like Kickstarter you are not really asking for donations. Your goal is to mobilize. As a band you have fans and fans want to see and hear your art. They come to shows, buy t-shirts, tell their friends and generally support you. Kickstarter is not about asking for money. It is recognizing that the middlemen known as labels are not really necessary. Fans are the ones that make everything work, so why use a label? Kickstarter gives fans yet another way to support the music.

This doesn’t mean that bands always need to use Kickstarter. There are still labels out there that believe in bands and can help. Having a label means that someone took on the sole amount of risk on an artist and that says something about the art. This helps a great deal in formalizing a bands stature as a respectable artist. Still, it seems that fans and artists are both recognizing that Kickstarter is a respectable way to move forward finding funding for making music.

This concept of mobilization is really the whole point, whether someone is using Kickstarter or not. Fans may pirate your music, but they will buy the record, the t-shirt and give money to your Kickstarter campaign to get your van fixed for the next tour. When I think of Kickstarter as mobilizing fans instead of begging for money it doesn’t feel wrong. You have fan clubs and street teams that do even more by posting flyers and spreading the word. Really, asking for $20 to get something done doesn’t seem like any more of an investment.

Mon, 21 Mar 2011 00:00:00 +0000 <![CDATA[Being Crotchety]]> Being Crotchety

Is it bad that when I hear ideas I often think of how they shouldn’t work? Tonight I hung out with some #sxswi folks and listened as they passionately discussed things they were working on. Initially, I was really excited. The idea was cool and spoke to part of my own beliefs that would be really exciting to expand upon. But then something happened. Amidst the flurry of interesting ideas and real world applications I started thinking about what could go wrong. The core issue, which I think could be the core issue with many tech start-ups, is that the eventual need to make money doesn’t align with the intent of the start-up.

Taking Twitter as an example, one thing they do right is the making the barrier to entry really low. On the other hand though, if they expect to make money via ads, how can they really captivate when people are only writing 140 characters? Sure, there are links, but how can you really support the massive amount of messaging bandwidth and traffic without charging for the service or getting revenue from ads? My understanding is that while Twitter is extremely popular, they still haven’t started making real money.

Facebook is another example. They actually have a ton of traffic and a platform that can support ads. The problem in my mind is how do they avoid getting beat? The like buttons and login are kind of nice but eventually won’t they just become AOL? At some point doesn’t the world that Facebook creates end up being some stripped down version of something better? Maybe they can hold on to their stature as a person’s profile for the internet, but I doubt it.

Throwing myself to my own crotchety devices, I’m really just whining.

Both Twitter and Facebook are successes. And the initial idea I heard tonight was really compelling, even if it feels like it can’t make money. I’m not going to speak negatively to someone trying to change the world. It still makes me think that there could be problems though.

Sometimes it feels like the world we live in just doesn’t make sense, and yet we all rely on these assumptions to make things work. One can simply look at the recent economic crisis to recognize that people relying on incorrect assumptions is a dangerous past time. In some ways I think my recent crotchety attitudes are really a sign of my own negativity. But at the same it wouldn’t surprise me if it turns out that my own intuitions regarding the state of our society is a lot more correct than not. In either case though, I do believe that even though an idea may seem tough to execute as a commercial venture, it is worth trying. Cynicism is not very helpful. Even though I might be crotchety at times, it doesn’t change the fact that my presuppositions are really just untested theories. In this specific case, it is actually something I would love to be proven wrong about.

Tonight, instead of taking a crotchety attitude as negative, I’m going to take some time to enjoy the excitement of what could be. Hopefully I’ll have a chance to help in small ways to make a far-fetched idea one small step closer and wish new friends the best of luck in an excellent venture that really could make a difference to a lot of people.

Mon, 14 Mar 2011 00:00:00 +0000 <![CDATA[The Ada Initiative]]> The Ada Initiative

I’m really thrilled to see the Ada Initiative! I’m a guy, but I play music with two very talented female musicians. Watching them destroy day in and day out, writing amazing music and mastering their instruments while not getting the respect they deserve is very motivating when it comes to supporting things like the Ada Initiative. While I don’t have the same experience in technology in terms of seeing sexism at work, I have definitely seen it happen and it is really wrong.

Also, I don’t think that sexist actions mean you really have a deep seated lesser opinion of the opposite sex. This goes for many small subtle acts of bigotry. But that is the point. There are many times we don’t realize we are acting like a bigot. We say something or act a certain way towards someone without recognizing its potential offensiveness. This is not about simply not offending people either. I know I’ve said things that have been lame without even realizing it, only to be given a kind pass. But that doesn’t mean it was OK or reflects what I really think. What it does reflect is that there are accepted forms of bigotry in our society that are really easy to maintain. It takes practice and awareness to recognize it and make it change.

Finally, I hope it is clear that this kind of thinking is not necessarily based around feelings. When you demean someone else, it doesn’t matter why, you are removing their opportunities. The Ada Initiative is a great example. There are companies everywhere that will only hire people who care about open source software and have tried to contribute. By avoiding the effort to include these women, you not only hurt their stature in the community, you also are hurting their career opportunities.

I’m really excited about The Ada Initiative. As a society we can only become better by recognizing our flaws and correcting them. Bigotry is a flaw because it hides reality. Making an effort to include more women in technology is not about simply balancing numbers, but revealing more talent and growing abilities that might otherwise not see the exposure they should.

Thu, 17 Feb 2011 00:00:00 +0000 <![CDATA[Python Packaging Frustrations]]> Python Packaging Frustrations

I’m not a fan of Python packaging. This is a new development for me because up until now, things like easy_install and setuptools have felt perfectly adequate. The big reason for the change is a realization that software should have a build step. A build step is important because it forces you to think about distributing your application in an environment. You build your software to some executable or package that will be put on some other system.

Distutils has this concept, but setuptools took it and ended up making it the norm to just install the source as an egg instead of creating a true package. The result is that tools like virtualenv have cropped up that focus more on recreating a dev environment than reproducing an installation of packages. I’m not bashing virtualenv, but rather just suggesting that there might be better ways to think about creating a dev environment and translating that to a production environment.

I realize this is a subtle difference . But after having a nightmare trying to install an application on OS X thanks to its dependencies, it became clear that deploying source packages and letting dependencies be downloaded and built is not a great model. Fortunately, I don’t think it is that hard a problem to fix. The bigger issue is how to deal with the cultural aspects and communicating the subtleties of the two techniques when the current status quo is really easy.

Hopefully the distutils2 effort will show some improvements.

Mon, 07 Feb 2011 00:00:00 +0000 <![CDATA[Mercurial Workflows]]> Mercurial Workflows

I have an idea for a workflow using mercurial where you have a simple tool hide things like merging and dealing with multiple heads. I felt pretty good about it until I talked to my manager about it and he mentioned the difficulties of communicating status based on the state of the source repo. Even though I disagree, it is clear that one must be careful not to lose the communicative nature of source control.

My idea for a workflow is based on this video. It is a good idea to go ahead and watch it first before moving on.

The biggest thing to draw from this is the idea that it is not the facilities of your VCS that makes development work with complicated code trees, but rather the etiquette prescribed for the team. If you have a vision for the flow of code, it is best to create that system and use tools instead of allowing the tool to decide your version control strategy. There reason is because your etiquette is a protocol with interop between each developer. In other words, it is a UI that you have to live with for a potentially long time.

With that in mind, my specific workflow does cater to ease of use over expertise. My goal is that you avoid running into trouble using this system. Where does trouble come from? In a word, merges. By merges I mean taking one code line and merging it into another. This distinction is important because it differs greatly depending on your tools. The point is though, most all VC systems contain the idea of a branch and merging is taking those branches, no matter what they contain, and merging them into another. The difference in my workflow is that you avoid taking branches and merging them in favor of working with changesets. Abstract thoughts aside, here is my proposal.

You have a “project”. This the traditional repo for your code.

Somewhere there is a canonical version of that code that you do releases from. In my system you have a canonical remote repo and you have a canonical local repo. We’ll call these the remote and local mainlines.

The idea is that the remote mainline must always be at a reasonable level of stability. I’m defining “reasonable” here as a developer should be able to clone it and “run” the package successfully on the main branches.

The reasonable level of stability is important because it prescribes a condition such that you shouldn’t be pushing code for others that hasn’t reached a certain level of stability. That doesn’t mean you will never break the build, but it means you understand it is a serious problem when you do. In order to keep this possible problem to a minimum, we utilize a local mainline to stage your pushes.

Traditionally when you have a bug or feature you need to work on you will branch. That will create a copy of your local mainline in a directory called “branches”. You then point your environment at that branch and start working on fixing the bug. The next day you “sync” the branch you are working on. A “sync” is a process that pulls in the stable changes found in the public repo and immediately adds them to your “unstable” feature branch. You are then forced to handle any issues at that time. Likewise, the “sync” process rebases your branch to the new stable mainline. In this way your changes are always going to be easy to apply to mainline.

When you finish your feature or bug fix, the next step is to get the changes into mainline. The first step is to apply those changes to the local mainline. This intermediate step is important because it gives you a change to have staged other code that might not be public yet. For example, if you have a bug that requires changes to other parts of the code, you could work on each and them as they are ready to the local mainline. When you are all done and everything seems to be working correctly, you can then push your changes to the remote mainline.

Doing this with mercurial, the process looks something like this.

% cd $project
% hg clone . branches/foo
% cd branches/foo
% python develop # point the virtualenv at the branch package

# work on the code
# ok we're don

% cd ../../ # back to the project dir
% hg pull branches/foo
% run-tests
% hg push ssh://mainline/repo/$project

The essential bit is that you “pull” into the mainline branch. Assuming that the branch and the mainline are in “sync”, that makes sure the branch changesets end up on the top of the mainline stack of changesets.

The result is that you appear to have a sequence of changesets that can be viewed atomically. You also have not “merged” anything. It is as if you perfectly applied all your changes with the result being a very simple stack of changes ending up on the source tree.

This simple list is advantageous because it removes the complexity of dealing with parents and ancestors. If something has causes a regression, the solution can be to simply pop off changesets until the regression is gone. There is also little confusion if a merge potentially pulled changes that were undesirable.

Going back to my original paragraph where my manager mentioned that by not having the branch in the remote code repo you lose track of what the developers are doing. I do see the benefit. If you want to collaborate on something, you just switch to the branch that work is happening on.

If you don’t want the changes, then don’t pull or push them to your working branch.

The problem is when you have to follow that tree of changes. This happens when you have a break in production and you have to ask yourself what the correct baseline is to move back to. This is when trying to understand the parallel lines in your graphical log becomes hopeless.

Once you do get a picture of what happened, how do you back it out?

Where is the best place to apply the changes? How they propagate between all those parallel branches? I’m not saying that my workflow is totally correct, but I believe when it counts the most, simplicity will make life easier.

An answer to the collaboration question is also very possible. Most developers have a desire to pull at the beginning of the day and push at the end. It is trivial then that you’d always push to a personal repo or branch. Here is an example of a potential filesystem for a suite of projects.

├── main
│   ├── project1
│   ├── project2
│   └── project3
└── users
    ├── eric
    │   ├── project2
    │   └── project2-feature-branch
    └── mike
        └── project1

If you want to see what people are working on, then using the idea of the “local mainline” mentioned above, you could keep that copied on the server. The point being is that the etiquette defined by the protocol is the most important function of the version control workflow because it is that protocol that guarantees the release process is exactly the same and that developers can be confident in their actions using the version control system.

This workflow does make the assumption that it is trivial to point the environment at different branches. If that is difficult then I’d argue that there is something wrong. If you want to make sure deployments are simple and the same across N servers, you need to be sure you can create that environment from scratch at the push of a button. Therefore this workflow makes that assumption on the source code.

Lastly, I’m sure different tools would make the workflow different. My understanding is that git gets branching right such that some of the problems that this workflow solves might not be issues. In my mind it doesn’t matter what tool you are using. All that matters is that when you get a source tree you can easily see the obvious path the code took getting from point a to point b. The analogy is like publishing. You keep drafts private, sometimes you share them with select people and collaborate for different sections, but at some point you publish it and at that point you can’t simply change things. If you ask me, the same technique and process applies to source control.

Fri, 28 Jan 2011 00:00:00 +0000 <![CDATA[Rewriting Code]]> Rewriting Code

We have a large project that is going to be going through a pretty large change. There will be a new incarnation of the project that hopefully sets the bar for the future. Seeing as that is a pretty big lofty goal and the real world is rarely big and lofty, it seemed like a good idea to write down some specific reasons we need to make rather major changes.

I’m not going to deny the reason for this explanation is to help me feel better about effectively rewriting our application. After all, rewriting software is a really bad idea. I heartily agree that a big rewrite is rarely going to solve problems. But, in this case, the goal is to improve the system. How is that different? When you write software you have bugs and bugs have assumptions. Take persistence for example.

You might be getting errors every so often that happen because the database layer is a bottleneck. Fixing the bug might be to include some caching or simply throwing hardware at the issue. The system on the other hand could include a completely different design of the data such that writes can be made incrementally and later compiled into a complete object. The difference here is that you’ve starting changing the assumptions and in doing so opened up a different set of opportunity that could be the difference between constant frustration and actually getting new features.

Our code base, while very successful, has begun to show these sorts of systemic issues that will prevent us from expanding far beyond where we are currently. I say “far” beyond because that is real goal. We want to handle 1000x the load with 1/10th the hardware and that can’t happen given the current system assumptions. Likewise, if we continue to focus on fixing our bugs, we’ll never be able to radically change the system.

The biggest systemic issue that we have is speed. We need to be faster.

Our response times need to be much lower and our ability to develop exciting features needs to be faster as well. The current assumption is that there is one data store. The single data store implies that you write to one “place” and read from the same “place”. The problem is that as we’ve grown, the realization has come that we don’t read/write to the same place. There can be an intermediary. Along similar lines we don’t read everything in the same fashion, yet the vast majority of data is in fact read only. Our persistence goals then are to make sure our updates are fast and don’t impact our read performance. There will be a trade off that much of our data will be slightly more stale than before.

Another speed issue deals with development speed. There are infinite possibilities for asking questions, but it is not simple to create another way of asking questions very quickly. This problem become more complex as we enter different platforms (mobile). Here the solution is not as sweeping but simply involves producing some API to our storage that any client can utilize. While this seems simple (GET the question foo, POST the answer bar) in reality there is huge set of assumptions the system makes throughout that have never been truly codified. By codified I mean they have not been defined in a publishable manner in addition to lacking consistency through out the code. This improvement will mean providing a true API that we publish along with tools to make things easier to work with. From there, our hope is that we can have a platform for more customized questions that help us move beyond check boxes and into rich interfaces.

Finally, we need to make our authors faster. I’ve mentioned before that we have a custom language we use for writing questionnaires. These “scripters” as we call them are in fact a mix of programmer, designer, statistician and project manager all rolled into one. As such there is a wide variety of skill levels that need to be supported. In this case our goal will be to extend our scripting environment to better support our more basic users while giving the advanced users tools that can improve everyone’s workflow.

The reality is that while we are effectively rewriting our application there is a clear direction that we want to go. Our first iterations have been naive as to the problems at scale. Now we have an opportunity to take some systemic bottlenecks and hopefully improve things for the foreseeable future. We’re pretty confident that we’ll see a whole new set of problems but hopefully by that time we will have gained enough understanding that we can make another shift to keep growing.

Tue, 04 Jan 2011 00:00:00 +0000 <![CDATA[Appreciating Mercurial]]> Appreciating Mercurial

There is a lot of buzz around git. Since I’ve never spent much time with it, I can’t really say whether it is warranted or not. I ended up using mercurial and never had a reason to change.

One thing that consistently happens when using a DVCS is that you reconsider how you work with version control. There are some larger concepts that are largely static such as tagging releases and branching for features or bug fixes, but past that the world is wide open. This is a blessing and a curse. The options are never ending, so like vim or emacs, you can always tinker with your version control. The downside is that it can be really difficult to find a canonical method of use.

Some people might wonder why you’d want a “canonical” way to use your version control system. After all, a DVCS lets you program on planes so your work flow doesn’t effect anyone else, right? In theory this is true, but in practice I’d argue that isn’t the case. The reality is a DVCS is a complicated beast and in most dev environments you really don’t need the extra complications. A successful colleague of mine expressed his appreciation for Perforce, a known target for version control bigotry. His point was not necessarily that Perforce was such a perfect design but rather the constraints of it were reasonable and helped get things done efficiently. It had a very clear canonical way of working with it that made everything from getting new employees up to speed just as simple as pushing out new releases. Unfortunately, the git vs. hg debates usually come from radically different environments where this idea of a canonical use doesn’t easily apply and the result is that there seems to be a farther discrepancy between the two than there really needs to be.

I read this article regarding how git gets branching more correct than mercurial. Looking at the context, the author’s work flow requires accepting and reviewing patches before applying it to the project code base. His perspective is that losing the context of where a patch came from in terms of the branch doesn’t really matter compared to the ability to disassociate patches with branches. I might have misunderstood his point but it doesn’t really matter. The use case of pushing forward development via the submission of patches is a very specific use case that doesn’t happen in many situations.

Most open source programmers have day jobs and I’ve yet to see the situation where fixing bugs in an organization goes through some maintainer that reviews the patches and applies them to the main branch. It is more common that developers work on specific bugs and features within the context of some time period. At the end of the time period, there is a release event that tags the current stable state of the repo and the cycle continues. One option would be to create a release manager position that is responsible for integrating patches to make sure they work and don’t cause problems, but the smarter way to deal with this is via automation and continuous integration.

Hopefully it is clear that the biggest difference between this traditional organization based model and open source experiences is that in an organization your are responsible for the code. In an open source situation you can submit patches all day long and there is no obligation for anyone to pay attention. The open source developer has to politic one way or another in order to be heard where as in an organization, your obligation is to communicate and produce code. This distinction is critical because in addition to using a tool, an organization can specify the “right” way to use it such that it reduces issues associated with random features colliding. This is important because by specifying the correct way to use the tool you open the door for other assumptions to be made.

A really good example of this would be in a release process. If you as a group decided to always add a “closed #{bug}” format in commit messages, writing a script that compiles the release notes and posts them to a wiki would be pretty trivial. In a similar fashion, you could add flags to your commit messages that hooks in the VCS use to do things like post back to a ticket/bug page. This is something a developer at our company recently started working on. It would be impossible to things like this in an open source model.

I’m not trying to argue that one system is better or worse than the other. My goal is to simply make it clear that you can’t simply read blogs about git or hg and assume that you’re finding a consensus on what is the best tool. It is not the tool, but how you use it that really matters most. Personally, I’d stick with mercurial because, as many git fans have mentioned, the UI is easier to use. My perspective is that you can work around the vast majority of subtle issues by simply specifying the best way to use the tool.

As a side note, in this branching model post, one thing that might help in mercurial to avoid many branches in default is to only push to default or the production branch. If I have two features I’m working with their own named branches and I finish one, I can choose to merge that branch into default and push only the default branch changes. That way you can experiment and create branches as needed without polluting the canonical repo. Does mercurial do anything to help this work flow? Nope, it is just something you have to tell the team to do. Some smart folks say that constraints can be good and this is simply an example of that concept.

Thu, 30 Dec 2010 00:00:00 +0000 <![CDATA[The Query Queue]]> The Query Queue

A really basic data structure is a queue. You put things in the queue at one end and grab things off the queue at the other end of the line. In terms of making highly scalable web applications, queues allow you to set up work to be done by some other process in order to get the response to the user faster.

I had an idea based on some slightly different use cases. The idea is that instead of simply popping off the last item in the queue, you instead query the queue to get the last item. The querying it allows you to have different types of workers utilizing the queue without stepping on each others toes. This can solve an issue of granularity. If you are saving some set of data that gets collected in steps, there is a good chance that step has meaning. In a survey for example, there is value in the set of answers to all the questions, but there is also value in the single questions as well. This is especially true if the entire survey wasn’t completed.

There might be other situations where it would be beneficial. When you register for some service, they might need to verify an email address or do other operations that cross communication lines (sending a sms message). If you queue the progress in a query queue, each component could query for the unsent emails or sms messages while the actual registration process waits for finalized and confirmed registrations. I obviously don’t have all the details worked out. For example, what happens when the queue gets full? What happens if not all queries are fulfilled? Using our registration example, if the person never verifies their account, it just sits in the queue. It should probably be expunged from the queue some point. How should that happen? This might be a horrible idea or it might be something someone else has implemented or found a different solution to. Already I have considered that you could just create more simple queues for each operation. That would probably get around a good portion of the problems, but you inherently lose the more natural continuation type pattern, which is the benefit of this kind of system.

If you know of any similar systems or people who have tried this kind of design please let me know. Part of me feels it would be worth trying out, but at the same time I have a nagging feeling someone really smart sees a much different and better pattern available that makes the whole idea moot. That wouldn’t bother me in the slightest because the problem is getting solved.

Update: I just read about the end of Nsyght. This is exactly the sort of problem that I think a query queue could support. The river of data coming really quickly and multiple services reading off it as fast as possible getting the information they need. Some look for images, others links, others focus on indexing text, while others focus on the relationships between the atomic units. Again, my idea for a query queue could be totally off, but it is an idea.

Wed, 22 Dec 2010 00:00:00 +0000 <![CDATA[Micro Frameworks]]> Micro Frameworks

I came across Bottle today and thought it was kind of silly. Not in the sense that the actual framework design or functionality is silly, but rather that there are so many attempts to make stripped down frameworks. There is really nothing wrong with making these frameworks. I’m sure the authors learn a lot and they scratch an itch. Every time one comes up though, I wonder about something similar built on CherryPy and I’m reminded that CP is really the original microframework and works even better than ever.

Even though CP has become my framework of choice, others may not realize how it really is similar to the other micro frameworks out there with the main difference being it has been tested in the real world for years. Lets take a really simple example of templates and see how we can make it easy to use Mako with CherryPy.

First off, lets write a little controller that will be our application. I’m going to use the CP method dispatcher.

import cherrypy

class SayHello(object):
    exposed = True # the handler is exposed or else a 404 is raised. very pythonic!

    def GET(self, user, id):
        some_obj = db.find(user, id)
        return {
            'model': some_obj

    def POST(self, user, id, new_foo, *args, **kw):
        updated_foo = SomeModel(user, id, new_foo)
        raise cherrypy.HTTPRedirect(cherrypy.request.path_info)

I’ve kind of stacked the deck a little bit here with my ‘GET’ method. It is returning a dict because we are going to use that to pass info into a render function that renders the template. There are many ways you could do this, but since I like to reuse the template look up, I’ll make a subclass that includes a render function.

import os
import cherrypy
import json

from mako.template import Template
from mako.lookup import TemplateLookup

__here__ = os.path.dirname(os.path.abspath(__file__))

class RenderTemplate(object):
    def __init__(self):
        self.directories = [
            os.path.normpath(os.path.join(__here__, 'view/'))
        self.theme = TemplateLookup(

        self.constants = {
            'req': cherrypy.request,

    def __call__(self, template, **params):
        tmpl = self.theme.get_template(template)
        kw = self.constants.copy()
        return tmpl.render(**kw)

_render = RenderTemplate()

class PageMixin(object):
    def render(self, tmpl, params=None):
        params = params or {}
            (name, getattr(self, name))
             for name in dir(self)
             if not name.startswith('_')
        return _render(tmpl, **params)

    def json(self, obj):
        cherrypy.response.headers['Content-Type'] = 'application/json'
        return json.dumps(obj)

There is a bunch of extra code here but what I’m doing is setting up a simple wrapper around the Mako template and template look up. I could have use pgk_resources as well here. You’ll also notice that the handler will automatically get the cherrypy.request as a constant called ‘req’ for use in the template. Below our renderer is a PageMixin. I do this b/c it is easy to add simple functions to make certain aspects faster, for example, quickly returning JSON.

Here is how our controller class’ GET method would change.

def GET(self, user, id):
    some_obj = db.find(user, id)
    return self.render('foo.mako', {
        'model': some_obj

Pretty simple really. I could try to get more clever by automatically passing in our locals() or do some other tricks to make things a little more magic, but that is really not the point. The point here is that I’m just using Python. I don’t have to use CherryPy Tools to make major changes to the way everything works. Including a library is just an import away. If I wanted to write my render function as a decorator that is possible since it would just be a matter of writing the wrapper. If we wanted to do some sort of a cascaded look up on template files, no problem. It is all just Python.

To wrap things up, the other day I started looking into writing a Tool for CherryPy. After messing with things a bit, I came to the conclusion I wasn’t really a huge fan of the Tool API. After thinking of ways I could improve it and getting some good ideas from Bob, something struck me. The Tool API has been around for a long time and yet it never has been a really important part of my writing apps with CherryPy. The reason is really simple. I can write Python with CherryPy. Python has decorators, itertools, functools, context managers and a whole host of facilities for doing things like wrapping function calls. It doesn’t mean I can’t write a tool, but I don’t have to. The framework is asking me to either. When I used WSGI, I would write my whole application as bits of middleware and compose the pieces. It felt reusable and very powerful, but it also ended up being a pain in the neck. Frameworks have a tendency to be opinionated and while CherryPy is seemingly rather unbiased, I’d argue the real opinion it reflects is “quick messing with frameworks and get things done”. I like that.

Tue, 21 Dec 2010 00:00:00 +0000 <![CDATA[Recognizing Patterns and Macros]]> Recognizing Patterns and Macros

If you’ve ever had to write a language for a user you’ve probably had a vision of how you could make things easier for the user to write in domain specific language (DSL). The thing about many DSLs is that they inherit from the parent language. I’m specifically looking at you Ruby, even though that is no where close to the real use case I’m describing. No, my use case is much closer to the many declarative XML based languages. These are all languages that, aside from the parser, create syntax and structure from scratch.

The question then is how do you recognize when your initial structure has outgrown its humble roots. More importantly, how do you meet your users requirements without increasing complexity? I should also mention that complexity is the only real measure that we should be addressing. This is my opinion, but it is based in the idea that no matter the syntax, complexity is what you are battling against. Complexity is also not simply a measure of how many different tokens or keywords, but rather the number of specific details that must be kept in the forefront of your mind in order to get the correct meaning.

Lets take a look at a piece of code:


This looks pretty nasty. The first optimization would be to allow something like a dict. I’m going to focus on Python references b/c not only is that what I use every day, but it is also the parent language. If we improve things by introducing a dict, then that makes sense for the initial variable definitions. But then the question is how you use those variables. The above code block is actually within a set of code defined between some curly braces (think wiki syntax as opposed to a C function). Outside of the curly braces the syntax changes.

Using one of the variables in the traditional scope, the core of the special language, we prefix its usage with a $. Again, it is very similar to a template language in this regard. The problem is things like square brackets already have meaning within the parser in the normal scope. This makes it somewhat difficult to simply add features like a dict that would generally improve the use of complicated or repetitive patterns.

This is the challenge in having a DSL. On the one hand it makes things much simpler. You can write a simple language that doesn’t have to look like HTML or other more visually noisy languages that have a subtle parsing requirement that doesn’t really help authors. On the other hand, the parser must be written in a way to support later features that might conflict with the current syntax that is in the field. Backwards compatibility is a must in these situations because unless you’ve written your parser and objects in such a way as to allow lossless serialization, fixing old scripts ends up being a bug ridden exercise in regex.

Beyond the practical challenges of syntax, there are still questions as to what is truly easier for users. Take an idea such as modules, again as in Python. How do you allow including them in the code? Do they get included where they effectively become written inline or can we import the code, adding thing via a virtual context. How does the editor play a role in the whole operation? In our case, the language is not something people interact with on the command line but rather via a simple web interface. Therefore, things like imports/includes involve not only the mechanical functionality, but the UI for writing, validating and storing them within their own scope. When you consider the environment you have the consideration of whether or not the include actually becomes like a macro when the code gets saved. Likewise, macros are another tact to take in order to make things easier for scripters to reuse code.

In some ways the answer is really all of the above, but that still begs the question of whether moving the complexity outside of the basic language and script has simplified things or in fact just moved the complexity. What you want to do is remove complexity by allowing the user to think at a higher level. This means abstractions that create a contract with the more detailed lower level aspects lets the user work without the need to consider whether or not some lower level piece gets done. Adding things like imports/includes and macros may all do that, but they are dependent on how they are used. Some fancy user might end up writing scripts like this:

include opening
include b2b_12
include b2b_13
include b2b_14
include gen_opt_6
include footer
include postproc

At this point you’ve successfully created something extremely opaque. The complexity is not gone, but simply moved.

Just like in programming you notice patterns and develop tactics for abstraction, when writing a DSL you have a similar task. The difference is that instead of writing it for your own use cases in a known language where users are expected to understand a larger environment (build system, vcs, editor, etc.), you are defining a language for a potentially non-technical user. These users don’t read blogs on writing code. They don’t go to conferences for your language. The users don’t look forward to the latest version that includes closures. Instead they rely on you to guide their options in a way that lets them get their work done quickly. It is your responsibility to take their use cases, find the patterns and figure out a way of adding abstractions that actually help improve the complexity. It is anything but easy.

Mon, 13 Dec 2010 00:00:00 +0000 <![CDATA[Starting Fresh]]> Starting Fresh

At work I work on a project that has been a project for quite a while now. As such, we’ve tried to add a lot of new features, but the reality is almost all the work is maintenance. Fortunately though, I was informed yesterday that our next iteration was on the horizon and the current version was going to have a strict feature freeze in the new year! This is really exciting for me because while I’ve done some new code at work, the vast majority of my responsibility is maintaining our current system.

A theme of this maintenance was actually that often times it was better to rewrite a feature rather than edit the code. The code isn’t bad, but it is from a different time. It started with a different Python version, testing wasn’t necessarily a top priority (I’m still personally working this aspect), and the company was in startup mode where new features were often more important making sure you went back and cleaned up the code. Along similar lines, we now are a lot larger and some of the designs have been stressed by the load. Making changes to add features usually ends up meaning schema changes (in one way or another, even though we are using MongoDB), which are rarely cheap. Now is definitely a good time to consider a reset to see where we can improve our design and prepare for the future.

There is a different between starting fresh and restarting. To use a computer term, restarting shuts down the computer and brings it back up again. Anything left in memory is gone. Refreshing is more a process of taking what you have now and using it a reference as you recreate it. Consider it like a disk defrag or reindex in the database. Your goal is not to change the function, but rather clean up the debris left behind through years of use.

What is also interesting is that we will finally have an opportunity to truly change how we store and accept data. This second iteration of our application went from using a home-made storage engine based on bsddb to using MongoDB. While MongoDB is pretty cool, it has its warts. Honestly, I’m not sure it is the right path either. The big plus with MongoDB is that you can query it. When there are fires to put out finding information is critical and queries in MongoDB can be lifesavers.

Outside of that though, MongoDB feels somewhat dangerous. We added indexes and realized that we are more write heavy than we realized. That or we are write heavy enough to plague MongoDB. You also need a lot of machines to really fulfill MongoDB’s replication expectations. This might be a good thing, but outside losing power and natural disasters, there are people called users that make things called “typos” that can end up “killing” the wrong process. MongoDB doesn’t protect against this “threat” and that makes things a little scary at times.

All this is not to say we won’t use MongoDB though. What I know I’d like to implement is a queue that also acts like a bus. My take away from our old design is that we have different use cases for our data. We need some layer of abstraction over our data, but at the same time, we need to do some more setup before trusting a single DB system to just work supporting all our use cases. The idea then is to have some somewhere to queue data coming in, while at the same time notify the systems that need it. As new data comes in some processes want it right away, while others need to wait until more data in its set have completed. We work on finding what people think, so there are multiple levels of this data. There is the overarching topic described by the survey, individual questions and how each person’s opinion fits within the sample. The bus then should support keeping the data while each of these use cases are supported.

There are a lot of details yet to understand, but the process should be helpful and fun. As a developer you are always trying to improve your skills. The goal should be to write functioning code that is easy to maintain, but that is a difficult thing to practice if you a) never maintain code or b) never have to write code that gets maintained by others. This gives me personally a chance to write some new code that others will need to maintain at some point in the future. I’m also excited to hopefully consider using a newer version of Python, writing tests from scratch and generally getting to try out some new-ish tools to make life easier. It should be a great new year!

Wed, 08 Dec 2010 00:00:00 +0000 <![CDATA[I Saw The Social Network]]> I Saw The Social Network

Last night Lauren and I took a break from working and music to catch The Social Network. For those that don’t know, the movie is the story of Mark Zuckerberg and how Facebook got started. Going into the movie, I wasn’t really thinking I’d like it. I’m not really a fan of Facebook or Mark Zuckerberg. In the grand scheme of things, Facebook doesn’t feel like Google or Microsoft or Apple. On a technical level, it doesn’t seem as though it is really improving the world via technology as much as it is an addictive website that allows people a place for their online identities. Me, I have a real website, so social networks fail to provide a place to call home. That is simply my own perspective.

Even though my attitude could have been better, it wasn’t too bad a movie. In the previews I saw, the image of “rockstar” programmer was laid on pretty thick, complete with the wild nights, drugs, alcohol and promiscuous women. Fortunately, this wasn’t nearly as prevalent as I expected. There was definitely a bit of a college lifestyle represented but it was somewhat believable. The one standout situation that was truly ridiculous was the drunken hacker internship where college students were forced to take shots while trying to break into a “python” web server. It might have happened, but that one seemed really far fetched. Again, my attitude was somewhat lacking in that I don’t see Facebook as some powerhouse of technical innovation. It is a completely irrational opinion, but being honest, it still exists. While they tried to make Zuckerberg into something of a genius, at best, he came off as being relatively bright with a bent on ambition and capitalizing on opportunities. Overall, it was a pretty decent movie and I most importantly, I had a good time, so that is that.

Our drummer thought I would have really like it though because it seems like a “geek” movie. The fact is Facebook just isn’t very interesting from a “geek” perspective. There are those that are interested in its social graph, but that just doesn’t seem that interesting because it still is in the walled garden of Facebook. In my mind Facebook and the social aspects that have been seen as revolutionary never seem that poignant. Social networks have continued to be very similar in that they consistently give people a place online to point others at that they can easily publish to and gather an audience. All the photos, videos, witty comments, etc. all are just people publishing. Some have purpose while others don’t. I’m not saying this is a bad thing or suggesting it is negative. Lauren always says I’m extremely active on Facebook when I probably visit it once a month, if that. The fact is I use facebook to publish just like anyone else. But when the next “social” publishing platform comes around, I’ll do the same thing and update it with whatever it is I’m trying to publish. And that is the point, something else will come along.

I’m really not trying to be a downer here. I know tons of people love Facebook and just can’t enough of the communication that goes on there. My “geek” perspective is that I’m just not that impressed or excited about it. Personal publishing online has been something I’ve done for a long time, so the opportunity that Facebook offers just isn’t that compelling. That doesn’t mean Facebook isn’t a powerful medium for others to publish online. In fact, I’d argue that what does make Facebook compelling is that it raises the expectations of what non-programmer citizens of the web expect. It is good that people have talked about privacy and the problems of using many social networks. It is because at some point, I’d hope people will want to have not just an account, but a real domain they truly own. Facebook is just one step towards that, just like Blogger, Wordpress, MySpace, Posterous, Twitter, etc....

Thu, 07 Oct 2010 00:00:00 +0000 <![CDATA[Migrating Data]]> Migrating Data

At work we have something of a problem. We keep a ton of data, but most of that data is read only. In the spirit of avoiding optimizations, we’ve just kept all that data in one MongoDB process. This has been working pretty well, but recently we realized we’re hitting some performance limits that happen because our use cases for the data is different and the DB isn’t really equipped to handle each use case efficiently at the same time. The idea is to migrate the data out of that database to a read only archive database.

This feels like a pretty simple problem, but since it deals with our data, it is really important to get it right. A co-worker recommended Celery, which looks promising. I really want to make sure I have accurate logs on this project. It should be really easy to monitor and find interesting information from the constant stream of data. Obviously, some might be asking why not just use more hardware and that is a valid question. It is interesting because that was the initial plan, but when we starting thinking about the amount of hardware and space we’d need, it became rather clear that creating a MongoDB cluster wouldn’t be trivial or cheap. If we used EC2 it wouldn’t be a big deal, but, we also would have hit performance issues much sooner since the issues stem from reading old data off the disk.

The other reason for the design change is that the data usage really is different. The vast majority of the data is read-only. What happens then is that a read would be expensive and lock the DB. This automatically should raise red flags since MongoDB doesn’t lock on reads. The problem was actually that the query used enough resources to effectively lock the DB. The connections and queries would pile up and then finally stampede. This would interrupt our writing systems and generally just cause a lot errors, not to mention a poor user experience.

If the migration tool seems helpful, I’ll try to post it. It is interesting in that it reveals the power in customization and how generic solutions, while seemingly effective, usually end up falling down over the long haul.

Tue, 05 Oct 2010 00:00:00 +0000 <![CDATA[Foreign Films from iTunes]]> Foreign Films from iTunes

The other day we rented The Girl with the Dragon Tattoo from iTunes. Lately, we’ve been using iTunes and Amazon quite a bit for renting movies. Our track record for returning movies to the video store is pretty abysmal even though Austin has a truly great video store in I Luv Video.

Our TV watching technology is pretty “old school” as television has never been an integral part of our life. Our TV is a hand me down from my grand dad and has begun to show its age rather clearly. To watch shows or movies we just hook up the laptop and go full screen. Pretty simple.

The problem is that when we watch foreign films (which is somewhat often) iTunes seems prefer subtitles in white text. The Girl with the Dragon Tattoo was a Swedish film, which means, surprise surprise, there is a lot of snow. Another theme in the movie is collecting information from text that traditionally is on a white background. The result was an EXTREMELY difficult time reading the dialog. The frustration is hard to express in fact. As we became intrigued by the complicated story line, we’d hit a character wearing a white shirt and have to jump up try and catch the missing pieces in the story we couldn’t read.

You would think in this day and age of extreme usability we’d see a bit more dedication to obvious issues. I’ve seen plenty of foreign films with yellow text and that would have done nicely. It would also be obvious to move the text to the unused portion of the screen. Yes, I realize that is a hold over from my ancient television tech, but it still is a decent option an application could offer. It would be a neat trick to have the text change color according to the background, but if the text was yellow, that would work too.

Fri, 01 Oct 2010 00:00:00 +0000 <![CDATA[Whatever Happened to "Indie"]]> Whatever Happened to &quot;Indie&quot;

The other day we had to rent a mini-van to take our gear to a show. One of the niceties of the rental was satellite radio. Seeing as we had a healthy stretch of road to cover, it gave me plenty of time to enjoy the wealth of options on the radio. I try to keep up with new music, but it is tough to actually listen to all the new music coming out.

Fortunately, there was SirusXMU to help pass the time and give me a chance to become more familiar with some of the more recent “indie” artists that I’ve read about on blogs.

My conclusion after listening is that “indie” music has evolved. At one point, “indie” was short for independent. This could take many forms, but the overarching theme was that “indie” meant the music wasn’t a part of the typical music establishment. In more cases than not, it was a matter of some subversive aspect, whether by choice of the artists or due to the music, that set the music outside the scope of traditional culture. While there are still plenty of bands that could be considered subversive, I’m not so sure it is the majority. What is interesting is that even though the music has become more acceptable to the mainstream, the “industry” behind it is still very much independent of major labels and their vast wealth.

Another interesting observation is that while much of “indie” music has become more mainstream, the mainstream has also been opening its doors to more subversive acts. The spectrum of music has become more detailed and at the same time wider. This trend is something that’s been going on for a long time.

Personally, when it comes to subversive music, I tend to agree with Ian McKaye’s perspective. While it is arguable that “indie” is different than “punk”, it is good to know that there will always be people discovering (and rediscovering) new perspectives on music.

Wed, 15 Sep 2010 00:00:00 +0000 <![CDATA[The Trick to Driving]]> The Trick to Driving

Being in a band often means you have to drive a lot. We’re not talking truck driver miles, but you can easily put an extra 15k-30k on the odometer without much thought. While many times the trek from city to city is pretty minimal, there are usually days where you need to get up at 6 or 7 just to get to the club before doors. It is these long stretches of time that the ability to drive long distances becomes of the utmost importance.

This came up the other day when my drummer and I had to trek up to Denver. We had two days, which is plenty of time, but we had to use a friends van (ours was in the shop) that didn’t have any radio or air conditioning. As luck would have it, there was a nice cold front that kept the back sweat to a minimum on the first day. Still, I ended up driving in near silence on back road highways for about 11 hours and it didn’t really bother me. I can’t say exactly why sometimes it is nice to drive like this, but here are some theories.

As a kid my family would take vacations and we would drive cross country. One time I remember vividly not having moved for a really long time. My hand had been holding my wrist and it was sweaty. My mouth felt strange since it had obviously not been open for a while. In short, I zoned out for probably an hour or so. Before I broke out of my unknown trance, it occurred to me I was on a roll. I ended up seeing how long I could simply sit still and ride. The scenery whizzed passed and I enjoyed the vast landscapes of the country. It is this kind of focus that makes driving extremely palatable. Not the extreme front of the mind kind of focus, but letting your instincts take over while allowing your mind to open to the experience. Kind of zen-ish, but it honestly is reality.

Along side allowing your mind to find some existential plane to sit on while the miles fly by is finding ways to keep you focused on the road. Meditating on the road is fine as long as you always remember you’re behind the wheel. This past trip we took brough upon a mock slam poetry session in my mind where I described my experience in snap worthy prose for a made up audience in some movie scene book store. I’ll be the first to admit that it was cheesy, but it did exercise both my focus on driving, my environment and language.

The last thing that helps you pile on the miles is to avoid stopping. Instead watching the miles slowly fall off the garmin, focus on the gas tank. That is the real reason you should need to stop and no other. Make a goal to get through half a tank or better yet, get the arrow below 3/4s of a tank. Most trips don’t take more than a tank or two, so by setting your goals according to fuel, you get to focus on getting there rather than having to pee.

So, that is how I do it. If you have other suggestions for making long stints behind the wheel, leave a comment!

Tue, 07 Sep 2010 00:00:00 +0000 <![CDATA[Not Writing]]> Not Writing

My writing has dropped off dramatically lately. This blog is not exactly the brainchild of an extremely prolific writer, but it also isn’t a stagnant pool of text bit rotting on the web. So what has changed? The first big change has been some of my responsibilities at work. The project that was lead by my previous partner in crime has become more my responsibility. This isn’t really a promotion, but rather a shifting of guidance and responsibility. Even though it has not been a huge change, it is still change, which is rarely easy. Things have begun to pick up as we’ve gotten a bit of momentum, but it is still an adjustment for me.

The other big change has been less stark. There is some new music coming down the pipe, but it is not without its share of blood, sweat and tears. If anyone says recording a record is easy, they have never done it. And if by chance they have done it, then they weren’t successful at it. If they were successful at making a record and still claim it is easy then I don’t know what to tell you other than they are really lucky. This has been some of the toughest music to write and record. We’ve been playing for a really long time and while in sense our chemistry is pretty well solidified, it also feels somewhat too familiar at times. That feeling isn’t too much of a detractor, except, when after you finish an exceptionally painful recording process the only feed back suggests you have a lot more work to do and no money to do it with. The details aren’t terribly important, but it has been a really tough couple months.

The good side is that we have some tours coming up and, as I said before, we’ve gotten a bit of momentum at work. The goal is that some of that momentum can be directed towards writing. Whether or not it shows up on my blog is another question all together, but I’m sure that there will be some more reflections and discoveries I’d like to share sooner rather than later.

Mon, 30 Aug 2010 00:00:00 +0000 <![CDATA[Band Relationships]]> Band Relationships

We just got back from playing Denver for The UMS. Our first show was pretty crappy and honestly we didn’t play very well. It ended up being OK as people showed up, but for whatever reasons, the stage manager and people running it were non-existent. Not a huge deal, but it made for a rough start after a long trip from Oklahoma and getting in a small fender bender in our hotel parking lot. The second show was much better. Tons of people came out, it was in an actual club (the first was in a stage built into a truck) and we sold a bunch of merch. All in all, not a bad trip.

We also spent the weekend with our manager and generally tried to hash out our plans. After being in the studio for a time we have some songs recorded but the mix is pretty terrible, which is frustrating. The short and skinny is that the folks working on the record could have done a better job. Our expectations might very well have been higher this time around, but at the same time, we still expected more engagement and effort in terms of doing things right. It felt like an up hill battle, which is unfortunate, since we did spend a lot of time and worked really hard. The result though is that we’ll need to get our tracks to get remixed.

This brings me to the actual title of this blog. Our friend Dave constantly tells us to just ask our fans for money and use tools like Kickstarter to raise funds for things like recording. I always agreed with him that it would work for some bands, but for whatever reason I didn’t think it would work for us. Lately, I think I’ve realized why. It all comes down the relationship bands have with their fans.

In my mind, I see our music a certain way. It is probably an ideal vision of super cool people pushing the boundaries of tone and ripping faces off live. The reality is though, I think we’re actually more of a rock band. The concepts of cool that I always try to instantiate in my visions are probably not really there. This isn’t to say we are “cool”, but rather we’re not the leather jacket, cigarette smoking, sun glasses wearing, modern Velvet Underground. Live we try to put on a show and there is intensity, but the more I think about who we are and what we really sound like, it feels like I’ve misunderstood something.

This disconnect between my vision of our music, our aesthetic reality, is what has always dissuaded me from wanting to ask fans for money and be more communicative. In my vision, we’re an enigma and a mystery. The reality everyone says we’re really nice people. We shred on stage and off stage you can come up to us and we’ll smile and talk to you. In short, asking for money from fans and providing a more open relationship with them seems totally natural in reality.

The reason this has begun to feel natural is because fans understand music is entertainment. I don’t believe anyone thinks we are simply acting on stage, but rather they realize that in addition to getting on stage and playing music, we also drive around in our van, eat cheap food and generally struggle like anyone else. The 40 minutes on stage is the show and the rest is real life. Up until this point, I don’t think we’ve connected with our fans in real life and the result is that we haven’t made clear our needs, and more importantly, given them a chance to help. Moving forward, I’m not sure what will change today, but something definitely should. There are some obvious things we can do to shed a little light on what our day to day looks like. There will always be some mystery, but we don’t really have much to hide. One thing that has become exceptionally clear is that we can’t try to be something we’re not and now that I think I have a more realistic view of what we really are, avoiding the things we’re not should be much easier.

Thu, 29 Jul 2010 00:00:00 +0000 <![CDATA[Concurrency and Focus]]> Concurrency and Focus

Last night as we were driving back towards the great state of Texas, we listened to Coast to Coast. If you’ve never listened, it is a radio show that traditionally focuses on non-mainstream topics that can often stretch well into the weird. It is a fun show to listen to for freak factor as well as the interesting perspectives. The guest in this case was the author of the book The Shallows, Nicholas Carr. The book argues that the Internet has changed the way people think and robbed people of focus, and personally, I have to agree.

One interesting thing he talked about was the brain’s concept of memory. He described the basic ideas of RAM vs. CPU Cache vs. the hard drive. The idea was that we’ve gotten faster RAM thanks to the web, but in doing so, we’ve lost the ability to save ideas and knowledge to our long term memory storage. One argument he disagreed with was that the web offers you a place to store information so you can think about other things. I partially disagreed with this because as a programmer, much of my job is finding interfaces and layers to help manage complexity, effectively relieving my mind of the burdens associated with lower level problems.

This quest to hide complexity has been a major theme in computing. Back when computers were actual people, the idea was that a scientist could let the computer do the low level work while they continued to hone their larger level algorithms and models. The ability to abstract and classify is critical to moving beyond what you can do by yourself. There are so many details in the world that without tools to manage them, life would be unbearable.

Concurrency is a very similar situation. I read this article discussing how trends in concurrency often do not fairly characterize the problem. This got me thinking about the issues with focus. If we consider concurrency in terms of workers that all equally do things, you miss the obvious abstractions you get from more specific tools. Like learning to focus, the goal is not to store information as much it is to manage the transitions and interruptions. It is often simpler to have a master process that handles specifically who does what than relying on a generic solution to magically do everything correctly. Just like the scientists in the early days of computing, we should think of concurrency in terms of how we can create abstractions while freeing up resources in a way that doesn’t force us to pass everything through to a intermediary. There were probably plenty of computations that early programmers did themselves. But at some point, there was a decision made on a set of parameters that suggested the solution be sent to a computer to do the work. It is this kind of mindset that will make concurrency in the future relevant in the future and reasonable to implement.

Mon, 26 Jul 2010 00:00:00 +0000 <![CDATA[Why Choose MongoDB]]> Why Choose MongoDB

A co-worker will occasionally send along recent blog posts that place into question our decision to use MongoDB. These are small reminders that technical decisions are never black and white and that no matter how careful you are, there are going to be trade offs. Usually the complaints I hear about MongoDB have to do with durability.

Specifically, people have a problem with 10gen (the folks who make MongoDB) suggesting that you run at least two MongoDB instances to provide some level of reliability in the face of one system going down. The fact is we have been bitten by a bug that required a good deal of downtime. This kind of thing totally stinks. I’d say it is unacceptable, but the reality is that at some point you have to trust that a system will do its job. In other words, it is not enough to prevent problems, you must also have a means of recovering from problems. In our case, that recovery plan was the problem. Shame on us.

This has since been remedied and there is probably more to do. I’m sure other organizations might have considered the hiccup a worthy reason to abandon MongoDB for another more durable system. The problem is that while MongoDB’s durability did prove to be a major problem, that isn’t the whole story. There are other reasons to choose MongoDB, but durability is not one of them.

In our situation, the biggest benefit we get from MongoDB is the ability to query. This seems like such a basic tenant of databases that you’d be hard pressed to find someone who didn’t believe running queries on your data isn’t the killer feature of a database. With that in mind, I’d point out that many NoSQL systems do not provide the same level of support for queries. CouchDB (which I’m personally a fan of) is in fact really bad at ad-hoc queries. Once you save a view in CouchDB, you’re in pretty good shape, but randomly querying the DB is considered bad form. Likewise, generating a new view on a massive dataset is time consuming as well. That is the trade off CouchDB made. In return you get excellent durability, a fantastic HTTP based interface and the ability to utilize pure JSON as a document format. Totally reasonable trade offs if you ask me. The only problem is, the ability to query the dataset in real time is an order of magnitude more important to us than the other features. We can run a slave (for reads even), deal with BSON, and use the socket based Python driver that MongoDB provides, yet if we couldn’t do realtime queries in a reasonable about of time, then what is the point? By the way, I don’t mean to pick on CouchDB or any other system. The fact of the matter the discussion is not over what system to use as much as weighting what features are important. For us, the ability to query very quickly and under load trumps the other features such as single machine durability. We’ve paid the price of that trade off once, so we’re aiming not to do so again. Thanks to my cautious and objective thinking co-worker, we also consistently consider other data stores.

That said, I think we made a good choice to go with MongoDB. It is not perfect, but nothing is. MongoDB has been meeting needs very successfully and I have no problem recommending it.

Mon, 12 Jul 2010 00:00:00 +0000 <![CDATA[Mobile is the Future]]> Mobile is the Future

I read a great article the other day by someone who used to work at Nokia on how the US is so radically different than the rest of the world when it comes to cellular technology. It was really interesting to see how Nokia as a company was driven by research in its own local market, which happened to be a great predictor of how the rest of the world uses mobile phones. What was also really interesting was the impact of SMS on the mobile industry, and more specifically, for carriers in terms of profits. These profits (along with a wildly different geography) also seemed to help with providing better service to carrier customers. For example, no one wanted antenna on their phones because they got in the way of putting it in your pocket. In the US though, an antenna meant better reception and fewer dropped calls. That never happened, so users just assumed the phone always worked.

Articles like this are really interesting as programmer because reinforces the importance gathering data. Nokia’s reel claim to fame in the story is its research. It payed closed attention to its users and acted accordingly. As a result, Nokia sell millions of phones at a really high price all over the world and have customer loyalty. While it is something I’m not very good at, it is a good reminder that being able to gather data can make the difference between tweaking a setting and rewriting an application. A wrong hypothesis can be expensive, and data helps to make better guesses as to what the problem is. It is something I personally want to get better at.

The other theme of the article is the importance of mobile in terms of technology. The author described impact of SMS on the industry, which was enormous. Any company that has a technology focus needs to see the importance of mobile phones as a platform. Also, the platform is not the iPhone. The author made it clear that the iPhone application business model doesn’t work, which is something I have felt for a while now. The numbers just aren’t there. You sell the app for a couple bucks to a million people and you might have made a couple million dollars. But that’s it. There is nothing else to really be done except releasing updates and hope people keep buying it. The problem is that if it gets extremely popular, there is a really good chance Apple will just release its own version and put you out of business. The other tact is to release system services for other companies to use, but again this seems like a bad idea. You’re selling some service to the guy that only make a few million dollars off some application. Some percentage of those sales is pretty weak in terms of revenue and you still have the same problem of Apple simply destroying your business by releasing their own version. Apps just don’t make sense in the long term and doesn’t reflect the rest of the world.

The reason for all this is that defaults rule. SMS changed everything because it was a built-in feature of every phone. There are tons of phones now with data plans, but the reality is that everyone texts. This is not because texts are better than Safari or Twitter, but because interoperability is taken care of. The web flourished because browsers felt the same. The mobile industry gets the same feeling from SMS. The article states that every day users won’t buy apps and after thinking about it a bit, I have to agree. The one thing that could change that would be the iPad. The iPad has the opportunity to change the way people think of computers and it is built completely on the idea of apps.

Before you had a filesystem and there was a disconnect between files and applications. The iPad is destroying that assumption and people seem to appreciate the simplicity. If the new assumption is that applications are the baseline and it is a pattern that naturally works on a phone (which it arguably has), it seems possible that the idea of the app store could become widespread. A person would have their iPad as their computer and their iPhone as their laptop with both synced via the cloud and apps. I don’t know that it will work like that but it could.

The point is that in either case, mobile computing is only getting more important. One interesting side effect is that mobile platforms are only now considering how to handle multiprocessing. This is not even how to mimic threading, but rather how to do two things at once. It is getting solved quickly, but it is interesting because developers have been focused on how to handle the problem of writing apps for more than one processor. It appears it doesn’t matter because while desktops and laptops are getting more cores, people might just stop using them in favor of a different kind of machine. It still means the web and servers are still very important, but we can probably keep punting needing things like functional paradigms in order to handle the multiple processors all our users supposedly will be utilizing in the future.

Tue, 29 Jun 2010 00:00:00 +0000 <![CDATA[MongoUI - Barely Better Than a Terminal]]> MongoUI - Barely Better Than a Terminal

At work we use MongoDB for storing most of our critical data. We started with customized key value store we called Faststore. It was built in Python and used BsdDB. It was a pretty successful system in that it was reliable and performed very well. The thing that killed Faststore was that there wasn’t a good way to replicate the data. We tried considering a different key/value database in the core such as Tokyo Cabinet, but it quickly became clear the speed wasn’t there for our critical operations. After some research, we settled on MongoDB for our storage needs. So, far the decision has been a rather successful one. We did hit a rather nasty bug that required some significant downtime, but beyond that, MongoDB has been excellent. This is especially true when you consider the amount of data.

As we’ve used MongoDB one theme that I’ve hit is that is can be a little hard to view the data. Things like CSVs and tables in terminals can be a little easier to read than a pretty printed dictionary. In order to help making the perusing of data a little bit easier when developing with MongoDB, I made a little tool called MongoUI. MongoUI is really simple. It is only a little better than a terminal. That said, it makes basic browsing a little easier.

Pretty much all you have to do to use it is download it (hg clone, install it (cd mongoui && python setup install) and then run it (mongoui). Once it is running, point your browser to localhost, port 9000 and you’ll see a simple list of your databases. Clicking on a DB will show you the collections and clicking on the collections gives you a couple fields to query. The first field is for excluding/including fields. The second is for providing parameters ({‘foo’: ‘bar’}). You can include a count by clicking the count checkbox and change the include to an exclude by clicking the exclude checkbox.

The output is still pretty much just a pretty printed dictionary, but the browser is a slightly easier place to scroll up and down than a terminal. There is no writing or editing of documents, and I seriously doubt I’d add that anytime soon. It also is not meant for huge databases. For example, if you have 20k collections in a DB, it will show a really big list. Same goes for large collections. I’m not paging or anything like that. It really is meant make basic queries on local tests databases a little bit easier.

If it seems helpful I’ll try to develop the viewing of the data bit more as well as possibly including some editing bits. That said, it is no where on my todo list. The other thing I’d like to try and incorporate is a good RESTful API. My biggest complaint about MongoDB is that the protocol to the DB is not via HTTP. It was a basic design decision, but it would be nice at times to have a basic HTTP enabled caching proxy in front of a MongoDB for things like saving queries as views. Likewise things like replication and load balancing might also be interesting uses. That said, I don’t really have a need for it at work, so I don’t know that I would get very far.

Feedback is always welcome!

Thu, 24 Jun 2010 00:00:00 +0000 <![CDATA[MongoDB Schema]]> MongoDB Schema

Unfortunately, I can’t remember where I read it, but there was a blog post that mentioned the importance of schema in a schemaless database. It hit me as somewhat close minded at the time. After all, they don’t call them “schemaless” for nothing. Slowly there has been a change in how I view the issue.

Contracts have been a theme recently in my programming life. Tests essentially create contracts. The environment you run in can be rather limited and yet more robust thanks to contracts with the underlying system. Contracts are what let you sleep at night because without some guarantees you’d never get anything done. The key with contracts is that they happen at the right time and place. Abstraction (and much