Parallel Processing

It can be really hard to work with data programmatically. There is some moment when working with a large dataset where you realize you need to process the data in parallel. As a programmer, this sounds like it could be a fun problem, and in many cases it is fun to get all your cores working hard crunching data.

The problem is parallel processing never is purely a matter of distributing your work across CPUs. The hard part ends up being getting the data organized before sending it to your workers and doing something with the results. Tools like Hadoop boast processing terabytes of data, but it’s a little misleading because there is most likely a ton of code on either end of that processing.

The input and output code (I/O) can also have big impact on the processing itself. The input often needs to consider what the atomic unit is as well as what the “chunk” of data needs to be. For example, if you have 10 million tiny messages to process, you probably want to chunk up the million messages into 5000 messages when sending it to your worker nodes, yet the workers will need to know it is getting a chunk of messages vs. 1 message. Similarly, for some applications the message:chunk ratio needs to be tweaked.

In hadoop this sort of detail can be dealt with via HDFS, but hadoop is not trivial to set up. Not to mention if you have a bunch of data that doesn’t live in HDFS. The same goes for the output. When you are done, where does it go?

The point being is that “data” always tends towards spcificity. You can’t abstract away data. Data always ends up being physical at its core. Even if the processing happens in parallel, the I/O will always be a challenging constraint.

View Source

I watched a bit of this fireside chat with Steve Jobs. It was pretty interesting to hear Steve discuss the availability of the network and how it changes the way we can work. Specifically, he mentioned that because of NFS (presumably in the BSD family of unices), he could share his home directory on every computer he works on without ever having to think about back ups or syncing his work.

What occurred to me was how much of the software we use is taken for granted. Back in the day and educational license for Unix was around $1800! I can only imagine the difficulties becoming a software developer back then when all the interesting tools like databases or servers were prohibitively expensive!

It reminds me of when I first started learning about HTML and web development. I could always view the source to see what was happening. It became an essential part of how I saw the web and programming in general. The value of software was not only in its function, but in its transparency. The ability to read the source and support myself as a user allowed me the opportunity to understand why the software was so valuable.

When I think about how difficult it must have been to become a hacker back in the early days of personal computing, it is no wonder that free software and open source became so prevalent. These early pioneers had to learn the systems without reading the source! Learning meant reading through incomplete, poorly written manuals. When the manual was wrong or out of date, I can only imagine the hair pulling that must have occurred. The natural solution to this problem was to make the source available.

The process of programming is still very new and very detailed, while still being extremely generic. We are fortunate as programmers that the computing landscape was not able to enclose software development within proprietary walls like so many other technical fields. I’m very thankful I can view the source!

Property Pattern

I’ve found myself doing this quite a bit lately and thought it might be helpful to others.

Often times when I’m writing some code I want to access something as an attribute, even though it comes from some service or database. For example, say we want to download a bunch of files form some service and store them on our file system for processing.

Here is what we’d like the processing code to look like:

def process_files(self):
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

We don’t really care what the filter_and_store method does. What we do care about is downloaded_files attribute.

Lets step back and see what the calling code might look like:

processor = MyServiceProcessor(conn)
processor.process_files()

Again, this is pretty simple, but now we have a problem. When do we actually download the files and store them on the filesystem. One option would be to do something like this in our process_files method.

def process_files(self):
    self.downloaded_files = self.download_files()
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

While it may not seem like a big deal, we just created a side effect. The downloaded_files attribute is getting set in the process_files method. There is a good chance the downloaded_files attribute is something you’d want to reuse. This creates an odd coupling between the process_files method and the downloaded_files method.

Another option would be to do something like this in the constructor:

def __init__(self, conn):
    self.downloaded_files = self.download_files()

Obviously, this is a bad idea. Anytime you instantiate the object it will seemingly try to reach out across some network and download a bunch of files. We can do better!

Here are some goals:

  1. keep the API simple by using a simple attribute, downloaded_files
  2. don’t download anything until it is required
  3. only download the files once per-object
  4. allow injecting downloaded values for tests

The way I’ve been solving this recently has been to use the following property pattern:

class MyServiceProcessor(object):

    def __init__(self, conn):
        self.conn = conn
        self._downloaded_files = None

    @property
    def downloaded_files(self):
        if not self._downloaded_files:
            self._downloaded_files = []
            tmpdir = tempfile.mkdtemp()
            for obj in self.conn.resources():
                self._downloaded_files.append(obj.download(tmpdir))
        return self._downloaded_files

    def process_files(self):
        result = []
        for fn in self.downloaded_files:
            result.append(self.filter_and_store(fn))
        return result

Say we wanted to test our process_files method. It becomes much easier.

def setup(self):
    self.test_files = os.listdir(os.path.join(HERE, 'service_files'))
    self.conn = Mock()
    self.processor = MyServiceProcessor(self.conn)

def test_process_files(self):
    # Just set the property variable to inject the values.
    self.processor._downloaded_files = self.test_files

    assert len(self.processor.process_files()) == len(self.test_files)

As you can see it was realy easy to inject our stub files. We know that we don’t perform any downloads until we have to. We also know that the downloads are only performed once.

Here is another variation I’ve used that doesn’t required setting up a _downloaded_files.

@property
def downloaded_files(self):
    if not hasattr(self, '_downloaded_files'):
        ...
    return self._downloaded_files

Generally, I prefer the explicit _downloaded_files attribute in the constructor as it allows more granularity when setting a default value. You can set it as an empty list for example, which helps to communicate that the property will need to return a list.

Similarly, you can set the value to None and ensure that when the attribute is accessed, the value may become an empty list. This small differentiation helps to make the API easier to use. An empty list is still iterable while still being “falsey”.

This technique is nothing technically interesting. What I hope someone takes from this is how you can use this technique to write clearer code and encapsulate your implementation, while exposing a clear API between your objects. Just because you don’t publish a library, keeping your internal object APIs simple and communicative helps make your code easier to reason about.

One caveat is that this method can add a lot of small property methods to your classes. There is nothing wrong with this, but it might give a reader of your code the impression the classes are complex. One method to combat this is to use mixins.

class MyWorkerMixinProperties(object):

    def __init__(self, conn):
        self.conn = conn
        self._categories = None
        self._foo_resources = None
        sef._names = None

    @property
    def categories(self):
        if not self._categories:
            self._categories = self.conn.categories()
        return self._categories

    @property
    def foo_resources(self):
        if not self._foo_resources:
            self._foo_resources = self.conn.resources(name='foo')
        return self._foo_resources

    @property
    def names(self):
        if not self._names:
            self._names = [r.meta()['name'] for r in self.resources]



class MyWorker(MyWorkerMixinProperties):

    def __init__(self, conn):
        MyWorkerMixinProperties.__init__(self, conn)

    def run(self):
        for resource in self.foo_resources:
            if resource.category in self.categories:
                self.put('/api/foos', {
                    'real_name': self.names[resource.name_id],
                    'values': self.process_values(resource.values),
                })

This is a somewhat contrived example, but the point being is that we’ve taken all our service based data and made it accessible via normal attributes. Each service request is encapsulated in a function, while our primary worker class has a reasonably straightforward implementation of some algorithm.

The big win here is clarity. You can write an algorithm by describing what it should do. You can then test the algorithm easily by injecting the values you know should produce the expected results. Furthermore, you’ve decoupled the algorithm from the I/O code, which is typically where you’ll see a good deal of repetition in the case of RESTful services or optimization when talking to databases. Lastly, it becomes trivial to inject values for testing.

Again, this isn’t rocket science. It is a really simple technique that can help make your code much clearer. I’ve found it really useful and I hope you do too!

Iterative Code Cycle

TDD prescribes a simple process for working on code.

  1. Write a failing test
  2. Write some code to get the test to pass
  3. Refactor
  4. Repeat

If we consider this cycle more generically, we see a typical cycle every modern software developer must use when writing code.

  1. Write some code
  2. Run the code
  3. Fix any problems
  4. Repeat

In this generic cycle you might use a REPL, a stand alone script, a debugger, etc. to quickly iterate on the code.

Personally, I’ve found that I do use a test for this iteration because it is integrated into my editor. The benefit of using my test suite is that I often have a repeatable test when I’m done that proves (to some level of confidence) the code works as I expect it to. It may not be entirely correct, but at least it codifies that I think it should work. When it does break, I can take a more TDD-like approach and fix the test, which makes it fail, and then fix the actual bug.

The essence then of any developer’s work is to make this cycle as quick as possible, no matter what tool you use to run and re-run your code. The process should be fluid and help get you in the flow when programming. If you do use tests for this process, it may be a helpful design tool. For example, if you are writing a client library for some service, you write an idealistic API you’d like to have without letting the implementation drive the design.

TDD has been on my mind recently as I’ve written a lot of code recently and have questioned whether or not my testing patterns have truly been helpful. It has been helpful in fixing bugs and provides a quick coding cycle. I’d argue the code has been improved, but at the same time, I do wonder if by making things testable I’ve introduced more abstractions than necessary. I’ve had to look back on some code that used these patterns and getting up to speed was somewhat difficult. At the same time, anytime you read code you need to put in effort in order to understand what is happening. Often times I’ll assume if code doesn’t immediately convey exactly what is happening it is terrible code. The reality is code is complex and takes effort to understand. It should be judged based on how reasonable it is fix once it is understood. In this way, I believe my test based coding cycle has proven itself to be valuable.

Obviously, the next person to look at the code will disagree, but hopefully once they understand what is going on, it won’t be too bad.

TDD

I watched DHH’s keynote at Railsconf 2014. A large part of his talk discusses the misassociation of TDD on metrics and making code “testable” rather than stepping back an focusing on clarity, as an author would when writing.

If you’ve ever tried to do true TDD, you might have a similar feeling that you’re doing it wrong. I know I have. Yet, I’ve also seen the benefit of iterating on code via writing tests. The faster the code / test cycle, the easier it is to experiment and write the code. Similarly, I’ve noticed more bugs show up in code that is not as well covered by tests. It might not be clear how DHH’s perspective then fits in with the benefits of testing and facets of TDD.

What I’ve found is that readability and clarity in code often comes by way of being testable. Tests and making code testable can go along way in finding the clarity that DHH describes. It can become clear very quickly that your class API is actually really difficult to use by writing a test. You can easily spot odd dependencies in a class by the number of mocks you are required to deal with in your tests. Sometimes I find it easier to write a quick test rather than spin up a repl to run and rerun code.

The point being is that TDD can be a helpful tool to write clear code. As DHH points out, it is not a singular path to a well thought out design. Unfortunately, just as people take TDD too literally, people will feel that any sort of granular testing is a waste of time. The irony here is that DHH says very clearly that we, as software writers, need to practice. Writing tests and re-writing tests are a great way to become a better writer. Just because the ideals presented in TDD might be a bit too extreme, the mechanism of a fast test suite and the goal for 100% coverage are still valuable in that they force you to think about and practice writing code.

The process of thinking about code is what is truly critical in almost all software development exercises. Writing tests first is just another way to slow you down and force you to think about your problem before hacking out some code. Some developers can avoid tests, most likely because they are really good about thinking about code before writing it. These people can likely iterate on ideas and concepts in their head before turning to the editor for the actual implementation. The rest of us can use the opportunity of writing tests, taking notes, and even drawing a diagram as tools to force us to think about our system before hacking some ugly code together.

Concurrency Transitions

Glyph, the creator of Twisted wrote an interesting article discussing the intrinsic flaws of using threads. The essential idea is that unless you know explicitly when you are switching contexts, it is extremely difficult to effectively reason about concurrency in code.

I agree that this is one way to handle concurrency. Glyph also provides a clear perspective into the underlying constraints of concurrent programming. The biggest constraint is that you need a way to guarantee a set of statements happens atomically. He suggests an event driven paradigm as how best to do this. In a typical async system, the work is built up using small procedures that run atomically, yielding back control to the main loop as they finish. The reason the async model works so well is because you eliminate all CPU based concurrency and allow work to happen while waiting for I/O.

There are other valid ways to achieve as similar effect. The key in all these methods, async included, is to know when you transition from atomic sequential operations to potentially concurrent, and often parallel, operations.

A great example of this mindset is found in functional programming, and specifically, in monads. A monad is essentially a guarantee that some set of operations will happen atomically. In a functional language, functions are considered “pure” meaning they don’t introduce any “side effects”, or more specifically, they do not change any state. Monads allow functional languages a way to interact with the outside world by providing a logical interface that the underlying system can use to do any necessary work to make the operation safe. Clojure, for example, uses a Software Transactional Memory system to safely apply changes to state. Another approach might be to use locking and mutexes. No matter the methodology, the goal is to provide a safe way to change state by allowing the developer an explicit way to identify portions of code that change external state.

Here is a classic example in Python of where mutable state can cause problems.

In Python, and the vast majority of languages, it is assumed that a function can act on a variable of a larger scope. This is possible thanks to mutable data structures. In the example above, calling the function multiple time doesn’t re-initialize argument to an empty list. It is a mutable data structure that exists as state. When the function is called that state changes and that change of state is considered a “side effect” in functional programming. This sort of issue is even more difficult in threaded programming because your state can cross threads in addition to lexical boundaries.

If we generalize the purpose of monads and Clojure’s reference types, we can establish that concurrent systems need to be able to manage the transitions between pure functionality (no state manipulation) and operations that effect state.

One methodology that I have found to be effective managing this transition is to use queues. More generally, this might be called message passing, but I don’t believe message passing guarantees the system understands when state changes. In the case of a queue, you have an obvious entrance and exit point for the transition between purity and side effects to take place.

The way to implement this sort of system is to consider each consumer of a queue as a different process. By considering consumers / producers as processes we ensure there is a clear boundary between them that protects shared memory, and more generally shared state. The queue then acts as bridge to cross the “physical” border. The queue also provides the control over the transition between pure functionality and side effects.

To relate this back to Glyph’s async perspective, when state is pushed onto the queue it is similar to yielding to the reactor in an async system. When state is popped off the queue into a process, it can be acted upon without worry of causing side effects that could effect other operations.

Glyph brought up the scenario where a function might yield multiple times in order to pass back control to the managing reactor. This becomes less necessary in the queue and process system I describe because there is no chance of a context switch interrupting an operation or slowing down the reactor. In a typical async framework, the job of the reactor is to order each bit of work. The work still happens in series. Therefore, if one operation takes a long time, it stops all other work from happening, assuming that work is not doing I/O. The queue and process system doesn’t have this same limitation as it is able to yield control to the queue at the correct logical point in the algorithm. Also, in terms of Python, the GIL is mitigated by using processes. The result is that you can program in a sequential manner for your algorithms, while still tackle problems concurrently.

Like anything, this queue and process model is not a panacea. If your data is large, you often need to pass around references to the data and where it can be retrieved. If that resource is not something that tries to handle concurrent connections, the file system for example, you still may run into concurrency issue accessing some resource. It also can be difficult to reason about failures in a queue based system. How full is too full? You can limit the queue size, but that might cause blocking issues that may be unreasonable.

There is no silver bullet, but if you understand the significance of transitions between pure functionality and side effects, you have a good chance of producing a reasonable system no matter what concurrency model you use.

A Sample CherryPy App Stub

In many full stack frameworks, there is a facility to create a new application via some command. In django for example, you use django-admin.py startproject foo. The startproject command will create some directories and files to help you get started.

CherryPy tries very hard to avoid making decisions for you. Instead CherryPy allows you to setup and configure the layout of your code however you wish. Unfortunately, if you are unfamiliar with CherryPy, it can feel a bit daunting setting up a new application.

Here is how I would set up a CherryPy application that is meant to serve basic site with static resources and some handlers.

The File System

Here is what the file system looks like.

├── myproj
│   ├── __init__.py
│   ├── config.py *
│   ├── controllers.py
│   ├── models.py
│   ├── server.py
│   ├── static
│   ├── lib *
│   └── views
│       └── base.tmpl
├── setup.py
└── tests

First off, it is a python package with a setup.py. If you’ve never created a python package before, here is a good tutorial.

Next up is the project directory. This is where all you code lives. Inside this directory we have a few files and directories.

  • config.py : Practically every application is going to need some configuration and a way to load it. I put that code in config.py and typically import it when necessary. You can leave this out until you need it.
  • controllers.py : MVC is a pretty good design pattern to follow. The controllers.py is where you put your objects that will be mounted on the cherrypy.tree.
  • models.py : Applications typically need to talk to a database or some other service for storing persistent data. I highly recommend SQLAlchemy for this. You can configure the models referred to in the SQLAlchemy docs here, in the models.py file.
  • server.py : CherryPy comes with a production ready web server that works really well behind a load balancing proxy such as Nginx. This web server should be used for development as well. I’ll provide a simple example what might go in your server.py file.
  • static : This is where your css, images, etc. will go.
  • lib : CherryPy does a good job allowing you to write plain python. Once the controllers start becoming more complex, I try to move some of that functionality to well organized classes / function in the lib directory.
  • views : Here is where you keep your template files. Jinja2 is a popular choice if you don’t already have a preference.

Lastly, I added a tests directory for adding unit and functional tests. If you’ve never done any testing in Python, I highly recommend looking at pytest to get started.

Hooking Things Together

Now that we have a bunch of files and directories, we can start to write our app. We’ll start with the Hello World example on the CherryPy homepage.

In our controllers.py we’ll add our HelloWorld class

# controllers.py
import cherrypy


class HelloWorld(object):
    def index(self):
        return 'Hello World!'
    index.exposed = True

Our server.py is where we will hook up our controller with the webserver. The server.py is also how we’ll run our code in development and potentially in production

import cherrypy

# if you have a config, import it here
# from myproj import config

from myproj.controllers import HelloWorld

HERE = os.path.dirname(os.path.abspath(__file__))


def get_app_config():
    return {
        '/static': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': os.path.join(HERE, 'static'),
        }


def get_app(config=None):
    config = config or get_config()
    cherrypy.tree.mount(HelloWorld(), '/', config=config)
    return cherrypy.tree


def start():
    get_app()
    cherrypy.engine.signals.subscribe()
    cherrypy.engine.start()
    cherrypy.engine.block()

if __name__ == '__main__':
    start()

Obviously, this looks more complicated than the example on the cherrypy homepage. I’ll walk you through it to let you know why it is a little more complex.

First off, if you have a config.py that sets up any configuration object or anything we import that first. Feel free to leave that out until you have a specific need.

Next up we import our controller from our controllers.py file.

After our imports we setup a variable HERE that will be used to configure any paths. The static resources is the obvious example.

At this point we start defining a few functions. The get_app_config function returns a configuration for the application. In the config, we set up the staticdir tool to point to our static folder. The default configuration is to expose these files via /static.

This default configuration is defined in a function to make it easier to test. As you application grows, you will end up needing to merge different configuration details together depending on configuration passed into the application. Starting off by making your config come from a function will help to make your application easier to test because it makes changing your config for tests much easier.

In the same way we’ve constructed our config behind a function, we also have our application available behind a function. When you call get_app it has the side effect of mounting the HelloWorld controller the cherrypy.tree, making it available when the server starts. The get_app function also returns the cherrypy.tree. The reason for this is, once again, to allow easier testing for tools such as webtest. Webtest allows you to take a WSGI application and make requests against it, asserting against the response. It does this without requiring you start up a server. I’ll provide an example in a moment.

Finally we have our start function. It calls get_app to mount our application and then calls the necessary functions to start the server. The quickstart method used in the homepage tutorial does this under the hood with the exception of also doing the mounting and adding the config. The quickstart can become less helpful as your application grows because it assumes you are mounting a single object at the root. If you prefer to use quickstart you certainly can. Just be aware that it can be easy clobber your configuration when mixing it with cherrypy.tree.mount.

One thing I haven’t addressed here is the database connection. That is outside the scope of this post, but for a good example of how to configure SQLAlchemy and CherryPy, take a look at the example application, Twiseless. Specifically you can see how to setup the models and connections. I’ve chosen to provide a file system organization that is a little closer to other frameworks like Django, but please take liberally from Twiseless to fill in the gaps I’ve left here.

Testing

In full stack frameworks like Django, testing is part of the full package. While many venture outside the confines of whatever the defaults are (using pytest vs. django’s unittest based test runner), it is generally easy to test things like requests to the web framework.

CherryPy does not take any steps to make this easier, but fortunately, this default app configuration lends itself to relatively easy testing.

Lets say we want to test our HelloWorld controller. First off, we’ll should set up an environment to develop with. For this we’ll use virtualenv. I like to use a directory called venv. In the project directory:

$ virtualenv venv

Virtualenv comes bundled with a pip. Pip has a helpful feature where you can define requirements in a single test file. Assuming you’ve already filled in your setup.py file with information about your package, we’ll create a dev_requirements.txt to make it easy to get our environment setup.

# dev_requirements.txt

-e .  # install our package

# test requirements
pytest
webtest

Then we can install these into our virtualenv by doing the following in the shell:

$ source venv/bin/activate
(venv) $ pip install -r dev_requirements.txt

Once the requirements are all installed, we can add our test.

We’ll create a file in tests called test_controller_hello_world.py. Here is what it will look like:

import pytest
import webtest

from myproj.server import get_app


@pytest.fixture(scope='module')
def http():
    return webtest.WebTest(get_app())


class TestHelloWorld(object):

    def test_hello_world_request(self, http):
        resp = http.get('/')
        assert resp.status_int == 200
        assert 'Hello World!' in resp

In the example, we are using a pytest fixture to inject webtest into our test. WebTest allows you to perform requests against a WSGI application without having to start up a server. The request.get call in our test then is the same as if we had started up the server and made the request in our web browser. The resulting response from the request can be used to make assertions.

We can run the tests via the py.test command:

(venv) $ py.test tests/

It should be noted that we also could test the response by simply instantiating our HelloWorld class and asserting the result of the index method is correct. For example

from myproj.controllers import HelloWorld


def test_hello_world_index():
    controller = HelloWorld()
    assert controller.index() == 'Hello World!'

The problem with directly using the controller objects is when you use more of CherryPy’s features, you end up using more of cherrypy.request and other cherrypy objects. This progression is perfectly natural, but it makes it difficult to test the handler methods without also patching much of the cherrypy framework using a library like mock. Mock is a great library and I recommend it, but when testing controllers, using WebTest to handle assertions on responses is preferred.

Similarly, I’ve found pytest fixtures to be a powerful way to introduce external services into tests. You are free to use any other method you’d like to utilize WebTest in your tests.

Conclusions

CherryPy is truely an unopinionated framework. The purpose of CherryPy is to create a simple gateway between HTTP and plain Python code. The result is that there are often many questions of how to do common tasks as there are few constraints. Hopefully the above folder layout along side the excellent Twiseless example provides a good jumping off point for getting the job done.

Also, if you don’t like the layout mentioned above, you are free to change it however you like! That is the beauty of cherrypy. It allows you to organize and structure your application the way you want it structured. You can feel free to be creative and customize your app to your own needs without fear of working against the framework.

Queues

Ian Bicking has said goodbye. Paste and WSGI played a huge part of my journey as a Python programmer. After reading Ian’s post, I can definitely relate. Web frameworks are becoming more and more stripped down as we move to better JS frameworks like AngularJS. Databases have become rather boring as Postgres seems to do practically anything and MongoDB finally feels slightly stable. Where I think there is still room to grow is in actual data, which is where queues come in.

Recently I’ve been dragged into the wild world of Django. If you plan on doing anything outside of typical request / response cycle, you will quickly run into Celery. Celery defines itself as a distributed task queue. The way it works is that you run celery workers processes that use the same code as your application. These workers listen to a queue (like RabbitMQ for example) for task events that a worker will execute. There are some other pieces that are provided such as scheduling, but generally, this is the model.

The powerful abstraction here is the queue. We’ve recently seen the acceptance of async models in programming. On the database front, eventual consistency has become more and more accepted as fact for big data systems. Browsers have adopted data storage models to help protect user data while that consistency gets replicated to a central system. Mobile devices with flaky connectivity provide the obvious use case for storing data temporarily on the client. All these technologies present a queue-like architecture where data is sent through a series of queues where workers are waiting to act on the data.

The model is similar to functional programming. In a functional programming language you use functions to describe a set of operations that will happen on a specific type or set of data. Here is simple example in Clojure

(defn handle-event [evt]
  (add-to-index (map split-by-id (parse (:data evt)))))

Here we are handling some evt data structure that has a data key. The data might be a string that gets parsed by the parse function. The result of the parsing is passed to a map operation that also returns an iterable that is consumed by the add-to-index function.

Now, say we wanted to implement something similar using queues in Python.

def parse(data, output):
    # some parsing...
    for part in parts:
        output.push(split_by_id(part))


def add_to_index(input):
    while True:
        doc = input.get()
        db.write(doc)


def POST(doc):
    id = gen_id()
    indexing_queue.push((id, doc))
    return {'message': 'added to index queue',
            'redirect': '/indexed/%s' % id}

Even though this is a bit more verbose, it presents a similar model as the functional paradigm. Each step happens on a immutable value. Once the function receives the value from the queue, it doesn’t have to be concerned with it changing as it does its operation. What’s more, the processing can be on the same machine or across a cluster of machines, mitigating the effect of the GIL.

This isn’t a new idea. It is very similar to the actor model and other concurrency paradigms. Async programming effectively does the same thing in that the main loop is waiting for I/O, at which time it sends the I/O to the respective listener. In theory, a celery worker could queue up a task on another celery queue in order to get a similar effect.

What is interesting is that we don’t currently have a good way to do this sort of programming. There is a lot of infrastructure and tooling that would be helpful. There are questions as to how to deploy new nodes, keep code up to date and what happens when the queue gets backed up? Also, what happens when Python isn’t fast enough? How do you utilize a faster system? How do you do backfills of the data? Can you just re-queue old data?

I obviously don’t have all the answers, but I believe the model could work to make processing streamable data more powerful. What makes the queue model possible is an API and document format for using the queue. If all users of the queue understood the content on the queue, then it is possible for any system that connect to the queue to participate in the application.

Again, I’m sure others have built systems like this, but as there is no framework available for python, I suspect it is not a popular paradigm. One example of the pattern (sans a typical queue) is Mongrel2 with its use of ZeroMQ. Unfortunately, with the web providing things like streaming responses and the like, I don’t believe this model is very helpful for a generic web server.

Where I believe it could excel is when you have a lot of data coming that requires flexible analysis by many different systems, such that a single data store cannot provide the flexibility required. For example, if you wanted to process all facebook likes based on the URLs, users and times, it would require a rather robust database that could effectively query each facet and establish a reasonably fast means of calculating results. Often this is not possible. Using queues and a streaming model, you could listen to each like as it happens and have different workers process the data and create their own data sources customized for the specific queries.

I still enjoy writing python and at this point I feel I know the language reasonably well. At the same time I can relate to the feeling that it isn’t as exciting as it used to be. While JavaScript could be a lot of fun, I think there is still something to be done with big data that makes sense for Python. Furthermore, I’d hope the queue model I mentioned above could help leave room to integrate more languages and systems such that if another platform does make sense, it is trivial to switch where needed.

Have other written similar systems? Are there problems that I’m missing?

Immutability

One thing about functional programming languages that is source of frustration is immutable data structures. In Clojure there are a host of data structures that allow you change the data in place. This is possible because the operation is wrapped in a transaction of sorts that will guarantee it will work or everything will be reverted.

One description that might be helpful is that Clojure uses locks by default. Any data you store is immutable and therefore locked. You can always make a copy efficiently and you are provided some tools to unlock the data when necessary.

I’m definitely not used this model by any stretch, but it seems the transactional paradigm along with efficient copies makes for a good balance of functional values along side practical requirements.

My First Cojure Experience

Clojure is a LISP built on top of the JVM. As a huge fan of Emacs, it shouldn’t be suprising that there is a soft spot in my heart for LISP as well functional programming. The problem is lisp is a rather lonely language. There are few easily googable tutorials and a rather fractured community. You have a ton of options (Guile, Scheme, Racket, CL, etc.) with none of them providing much proof that a strong, long lasting community exists. It can be rather daunting to spend time trying to learn a language based on a limited set of docs knowing that it is unlikely you will have many chances to actually use it.

Of course, learning a lisp (and functional programming) does make you a better programmer. Learning a lisp is definitely time well spent. That said, this reality of actually using lisp in the real world has always been a deterrent for me personally.

Clojure, on the other hand, is a little different.

Clojure, being built on the JVM, manages to provide a lisp that is contextualized by Java. Clojure doesn’t try to completely hide the JVM and instead provides clear points of interoperability and communicates its features in terms of Java. Rich Hickey does a great job explaining his perspective on Clojure, and more importantly, what the goal is. This all happens using Java colored glasses. The result is a creator that is able to present a practical lisp built from lessons learned programming in typical object oriented paradigms.

Idealism aside, what is programming in Clojure really like?

Well, as a python programmer with limited Java experience, it is a little rough to get started. The most difficult part of learning a lisp is how to correctly access data. In Python (and any OO language) it is extremely easy to create a data structure and get what you need. For example, if you have a nested dictionary of data, you can always provide a couple keys and go directly to the data you want. Lisp does not take the same approach.

It would be really great if I were to tell you how best to map data in python data structures into Clojure data structures, but I really don’t know. And that is really frustrating! But, it is frustrating because I can see how effective the platform and constructs would be if only I could wrap my head around dealing with data.

Fortunately, Rich gives us some tips by way of Hammock Driven Development, that seem promising. A common concept within the world of lisp is that your program is really just data. Cascalog, a popular hadoop map reduce framework, provides a practical example of this through its logic engine. Here is a decent video that reflects how a declarative form, where you program really is just data used by a logic engine. Eventually, I’m sure my brain will figure out how to effectively use data in Clojure.

Another thing that is commonly frustrating with a JVM for a Python programmer is dealing with the overwhelming ecosystem. Clojure manages to make this aspect almost trivial thanks to Leiningen. Imagine virtualenv/pip merged with manage.py in Django and you start to see how powerful a tool it is.

Finally, Clojure development is really nice in Emacs. The key is the inferior lisp process. If you’ve ever wanted a Python IDE you’ll find that the only way to reliably get features like autocomplete to work with knowledge of the project is to make sure the project is configured with the specific virtualenv. Emacs makes this sort of interaction trivial in Clojure because of tools like Cider that jack into the inferior lisp process to compile a specific function, run tests or play around in a repl.

I highly recommend checking out Clojure. Like a musical instrument, parens may take a little practice. But, once you get used to them, the language becomes really elegant. At a practical level you get a similar dynamism as you see in Python. You also get the benefits of a platform that is fast and takes advantages of multiple cores. Most importantly, there is a vibrant and helpful community.

Even if you don’t give Clojure a try, I encourage you to watch some of Rich Hickey’s talks online. They are easy to watch and take an interesting perspective on OO based languages. I’ve become a fan.