I've finally gotten around to playing with CouchDB . It really reminds me more of an object database than a document store in that your documents are effectively JSON files that can be queried via JavaScript. I'm not sure if you could put something like an image or some other binary data in the store just yet, but seeing as that never seems to be that great an idea, it's probably for the best.
My big question is the speed. I just can't imagine that as things get really large your going to pay a pretty hefty price. If a view (kind of like a query) gets updated on each write, is that an enormous index keeping track of all the documents? If it doesn't keep something like an index, then how could it search its stored items any faster than a typical database?
Seeing as the CouchDB guys must be pretty smart, I'm betting they have thought about these kinds of issues before. There are also many examples of similar ideas in other hash based databases, so hopefully things are effectively solved and the issue is now just a marketing question. In either case, hopefully the Lazyweb can help me out here.
Wanting to get a better feel for CouchDB , I figured what better way to explore it than build a small application. Usually the example application is a blog, but being rather sick of writing blog software, I decided to make a simple todo list.
I should mention that my example was not meant to be a Tada-List or Remember the Milk or some other equally heavy time management tool. While I'm sure if I could commit to using one it would be beneficial in the long run, I'm just too simple a guy to use some web service for checking off things I get done. Instead, my todo list app was based loosely on Org-Mode and its built in TODO functionality.
If anyone wants to look at the code it is here .
I'm using CherryPy because it is pretty easy to get up and running and I've found it to be a very solid server. I'm also using jQuery on the JavaScript side of things. I've tried to get accustomed to writing JavaScript in terms of jQuery plugins . This is something of a hassle, while at the same time, I feel somewhat protected via the jQuery namespace. This could very easily be a false sense of security, but after writing a simple widget framework, my impression is that when working with JavaScript, the less time you spend on the front lines battling cross browser issues, the better. There really is a better chance of the jQuery folks solving problems than me understanding issues and fixing them myself simply because they have spent more time on the issues. Anyway.
The Good.
The Python CouchDB library is pretty slick. It uses httplib2 which is a plus in my book as it does caching right and has things like HTTP Auth built in. The abstraction between the actual JSON document and the Python object is really shallow, which is a huge plus. Most things seem to work when treated as a dict , which feels very pythonic. Also, it was trivial to get started on some simple iteractive interpreter tools to help work with a database. This sort of thing is really more of a testament to Python and CouchDB than anything else. Not having to be concerned with schemas and having simplejson makes playing with a CouchDB database really easy. Finally, the python-couchdb source was very helpful for learning the library. It looks like testing is done via doctest , so the docstrings are essential.
The Bad.
I had a few errors where it seems like persistent connection was lost between CouchDB and the httplib2 connection object. Essentially, it looked like a request failed because things timed out. I'm not sure if that is the case, but from the standpoint of a user, I would prefer to ignore that kind of error in favor of making another attempt at the request. That is if the error really was due to a timed out persistent HTTP connection.
Another area that I wish would be addressed is working with views. Specifically, it was sort of a pain to figure out how to programatically add a view (which is essentially like creating an index on some part of the document in the CouchDB world). It makes total sense why this would be left out of the library because views are simply regular documents following a special convention. But, since the rest of the library works so darn well, why not add a couple helpful bits to make life easier? In this case it would probably be better for me to shut up and send a patch, so that might be what I'll do.
Lastly, I didn't see how to deal with security. One of the benefit of httplib2 is that it has HTTP Auth built in. That means I could use something like Digest to make talking to CouchDB a little more secure. Seeing as I've barely scratched the surface with CouchDB, I might be missing some obvious best practices or patterns. The python-coucbdb lib might also have some surprises that I don't know about. In other words, I'm not consciously trying to be lazy. Security didn't hit me in the face.
Overall it was a lot of fun playing around with CouchDB. I'm sure I could have done more with JavaScript, but without security in place that seemed like a bad idea. I also didn't notice the performance being a problem. I didn't do any benchmarks or anything, but things felt snappy enough to feel confident things could work. And yes, this is a totally useless subjective opinion on performance, but I'm claiming avoiding the root of all evil.
I'm interested to see what happens with CouchDB. The whole document store side of things seems really powerful. I also like the idea of views. The big question in my mind is how do the management of the views complexity scale over time. When I end up with massive amounts of data over a cluster of CouchDB servers, am I also going to have to keep track of tons of code used for the views? Assuming views can get rather complicated, what are my strategies for reducing the complexity and how do those impact performance? SQLAlchemy , a truly great ORM, does a good job helping handle complexity without sacrificing too much speed, but does adding functions available in views cause a major slowdown? Time will tell how these sorts of questions pan out. One thing for sure is that CouchDB seems like a relatively worthwhile application to take a chance on. If I were an early adopter looking for an edge, CouchDB is a good option that might pay off.
In changing jobs a while back I also had the opportunity to change version control systems from Subversion to Mercurial. I had been a fan of Mercurial, but my usage was rather limited. When I actually had to get serious about using hg, things were not as smooth as I would have hoped.
The biggest issue was my own misunderstandings regarding what you could do with hg. I was under the impression you could clone a repo, make changes and push selective changes back to the main repo. The key word there is selective. You can branch/clone and merge back to the main repo, but it really needs to be an all or nothing operation. Otherwise you need to go back and use something like the transplant extension to migrate specific changesets to the other repo. Transplants are also somewhat scary because there is not a separate commit operation. When you transplant, the changesets are immediately committed to the branch/repo.
The other option for working with specific changesets is to use Mercurial Queues. This method seems to be very popular among hg fans, but honestly, it was more confusing than working with branches and transplant. I gave it a try on a small test repo and things broke almost immediately. While I can't recall exactly what the error messages were, I do remember it was a pain to fix. The biggest problem with MQ was that it used an entirely different model. The idea of working with patchsets seems appealing, but I got the impression that one really needed to commit to using Queues for things to really work as advertised. Just to be clear though, the chances for user error were pretty high. If a hg guru were to show me the right way to work on these things, I have no doubts much of my confusion could be cleared up.
The other huge frustration with Mercurial was the concept of heads. I sort of understand heads as divergent directions within one branch. So, if a branch has two heads, then there are almost two sub-branches (the heads) within the single branch. At the 5000 ft level, this seems pretty clear. As you work in a branch, you are also working in a head. If you pull changes from someone else, that creates a new head. To remove the extra heads (you really don't need two working heads) you need to merge the two heads back into one. This effectively is the same thing as merging two branches, with the outcome being a single head or branch. Again, from a design perspective, this makes perfect sense. Where things get sticky is when a problem arises due to conflicts. For example, I ended up making changes in a branch that I should have worked on in the default branch (I'll go into the tip/default issues next) and merged back into the versioned branch. For whatever reason, every time I needed to commit to the versioned branch, conflicts would need to be resolved for one file. This ended up getting propagated to others as well after pushing to the main repo. Again, user error could have easily been the cause, but fixing the problem was next to impossible.
Lastly, the distinction between tip and default is just weird. You'll notice above I mention working in a branch. This is because you need to realize you are always in branch, even if it is the "default" branch. If you ever consider yourself working in the "head" or "trunk", you'll run into problems. My understanding is that the default branch is the branch you change into when you do something like "hg up" without any arguments. I've eventually become explicit about this by always doing "hg up default" if I want to work on the "trunk". So far, thinking in terms of always using a branch is very helpful. Where things get odd is the idea of "tip". To this day I still don't truly understand what "tip" really is. If you do a "hg tip" it will tell you the most recently added changeset, in the most recently changed head (hg help tip). Again, the big picture makes sense. The problem is that you can use other commands on the "tip" such as the "heads" command. If the tip is the most recent changeset in the most recent head, then it would seem that doing something like "hg heads tip" must refer to something larger, such as the current or default branch. The truth is I have no idea what it refers to, so my resolution has been to always pay attention to the branch I'm in and never use "tip" for anything.
Overall, the idea of a DVCS is really great. When Mercurial has been working for me, it is actually really nice and doesn't require much extra work. When I think of alternatives (Subversion + DVCS), it always seems like a messy solution that is solvable by a single DVCS. The biggest benefit I've gotten from Mercurial is the need to pay close attention to what I'm doing with my VCS. I've become very conscious of what operations I'm doing and how they impact the tree as a whole. Mercurial requires cognitive thought in order to keep track of all the details, but in the end, it has been good practice for me.
Seeing as I actually enjoy programming, sometimes it is nice to make small projects for myself. I have another project I'm working on for a friend that has been taking up most of my free programming time, so my blog makes for arena of experimentation. Recently, I've been interested in using CouchDB. I'm partly interested in because it is not a traditional database. While I can put SQL on my resume, I'm far from an expert and lacking a desire to become one. Unless I find a good backend storage system, I'm always going to be a slave to either the filesystem or my own concoctions.
One option is to move the backend to CouchDB, but honestly, SQLite has been treating me very well, so I don't mind sticking with it for the time being. Moving my friends project is also an option, but my job is to make a working project, not experiment. One area that always seems to be lacking on my blog is the lack of Web 2.0-isms. I don't have any widgets, tagging, or more generally, social features. At this point, my interest is not to participate, but to contribute. My writing has become more important to me this past year and without readership, I'm losing out on important feedback from readers. Making posting articles on popular sites might help in boosting readership and generally get a little more attention. Tagging, on the other hand, just seems semantically cool over the long run.
So, my next post should hopefully include some nifty "Post to Social Website X" widgets as well as a simple tags list. Hopefully it will help improve my readership.