tsf and randstr

I wrote a couple really small tools the other day that I packaged up. I hope someone else finds them useful!

tsf

tsf directs stdin to a timestamped file or directory. For example:

$ curl http://ionrock.org | tsf homepage.html

The output from the cURL request goes into a file 20150326123412-homepage.html. You can also specify that a directory should be used.

$ curl http://ionrock.org | tsf -d homepage.html

That will create a homepage.html directory with the timestamped files in the directory.

Why is this helpful?

I often debug things by writing small scripts that automate repetitive actions. It is common that I’ll keep around output for this sort of thing so I can examine it. Using tsf, it is a little less likely that I’ll overwrite a version that I needed to keep.

Another place it can be helpful is if you periodically run a script and you need to keep the result in a time stamped file. It does that too.

randstr

randstr is a function creates a random string.

from randstr import randstr

print(randstr())

randstr provides some globals containing different sets of characters you can pass in to the call in order to get different varieties of random strings. You can also set the length.

Why is this helpful?

I’ve written this function a ton of times so I figured it was worth making a package out of it. It is helpful for testing b/c you can easily create random identifiers. For example:

from randstr import randstr

batch_records = [{'name': randstr()} for i in range(1000)]

I’m sure there are other tools out there that do similar or exactly the same thing, but these are mine and I like them. I hope you do too.

Automate Everything

I’ve found that it has become increasingly difficult to simply hack something together without formally automating the process. Something inside me just can’t handle the idea of repeating steps that could be automated. My solution has been to become faster at the process of formal automation. Most of the steps are small, easy to do and don’t take much time. Rather than feeling guilty that I’m wasting time by writing small a library or script, I work to make the process faster and am able to re-use these scripts and snippets in the future.

A nice side effect is that writing code has become much more fluid. I get more practice using essential libraries and tools where over time they’ve become second nature. It also can be helpful getting in the flow because taking the extra steps of writing to files and setting up a small package feels like a warm up of sorts.

One thing that has been difficult is navigating the wealth of options. For example, I’ve gone back to OS X for development. I’ve had to use VMs for running processes and tests. I’ve been playing with Vagrant and Docker. These can be configured with chef, ansible, or puppet in addition to writing a Vagrantfile or Dockerfile. Does chef set up a docker container? Do you call chef-apply in your Dockerfile? On OS X you have to use boot2docker, which seems to be a wrapper around docker machine. Even though I know the process can be configured to be completely automated, it is tough to feel as though you’re doing it right.

Obviously, there is a balance. It can be easy to become caught in a quagmire of automation, especially when you’re trying to automate something that was never intended to be driven programaticaly. At some point, even though it hurts, I have to just bear down and type the commands or click the mouse over and over again.

That is until I break down and start writing soem elisp to do it for me ;)

Server Buffer Names in Circe

Circe is an IRC client for Emacs. If you are dying to try out Emacs for your IRC-ing needs, it already comes with two other clients, ERC and rcirc. Both work just fine. Personally, I’ve found circe to be a great mix of helpful features alongside simple configuration.

One thing that was always bugging me was that the server buffers names. I use an IRC bouncer that keeps me connected to the different IRC networks I use. At work, I connect to each network using a different username using a port forwarded by ssh. The result being I get 3 buffers with extremely descriptive names such as localhost:6668<2>. I’d love to have names like *irc-freenode* instead, so here is what I came up with.

First off, I wrote a small function to connect to each network that looks like this:

(defun my-start-ircs ()
  (interactive)
  (start-freenode-irc)
  (start-oftc-irc)
  (start-work-irc)
  ;; Tell circe not to show mode changes as they are pretty noisey
  (circe-set-display-handler "MODE" (lambda (&rest ignored) nil)))

Then for each IRC server I call the normal circe call. The circe call returns the server buffer. In order to rename the buffer, I can do the following:

(defun start-freenode-irc ()
  (interactive)
  (with-current-buffer (circe "locahost"
                              :port 6689
                              :nick "elarson"
                              :user "eric_on_freenode"
                              :password (my-irc-pw))
    (rename-buffer "*irc-freenode*"))

Bingo! I get a nice server buffer name. I suspect this could work with ERC and rcirc, but I haven’t tried it. Hope it helps someone else out!

Hello Rackspace!

After quite a while at YouGov, I’ve started a new position at Rackspace working on the Cloud DNS team! Specifically, I’m working on Designate, a DNS as a Service project that is in incubation for OpenStack. I’ve had an interest in OpenStack for a while now, so I feel extremely lucky I have the opportunity to join the community with one of the founding organizations.

One thing that has been interesting is the idea of the Managed Cloud. AWS focuses on Infrastructure as a Service (IaaS). Rackspace also offers IaaS, but takes that a step farther by providing support. For example, if you need a DB as a Service, AWS provides services like Redshift or SimpleDB. It is up to the users to figure out ways to optimize queries and tune the database for the user’s specific needs. In the Managed Cloud, you can ask for help and know that an experienced expert will understand what you need to do and help to make it happen, even at the application level.

While this support feels expensive, it can be much cheaper than you think when you consider the amount of time developers spend becoming pseudo experts at a huge breadth of technology that doesn’t offer any actual value to a business. Just think of the time spent on sharding a database, maintaining a CI service, learning the latest / greatest container technology, building your own VM images, maintaining your own configuration management systems, etc. Then imagine having a company that will help you set it up and maintain it over time. That is what the managed cloud is all about.

It doesn’t stop at software either. You can mix and match physical hardware with traditional cloud infrastructure as needed. If your database server needs real hardware that is integrated with your cloud storage and cloud servers, Rackspace can do it. If you have strict security compliance requirements that prevent you from using traditional IaaS providers, Rackspace can help there too. If you need to use VMWare or Windows as well as Open Stack cloud technologies, Rackspace has you covered.

I just got back from orientation, so I’m still full of the Kool-Aid.

That said, Fanatical Support truly is ingrained in the culture here. It started when the founders were challenged. They hired someone to get a handle on support and he proposed Fanatical Support. His argument was simple. If we offer a product and don’t support it, we are lying to our customers. The service they are buying is not what they are getting, so don’t be a lier and give users Fanatical Support.

I’m extremely excited to work on great technology at an extremely large scale, but more importantly, I’m ecstatic to being working at a company that ingrains integrity treats its customers and employees with the utmost respect.

Setting up magit-gerrit in Emacs

I recently started working on OpenStack and, being an avid Emacs user, I hoped to find a more integrated workflow with my editor of choice. Of the options out there, I settled on magit-gerrit.

OpenStack uses git for source control and gerrit for code review. The way code gets merged into OpenStack is through code review and gerrit. In a nutshell, you create a branch, write some code, submit a code review and after that code is reviewed and approved, it is merged upstream. The key is ensuring the code review process is thorough and convenient.

As developers with specific environments, it is crucial to be able to quickly download a patch and play around with the code. For example, running the tests locally or playing around with a new endpoint is important when approving a review. Fortunately, magit-gerrit makes this process really easy.

First off, you need to install the git-review tool. This is available via pip.

$ pip install git-review

Next up, you can check out a repo. We’ll use the Designate repo because that is what I’m working on!

$ git clone https://github.com/openstack/designate.git
$ cd designate

With a repo in place, we can start setting up magit-gerrit. Assuming you’ve setup Melpa, you can install it via M-x package-install RET magit-gerrit. Add to your emacs init file:

(require 'magit-gerrit)

The magit-gerrit docs suggest setting two variables.

;; if remote url is not using the default gerrit port and
;; ssh scheme, need to manually set this variable
(setq-default magit-gerrit-ssh-creds "myid@gerrithost.org")

;; if necessary, use an alternative remote instead of 'origin'
(setq-default magit-gerrit-remote "gerrit")

The magit-gerrit package can infer the magit-gerrit-ssh-creds from the magit-gerrit-remote. This makes it easy to configure your repo via a .dir-locals.el file.

((magit-mode
  (magit-gerrit-remote . "ssh://eric@review.openstack.org:29418/openstack/designate")))

Once you have your repo configured, you open your repo in magit via M-x magit-status. You should also see a message saying “Detected magit-gerrit-ssh-creds” that shows the credentials used to login into the gerrit server. These are simple ssh credentials, so if you can’t ssh into the gerrit server using the credentials, then you need to adjust your settings accordingly.

If everything is configured correctly, there should be an entry in the status page that lists any reviews for the project. The listing shows the summary of the review. You can navigate to the review and press T to get a list of options. From there, you can download the patchset as a branch or simply view the diff. You can also browse to the review in gerrit. From what I can tell, you can’t comment on a review, but you can give a +/- for a review.

I’ve just started using gerrit and magit-gerrit, so I’m sure there are some features that I don’t fully understand. For example, I’ve yet to understand how to re-run git review in order to update a patch after getting feedback. Assuming that isn’t supported, I’m sure it shouldn’t be too hard to add.

Feel free to ping me if you try it and have questions or tips!

Todo Lists

With a baby on the way, I’ve started reconsidering how I keep track of my TODO list.

Seeing as I spent an inordinate amount of time Emacs, org-mode is my go to tool when it comes to keeping track of my life. I can keep notes, track time and customize my environment to support a workflow that works for me. The problem is life happens outside of Emacs. Shocking, I know.

So, my goal is to have a system that integrates well with org-mode and Emacs, while still allows me to use my phone, calendar, etc. when I’m away from my computer. Also, seeing as Emacs doesn’t provide an obvious, effective calendaring solution like GMail does, I want to be sure I’m able to schedule things so I do them at specific times and have reminders.

With that in mind, I started looking at org-mobile. It seems like the perfect solution. It is basically a way to edit an org file on my phone and will (supposedly) sync deadlines and scheduled items to my calendar as one would expect. Sure, the UI could use some work, but having to know the date vs. picking it from a slick dialog on my phone seemed like a reasonable way to know what date it was...

Unfortunately, it didn’t work. I had one event sync’d to my google calendar, but that was the end of that. It didn’t seem to add anything to my calendar no matter the settings. That is a deal breaker.

I’m currently starting to play with org-trello instead. I’m confident I can make this work for two reasons.

  1. The mobile app is nice
  2. The sync’ing in Emacs is nice

What doesn’t work (yet?) is adding deadlines or scheduling to my calendar, but seeing as this new year I’m resolving to slow down, I’m going to stop trying to over optimize and just add stuff to my calendar. It is a true revelation, I know.

One thing I did consider was just skipping the computer all together and using a physical planner. The problem with a planner for me is,

  1. My handwriting is atrocious
  2. It doesn’t sync to my calendar

In addition to trying to understand my handwriting, I’d have to develop a habit to always look at it. I can’t see it happening when I can get my phone to annoy me as needed.

Part of this effort is part of a larger plan to use some of the tactics in GTD to get tasks off of my mental plate and put them somewhere useful. So far, it has been sticking more than it ever has, so I’m hopeful this could be the time it becomes a real habit. Wish me luck, as I’m sure I’ll need it!

CacheControl 0.11.0 Released

Last night I released CacheControl 0.11.0. The big change this time around is using compressed JSON to store the cache response rather than a pickled dictionary. Using pickle caused some problems if a cache was going to be read by both Python 3.x and Python 2.x clients.

Another benefit is that avoiding pickle also avoid exec’ing potentially dangerous code. It is not unreasonable that someone could include a header that could cause problems. This hasn’t happened yet, but it wouldn’t suprise me if it were feasible.

Finally, the size of the cached object should be a little smaller. Generally responses are not going to that large, but it should help if you use storage that keeps a hot keys in memory. MongoDB comes to mind along with Memcached and, probably, Redis. If you are avoiding caching large objects it could also be valuable. For example, a large sparse CSV response might be able to get compressed quite a bit to make caching it reasonable.

I haven’t don any conclusive tests regarding the actual size impact of compression, so these are just my theories. Take them with a grain of salt or let me know your experiences.

Huge thanks to Donald Stuff for sending in the compressed JSON patches as well all the folks who have submitted other suggestions and pull requests.

The Future

I’ve avoided making any major changes to CacheControl as it has been reasonably stable and flexible. There are some features that others have requested that have been too low on my own personal priorites to take time to implement.

One thing I’ve been tempted to do is add more storage implementations. For example, I started working on a SQLite store. My argument, to myself at least, was that the standard library supports SQLite, which makes it a reasonable target.

I decided to stop that work as it didn’t really seem very helpful. What did seem enticing was that a cache store becoming queryable. Unfortunately, since the cache store API only gets a key and blob for the value, it would require any cache store author to unpack the blog in order read any values it is interested in.

In the future I’ll be adding a hook system to let a cache store author have access to the requests.Response object in order to create extra arguments for setting the value.

For example, in Redis, you can set an expires time the DB will use to expire the response automatically. The cache store then might have an extra method that looks like this.

class RedisCache(BaseCache):

    def on_cache_set(self, response):
        kwargs = {}
        if 'expires' not in response.headers:
            return kwargs

        return {'expires': response.headers['expires']}

    def set(self, key, value, expires=None):
        # Set the value accordingly

I’m not crazy about this API as it is a little confusing to communicate that creating a on_cache_set hook is really a way to edit the arguments sent to the set method. Maybe calling it a hook is really the wrong term. Maybe it should be called prepare and it explicitly calls set. If anyone has thoughts, please let me know!

The reasoning is that I’d like to remove the Redis cache and start a new project for CacheControl stores that includes common cache store implementations. At the very least, I’d like to find some good implementations that I can link to from the docs to help folks find a path from a local file cache to using a real database when the time comes.

Lastly, there are a couple spec related details that could use some attention that I’ll be looking at in the meantime.

Replacing Monitors

I just read a quick article on Microsoft’s new VR goggles. The idea of layering virtual interfaces on top of the real world seems really cool and even practical in some use cases. What seems really difficult is how an application will understand the infinite number visual environments in order to effectively and accurately visualize the interfaces. Hopefully the SDK for the device includes a library that provides for different real world elements like placeOnTop(vObject, x, y z) where it can recognize some object in the room and allows that object to be made available as a platform. In other words, it sees the coffee table and makes it available as an object that you can put something on top of.

The thing is, what I’d love to see is VR replacing monitors! Every year TVs and monitors get upgrades in size and quality, yet the prices rarely drop very radically. Right now I have a laptop and an external monitor that I use at home. I’d love to get rid of both tiny screens and just look into space and see my windows on a huge, virtual, flat surface.

An actual computer would still be required and there wouldn’t necessarily need to be huge changes to the OS at first. The goggles would just be another big screen that would take the rasterized screen and convert it to something seen in the analog world. Sure, this would probably ruin a few eyes at first, but having a huge monitor measured in feet vs. inches is extremely enticing.

DevOps

I finally realized why DevOps is an idea. Up until this point, I felt DevOps was a term for a developer that was also responsible for the infrastructure. In other words, I never associated DevOps with an actual strategy or idea, and instead, it was simply something that happened. Well no more!

DevOps is providing developers keys [1] to operations. In a small organization these keys never have a chance to leave the hands of the small team of developers that have nothing to concern themselves except getting things done. As an organization grows, there becomes a dedicated person (and soon after group of people) dedicated to maintaining the infrastructure. The thing that happens is that the keys developers had to log into any server or install a new database are taken away and given to operations to manage. DevOps is a trend where operations and developers share the keys.

Because developers and operations both have access to change the infrastructure, there is a meeting of the minds that has to happen. Developers and Ops are forced to communicate the what, where, when, why and how of changes to the infrastructure. Since folks in DevOps are familiar with code, version control becomes a place of communication and cohesion.

The reason I now understand this paradigm more clearly is because when a developer doesn’t have access to the infrastructure, it is a huge waste of time. When code doesn’t work we need to be able to debug it. It important to come up with theories why things don’t work and iteratively test the theories until we find the reason for the failure. While it is possible to debug bugs that only show up in production, it can be slow, frustrating and difficult, when access to the infrastructure isn’t available.

I say all this with a huge grain of salt. I’m not a sysadmin and I’ve never been a true sysadmin. While I understand the hand wavy idea that if all developers have ssh keys to the hosts in a datacenter, there are more vectors for attack. What I don’t understand is why a developer with ssh keys is any more dangerous than a sysadmin having ssh keys. Obviously, a sysadmin may have a more stringent outlook on what is acceptable, but at the same time, anyone can be duped. After all, if you trust your developers to write code that writes to your database millions (or billions!) of times a day, I’m sure you can trust them to keep an ssh key safe or avoid exposing services that are meant to remain private.

I’m now a full on fan of DevOps. Developers and Ops working together and applying software engineering techniques to everything they work with seems like a great idea. Providing Developers keys to the infrastructure and pushing Ops to communicate the important security and/or hardware concerns is only a good thing. The more cohesion between Ops and Dev, the better.

[1]I’m not talking about ssh keys, but rather the idea of having “keys to the castle”. One facet of this could be making sure developer keys and accounts are available on the servers, but that is not the only way to give devs access to the infrastructure.

Logging

Yesterday I fixed an issue in dadd where logs from processes were being correctly sent back to the master. My solution ended up being a rather specific process of opening the file that would contain the logs and ensuring that any subprocesses used this file handle.

Here is the essential code annotated:

# This daemonizes the code. It can except stdin/stdout parameters #
# that I had originally used to capture output. But, the file used for
# capturing the output would not be closed or flushed and we'd get #
# nothing. After this code finishes we do some cleanup, so my logs were
# empty.
with daemon.DaemonContext(**kw):

    # Just watching for errors. We pass in the path of our log file
    # so we can upload it for an error.
    with ErrorHandler(spec, env.logfile) as error_handler:
        configure_environ(spec)

        # We open our logfile as a context manager to ensure it gets
        # closed, and more importantly, flushed and fsync'd to the disk.
        with open(env.logfile, 'w+') as output:

            # Pass in the file handle to our worker will starts some
            # subprocesses that we want to know the output of.
            worker = PythonWorkerProcess(spec, output)

            # printf => print to file... I'm sure this will get
            # renamed in order to avoid confusion...
            printf('Setting up', output)
            worker.setup()
            printf('Starting', output)
            try:
                worker.start()
            except:
                import traceback

                # Print our traceback in our logfile
                printf(traceback.format_exc())

                # Upload our log to our dadd master server
                error_handler.upload_log()

                # Raise the exception for our error handler to send
                # me an email.
                raise

            # Wrapping things up
            printf('Finishing', output)
            worker.finish()
            printf('Done')

Hopefully, the big question is, “Why not use the logging module?”

When I initially hacked the code, I just used print and had planned on letting the daemon library capture logs. That would make it easy for the subprocesses (scripts written by anyone) to get logs. Things were out of order though, and by the time the logs were meant to be sent, the code had already cleaned up the environment where the subprocesses had run, including deleting the log file.

My next step then was to use the logging module.

Logging is Complicated!

I’m sure it is not the intent of the logging module to be extremely complex, but the fact is, the management of handlers, loggers and levels across a wide array of libraries and application code gets unwieldy fast. I’m not sure people run into this complexity that often, as it is easy to use the basicConfig and be done with it. As an application scales, logging becomes more complicated and, in my experience, you either explicitly log to syslog (via the syslog module) or to stdout, where some other process manager handles the logs.

But, in the case where you do use logging, it is important to understand some essential complexity you simply can’t ignore.

First off, configuring loggers needs to be done early in the application. When I say early, I’m talking about at import time. The reason being is that libraries, that should try to log intelligently, when imported, might have already configured the logging system.

Secondly, the propagation of the different loggers needs to be explicit, and again, some libraries / frameworks are going to do it wrong. By “wrong”, I mean that the assumptions the library author makes don’t align with your application. In my dadd, I’m using Flask. Flask comes with a handy app.logger object that you can use to write to the log. It has a specific formatter as well, that makes messages really loud in the logs. Unfortunately, I couldn’t use this logger because I needed to reconfigure the logs for a daemon process. The problem was this daemon process was in the same repo as my main Flask application. If my daemon logging code gets loaded, which is almost certain will happen, it reconfigures the logging module, including Flasks handy app.logger object. It was frustrating to test logging in my daemon process and my Flask logs had disappeared. When I go them back, I ended up seeing things show up multiple times because different handlers had been attached that use the same output, which leads me to my next gripe.

The logging module is opaque. It would be extremely helpful to be able to inject at some point in your code a pprint(logging.current_config) that will provide the current config at that point in the code. In this way, you could intelligently make efforts to update the config correctly with tools like logging.config.dictConfig by editing the current config or using the incremental and disable_existing_loggers correctly.

Logging is Great

I’d like to make it clear that I’m a fan of the logging module. It is extremely helpful as it makes logging reliable and can be used in a multithreaded / multiprocessing environment. You don’t have to worry about explicitly flushing the buffer or fsync’ing the file handle. You have an easily way to configure the output. There are excellent handlers that help you log intelligently such as the RotatatingFileHandler, WatchedFileHandler and SysLogHandler. Many libraries also allow turning up the log level to see more deeply into what they are doing. Requests and urllib3 do a pretty decent job of this.

The problem is that controlling output is a different problem than controlling logging, yet they are intertwined. If you find it difficult to add some sort of output control to your application and the logging module seems be causing more problems than it is solving, then don’t use it! The technical debt you need to pay off for a small, customized output control system is extremely low compared to the hoops you might need to jump through in order to mold logging to your needs.

With that said, learning the logging module is extremely important. Django provides a really easy way to configure the logging and you can be certain that it gets loaded early enough in the process that you can rely on it. Flask and CherryPy (and I’m sure others) provide hooks into their own loggers that are extremely helpful. Finally, the basicConfig is a great tool to get started logging in standalone scripts that need to differentiate between DEBUG statements and INFO. Just remember, if things get tough and you feel like your battling logging, you might have hit the edges of its valid use cases and it is time to consider another strategy. There is no shame in it!