With this I'm switching to a new blog layout based on material for mkdocs rather than the old layout based on pelican. I tried to keep the mostly white layout from the previous blog, but still went with a generally more modern look and feel. In addition, the new and shiny blog provides a couple of new features that I feel are improvements.
It is somewhat clear that software releases should carry a version like v3 or v1.1 or something. But this often seems more or less arbitrary, and you might feel that just counting git-commits or adding the current date would do equally well. After all, these would be easy to create automatically and once set up, you would never have to think about versioning again.
In fact, I used to think that way. I don't think so anymore though. More yet, I believe that versions are a key ingredient in decoupling different software components. At least if the versions are done right.
Data science projects can be messy. We try out something here, just to find out that it didn't really help. Then we try out another thing, and maybe it does improve performance a little bit. Most of the code that we write in these projects needs to run once and can be forgotten afterwards. As long as we know what the general outcome (e.g. a trained model) of this approach was, we don't need to actually maintain it. The only exception is that we occasionally need to re-train a model as new data has been collected.
Production software systems are quite different from this. New requirements usually mean that we need to be able to change the code in one place without affecting other parts of the system. Running machine learning models in production therefore requires a fairly clear separation between the data science world of experimentation and the rigid world of maintainable production code. This separation can be achieved by treating machine learning models as—more or less—abstract, exchangeable artifacts, while the backend, on which these models run, only operates on these abstractions.
Data science projects have a lot of moving parts—code, data, configuration, hypotheses, modeling assumptions, ... All of these change at different time scales and with different impact on what and how much needs to be stored. As a result, data science projects often diverge into some sort of chaos. This doesn't have to be the case!
For small to medium size data science projects, data are still often in csv files. In fact, I also often find myself manually editing data in such files. In particular during early data exploration phases.
Unfortunately, csv files are notorious when it comes to model validity.
While data in a database usually have been explicitly modelled in some way (and may still be messy), csv files can just be typed in by hand and there is no restriction whatsoever on whether the content can be parsed or not and if it parses to a given datatype or not. As a result, I find that I often have date-times or numbers that get parsed as
object. Finding the one item that resulted in this can be a nightmare.
Here, I describe how we can fix this.
Almost no startup sets out with the right idea from the beginning. It is quite common to change almost everything about what the company's product should do within a few weeks. From a technology perspective, this is quite challenging: We are used to thinking about creating software that is meant to last, but a startup environment likely means that things won't last very long. Yet, adopting careless attitude towards quality software can be equally fatal, often resulting in a gridlock where nothing can move forward nor backward.
In an ancient time, I wrote code in C++ (see here for a now dead project I was involved in). Since then, python has clearly dominated my programming and I came to really love it: Python is concise, readable and easy to learn (and teach). And I really liked python's duck typing approach, which basically meant that you never really had to worry about types. Then, over the past maybe 5 years, a number of different people kept praising the elegance and power of scala and they often highlighted one particular strength of scala: static types. What!? Wasn't that something that I had never liked about C++ and that I was glad to leave behind?
In the past couple of weeks, I made it a point to do a number coding katas in all three of these languages. Not surprisingly, all three of them have their advantages.
The other day, I had lunch with my friend Ori Barbut and it turns out, we both seem to have a passion for a fairly mouse-free computer experience. However, the center pieces of our interaction with a computer are quite different; Ori uses emacs and I use vim. Ori is certainly interested in vim and I have quite extensively used emacs in the past, so much of the conversation circled around the why-and-why-not of the editors we are using.
I couldn't imagine writing any text on a computer without vim. Yet, the conversation with Ori made me think. Ori's environment can effortlessly do all kinds of things that sound really interesting.