Skip to content

Data Science

Building a model registry in GitHub

Data science projects can be messy. We try out something here, just to find out that it didn't really help. Then we try out another thing, and maybe it does improve performance a little bit. Most of the code that we write in these projects needs to run once and can be forgotten afterwards. As long as we know what the general outcome (e.g. a trained model) of this approach was, we don't need to actually maintain it. The only exception is that we occasionally need to re-train a model as new data has been collected.

Production software systems are quite different from this. New requirements usually mean that we need to be able to change the code in one place without affecting other parts of the system. Running machine learning models in production therefore requires a fairly clear separation between the data science world of experimentation and the rigid world of maintainable production code. This separation can be achieved by treating machine learning models as—more or less—abstract, exchangeable artifacts, while the backend, on which these models run, only operates on these abstractions.

Version control for data science projects

Data science projects have a lot of moving parts—code, data, configuration, hypotheses, modeling assumptions, ... All of these change at different time scales and with different impact on what and how much needs to be stored. As a result, data science projects often diverge into some sort of chaos. This doesn't have to be the case!

Data Validation for CSV files

For small to medium size data science projects, data are still often in csv files. In fact, I also often find myself manually editing data in such files. In particular during early data exploration phases.

Unfortunately, csv files are notorious when it comes to model validity. While data in a database usually have been explicitly modelled in some way (and may still be messy), csv files can just be typed in by hand and there is no restriction whatsoever on whether the content can be parsed or not and if it parses to a given datatype or not. As a result, I find that I often have date-times or numbers that get parsed as object. Finding the one item that resulted in this can be a nightmare. Here, I describe how we can fix this.

"What's your favourite machine learning algorithm?"

According to my friend Kyle Becker, this interview question seems to throw off quite a few candidates. Honestly, it threw me off as well: I don't think I have a "favourite machine learning algorithm", I typically feel that the best solution is largely dictated by the problem to be solved. However, when thinking a little more about this question, I must admit that there are some ideas that I find very elegant and that certainly had a big impact on the way I think about data.