Data validation for csv files

For small to medium size data science projects, data are still often in csv files. In fact, I also often find myself manually editing data in such files. In particular during early data exploration phases.

Unfortunately, csv files are notorious when it comes to model validity. While data in a database usually have been explicitly modelled in some way (and may still be messy), csv files can just be typed in by hand and there is no restriction whatsoever on whether the content can be parsed or not and if it parses to a given datatype or not. As a result, I find that I often have date-times or numbers that get parsed as object. Finding the one item that resulted in this can be a nightmare.

Data linting to the rescue

With code, we use linters (e.g. flake8) and static analysis (e.g. mypy) to validate our code before we even compile/run it. Can we do something else with csv files? Can we run linting in csv files?

I wrote a small tool called csvmodel that validates a csv file similarly to how flake8 validates style in python files. The idea is simple: We first model our data by spelling out how we expect each record to look like, and then we check if the data in our csv file matches our data model. If a line does not match, csvmodel will spit out the line number and a brief description what went wrong with that record. This way, we immediately know which line violate our data model, and we don't have to search through a 1000 line (or more) csv to spot the errors.

Sound cool? Let's see it in action.

Getting started with csvmodel

Let's first install csvmodel from pypi (don't actually type the $ that's just to indicate your command prompt):

  $ pip install csvmodel

We now need to specify our data model. For that, csvmodel supports two kinds of data models:

  1. You can specify jsonschema that describes your data.
  2. You can use a pydantic model to specify your schema.

Both approaches have their advantages and disadvantages. If your main language is python, you may be familiar with python's dataclasses or even with pydantic itself. Pydantic provides a drop-in version for standard dataclasses (pydantic.dataclasses) but it is recommended to derive classes from pydantic.BaseModel instead. So let's write a simple data model:

from typing import Optional
from datetime import date
from pydantic import BaseModel


class DataModel(BaseModel):
    username: str
    registered_on: date
    number_of_logins_since_registered: int
    is_premium: bool
    premium_since: Optional[date]
    number_of_logins_since_premium: Optional[int]

We will store this model in a python file, let's say datamodel.py. Next, we need to tell csvmodel where the file is located. To do that, we create a config file (csvmodel.ini by default). In the simplest case, where we only have a single csv file corresponding to a single data model, the config file could look like this:

[csvmodel]
validator = pydantic
schema = file:datamodel.py:DataModel

Note how the schema contains both, the file and the class name inside that file (there are other ways to specify the data model, but we'll stick with this for now).

Let's create a small data file (we'll call it data.csv)

username,registered_on,number_of_logins_since_registered,is_premium,premium_since,number_of_logins_since_premium
chris93,2021-11-11,45,false
lisagg,2022-01-01,35,true,2022-03-05,22
tyson2000,2022-03-04,false,2022-04-21,-3
claire,2021-12-01,20,true,24.5.2022,13
madmax,2022-02-28,34,true,2022-03-21,2

Although this is a tiny data file, it is already difficult to spot all the issues with it. However, if we run csvmodel, we see the following output

data.csv:4: Issue in column number_of_logins_since_registered: value is not a valid integer
data.csv:4: Issue in column is_premium: value could not be parsed to a boolean
data.csv:5: Issue in column premium_since: invalid date format

Each line of the output starts with the name of the file tested followed by a colon, followed by the number of the offending line in that file and finally an error message. Note that these error messages should be taken with a grain of salt: csvmodel simply takes the values in that line and assumes that each one belongs to the corresponding variable from the header. That means that forgetting a column, can result in pretty confusing error messages. However, the line numbers will still be right.

Taking a closer look at line four (tyson2000), we actually find, that we forgot to enter the "number of logins since registered". Adding a number there (say 33) fixes the issues with that line and leaves us with only one error message for line 5 (claire). Again, inspecting that line, we find that the "is premium since" column contains an invalid date format (24.5.2022, pandas would have parsed the whole column as object without complaining). We assume that someone put a date in there as day-month-year, so we just reverse the numbers to "2022-05-24" and csvmodel is happy.

Slightly more advanced data linting

Are we happy? Our data file looks like this now:

username,registered_on,number_of_logins_since_registered,is_premium,premium_since,number_of_logins_since_premium
chris93,2021-11-11,45,false
lisagg,2022-01-01,35,true,2022-03-05,22
tyson2000,2022-03-04,33,false,2022-04-21,-3
claire,2021-12-01,20,true,2022-05-24,13
madmax,2022-02-28,34,true,2022-03-21,2

In fact, line 4 is still not quite right: tyson2000 has is_premium=False but is premium since April 21, 2022, since when he had -3 logins. This record perfectly matches our data model, but our data model does not properly model the data. There are two constraints missing:

  1. The number of logins since a certain date can never be negative.
  2. If is_premium=False, then the remaining columns should be undefined.

Let's add those constraints to our data model! Pydantic does those kinds of things through validators. For the first constraint, we can simply use a regular pydantic validator like this:

class DataModel(BaseModel):
    ...

    @validator('number_of_logins_since_registered')
    def n_logins_since_reg_is_positive(cls, v):
        if v < 0:
            raise ValueError('should be positive or 0')
        return v

    @validator('number_of_logins_since_premium')
    def n_logins_since_premium_is_positive(cls, v):
        if v < 0:
            raise ValueError('should be positive or 0')
        return v

For the second class of constraints, we need a root_validator according to the pydantic website. We'll add two separate root validators for the relationship between is_premium and premium_since and for the relationship between is_premium and number_of_logins_since_premium. Taking everything together, we then have the following in our datamodel.py:

from typing import Optional
from datetime import date
from pydantic import BaseModel, validator, root_validator


def is_positive(v):
    if v < 0:
        raise ValueError('should be positive or 0')
    return v


class DataModel(BaseModel):
    username: str
    registered_on: date
    number_of_logins_since_registered: int
    is_premium: bool
    premium_since: Optional[date]
    number_of_logins_since_premium: Optional[int]

    @validator('number_of_logins_since_registered')
    def n_logins_since_reg_is_positive(cls, v):
        if v < 0:
            raise ValueError('should be positive or 0')
        return v

    @validator('number_of_logins_since_premium')
    def n_logins_since_premium_is_positive(cls, v):
        if v < 0:
            raise ValueError('should be positive or 0')
        return v

    @root_validator
    def if_premium_have_date_else_not(cls, values):
        is_premium = values.get('is_premium')
        premium_since = values.get('premium_since', None)
        if is_premium and premium_since is None:
            raise ValueError('If is_premium=True, premium_since should not be empty')
        if is_premium is False and premium_since is not None:
            raise ValueError('If is_premium=False, premium_since should be empty.')
        return values

    @root_validator
    def if_premium_have_logins_else_not(cls, values):
        is_premium = values.get('is_premium')
        n_logins_name = 'number_of_logins_since_premium'
        n_logins = values.get(n_logins_name, None)
        if is_premium and n_logins is None:
            raise ValueError(f'If is_premium=True, {n_logins_name} should not be empty')
        if is_premium is False and n_logins is not None:
            print(values)
            raise ValueError(f'If is_premium=False, {n_logins_name} should be empty.')
        return values

Yes, there is a lot of validation code now. However, if you read through it, it should be pretty straight forward to follow. And let's be honest: These are really advanced constraints!

Re-running csvmodel with this more detailed data model. Actually finds two more issues with line 4:

test.csv:4: Issue in column number_of_logins_since_premium: should be positive or 0
test.csv:4: Issue in column __root__: If is_premium=False, premium_since should be empty.

Well, we knew that it would find those, because this record actually prompted us to add the extended validation. But imagine we had just found this issue in one record in a 1000 file csv. We would now know all the records that had the issue.

Getting it in your editor

Most people don't run flake8 by hand (except for inside a continuous integration/ continuous delivery, CI/CD setup), but they rather have their editor warn them about syntax issues. However, starting from csvmodel, this should be easy. I wrote a simple vim-plugin that registers csvmodel as a syntax checker for csv files to be used with syntastic. To use it, just add

Plug 'igordertigor/csvmodel-syntastic'

(or whichever is the corresponding line with your vim plugin manager) to your .vimrc or your .config/nvim/init.vim and csvmodel will automatically check any csv files that you edit with vim.

Please let me know if you write a similar plugin for another editor (or for vim-ale for that matter).