Data Mining Workflows with databases

Machine learning is the study of algorithm that mine large databases for insight. It is therefore surprising that machine learning research is conducted without using databases.

All machine learning practitioners, and anyone who has spent an afternoon munging CSV files can appreciate the utility of storing data in a structure database. Why is it that most machine learning research is conducted with scripts that generate CSV files and plots?

I recently worked on an entirely self contained ML project where the goal was to take some text documents, build models to classify that text, evaluate those models, and draw some conclusions.

As a graduate student trained in Mathematics and Computer Science, I was not well versed in enterprise Relational Database Management Systems (RDBMS) and the Standard Query Language (SQL). I knew that SQL is query language for describing operations on data but thought that clearly one would rather do their work in a real programming language like Python.

So I wrote my code that emitted text, used awk, grep, and sed to gather that data into CSV files, ingested those CSV files with Pandas and wrote my data analysis scripts against those data frames.

Having graduated and worked on a variety of software projects including web applications, I have become a much more proficient RDBMS and SQL user. This has really allowed me to improve the efficiency of my data mining research.

Why should I use a Datamining database?

If you are running a business or scientific project and are not storing your data in a RDBMS, then you are probably doing it wrong. No matter how complex your data seems, if you cannot convert it to rows and columns, then you don't fully understand your data.

That means that the data starts out in a database, it makes sense that the analytical results would live with the data in the same database1. Once you have gone through the initial cost of setting up a database and designing a schema, then you will find that the marginal cost of using the database is small, in fact since you end up writing less procedural code and spending less time managing many data files, you will often find that the marginal cost of using a database is negative.

What should I store in my DMDB?

Once you have made a home for your analytical results in your database2, you must consider the schema you are going to use. You should start with the underlying records. These should come from another schema in the database so you want to just provide every record that you want to include in the analysis with a new primary key that will be consistent within the analysis. The joincolumn field is for identifying the original records which have the structured data.

Records and Features

In any data mining project, we start with the data. Here we will refer to a single piece of data (image, document, message, user) as a record. CREATE TABLE IF NOT EXISTS records ( id INTEGER NOT NULL PRIMARY KEY, name TEXT NOT NULL, joincolumn INTEGER NOT NULL, )

  CREATE INDEX RecordsNameIdx ON records(name ASC);
   To access the records you would run a query:

  select records.id, field1, field2, field3
   from datatable join records 
        on datatable.joincolumn = records.joincolumn

This allows you to select the columns that you want and derive any additional columns as you pull the data for example, if you wanted to build a polynomial feature for feature engineering, you could do that in the database.

  select records.id, field1, field2, field3, (field1*field3) + field1*field2 as poly
   from datatable join records 
        on datatable.joincolumn = records.joincolumn

If features are extracted this way, then they are easily shareable and repeatable since anyone with the same database schema can execute your query to extract the same features.

Not all features can be easily computed using SQL, thus you will probably want a table that allows you to store your features derived from some off-line process like NLP or image processing.

   For sparse features you can use a table that stores both the row and column identifier.

  CREATE TABLE IF NOT EXISTS features (
    id INTEGER PRIMARY KEY NOT NULL ,
    record integer NOT NULL,
    column integer NOT NULL,
    value FLOAT NOT NULL,
    FOREIGN KEY (record) REFERENCES records(id)
  );

For dense features such as those used in image processing, you would want to use an alternative structure such as the following:

  CREATE TABLE IF NOT EXISTS media (
    id   INTEGER   NOT NULL PRIMARY KEY,
    record INTEGER NOT NULL,
    name TEXT      NOT NULL,
    -- so we know how to read it
    mime TEXT NOT NULL,
    blob BLOB,
    url  TEXT      NOT NULL,
    FOREIGN KEY (record) REFERENCES records(id)
  );

The mime field is important because it tells the application how to read the data stored in this table. You can use application/mytype to define custom types that your software knows how to read.

For medium sized dense features the BLOB field is a good tradeoff between convenience and performance. If you are using an advanced database such

For large feature vectors such as images it makes sense to store the data outside of the database. The url field can store a reference to the data so that any application can get the data with a simple HTTP GET or FTP download.

This brings us to our first foreign key, the features refer back to a record3. Every feature vector in our analysis refers back to the record that it represents. A foreign key constraint like FOREIGN KEY (record) REFERENCES records(id) tells the database that every entry in the record column in the features table must be the same as some entry in the id column of the records table. This allows the database to ensure that we do not store features for records that don't exist4.

Labels

Every machine learning problem requires labels of some kind. These can be class labels for a supervised classification problem or continuous targets for regression. Even unsupervised problems will usually have labels such as the cluster id. These two tables store the label data along with some common information that is useful for a classification problem.

An integer label with a floating point proba field is good for binary classification so that you can use models that produce a value to compare with a threshold. You can also use the proba field for a regression target variable.

Since we use the same table for input labels provided by the problem and output labels generated by algorithms, every record could have multiple labels.

A fold is a set of labels, for example when we are comparing multiple algorithms for the same problem, we will produce a label for each record for each algorithm.

     CREATE TABLE IF NOT EXISTS descriptions (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        description TEXT NOT NULL
     );

     CREATE TABLE IF NOT EXISTS folds (
        id INTEGER PRIMARY KEY,
        training BOOL,
        testing BOOL,
        validation BOOL,
        generated BOOL,
        desc INTEGER NOT NULL,
        created_at  timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
        FOREIGN KEY (desc) REFERENCES descriptions(id)
     );


     CREATE TABLE IF NOT EXISTS labels (
        id     INTEGER PRIMARY KEY,
        record INTEGER NOT NULL,
        fold   INTEGER NOT NULL,
        label  INTEGER NOT NULL,
        proba  FLOAT NOT NULL,
        FOREIGN KEY (record) REFERENCES records(id)
        FOREIGN KEY (fold) REFERENCES folds(id)
        FOREIGN KEY (label) REFERENCES descriptions(id)
     );

This illustrates a common idiom in database design. Often there will be small set of potentially long strings that many items will reference. In this case we have a the labels for the data that are often expressed in terms of the domain meaning such as healthy vs sick, legit vs illegit, or hot dog vs not hot dog. We handle this by using an integer to represent the label and then creating a second table which contains the mapping from integer to string. Thus the descriptions table is something like:

| id | name | description | |----+-------------+----------------------------------| | 1 | healthy | The person is healthy | | 2 | sick | The person has a disease | | 3 | legit | This customer is not a scammer | | 4 | illegit | This customer is a scamer | | 5 | hot dog | Sausage between buns | | 6 | not hot dog | Literally anything but a hot dog | | 7 | daschund | the intersection of 5 and 6 |

This way you have a both a short and long name for each label without repeating it for each record. Folds also need these descriptions. Some databases break up the descriptions into tables based on what they describe. Here we have chosen to put all descriptions in the same table.

Models

   The final piece of our puzzle is storing the trained models for evaluation. When we train a model
   we want to enter it into the database along with the relations it has to the folds.

     CREATE TABLE IF NOT EXISTS models (
       id          INTEGER PRIMARY KEY NOT NULL,
       name        TEXT NOT NULL ,
       description TEXT NOT NULL,
       created_at  timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
       blob        BLOB NOT NULL
     );
     CREATE TABLE IF NOT EXISTS modelfolds (
       id INTEGER PRIMARY KEY NOT NULL,
       model INTEGER NOT NULL,
       fold INTEGER NOT NULL,
       relation INTEGER NOT NULL,
       FOREIGN KEY (model) REFERENCES models(id)
       FOREIGN KEY (fold) REFERENCES folds(id)
       FOREIGN KEY (relation) REFERENCES descriptions(id)
     );

The modelfolds table is a Join Table which means that its primary purpose is to connect two tables that can be in a many to many relationship. Since a model can be trained on a fold, and applied to potentially many other folds we use a join table5.

Provenance Information noexport

This section forms the basis of a second post referring to tracking data provenance. I think it is too much material to cover in one post. So we are breaking it up. The second post will introduce these provenance tables and modify the existing tables to refer to them.

The provenance information is also harder because you have to use some specialized software to insert things into the DB instead of just inserting into the tables as you go. But it does allow for multi tenancy on the same database which is important for larger projects. CREATE TABLE IF NOT EXISTS experiments ( id INTEGER PRIMARY KEY NOT NULL, name TEXT NOT NULL , description TEXT NOT NULL, created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, )

  CREATE UNIQUE INDEX IF NOT EXISTS
  ExperimentIdx on experiments(name ASC);


  CREATE TABLE IF NOT EXISTS algorithms(
    id INTEGER PRIMARY KEY NOT NULL,
    experiment Integer NOT NULL,
    name TEXT NOT NULL,
    description TEXT NOT NULL,
    fold INT NOT NULL,
    version TEXT NOT NULL,
    paramset TEXT NOT NULL
  );

  -- for tagged stable releases
  CREATE TABLE IF NOT EXISTS versions (
    id INTEGER PRIMARY KEY NOT NULL,
    experiment INTEGER NOT NULL,
    name TEXT NOT NULL, -- the name of the software package or data product
    description TEXT NOT NULL, -- a human readable description
    hash TEXT NOT NULL, -- the hash of the commit produced by version control
    major INT NOT NULL, -- semantic versioning major
    minor INT NOT NULL, -- semantic versioning minor
    patch INT NOT NULL, -- semantic versioning patch
    suffix TEXT NOT NULL, -- semantic versioning suffix ie. -alpha, -beta, -dev, -rc1
    released_at timestamptz NOT NULL, -- should be release date, but could be now if no releasedate provided, useful for interactive development
    created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, -- time of entry into DB
    FOREIGN KEY (experiment) REFERENCES experiments(id)
  );

  CREATE UNIQUE INDEX IF NOT EXISTS
    VersionNameIdx on versions(name ASC, experiment ASC);

  CREATE TABLE IF NOT EXISTS parametersets (
    experiment INTEGER NOT NULL,
    name text NOT NULL, -- human name for display
    value text NOT NULL, -- a json string of the parameters
    created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, -- time of creation or entry into DB
    PRIMARY KEY (experiment, name),
    FOREIGN KEY (experiment) REFERENCES experiments(id)
  );

Footnotes


  1. I use [[https://www.sqlite.org/][sqlite]] for prototyping and testing because it is easy to create an share a database. ^
  2. Just like a web application separates the data used to manage the web application from the data that the web application manages, one should separate the analytical results from the data it is derived from. ^
  3. This foreign key is the reason we have a records table. We want features and labels in our analysis schema to relate back to a record in our dataset from the main schema. Since you might not be able to put foreign keys into that schema for technical or policy reasons, we made a new table to put our foreign key constraints onto. ^
  4. Foreign key constraints do have a measurable impact on performance and can make some ad-hoc operations harder. Overall the benefits to correctness outweigh the performance costs for most applications. ^
  5. You could solve this problem without a join table by picking a maximum number of folds that a model can be related to. For example you could say that every model has a training fold and an application fold. This would allow you to remove the join table, but would build into the database the assumption that a model can only be related to two folds. If you wanted to break that assumption in the future, you would have to change the database schema. ^