Machine learning is the study of algorithm that mine large databases for insight. It is therefore surprising that machine learning research is conducted without using databases.
All machine learning practitioners, and anyone who has spent an afternoon munging CSV files can appreciate the utility of storing data in a structure database. Why is it that most machine learning research is conducted with scripts that generate CSV files and plots?
I recently worked on an entirely self contained ML project where the goal was to take some text documents, build models to classify that text, evaluate those models, and draw some conclusions.
As a graduate student trained in Mathematics and Computer Science, I was not well versed in enterprise Relational Database Management Systems (RDBMS) and the Standard Query Language (SQL). I knew that SQL is query language for describing operations on data but thought that clearly one would rather do their work in a real programming language like Python.
So I wrote my code that emitted text, used awk
, grep
, and sed
to gather that data into CSV
files, ingested those CSV files with Pandas and wrote my data analysis scripts against those data
frames.
Having graduated and worked on a variety of software projects including web applications, I have become a much more proficient RDBMS and SQL user. This has really allowed me to improve the efficiency of my data mining research.
Why should I use a Datamining database?
If you are running a business or scientific project and are not storing your data in a RDBMS, then you are probably doing it wrong. No matter how complex your data seems, if you cannot convert it to rows and columns, then you don't fully understand your data.
That means that the data starts out in a database, it makes sense that the analytical results would live with the data in the same database1. Once you have gone through the initial cost of setting up a database and designing a schema, then you will find that the marginal cost of using the database is small, in fact since you end up writing less procedural code and spending less time managing many data files, you will often find that the marginal cost of using a database is negative.
What should I store in my DMDB?
Once you have made a home for your analytical results in your database2, you must consider the
schema you are going to use. You should start with the underlying records. These should come from
another schema in the database so you want to just provide every record that you want to include in
the analysis with a new primary key that will be consistent within the analysis. The joincolumn
field
is for identifying the original records which have the structured data.
Records and Features
In any data mining project, we start with the data. Here we will refer to a single piece of data (image, document, message, user) as a record. CREATE TABLE IF NOT EXISTS records ( id INTEGER NOT NULL PRIMARY KEY, name TEXT NOT NULL, joincolumn INTEGER NOT NULL, )
CREATE INDEX RecordsNameIdx ON records(name ASC);
To access the records you would run a query:
select records.id, field1, field2, field3
from datatable join records
on datatable.joincolumn = records.joincolumn
This allows you to select the columns that you want and derive any additional columns as you pull the data for example, if you wanted to build a polynomial feature for feature engineering, you could do that in the database.
select records.id, field1, field2, field3, (field1*field3) + field1*field2 as poly
from datatable join records
on datatable.joincolumn = records.joincolumn
If features are extracted this way, then they are easily shareable and repeatable since anyone with the same database schema can execute your query to extract the same features.
Not all features can be easily computed using SQL, thus you will probably want a table that allows you to store your features derived from some off-line process like NLP or image processing.
For sparse features you can use a table that stores both the row and column identifier.
CREATE TABLE IF NOT EXISTS features (
id INTEGER PRIMARY KEY NOT NULL ,
record integer NOT NULL,
column integer NOT NULL,
value FLOAT NOT NULL,
FOREIGN KEY (record) REFERENCES records(id)
);
For dense features such as those used in image processing, you would want to use an alternative structure such as the following:
CREATE TABLE IF NOT EXISTS media (
id INTEGER NOT NULL PRIMARY KEY,
record INTEGER NOT NULL,
name TEXT NOT NULL,
-- so we know how to read it
mime TEXT NOT NULL,
blob BLOB,
url TEXT NOT NULL,
FOREIGN KEY (record) REFERENCES records(id)
);
The mime
field is important because it tells the application how to read the data stored in this
table. You can use application/mytype
to define custom types that your software knows how to read.
For medium sized dense features the BLOB
field is a good tradeoff between convenience and
performance. If you are using an advanced database such
For large feature vectors such as images it makes sense to store the data outside of the database.
The url
field can store a reference to the data so that any application can get the data with a
simple HTTP GET
or FTP download.
This brings us to our first foreign key, the features refer back to a record3. Every feature
vector in our analysis refers back to the record that it represents. A foreign key constraint like
FOREIGN KEY (record) REFERENCES records(id)
tells the database that every entry in the record
column in the
features
table must be the same as some entry in the id
column of the records
table. This
allows the database to ensure that we do not store features for records that don't exist4.
Labels
Every machine learning problem requires labels of some kind. These can be class labels for a supervised classification problem or continuous targets for regression. Even unsupervised problems will usually have labels such as the cluster id. These two tables store the label data along with some common information that is useful for a classification problem.
An integer label with a floating point proba
field is good for binary classification so that
you can use models that produce a value to compare with a threshold. You can also use the
proba
field for a regression target variable.
Since we use the same table for input labels provided by the problem and output labels generated by algorithms, every record could have multiple labels.
A fold is a set of labels, for example when we are comparing multiple algorithms for the same problem, we will produce a label for each record for each algorithm.
CREATE TABLE IF NOT EXISTS descriptions (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
description TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS folds (
id INTEGER PRIMARY KEY,
training BOOL,
testing BOOL,
validation BOOL,
generated BOOL,
desc INTEGER NOT NULL,
created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
FOREIGN KEY (desc) REFERENCES descriptions(id)
);
CREATE TABLE IF NOT EXISTS labels (
id INTEGER PRIMARY KEY,
record INTEGER NOT NULL,
fold INTEGER NOT NULL,
label INTEGER NOT NULL,
proba FLOAT NOT NULL,
FOREIGN KEY (record) REFERENCES records(id)
FOREIGN KEY (fold) REFERENCES folds(id)
FOREIGN KEY (label) REFERENCES descriptions(id)
);
This illustrates a common idiom in database design. Often there will be small set of potentially
long strings that many items will reference. In this case we have a the labels for the data that
are often expressed in terms of the domain meaning such as healthy vs sick, legit vs illegit,
or hot dog vs not hot dog. We handle this by using an integer to represent the label and then
creating a second table which contains the mapping from integer to string. Thus the
descriptions
table is something like:
| id | name | description | |----+-------------+----------------------------------| | 1 | healthy | The person is healthy | | 2 | sick | The person has a disease | | 3 | legit | This customer is not a scammer | | 4 | illegit | This customer is a scamer | | 5 | hot dog | Sausage between buns | | 6 | not hot dog | Literally anything but a hot dog | | 7 | daschund | the intersection of 5 and 6 |
This way you have a both a short and long name for each label without repeating it for each record. Folds also need these descriptions. Some databases break up the descriptions into tables based on what they describe. Here we have chosen to put all descriptions in the same table.
Models
The final piece of our puzzle is storing the trained models for evaluation. When we train a model
we want to enter it into the database along with the relations it has to the folds.
CREATE TABLE IF NOT EXISTS models (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT NOT NULL ,
description TEXT NOT NULL,
created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
blob BLOB NOT NULL
);
CREATE TABLE IF NOT EXISTS modelfolds (
id INTEGER PRIMARY KEY NOT NULL,
model INTEGER NOT NULL,
fold INTEGER NOT NULL,
relation INTEGER NOT NULL,
FOREIGN KEY (model) REFERENCES models(id)
FOREIGN KEY (fold) REFERENCES folds(id)
FOREIGN KEY (relation) REFERENCES descriptions(id)
);
The modelfolds
table is a Join Table which means that its primary purpose is to connect two
tables that can be in a many to many relationship. Since a model
can be trained on a fold,
and applied to potentially many other folds we use a join table5.
Provenance Information
This section forms the basis of a second post referring to tracking data provenance. I think it is too much material to cover in one post. So we are breaking it up. The second post will introduce these provenance tables and modify the existing tables to refer to them.
The provenance information is also harder because you have to use some specialized software to insert things into the DB instead of just inserting into the tables as you go. But it does allow for multi tenancy on the same database which is important for larger projects. CREATE TABLE IF NOT EXISTS experiments ( id INTEGER PRIMARY KEY NOT NULL, name TEXT NOT NULL , description TEXT NOT NULL, created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, )
CREATE UNIQUE INDEX IF NOT EXISTS
ExperimentIdx on experiments(name ASC);
CREATE TABLE IF NOT EXISTS algorithms(
id INTEGER PRIMARY KEY NOT NULL,
experiment Integer NOT NULL,
name TEXT NOT NULL,
description TEXT NOT NULL,
fold INT NOT NULL,
version TEXT NOT NULL,
paramset TEXT NOT NULL
);
-- for tagged stable releases
CREATE TABLE IF NOT EXISTS versions (
id INTEGER PRIMARY KEY NOT NULL,
experiment INTEGER NOT NULL,
name TEXT NOT NULL, -- the name of the software package or data product
description TEXT NOT NULL, -- a human readable description
hash TEXT NOT NULL, -- the hash of the commit produced by version control
major INT NOT NULL, -- semantic versioning major
minor INT NOT NULL, -- semantic versioning minor
patch INT NOT NULL, -- semantic versioning patch
suffix TEXT NOT NULL, -- semantic versioning suffix ie. -alpha, -beta, -dev, -rc1
released_at timestamptz NOT NULL, -- should be release date, but could be now if no releasedate provided, useful for interactive development
created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, -- time of entry into DB
FOREIGN KEY (experiment) REFERENCES experiments(id)
);
CREATE UNIQUE INDEX IF NOT EXISTS
VersionNameIdx on versions(name ASC, experiment ASC);
CREATE TABLE IF NOT EXISTS parametersets (
experiment INTEGER NOT NULL,
name text NOT NULL, -- human name for display
value text NOT NULL, -- a json string of the parameters
created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP, -- time of creation or entry into DB
PRIMARY KEY (experiment, name),
FOREIGN KEY (experiment) REFERENCES experiments(id)
);
Footnotes
- I use [[https://www.sqlite.org/][sqlite]] for prototyping and testing because it is easy to create an share a database. ^
- Just like a web application separates the data used to manage the web application from the data that the web application manages, one should separate the analytical results from the data it is derived from. ^
- This foreign key is the reason we have a records table. We want features and labels in our analysis schema to relate back to a record in our dataset from the main schema. Since you might not be able to put foreign keys into that schema for technical or policy reasons, we made a new table to put our foreign key constraints onto. ^
- Foreign key constraints do have a measurable impact on performance and can make some ad-hoc operations harder. Overall the benefits to correctness outweigh the performance costs for most applications. ^
- You could solve this problem without a join table by picking a maximum number of folds that a model can be related to. For example you could say that every model has a training fold and an application fold. This would allow you to remove the join table, but would build into the database the assumption that a model can only be related to two folds. If you wanted to break that assumption in the future, you would have to change the database schema. ^