Back to shazow.net

Arbitrarily-Structured Data in Relational Databases

This approach is similar to FriendFeed’s schemaless database framework. The key difference is in the data locality.

The goal is not to build an effective schemaless database on top of a relational database, but rather to accomodate for rapidly-evolving relational schemas and reducing the difficult of migrating forward.

Hypothesis

In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often.

Solution

Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved.

Imagine a standardized table structure where each table has the following columns: id, time_created, time_updated, _data, and additional “index columns”.

The _data column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against.

Example

A typical user table might have the following columns (using an SQLAlchemy declarative model):

class User(Model):

    id = Column(types.Integer, primary_key=True)
    time_created = Column(types.DateTime, default=datetime.now, nullable=False)
    time_updated = Column(types.DateTime, onupdate=datetime.now)

    is_admin = Column(types.Boolean, default=False, nullable=False)

    email = Column(types.String(255), nullable=False, index=True, unique=True)
    display_name = Column(types.String(64))

    password_hash = Column(types.String(40), nullable=False)
    password_salt = Column(types.String(8), nullable=False)

In our example, this table will have two types of queries:

-- Load the user object from the current session (where we store the user_id)
SELECT * FROM user WHERE id = :user_id;

-- Check the given password against the email address, for login
SELECT password_hash, password_salt FROM user WHERE email = :user_email;

In the schemaless model, the table would look like this:

class User(SchemalessModel):

    id = Column(types.Integer, primary_key=True)
    time_created = Column(types.DateTime, default=datetime.now, nullable=False)
    time_updated = Column(types.DateTime, onupdate=datetime.now)
    _data = Column(types.JSON)

    email = Column(types.String(255), nullable=False, index=True, unique=True)

Where the _data column would contain data like this:

{
    "display_name": "Andrey Petrov",
    "is_admin": 1,
    "password_hash": "cSKSsy315E4EroxeDQrsxjTb6ijBxxbK",
    "password_salt": "vS5Otm",
}

And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as user.display_name or user.email regardless whether it’s an extracted indexed property or a buried _data element.

Process: Adding an index

  1. Build a table with just a free-structure _data field.
  2. Determine queries, extract relevant properties into indexed columns:
    1. ALTER TABLE to add the column
    2. Run full-scan query to populate new column with data
    3. Add relevant index onto said column
    4. Deprecate property from _data (optional, we could just assume that proper columns always supercede _data attributes)

Concerns

Back to shazow.net