Sunday, February 23, 2014

Some musings on databases and the NoSQL vs. SQL debate

So I've been doing some research into what kind of database this project will use for the knowledge base component.  There may be a separate database for managing user accounts, which is a much simpler functionality.  The short of it is that I'm thinking MongoDB is a good fit...

One of the main things I've been trying to decide is whether we want to do SQL or "NoSQL", and what are the real strengths and weaknesses of those two 'classes' (if you want to call them that).  Note that NoSQL actually stands for "Not Only SQL", and some NoSQL databases have SQL-like query languages but use different representations and organization for data.  What I'm generally finding (and have seen this in limited experience with systems such as MySQL) is that SQL databases, or more generally relational databases, are not all that good at encoding graphs nor hierarchical objects.  This is a natural weakness from the explicit use of tables as the fundamental concept.  Also, there's often a disconnect between how most object-oriented programs represent data and how that can and is represented in a relational database; this difference is sometimes called "impedance mismatch".

On the other hand, I'm liking what I find about NoSQL databases, and in particular document-oriented databases such as MongoDB.  Instead of rigid tables and pre-defined schemas, a document database stores data in formatted files, which can be plain text but are often binary for efficiency.  There is, of course, a syntax to these documents but there are generally not rigid requirements on what each object must or must not contain.  This kind of flexibility allows similar objects to store things if they need them and omit them if they don't.  I find this strength to be in direct contrast to the column concept in SQL-type database systems.  You often see those types of database tables with some columns that are usually NULL - this is either because most records simply don't need or have that data, or designed into the schema in the beginning and remains because it's harder to remove it than to keep storing all those NULL's.

A key aspect of this decision is that we will have some complex data structures in the database.  Per the work Alan Kuntz is doing, we will be storing graphs in this database representing diagnostic procedures with edge weights reflecting frequency of problem occurrence.  I am deeply concerned that using a relational database for this purpose will cause us nothing but heartache and pain.  By contrast, I believe a procedural step object (node) in a system like MongoDB could look something like this (pseudocode of course):

object node_step
{ "instructions": "check X"                                 // what's displayed to the user
   "id": "12345"                                                   // a key into this object
   "connected": "12678,78912,56742"                // connected nodes, could also point to Edge objects
   "value": "0.45"                                                // frequency of occurrence (based on repair history)
}

By contrast, I'm not all that sure how this would look in a relational database like MySQL.  I guess we would have a table for node_step, with the columns shown above, and the field "connected" would be a set of numbers (there may be an inherent flexibility about the number of connections right there) that are foreign keys into other entries in node_step.  To me, that document-oriented style just seems so much more suitable.  MongoDB uses BSON, a binary encoding of JavaScript Object Notation (JSON).  JSON is a language-independent data format that plays nicely with Java, Javascript, and many other languages.  The format of JSON is basically like that node_step object above.  I am feeling like the choice of this technology is a critical one because this knowledge base is essentially the core functionality of this application.

No comments:

Post a Comment