Database Considerations

TL;DR: Discussion on database variants, evaluation metrics, and choosing the right database for non-graphical information systems.

This document captures a brief about considerations, evaluation metrics, clarifications and observations regarding finalization of database choice for non-graphical information.

A Case for NoSQL

SQL Strain

A key phenomenon observed against relational data-sets that induces a strenuous environment adversely impacting information manageability. Key characteristics include:

  • Delineation of concerns - While graph DB helps build active connections, master DB is responsible to retain end/computed results, audit trails or cold storage attributes that help reconstruct future graphs
  • De-normalisation - Extensively applied for enhanced access depreciates quality (leading to redundancies and orphan data)
  • Overhead - Only a segment of data might be required at a given time

With systems dealing with steadily changing/evolving states of entities leading to accumulation of substantial data, the sheer volume of information to be processed warrants a methodology beyond immediate or transactional consistency.

NoSQL Variants

Based on assessment of impact of queries on entity models, these are primarily categorized as:

Key-Value Stores

  • For relatively short associated information
  • Simple key values expecting relatively less reads/writes (like LDAP)

Document Stores

  • Handle structured and semi-structured information
  • Accommodate varying attributes with embedding support
  • Provide indexing and filtering for subsets

Column-Family Stores

  • Handle structured and unstructured information with variable/evolving entity design
  • Accommodate dense data and high volume of information
  • Typically run in clusters by design
  • Ability to always write with extensive geographic distribution
  • Tolerance toward short-term inconsistency across replicas
  • Dynamic fields with variable data structure support

Available Choices

With growing popularity of NoSQL and exponential rise in data collection points, we have several choices. The evaluation exercise attempted to draw out specific parameters to narrow down on a suitable implementation.

High Level Parameters

  • Heavy on writes from users and graph connection deductions
  • Eventually heavy on reads to bring in analytical abilities

Comparison Matrix

MetricCassandraHBaseMongoDBComments
API supportabilityGoodGoodGoodGood overall API support
Ease of spawnGoodMediumGoodUnderlying HDFS dependency
Cost of ownershipApache V2Apache V2AGPLAll have routes for enterprise editions
Deployment and operationGoodMediumGoodDue to underlying Hadoop dependency
Operational and analyticalRight MixRight Mix as subsystemsOperationalCassandra is good for sensor driven data
Architectural layoutGoodMediumMediumVarying benchmark reports

Inclination

The evaluation gravitated toward a column-family based choice (Cassandra) due to:

  • Better write performance
  • Native clustering support
  • Geographic distribution capabilities
  • Balance of operational and analytical workloads

References

  • [WP1, 2015] Overcoming SQL Strain and SQL Pain - Neo4j
  • [Art1, 2015] Comprehensive analytical analysis - kkovacs.eu
  • [WP2, 2015] Benchmarking top NoSQL databases - DataStax