Database Considerations

This document captures a brief about considerations, evaluation metrics, clarifications and observations regarding finalization of database choice for non-graphical information.

A Case for NoSQL

SQL Strain

A key phenomenon observed against relational data-sets that induces a strenuous environment adversely impacting information manageability. Key characteristics include:

Delineation of concerns - While graph DB helps build active connections, master DB is responsible to retain end/computed results, audit trails or cold storage attributes that help reconstruct future graphs
De-normalisation - Extensively applied for enhanced access depreciates quality (leading to redundancies and orphan data)
Overhead - Only a segment of data might be required at a given time

With systems dealing with steadily changing/evolving states of entities leading to accumulation of substantial data, the sheer volume of information to be processed warrants a methodology beyond immediate or transactional consistency.

NoSQL Variants

Based on assessment of impact of queries on entity models, these are primarily categorized as:

Key-Value Stores

For relatively short associated information
Simple key values expecting relatively less reads/writes (like LDAP)

Document Stores

Handle structured and semi-structured information
Accommodate varying attributes with embedding support
Provide indexing and filtering for subsets

Column-Family Stores

Handle structured and unstructured information with variable/evolving entity design
Accommodate dense data and high volume of information
Typically run in clusters by design
Ability to always write with extensive geographic distribution
Tolerance toward short-term inconsistency across replicas
Dynamic fields with variable data structure support

Available Choices

With growing popularity of NoSQL and exponential rise in data collection points, we have several choices. The evaluation exercise attempted to draw out specific parameters to narrow down on a suitable implementation.

High Level Parameters

Heavy on writes from users and graph connection deductions
Eventually heavy on reads to bring in analytical abilities

Comparison Matrix

Metric	Cassandra	HBase	MongoDB	Comments
API supportability	Good	Good	Good	Good overall API support
Ease of spawn	Good	Medium	Good	Underlying HDFS dependency
Cost of ownership	Apache V2	Apache V2	AGPL	All have routes for enterprise editions
Deployment and operation	Good	Medium	Good	Due to underlying Hadoop dependency
Operational and analytical	Right Mix	Right Mix as subsystems	Operational	Cassandra is good for sensor driven data
Architectural layout	Good	Medium	Medium	Varying benchmark reports

Inclination

The evaluation gravitated toward a column-family based choice (Cassandra) due to:

Better write performance
Native clustering support
Geographic distribution capabilities
Balance of operational and analytical workloads

References

[WP1, 2015] Overcoming SQL Strain and SQL Pain - Neo4j
[Art1, 2015] Comprehensive analytical analysis - kkovacs.eu
[WP2, 2015] Benchmarking top NoSQL databases - DataStax

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.