«Top»

Before we go into the details in the following parts, this page gives an
overview of current Big Data solutions and shows how caches fit in as one
approach to tackle the Big Data challenge.

As the exponentially decreasing costs of data storage have made the cost of
recording and storing data almost free, data growth has become one
of the biggest challenges faced by today’s organizations.

The term Big Data was coined to
describe a collection of data sets so large and complex that it becomes
difficult to handle using traditional, centralized solutions.

big-data-bottleneck

Modern Java application servers support clustered deployments, and provide
good horizontal scalability for growing applications. However, traditional
centralized shared data stores often become the major performance bottleneck
when the amount of data grows.

Categorization of Big Data solutions

The following is an overview of current approaches in Big Data solutions.

Extended File Systems with Map/Reduce

  • Overview: The Hadoop File System (HDFS),
    which is originally based on the
    Google File System (GFS)
    is one of the most established technologies in Big Data.

    The basic idea is that, instead of making servers larger and larger,
    data should be distributed among multiple commodity hardware servers.

    Apart from the scalable data storage, Hadoop also provides an infrastructure
    for processing data: Map/Reduce Jobs
    can be distributed among the servers, such that each server processes the
    part of the data that is locally available.

  • Typical Use Cases: HDFS is good at storing large amounts of unstructured
    data, and making the data available for batch processing.
  • Non-Goals: HDFS is designed for write-once-read-many (WORM) applications:
    It does not scale well when data is frequently updated. Moreover,
    Map/Reduce is targeted at long-running background processes and cannot
    be used for real-time queries.

NoSQL Databases

  • Overview: While file systems are just a flat store for unstructured data,
    NoSQL databases allow for relational structuring of the data. In most cases,
    NoSQL does not aim at providing the full relational capabilities of SQL
    databases, but it provides simpler relations, like the Key/Value relation
    in Hash tables.

    Among common implementations are HBase, which
    is based on Hadoop, and MongoDB.

  • Typical Use Cases: Real-time access using a simple relational model.
  • Non-Goals: As compared to SQL, relations are very limited. Complex
    relations affect scalability, as they might result in queries where each
    node in the cluster has to interact with each other node.

Search Indexes

  • Overview: Lucene is a well-known Java
    library for full text indexing. In order to handle larger amounts of
    data, Solr and
    ElasticSearch implement clustered deployments
    of Lucene indexes.
  • Typical Use Cases: Lucene-based data stores make data available via
    full text queries, range queries, etc.
  • Non-Goals: It is impossible to access the raw data, like in Map/Reduce
    infrastructures. Lucene provides access only via fields that have explicitly
    been indexed.

Caches

  • Overview: Originally, Caches were local temporary stores in RAM
    designed to increase access time. As we will show, current cache
    implementations, like Ehcache,
    Hazelcast, and
    Infinispan have evolved into clustered
    in-memory data management systems and have become an independent class of
    Big Data solutions.
  • Typical Use Cases: Caches provide very fast real-time access, and
    can be configured to support frequent updates of the cached data.
  • Non-Goals: As caches mostly hold their data in RAM, the amount of
    data is typically smaller than in file-system-based solutions.

Convergence of Big Data Solutions

Although the solutions above have very different roots, the differences
between them begin to blur. There are caches supporting distributed
Lucene indexes and Map/Reduce Jobs, there are combinations of distributed
file systems with Lucene clusters on top, etc.

Next

On this page, we gave a very brief overview of related work.
We pointed out that Caches are used to implement In-Memory Data Management
Systems
, which are one among other Big Data solutions. The specific strengths
of In-Memory Data Management Systems are very fast read and write operations.

In the next part, we will introduce the example application.