Simple Big Data Streaming Solution

Problem

How can we process large data set both structured and unstructured, in most efficient form? Is this a big data challenge that requires case by case based solutions?

Resolution

Big data related challenges are not new. Many people have working on this challenge for a while; however, it became fashionable to work on the problem just now. A few challenges that we need to resolve are

  1. Quickly get data out of data store (database) without stressing the system both CPUs and memory
  2. Optimize data processing in application. It is preferable to process in small chunks rather than working in large data set
  3. Use Component Architecture (a.k.a Microservice Architecture) for efficient process through reuse of component as a service

How to stream data from database?

The basic problem in operating on a large data set is that the size of the data most likely saturates memory either or both in database and application. Therefore, instead of holding all the data in memory, the data set need to be cut into smaller logical chunks and provide the smaller set to processing pipeline.

Most of databases provide result set processing API that provide an application a pointer to result set. Such result set pointer is called CURSOR.

All database provide cursor api. A cursor is a pointer to the result set such that a caller can parse through rows without having to bring all the data in memory. This feature of cursor enable a client to stream the data.

A simple strategy for stream may be get a cursor and then iterate throw the data. As you iterate through the data, you can publish one row at a time. Now you can publish the data on messaging bus, a call back object or simply write to a file.

Databases may provide many different APIs but you to find and use an appropriate API for streaming data

Let’s take an example of JDBC client.
When using JDBC, we need to call Statement#setFetchSize(int) in order to give the DB/JDBC driver a hint on the number of result set rows to fetch in a single batch. As this is only hit, this may not be enough to prevent resultset object from collecting all result rows in memory or the driver waiting until all results have been collected by the DB.

PostgreSQL

For PostgreSQL, it is required to call setAutoCommit(false) on the Connection object. The reason is that PostgreSQL only enables cursors inside a transaction.

If you have a “LIMIT” in your query, read on

  1. http://stackoverflow.com/questions/13407555/postgresql-query-taking-too-long