# Background

Computation requirements across different industries have exploded, moving from 10’s of gigabytes per month to a terabyte a week. It is forcing companies to expand their clusters and increase processing bandwidth. Cloud has assuaged this to a great extent, it provides 100’s of powerful machines on demand with the ability to scale the cluster up and down as the data flow fluctuates.

This power of on-demand computation now allows developers to control how much scale is useful for their algorithms. Does some processing which took 10 hrs on 1 Machine earlier go to 1 hr with 10 machines or 0.5 hrs with 20 machines?

The shift from fixed-cost to on-demand pricing makes it really important that the resources are being used as efficiently as possible. As the data size begins to grow, the computing cost grows linearly, or often exponentially, depending on how the computation algorithms have been designed by the developers. Typical multi-pass algorithms are of O(n^2) complexity, compared to a single-pass algorithms of complexity O(n). Single-pass algorithms are generally cheaper, faster, and are also able to work on a streaming data source. The trade-off until now has been that single-pass algorithms require higher memory for efficient performance.

## Map Reduce

While Map part of the algorithm is truly scalable, the reduce phase is much slower as it requires the data to be sorted. Hence, it is usually done on order of smaller machines than the map phase. Therefore, truly scalable algorithms do most (if not all) of their computations in the Map phase & very small reduce phases.

# Statistical Algorithms for quicker analytics

In many analytical use cases, results with an accuracy of about 95% is “good enough” for making business decisions. Statistical algorithms allow us to achieve this. Statistical algorithms avoid storing the original data, and replace them by hash representations of the data. This eliminates the need to use the entire network cluster.

Some of the most common analytical queries that we have seen in real-life applications are:

- How many distinct elements are in the data set? (Hyperloglog)
- What are the most frequent elements (“heavy hitters”/“top-k elements”)? (Count min sketch)
- What are the frequencies of the most frequent elements? (Count min sketch)
- Does the dataset contain a particular element (search query)? (Bloom filters)

## Methods for statistical analysis

### Bloom Filters

A Bloom filter is a data structure designed to determine whether an element is present in a set in a rapid and memory-efficient way. Bloom filters are highly distributable, and therefore are greatly suitable in cases when we need to quickly filter items which are present in a set.

In this example, we have the data set (represented by {x, y, z}). First, the dataset is converted into its hash representation, shown by the numbers in cells. The incoming element (w) is first converted to its hash representation. It is then matched against the hash representation of the entire dataset. For some incoming data points, it might be that the hash of the incoming element matches the hash of a different element in the master dataset. This means that bloom filters’ positive matches are “maybe yes”, and negative matches are “surely no”.

This is useful in scenarios where we need to quickly filter against a set to narrow down the suspects – such as, analysing offensive activities daily my matching data against a set containing all malicious userIDs.

### Hyperloglog

Hyperloglog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using a small amount of memory. For instance, to achieve 99% accuracy, it needs only 16 KB. In cases where we need to count distinct elements in a dataset spread across a hadoop cluster, we could compute the hashes on different machines, build the bit index and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across network, thus saving a lot of time.

Let’s say we have N values in a set, and we hash it using a uniform hashing function, mapping it to [0, 1]. The expected width, w, between two adjacent hash values can be represented by 1/N. Of course, in real situations, it would not be exact, and to minimise deviations and error, we take E(kw) = k/N ; where k is a certain number of adjacent hash values in the set. ‘w’ is a good estimator for keeping track of the total number of items, N. The trade-off here is of speed vs. accuracy.

### Count-min sketch

Count–min sketch is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream to obtain frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. It is useful for estimating the total number of ‘counts’ or ‘frequency’ of an element in a set, and the accuracy of the estimate scales with the total sum of all the counts stored.

We divide the data set into K different buckets, each with a different hash. The incoming keys are matched to all of the different buckets, and what we get is an upper bound (max) for the frequency of the element being matched. Since we get K different upper bounds because there are K buckets, we choose the minimum of these (min) for getting the best estimate.

# Faster execution engine – Spark

Spark is a faster execution engine which can provide upto 10x performance over to MapReduce. Combining Spark with the use of these statistical algorithms gives us a huge leap both in terms of costs and performance. Spark gets most of its speed by constructing Directed Acyclic Graph (DAG) out of the job operations, and uses memory to save intermediate data, thus making the reads extremely efficient operations. When we use statistical algorithms on top of Spark, the intermediate hashes of the datasets are saved in-memory, giving a tremendous boost to the overall performance of these algorithms.

References: Images for the algorithms are taken from Avi Byrant’s excellent talk available at: https://www.youtube.com/watch?v=yzitqjUI6ok

Authors: Mayur Rustagi and Praveen R, from Sigmoid Analytics. T

## One thought on “Reducing cost in Big Data using statistics & in-memory technology”