Data Analysis on SparkFebruary 18, 2020 No Comments
Featured article by Angelica Germain , Independent technology Author
Computing systems in industry and research organizations are becoming more complex every day with the accelerated growth of big data. The growth is fueled by an unprecedented generation of enormous amounts of new data from application domains and instruments including e-commerce and social networks. This presents an opportunity for computing firms to develop big data processing frameworks as the amount of data and the speed of data production outpace the capabilities of single machines. New programming models capable of processing specialized data and new workloads were developed. Google developed MapReduce and Dremel. MapReduce is a batch-processing framework. The other models are Storm, Flink, Dryad, Caffe, and Tensor flow. Storm and Flink are both streaming processing frameworks whilst Dryad is used for graph processing. All these models nonetheless, are limited to work with for a certain data computation. In an attempt to resolve this limitation, the University of California, in 2009, started the development of Spark, a unified programming model for distributed data processing, proposed by Matei Zaharia.
Spark is a unified computing engine and a set of libraries for distributed data processing on large computer clusters. It is a general computing framework with a programming model similar to MapReduce and is capable of processing large-scale datasets fast. Spark uses a Resilient Distributed Datasets (RDDs) extension that can support a wide range of processing workloads including SQL, streaming, machine learning, and graph processing. Before the introduction of Spark, such computations required separate engines. Spark is supported by most platforms including Python, Java, Scala and R and can run on a laptop as well as a cluster of several servers. As a result, Spark is widely used both in industry and in academia.
Functional significance of Spark
The generality of Spark gives numerous benefits. Spark makes it easier to develop applications because of the unified API. Spark combines processing tasks, run diverse functions in memory, and do not require data storage in another engine. Besides, another functional significance of Spark is that it can compute new applications such as streaming and machine learning. Such interactive queries could not be handled by previous systems.
Components of Spark
Spark core is the building block of the program upon which all functions are built and executed. Spark core provides In-memory computing as well as a reference dataset. Spark SQL contains Schema RDD that supports processing of specialized data. Spark streaming aids Spark cores’ scheduling capability and performs streaming analytics. MLlib is machine-learning framework on top of Spark core. GraphX is a distributed graph-processing framework above the Spark core that executes graph computations.
Applications of Spark
Spark usage is widespread. Its use is dominant in financial services, telecommunications, retail, energy and life sciences. Common applications of Spark include streaming data, artificial intelligence, machine learning, business intelligence and data integration.
Spark Program model
Resilient Distributed Datasets (RDD) is s a read-only, partitioned collection of records that can be operated in parallel. There are only two ways to create RDDs. That is, the use of data in storage such as HDFS or other RDDS by coarse-grained transformations including map, filter and join. Spark uses RDDs to process MapReduce operations faster and efficiently. A transformation can only define a new RDD and in this regard, it is a lazy operation. Computation of RDD can also be launched by using action operations such as count, collect, save and reduce. Action operations can recycle a value to an application program. The same can export the RDD’s data to an external storage system.
Pros of the RDD model
The advantages of RDDs is compared to distributed shared models (DSM). The major difference between the two is that RDD is written through coarse-grained transformations while DSM permit reads and writes in each location. As a result, RDDs allows efficient fault tolerance but cannot perform bulk writes
Another benefit of RDDs allows mitigation of low nodes by duplicating back-up copies of slow tasks. With DSM, implementation of back up is difficult due to likely access memory conflict by two copies of a task in the same location. The tasks can also interfere with their updates. Besides, in congested operations RDDs can separate tasks to improve performance. Moreover, RDDs degenerate if memory is insufficient or can be stored on disk.
For each operation, a signature is given that shows type parameters in square brackets. Certain operations, for example join, are only available on RDDs of key value pairs. All functions match APIs in Scala and other programming languages.
For instance, map is a one-to-one mapping, and the flatMap function maps input to one or more outputs. Users can also request RDD to be set aside as buffer storage or get a partition order.
Spark has a master-slave architecture, which consist of two daemons. They are the master daemon and the worker daemon. A single master daemon supports several worker daemons. Spark also has a cluster manager. The cluster manager administers spark cluster.
The master daemon, sometimes referred to as the driver, is responsible for task scheduling in Spark. Master daemon is also the entry point of Spark shell and creates Spark context. The master daemon contains like DAG Scheduler, Backend Scheduler, Task Scheduler, and Block Manager. By using these components, the driver translates spark user code into actual spark jobs executed on the cluster manager. The user code is converted into tasks that are processed by the worker daemon. Separately the schedulers have distinct functions. The DAG Scheduler split the DAG into separate stages of tasks. It schedules tasks and controls their execution. The Task Scheduler submits each stage of tasks to cluster manager.
The primary function of a worker or slave daemon is the execution of tasks. The number of executors are configurable according to user requirements and executor instances.
The executor performs the given task. It reads and writes data to external sources. The computation output is stored in-memory, cache or on hard disk drives.
The cluster manager manages Spark jobs and acquires the resources all for a job execution. Cluster manager allocates memory and other resources required to execute a task. A Spark application can run on Hadoop, YARN or Apache Mesos cluster managers.
Spark streaming is the processing of live streams of big data such as log files and status updates. Sources of data ingestion into Spark streaming are Twitter, HDFS or TCP sockets. The output is written onto file-systems, databases and live dashboards
Spark streaming creates a series of data streams or batches at regular time intervals of a few seconds. The time intervals also referred to as batch interval, ranges between 500ms and several seconds Each data stream is treated as RDDs and is therefore processed using RDD operations. The results are released in batches.
Discretized Stream (DStream)
DStream is a basic abstraction provided by Spark as a continuous stream of data. DStream is different from traditional streams. It produces a set of short, stateless, deterministic tasks. A DStream is stored as fault-tolerant data structures (RDDs). The RDDs can be recomputed by performing a series of deterministic operations. DStream provides fault-tolerance and unification by batch processing
Machine learning (ML) in Spark
Machine learning is a system of data analysis used to create personalization, recommendations and predictive insights using analytical model building. Machine learning produces a wide range of data products and services using algorithms and iterative computation.
ML is useful to solve real-life problems in marketing, human resources, risk management, health care, travel or education.
Machine Learning Architecture
Users generates data on browsers, mobile applications or external web APIs and initiates the process. Data storage systems include HDFS, Amazon S3, and other file systems.
Data transformation process follows five distinct steps. The first is to remove records with missing values. This is followed by filling bad or missing data and application of robust techniques to eliminate outliers. Finally, apply transformations to other potential outliers and secure useful features.
The next step of ML process is to set input parameters for model training and testing and generate predictions. The predictions are evaluated by checking accuracy and prediction error. Feature standardization is used to improve model performance.
Machine learning system
There are two broad machine learning systems namely supervised and unsupervised machine learning models.
Supervised models create a mode that makes sound predictions. In supervised models, correct classes of the training data are known and the performance of the model can be validated. Supervised models involves classification and regression. Classification refers to a process of assigning a class to an observation. Regression predict continuous performance for an observation, for instance fuel prices.
Unsupervised models build a model that executes data analysis to identify obscured patterns or grouped data. In unsupervised models simulation of data clusters following a certain measure of conformity.
MLlib is the largest scalable machine-learning library in Spark. MLlib is composed of common learning algorithms and utilities. This includes classification, regression, clustering, collaborative filtering, dimensionality, reduction, optimization primitives and higher-level pipeline APIs.
Features of MLlib
MLlib has the following core features:
- It executes classic machine learning algorithms,
- MLlib performs numerous optimizations to support distributed learning and prediction,
- MLlib consist of spark.ml which enhances practical learning pipelines and provide high-level APIs, and
- MLlib is well integrated with Spark SQL, GraphX, Spark streaming and Spark core to achieve high performance for MLlib.
MLlib also has numerous advantages. Tang et al (2018) asserts that MLlib is simple, scalable and is compatible with other modules in Spark. MLlib is used extensively in marketing, advertisements and fraud detection.
The growth of big data is a tremendous challenge to traditional data analysis models. At the same time presents an opportunity to develop new data analysis models. The amount of data associated with current information systems and the speed of data production is beyond the capabilities of most traditional models including MapReduce, Dremel, Storm, Flink, Dryad, Caffe, and Tensorflow. Besides, there are new interactive jobs and real-time queries that are difficult to execute by traditional data analysis frameworks. Spark programing model was created to resolve this challenge.
Spark is a programming model capable of processing specialized data and new workloads at high speeds. Spark uses a Resilient Distributed Datasets (RDDs) extension that can support a wide range of processing workloads including SQL, streaming, machine learning, and graph processing. The most common applications of Spark include streaming data, artificial intelligence, machine learning, business intelligence and data integration. Besides, RDDs allows mitigation of low nodes by duplicating back-up copies of slow tasks and allows efficient fault tolerance but cannot perform bulk writes.