Introduction to Apache Spark


Hello Friends, In this blog, we will study why we need Apache Spark and benefits of using Spark. So, let's get started.......

Basic Info about Spark...

The main idea for creating Spark is that MapReduce was not good for Iterative and Interactive applications.
Now, another question arises, what is actually Spark is ???

Spark is a cluster computing framework. It's fast supports in-memory computation(Biggest advantage of using Spark), designed to cover a wide range of workloads like batch application,  interactive queries, streaming, machine learning, etc..

In-Memory computation means it takes data into RAM and processes it and returns the result. The capability of processing data in RAM makes Spark faster as it avoids costly disk I/O operations.

Also, Spark handles different types of workloads, for example,  we can process batches, structured data, Machine Learning, streaming data, graphs, etc..
In contrast to Spark in Hadoop, we can do all above-mentioned tasks but with the help of third-party applications, for example, if we want to process structured data, we need to install Hive on top of Hadoop, if we want to perform machine learning we need to install mahout, etc. Sparks do all these tasks through their own components.

Clusters Managers

Spark can be installed on top of many clusters managers. 
  • Hadoop YARN
  • Apache Mesos
  • Standalone Spark Cluster.

Storage Types Supported by Spark

Spark can process data present in the following Storage types :

  • Hadoop Distributed File System (HDFS)
  • Support Text Files
  • Parquet Format (Column-based storage format)
  • Avro Format (Row-based storage format)
  • Amazon S3
  • Cassandra
  • Local File System

Basic Concepts of Spark

  • RDD (Resilient Distributed Dataset): In short we can say, RDD is data that is distributed and fault-tolerant. The important thing about RDD is they are immutable.
  • Transformations
  • Actions

Spark Feature Stack

  • Spark Core: It's a base engine for distributed data processing with Spark. It is responsible for memory management, the fault recovery process, scheduling tasks and job monitoring on the cluster.
  • Spark SQL: it provides relational processing for structured data. Support many sources of data like Hive Tables, Parquet, etc.

Spark Architecture


Spark Architecture is based on Master-Slave architecture. 
  • Spark Driver Program is a program where the Spark Context object is created. It's a job of Spark Context to connect to Spark Cluster Manager.
  • Cluster Manager provides resources on cluster nodes
  • Executors are processes that run computations and store data for your applications. Spark Context sends Tasks to run on executors.
So, friends, I hope you like this Blog. Please write your comments and share them with your friends. Happy learning !!

Comments

Post a Comment