What is AWS Data Pipeline-Aws Blog Info

AWS Data Pipeline is a web service, designed to make it easier for users to integrate data spread across multiple AWS services and analyze it from a single location.

Using AWS Data Pipeline, data can be accessed from the source, processed, and then the results can be efficiently transferred to the respective AWS services.

The following are the main goals:

Improve the content: Display the content what the customers want to see in the future. In this way, content can be enhanced.
Manage application efficiently: To keep track of all the activities in an application and storing the data in an existing database rather than storing the data in a new database.
Faster: To improve the business faster but at a cheaper rate.

Achieving the above goals might be a difficult task as a huge amount of data is stored in different formats, so analyzing, storing and processing of data becomes very complex. The various tools are used to store different formats of data. The feasible solution for such a situation is to use the Data Pipeline. Data Pipeline integrates the data which is spread across different data sources, and it also processes the data on the same location.

What is a Data Pipeline?

AWS Data Pipeline is a web service that can access the data from different services and analyzes, processes the data at the same location, and then stores the data to different AWS services such as DynamoDB, Amazon S3, etc.

For example, using data pipeline, you can archive your web server logs to the Amazon S3 bucket on daily basis and then run the EMR cluster on these logs that generate the reports on the weekly basis.

Data Pipeline

Concept of AWS Data Pipeline

Data Pipeline

The concept of the AWS Data Pipeline is very simple. We have a Data Pipeline sitting on the top. We have input stores which could be Amazon S3, Dynamo DB or Redshift. Data from these input stores are sent to the Data Pipeline. Data Pipeline analyzes, processes the data and then the results are sent to the output stores. These output stores could be an Amazon Redshift, Amazon S3 or Redshift.

Advantages of AWS Data Pipeline

Data Pipeline

Easy to use
AWS Data Pipeline is very simple to create as AWS provides a drag and drop console, i.e., you do not have to write the business logic to create a data pipeline.
Distributed
It is built on Distributed and reliable infrastructure. If any fault occurs in activity when creating a Data Pipeline, then AWS Data Pipeline service will retry the activity.
Flexible
Data Pipeline also supports various features such as scheduling, dependency tracking, and error handling. Data Pipeline can perform various actions such as run Amazon EMR jobs, execute the SQL Queries against the databases, or execute the custom applications running on the EC2 instances.
Inexpensive
AWS Data Pipeline is very inexpensive to use, and it is built at a low monthly rate.
Scalabl
By using the Data Pipeline, you can dispatch the work to one or many machines serially as well as parallelly.
Transparent
AWS Data Pipeline offers full control over the computational resources such as EC2 instances or EMR reports.

Components of AWS Data Pipeline

Following are the main components of the AWS Data Pipeline:

Pipeline Definition
It specifies how business logic should communicate with the Data Pipeline. It contains different information:
- Data Nodes
  It specifies the name, location, and format of the data sources such as Amazon S3, Dynamo DB, etc.
- Activities
  Activities are the actions that perform the SQL Queries on the databases, transforms the data from one data source to another data source.
- Schedules
  Scheduling is performed on the Activities.
- Preconditions
  Preconditions must be satisfied before scheduling the activities. For example, you want to move the data from Amazon S3, then precondition is to check whether the data is available in Amazon S3 or not. If the precondition is satisfied, then the activity will be performed.
- Resources
  You have compute resources such as Amazon EC2 or EMR cluster.
- Actions
  It updates the status about your pipeline such as by sending an email to you or trigger an alarm.
Pipeline
It consists of three important items:
- Pipeline components
  We have already discussed about the pipeline components. It basically how you communicate your Data Pipeline to the AWS services.
- Instances
  When all the pipeline components are compiled in a pipeline, then it creates an actionable instance which contains the information of a specific task.
- Attempts
  We know that Data Pipeline allows you to retry the failed operations. These are nothing but Attempts.
Task Runner
Task Runner is an application that polls the tasks from the Data Pipeline and performs the tasks.

Architecture of Task Runner

Data Pipeline

In the above architecture, Task Runner polls the tasks from the Data Pipeline. Task Runner reports its progress as soon as the task is done. After reporting, the condition is checked whether the task has been succeeded or not. If a task is succeeded, then the task ends and if no, retry attempts are checked. If retry attempts are remaining, then the whole process continues again; otherwise, the task is ended abruptly.

Creating a Data Pipeline

Sign in to the AWS Management Console.
First, we will create the Dynamo DB table and two S3 buckets.
Now, we will create the Dynamo DB table. Click on the create table.

Data Pipeline

Fill the following details such as table name, Primary key to create a new table.

Data Pipeline

The below screen shows that the table "student" has been created.

Data Pipeline

Click on the items and then click on create an item.
We add three items, i.e., id, Name, and Gender.

Data Pipeline

The below screen shows that data is inserted in a DynamoDB table.

Data Pipeline

Now we create two S3 buckets. First will store the data that we are exporting from the DynamoDB and second will store the logs.

Data Pipeline

We have created two buckets, i.e., logstoredata and studata. The logstoredata bucket stores the logs while studata bucket stores the data that we are exporting from the DynamoDB.

Now we create the Data Pipeline. Move to the data Pipeline service and then click on the Get started button

Data Pipeline

Fill the following details to create a pipeline, and then click on the Edit on Architect if you want to change any component in a pipeline.

Data Pipeline

The below screen appears on clicking on the Edit in Architect. We can see that the warning occurs, i.e., TerminateAfter is missing. To remove this warning, you need to add the new field of TerminateAfter in Resources. After adding the field, click on the Activate Button.

Data Pipeline

Initially, WAITING_FOR_DEPENDENCIES status appears. On refreshing, status is WAITING_FOR_RUNNER. As soon as the Running state appears, you can check your S3 bucket, the data will be stored there.

Data Pipeline

How to Set Up Data Pipeline?

Following are the steps to set up data pipeline −

Step 1 − Create the Pipeline using the following steps.

Sign-in to AWS account.
Use this link to Open AWS Data Pipeline console − https://console.aws.amazon.com/datapipeline/
Select the region in the navigation bar.
Click the Create New Pipeline button.
Fill the required details in the respective fields.
- In the Source field, choose Build using a template and then select this template − Getting Started using ShellCommandActivity.

- The Parameters section opens only when the template is selected. Leave the S3 input folder and Shell command to run with their default values. Click the folder icon next to S3 output folder, and select the buckets.
- In Schedule, leave the values as default.
- In Pipeline Configuration, leave the logging as enabled. Click the folder icon under S3 location for logs and select the buckets.
- In Security/Access, leave IAM roles values as default.
- Click the Activate button.

How to Delete a Pipeline?

Deleting the pipeline will also delete all associated objects.

Step 1 − Select the pipeline from the pipelines list.

Step 2 − Click the Actions button and then choose Delete.

Step 3 − A confirmation prompt window opens. Click Delete.

Features of AWS Data Pipeline

Simple and cost-efficient − Its drag-and-drop features makes it easy to create a pipeline on console. Its visual pipeline creator provides a library of pipeline templates. These templates make it easier to create pipelines for tasks like processing log files, archiving data to Amazon S3, etc.

Reliable − Its infrastructure is designed for fault tolerant execution activities. If failures occur in the activity logic or data sources, then AWS Data Pipeline automatically retries the activity. If the failure continues, then it will send a failure notification. We can even configure these notification alerts for situations like successful runs, failure, delays in activities, etc.

Flexible − AWS Data Pipeline provides various features like scheduling, tracking, error handling, etc. It can be configured to take actions like run Amazon EMR jobs, execute SQL queries directly against databases, execute custom applications running on Amazon EC2, etc.

Search This Blog

AWS Cloud Tech Info