AWS Certification: EMR Questions

Amazon EMR

Overview
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand. If you have existing on-premises deployments of open source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts.

Overview

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand. If you have existing on-premises deployments of open source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts.

1. You have a set of IIS Servers running on EC2 Instances. You want to collect and process the log files generated from these IIS Servers. Which of the below services is ideal to run in this scenario?

A. Amazon S3 for storing the log files and Amazon EMR for processing the log files.

B. Amazon S3 for storing the log files and EC2 Instances for processing the log files.

C. Amazon EC2 for storing and processing the log files.

D. Amazon DynamoDB to store the logs and EC2 for running custom log analysis scripts.

Answer

A. Amazon S3 for storing the log files and Amazon EMR for processing the log files.

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such asApache HadoopandApache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Options B and C, though partially correct would be an overhead for EC2 Instances to process log files when you already have a ready made service to help in this regard. Option D is in invalid because DynamoDB is not an ideal option to store log files. For more information on EMR, please visit the below URL: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html

2. You need to start using resources in AWS to build a big data processing system. Which one of the following services would you ideally use for this requirement?

A. AWS DynamoDB

B. AWS EMR

C. AWS ECS

D. AWS ECR

Answer

B. AWS EMR

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such asApache Spark,HBase,Presto,andFlinkin Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics. For more information on the EMR service, please visit the following URL: https://aws.amazon.com/emr/?nc2=h_m1