AWS EMR free usage - amazon-web-services

I am very new to cloud based services. I want to try impala queries on AWS EMR and EC2. Is it possible, Can I create a free account for EC2/EMR. If yes then how?

Impala is not available as a standard option in Amazon EMR.
You would probably need to launch your own Hadoop cluster on Amazon EC2 instances.
However, the AWS Free Usage Tier only provides micro-sized EC2 instances, which are not appropriate for a Hadoop cluster.

Related

Does AWS Elasticsearch snapshot contains data?

I was reading upon the AWS documentation on Elasticsearch and in the latest versions they take snapshot of the AWS ES Cluster every 1 hour and store it in S3.
This can prove super useful in terms of recovery.
But I could not find if this snapshot just contains the cluster information or the data as well ?
Can someone confirm with if it stores the data as well or just the cluster information ?
Thanks !
From the AWS documentation:
On Amazon Elasticsearch Service, snapshots come in two forms: automated and manual.
Automated snapshots are only for cluster recovery. You can use them to restore your domain in the event of red cluster status or other data loss. Amazon ES stores automated snapshots in a preconfigured Amazon S3 bucket at no additional charge.
Manual snapshots are for cluster recovery or moving data from one cluster to another. As the name suggests, you have to initiate manual snapshots. These snapshots are stored in your own Amazon S3 bucket, and standard S3 charges apply. If you have a snapshot from a self-managed Elasticsearch cluster, you can even use that snapshot to migrate to an Amazon ES domain.
This will support cluster recovery for both and data migration from a manual snapshot. Any networking or configuration of the cluster from within the Elasticsearch service itself is managed entirely via the AWS API so these should be managed via infrastructure as code (such as CloudFormation or Terraform).

AWS and GCP centrally managed airflows and Dataflow equivalent for AWS

I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!
No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same

Can we use Pre configured AMI to create AWS EMR cluster?

I am not good at writing a shell script/Bootstrap action for EMR. Can I able to use a preconfigured AMI snapshot for creating the cluster?
You can use a preconfigured AMI for EMR. However, there are some restrictions in that you must start with a supported EMR AMI. I have done this many times to create encrypted root volumes for EMR (copying the AMI and enabling encryption).
Amazon EMR now supports launching clusters with custom Amazon Linux AMIs

aws emr pricing programmatically

I have one requirement in which I am launching emr cluster pragmatically through Java api. Now I want to predict how much charge aws will take before launching emr cluster? I went through new aws pricing api but this api is not providing any information regarding charges applied on emr cluster? also I checked aws price calculator its totally based on javascript, so could you please suggest me any way through which I can programmatically find the charges taken by aws emr pricing api?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR