How to use newer versions of HBase on Amazon Elastic MapReduce? - amazon-web-services

Amazon's Elastic MapReduce tool seems to only have support for HBase v0.92.x and v0.94.x.
The documentation for the EMR AMIs and HBase is seemingly out-of-date and there is no information about HBase on the newest release label emr-4.0.0.
Using this example from an AWS engineer, I was able to concoct a way to install another version of HBase on the nodes, but it was ultimately unsuccessful.
After much trial and error with the Java SDK to provision EMR with better versions, I ask:
Is it possible to configure EMR to use more recent versions of HBase (e.g. 0.98.x and newer?)

After several days of trial, error and support tickets to AWS, I was able to implement HBase 0.98 on Amazon's ElasticMapReduce service. Here's how to do it using the Java SDK, some bash and ruby-based bootstrap actions.
Credit for these bash and ruby scripts goes to Amazon support. They are in-development scripts and not officially supported. It's a hack.
Supported Version: HBase 0.98.2 for Hadoop 2
I also created mirrors in Google Drive for the supporting files incase Amazon pulls them down from S3.
Java SDK example
RunJobFlowRequest jobFlowRequest = new RunJobFlowRequest()
.withSteps(new ArrayList<StepConfig>())
.withName("HBase 0.98 Test")
.withAmiVersion("3.1.0")
.withInstances(instancesConfig)
.withLogUri(String.format("s3://bucket/path/to/logs")
.withVisibleToAllUsers(true)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Download HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/wget/ssh/script.sh"))
)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Install HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/hbase/install/ruby/script"))
)
.withServiceRole("EMR_DefaultRole")
.withJobFlowRole("EMR_EC2_DefaultRole");
"Download HBase" Bootstrap Action (bash script)
Original from S3
Mirror from Google Drive
"Install HBase" Bootstrap Action (ruby script)
Original from S3
Mirror from Google Drive
HBase Installation Tarball (used in "Download HBase" script)
Original from S3
Mirror from Google Drive
Make copies of these files
I highly recommend that you download these files and upload them into your own S3 bucket for use in your bootstrap actions / and scripts. Adjust where necessary.

Related

Data streaming from raspberry pi CSV file to BigQuery table

I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer

Migrate Apache Cassandra to Amazon DynamoDB

I want to migrate the database from Apache Cassandra to Amazon DynamoDB.
I am following this user guide
https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/agents.cassandra.html
When I try to create a clone data centre for extraction it throws
If you read through that document, you'll find that the conversion tool supports very old versions of Cassandra: 3.11.2, 3.1.1, 3.0, 2.1.20.
There will be a lot of configuration items in your cassandra.yaml that will not be compatible with the conversion tool including replica_filtering_protection since that property was not added until C* 3.0.22, 3.11.8 (CASSANDRA-15907).
You'll need to engage AWS Support to figure out what migration options are available to you. Cheers!

ModuleNotFoundError: No module named 'aiohttp' in AWS Glue

I am using AWS glue to create ETL workflow, where I am fetching the data from the API and loading it into RDS. In AWS Glue, I used pyspark script. In the same script, I have used the 'aiohttp' and 'asyncio' modules of python to call my API asynchronously. But in AWS glue it is throwing me an error that Module Not found for the only aiohttp.
I have already tried with different versions of aiohttp module and tested in the glue job but still throwing me the same error. Can someone please help me with this topic?
Glue 2.0
AWS Glue version 2.0 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules job parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module.
Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module.
This link to official documentation lists all modules already available. If you need a different version or need one to be installed, it can be specified in the parameter mentioned above.
Glue 1.0 & 2.0
You can zip the python library, upload it so s3 and specify the path as --extra-py-files job parameter.
See link to official documentation for more information.

Does Google Dataproc support Apache Impala?

I am new to using cloud services and navigating Google's Cloud Platform is quite intimidating. When it comes to Google Dataproc, they do advertise Hadoop, Spark and Hive.
My question is, is Impala available at all?
I would like to do some benchmarking projects using all four of these tools and I require Apache Impala along side Spark/Hive.
No, DataProc is a cluster that supports Hadoop, Spark, Hive and pig; using default images.
Check this link for more information about native image list for DataProc
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions
You can try also using another new instance of Dataproc, instead of using the default.
For example, you can create a Dataproc instance with HUE (Hadoop User Experience) which is an interface to handle Hadoop cluster built by Cloudera. The advantage here is that HUE has as a default component Apache Impala. It also has Pig, Hive, etc. So it's a pretty good solution for using Impala.
Another solution will be to create your own cluster by the beginning but is not a good idea (at least you want to customize everything). With this way, you can install Impala.
Here is a link, for more information:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/hue
Dataproc provides you SSH access to the master and workers, so it is possible to install additional software and according to Impala documentation you would need to:
Ensure Impala Requirements.
Set up Impala on a cluster by building from source.
Remember that it is recommended to install the impalad daemon with each DataNode.
Cloud Dataproc supports Hadoop, Spark, Hive, Pig by default on the cluster. You can install more optionally supported components such as Zookeeper, Jyputer, Anaconda, Kerberos, Druid and Presto (You can find the complete list here). In addition, you can install a large set of open source components using initialization-actions.
Impala is not supported as optional component and there is no initialization-action script for it yet. You could get it to work on Dataproc with HDFS but making it work with GCS may require non-trivial changes.

Read file from s3a along with AWS Athena SDK (1.11+)

I am writing a spark/scala program which submits a query on athena (using aws-java-sdk-athena:1.11.420) and waits for the query to complete. Once the query is complete, my spark program directly reads from the S3 bucket using s3a protocol (the output location of the query) using spark's sparkSession.read.csv() function.
In order to read the CSV file, I need to use org.apache.hadoop.hadoop-aws:1.8+ and org.apache.hadoop.hadoop-client:1.8+. Both of these libraries are build using AWS SDK version 1.10.6. However, AWS athena does not have any SDK with that version. The oldest version they have is 1.11+.
How can I resolve the conflict? I need to use the latest version of AWS SDK to get access to athena, but hadoop-aws pushed me back to an older version?
Are there other dependency version of hadoop-aws that uses 1.11+ AWS SDKs? If so, what are the versions that will work for me? If not, what other options do I have?
I found out that I can use hadoop-aws:3.1+ which comes with aws-java-sdk-bundle:1.11+. This AWS SDK Bundle comes with Athena packaged.
I although still need to run spark with hadoop-commons:3.1+ libraries. The spark cluster I have runs 2.8 version libraries.
Due to my spark cluster running 2.8, spark-submit jobs were failing, while normal execution of the jar (java -jar myjar.jar) was working fine. This is because Spark was replacing the hadoop libraries I provided with the version it was bundled with.