Does Google Dataproc support Apache Impala? - google-cloud-platform

I am new to using cloud services and navigating Google's Cloud Platform is quite intimidating. When it comes to Google Dataproc, they do advertise Hadoop, Spark and Hive.
My question is, is Impala available at all?
I would like to do some benchmarking projects using all four of these tools and I require Apache Impala along side Spark/Hive.

No, DataProc is a cluster that supports Hadoop, Spark, Hive and pig; using default images.
Check this link for more information about native image list for DataProc
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions

You can try also using another new instance of Dataproc, instead of using the default.
For example, you can create a Dataproc instance with HUE (Hadoop User Experience) which is an interface to handle Hadoop cluster built by Cloudera. The advantage here is that HUE has as a default component Apache Impala. It also has Pig, Hive, etc. So it's a pretty good solution for using Impala.
Another solution will be to create your own cluster by the beginning but is not a good idea (at least you want to customize everything). With this way, you can install Impala.
Here is a link, for more information:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/hue

Dataproc provides you SSH access to the master and workers, so it is possible to install additional software and according to Impala documentation you would need to:
Ensure Impala Requirements.
Set up Impala on a cluster by building from source.
Remember that it is recommended to install the impalad daemon with each DataNode.

Cloud Dataproc supports Hadoop, Spark, Hive, Pig by default on the cluster. You can install more optionally supported components such as Zookeeper, Jyputer, Anaconda, Kerberos, Druid and Presto (You can find the complete list here). In addition, you can install a large set of open source components using initialization-actions.
Impala is not supported as optional component and there is no initialization-action script for it yet. You could get it to work on Dataproc with HDFS but making it work with GCS may require non-trivial changes.

Related

Migrate Apache Cassandra to Amazon DynamoDB

I want to migrate the database from Apache Cassandra to Amazon DynamoDB.
I am following this user guide
https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/agents.cassandra.html
When I try to create a clone data centre for extraction it throws
If you read through that document, you'll find that the conversion tool supports very old versions of Cassandra: 3.11.2, 3.1.1, 3.0, 2.1.20.
There will be a lot of configuration items in your cassandra.yaml that will not be compatible with the conversion tool including replica_filtering_protection since that property was not added until C* 3.0.22, 3.11.8 (CASSANDRA-15907).
You'll need to engage AWS Support to figure out what migration options are available to you. Cheers!

When I try fetch data from Amazon Keyspaces with Pyspark, I get Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner Error

I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax. I'm using Pyspark to fetch data from Cassandra. I can successfully connect to Keyspaces/Cassandra cluster. But, when I try to fetch data from it.
df = spark.sql("SELECT * FROM cass.tutorialkeyspace.tutorialtable")
print ("Table Row Count: ")
print (df.count())
I get this error:
Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner
Yes, keyspace & table exists and has data. How can I fix/workaround this? Thanks!
As an FYI, Keyspaces now supports using the RandomPartitioner, which enables reading and writing data in Apache Spark by using the open-source Spark Cassandra Connector.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
Launch announcement: https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-keyspaces-read-write-data-apache-spark/
Spark Cassandra Connector is relying on specific partitioner implementation to define data splits, etc. There is no workaround for this problem right now, until somebody adds the implementation of corresponding TokenFactory into this code. It shouldn't be very complex, just should be done by someone who is interested in it.
Thank you for the feedback. At this time, You can write to Keyspaces using the Cassandra Spark Connector. Reading requires support for token rage. Please see the following doc page to see list of supported APIs https://docs.aws.amazon.com/keyspaces/latest/devguide/cassandra-apis.html.
Although we don't have timelines to share at the moment, we prioritize our roadmap based on customer feedback. We are releasing new features all the time. To learn more about our roadmap and upcoming features please contact your AWS Account manager.

Don't know how to load data to GCP notebook (Platform AI)

I am turning into GCP "Google cloud platform" to train a Keras model using google's powerful GPUs, for that I created an instance of VM on which I run a JupyterLab notebook.
I found my self unable to access my data that is stored as a bucket on google storage.
I found this small doc, under python, they define two function allowing to create and fill a dataset. my problem here is that I couldn't install the datalabeling_v1beta1 module.
I already tried the command below but had no result.
! gcloud components install datalab
I am new to GCP, so I really don't know much about the terminology, my goal for the moment is to uplaod my set of data to be able to use it as if I were on Google Colab or on my local machine.
Please refer to installing dependencies
Create a new notebook, File -> New -> Notebook
%pip install google-cloud-datalabeling
For Data labeling reference

Medium Hadoop / Spark Cluster Administration

Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.
If you have some experience in scripting language then you can go for chef. The recipes are already available for deployment and configuration of cluster and it's very easy to start with.
And if wants to do it by your own then you can use sshxcute java API which runs the script on remote server. You can build up the commands there and pass them to sshxcute API to deploy the cluster.
Check out Apache Ambari. Its a great tool for central management of configs, adding new nodes, monitoring the cluster, etc. This would be your best bet.

How to use newer versions of HBase on Amazon Elastic MapReduce?

Amazon's Elastic MapReduce tool seems to only have support for HBase v0.92.x and v0.94.x.
The documentation for the EMR AMIs and HBase is seemingly out-of-date and there is no information about HBase on the newest release label emr-4.0.0.
Using this example from an AWS engineer, I was able to concoct a way to install another version of HBase on the nodes, but it was ultimately unsuccessful.
After much trial and error with the Java SDK to provision EMR with better versions, I ask:
Is it possible to configure EMR to use more recent versions of HBase (e.g. 0.98.x and newer?)
After several days of trial, error and support tickets to AWS, I was able to implement HBase 0.98 on Amazon's ElasticMapReduce service. Here's how to do it using the Java SDK, some bash and ruby-based bootstrap actions.
Credit for these bash and ruby scripts goes to Amazon support. They are in-development scripts and not officially supported. It's a hack.
Supported Version: HBase 0.98.2 for Hadoop 2
I also created mirrors in Google Drive for the supporting files incase Amazon pulls them down from S3.
Java SDK example
RunJobFlowRequest jobFlowRequest = new RunJobFlowRequest()
.withSteps(new ArrayList<StepConfig>())
.withName("HBase 0.98 Test")
.withAmiVersion("3.1.0")
.withInstances(instancesConfig)
.withLogUri(String.format("s3://bucket/path/to/logs")
.withVisibleToAllUsers(true)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Download HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/wget/ssh/script.sh"))
)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Install HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/hbase/install/ruby/script"))
)
.withServiceRole("EMR_DefaultRole")
.withJobFlowRole("EMR_EC2_DefaultRole");
"Download HBase" Bootstrap Action (bash script)
Original from S3
Mirror from Google Drive
"Install HBase" Bootstrap Action (ruby script)
Original from S3
Mirror from Google Drive
HBase Installation Tarball (used in "Download HBase" script)
Original from S3
Mirror from Google Drive
Make copies of these files
I highly recommend that you download these files and upload them into your own S3 bucket for use in your bootstrap actions / and scripts. Adjust where necessary.