Apache Phoenix - GCP Data Proc - google-cloud-platform

I am doing a POC on Google Cloud Dataproc along with HBase as one of the component.
I created cluster and was able to get the cluster running along with the HBase service. I can list and create tables via shell.
I want to use the Apache Phoenix as the client to query on HBase. I installed that on the cluster by referring to this link.
The installation when fine but when I execute sqlline.py localhost which should create the Meta table in hbase. It actually fails and gives error as Region in Transistion.
Does anyone know how to resolve this or is there a limitation that Apache Phoenix cannot be used along with Dataproc.

There is no limitation on Dataproc to use Apache Phoenix. You might want to dig deeper into the error message, it might be a configuration issue.

Related

Send metrics query on AWS AMP

I am using AWS Managed Prometheus service and setup a Prometheus server on my EKS cluster to collect and write metrics on my AMP workspace, using the helm chart, as per tutorial from AWS. All works fine, I am also connecting to a cluster run Grafana and I can see the metrics no problem.
However, my use case is to query metrics from my web application which runs on the cluster and to display the said metrics using my own diagram widgets. In other words, I don't want to use Grafana.
So I was thinking to use the AWS SDK (Java in my case, https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/amp/model/package-summary.html), which works fine (I can list my workspaces etc...), except it doesn't have any method for querying metrics!?
The documentation indeed mentions that this is not out of the box (!) and basically redirects to Grafana...
This seems fairly odd to me as the basic use case would be to run some queries no? Am I missing something here? do I need to create my own HTTP requests for this?
FYI, I ended up doing the query manually, creating an SdkHttpFullRequest and using an AWS4Signer to sign it. Works OK but I wonder why it couldn't be included in the SDK directly... The only gotcha was to make sure to specify the "aps" for the signing name in the Aws4SignerParams creation.

Can AWS Glue connect to a remote server via SFTP?

I am trying to establish a connection from AWS Glue to a remote server via SFTP using Python 3.7. I tried using the pysftp library for this task.
But pysftp uses a library named bcrypt that has python and c code. As of this moment, AWS Glue only supports pure python libraries as mentioned in the documentation (below link).
https://docs.aws.amazon.com/glue/latest/dg/console-custom-created.html
The error I am getting is as below.
ImportError: cannot import name '_bcrypt'
I am stuck here due to a compilation error.
Hence, I tried the JSch java library using Scala. There the compilation is successful, but I get the below exception.
com.jcraft.jsch.JSchException: java.net.UnknownHostException: [Remote Server Hostname]
How can we connect to a remote server via SFTP from AWS Glue? Is it possible?
How can we configure outbound rules (if required) for a Glue job?
I am answering my own question here for anyone whom this might help.
The straight answer is no.
I found the below resources which indicate that AWS Glue is an ETL tool for AWS resources.
AWS Glue uses other AWS services to orchestrate your ETL (extract,
transform, and load) jobs to build a data warehouse.
Source - https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html
Glue works well only with ETL from JDBC and S3 (CSV) data sources. In
case you are looking to load data from other cloud applications, File
Storage Base, etc. Glue would not be able to support.
Source - https://hevodata.com/blog/aws-glue-etl/
Hence to implement what I was working on, I used an AWS Lambda function to connect to the remote server via SFTP, pick the required files and drop them in an S3 bucket. The AWS Glue job can now pick the files from S3.
i know that there is some time since this question was post, so i like to share some tools that could help you to get data from a sftp more easily and quickly. so for get a layer in a easy way use this tool https://github.com/aws-samples/aws-lambda-layer-builder, you can make a layer of pysftp faster and free of those annoying errors (cffi, bycrypt).
The lambda has a limit of 500 MB,so if you are trying to extract heavy files, the lambda will crash for this reason. to fix this you have to attach EFS (Elastic File System) to your lamdba https://docs.aws.amazon.com/lambda/latest/dg/services-efs.html

AWS EMR Presto job

Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3?
I would not like to SSH to the instance to execute the commands, but do it automatically
You can use any JDBC/ODBC client to connect with Presto cluster. You can do this with available drivers if you want to connect programmatically.
https://prestodb.github.io/docs/current/installation/jdbc.html
https://teradata.github.io/presto/docs/0.167-t/installation/odbc.html
If you do have any BI tool like tableau, Clickview, Superset then it can be easily done as well.
e.g https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_presto.htm

Manual installation of Sqoop 1.4 on AWS EMR 5.5.0

I have launched a hadoop EMR Cluster (5.5.0 - components - Hive, Hue) but not SQOOP. But now i need to have sqoop also to query and dump data from mysql database. Since the cluster is already launched with good amount of data wanted to know if i can also add Sqoop. I dont see this option on AWS Console.
Thanks
I installed it manually, done the required configuration. The limitation i guess now would be that if i have to clone the cluster then it wont be available.

Connecting to Spark SQL on EMR using JDBC

I have spark running on EMR and i have been trying to connect to spark-SQL from SQLWorkbench using the JDBC hive drivers, but in vain. I have started the thrift server on the EMR and i'm able to connect to Hive on port 10000(default) from Tableau/SQL Workbench. When i try to run a query, it fires a Tez/Hive job. However, i want to run the query using Spark. Within the EMR box, I'm able to connect to SparkSQL using beeline and run a query as a spark job. Resource manager shows that the beeline query is running as a spark job, while the query running through SQLWorkbench, is running a hive/Tez job.
When i checked the logs, i found that the thrift server to connect to spark was running on port 10001(default).
When i fire up beeline, the entries come up for connection and sql that i'm running. However, when the same connection parameters are used to connect form SQLWorkbench/Tableau, it has an exception without much details. the exception just say connection ended.
I tried running on a custom port by passing the parameters, beeline works, but not through jdbc connection.
Any help to resolve this issue?
I was able to resolve the issue. I was able to connect to SparkSQL from Tableau and the reason i was not able to connect was we were bringing up the thrift service as root. Not sure why it would matter, i had to change the permission on the log folder to the current user(not root) and bring up the thrift service, which enabled me to connect without any issues.