Hadoop - ODBC with HBASE without HIVE - web-services

I would like to know if it is possible to develop a bridge beetween Hadoop (HBASE) and a ODB, for example a web service. (so, without using HIVE)
Hadoop without HIVE <-> Web service <->(ODBC flow)<->Database
Thank you all for your help.

Yes you can build a bridge using the Simba Engine SDK. We would be pleased to have a call to discuss.
Please let me know a time that works.
Tom Newton

Related

When I try fetch data from Amazon Keyspaces with Pyspark, I get Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner Error

I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax. I'm using Pyspark to fetch data from Cassandra. I can successfully connect to Keyspaces/Cassandra cluster. But, when I try to fetch data from it.
df = spark.sql("SELECT * FROM cass.tutorialkeyspace.tutorialtable")
print ("Table Row Count: ")
print (df.count())
I get this error:
Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner
Yes, keyspace & table exists and has data. How can I fix/workaround this? Thanks!
As an FYI, Keyspaces now supports using the RandomPartitioner, which enables reading and writing data in Apache Spark by using the open-source Spark Cassandra Connector.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
Launch announcement: https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-keyspaces-read-write-data-apache-spark/
Spark Cassandra Connector is relying on specific partitioner implementation to define data splits, etc. There is no workaround for this problem right now, until somebody adds the implementation of corresponding TokenFactory into this code. It shouldn't be very complex, just should be done by someone who is interested in it.
Thank you for the feedback. At this time, You can write to Keyspaces using the Cassandra Spark Connector. Reading requires support for token rage. Please see the following doc page to see list of supported APIs https://docs.aws.amazon.com/keyspaces/latest/devguide/cassandra-apis.html.
Although we don't have timelines to share at the moment, we prioritize our roadmap based on customer feedback. We are releasing new features all the time. To learn more about our roadmap and upcoming features please contact your AWS Account manager.

How to call a Bigquery stored procedure in Nifi

I have a bigquery stored procedures which will run on some GCS object and do magic out of it. The procedures work perfect manually but I want to call the procedure from Nifi. I have worked with HANA and know that I need JDBC driver to connect and perform query.
Either I can use the executeprocess processor or I could use executeSQL processor. I dont know to be honest
I am not sure how to achieve that in Nifi with bigquery stored procedures. Could anyone help me on this?
Thanks in advance!!
Updated with new error if someone could help
Option1: Executeprocess
The closest thing to "execute manually" is installing the Google Cloud SDK and execute within 'executeprocess' this:
bq query 'CALL STORED_PROCEDURE(ARGS)'
or
bq query 'SELECT STORED_PROCEDURE(ARGS)'
Option 2: ExecuteSQL
If you want to use ExecuteSQL with Nifi to call the stored procedure, you'll the BigQuery JDBC Driver.
Both 'select' and 'call' methods will work with BigQuery.
Which option is better?
I believe ExecuteSQL is easier than Executeprocess.
Why? because you need to install the GCloud SDK on all systems that might run executecommand, and you must pass the google cloud credentials to them.
That means sharing the job is not easy.
Plus, this might involve administrator rights in all the machines.
In the ExecuteSQL case you'll need to:
1 - Copy the jdbc driver to the lib directory inside your Nifi installation
2 - Connect to BigQuery using pre-generated access/refresh tokens - see JDBC Driver for Google BigQuery Install and Configuration guide - that's Oauth type 2.
The good part is that when you export the flow, the credentials are embedded on it: no need to mess with credentials.json files etc (this could be also bad from a security standpoint).
Distributing jdbc jars is easier than installing the GCloud SDK: just drop a file on the lib folder. If you need it in more than one node, you can scp/sftp it, or distribute it with Ambari.

Talend Identity and Access Management for Talend Cloud

We are using the Talend Cloud version, so, there is TMC (Talend Management Console) instead of TAC. We need to set up authentication and authorization for our ESB services, but it is impossible within TMC. We have found Talend Identity and Access Management, but no idea if it is used for only TAC or TMC as well. Could you inform me if this Talend IAM supports TMC or not, if yes then how? If not, then which tool could be used instead?
Kind Regards
It would appear that TIA is a white labeled version of Apache Syncope. Using on-premise Talend (i.e. TAC) you could install this on the same server running TAC, however as you are using Talend Cloud this isn't an option.
It looks like you are going to need a server of some description to run TIA on, if you are using Remote Engines (which I think you much be as I don't think you can run ESB jobs on Cloud Engines yet) then I recommend you install TIA (or even the latest version of Apache Syncope as Talend can sometimes ship some pretty ancient versions of software they have white-labeled) on a remote engine.
As far as I can tell there should be no reason why your ESB jobs shouldn't be able to use TIA (or Syncope) provided the appropriate firewall rules are in place.

How to copy huge file(200-500GB) everyday from Teradata server to HDFS

I have teradata files on SERVER A and I need to copy to Server B into HDFS. what options do i have?
distcp is ruled because Teradata is not on HDFS
scp is not feasible for huge files
Flume and Kafka are meant for Streaming and not for file movement. Even if i use Flume using Spool_dir, it will be an overkill.
Only option I can think of is NiFi. Does anyone has any suggestions on how can i utilize Nifi?
or if someone has already gone through these kind of scenarios, what was the approach followed?
I haven't specifically worked with Teradata dataflow in NiFi but having worked with other SQL sources on NiFi, I believe it is possible & pretty straight-forward to develop dataflow that ingests data from Teradata to HDFS.
For starters you can do a quick check with ExecuteSQL processor available in NiFi. The SQL related processors take one DBCPConnectionPool property which is a NiFi controller service which should be configured with the JDBC URL of your Teradata server and the driver path and driver class name. Once you validate the connection is fine, you can take a look at GenerateTableFetch/ QueryDatabaseTable
Hortonworks has an article which talks about configuring DBCPConnectionPool with a Teradata server : https://community.hortonworks.com/articles/45427/using-teradata-jdbc-connector-in-nifi.html

Redshift to R Shiny Connectivity

Are there any existing packages in R or somewhere else that can connect AWS Redshift clusters to R shiny apps? I'm trying to build up an interactive dashboard using Shiny and the data source is primarily Amazon Redshift or S3. Any workable alternatives or suggestions are welcomed too.
I am using R Shiny with redshift with very nice results.
First you have to install
library(RPostgreSQL)
library(shinydashboard) #just if you want to use nice dashboards
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, host="blabla.eu-west-1.redshift.amazonaws.com",
port="5439", dbname="xx, user="aaaaa", password="xxxxx")
conn #run you connection
test <-data.frame( dbGetQuery(conn, "select * from youtbalename"))
That is working for me.
I know this is an old post but wanted to mention RPostgres.
Unlike RPostgreSQL, RPostgres supports SSL and parameterization. Plus, you don't have to download an additional driver like RJDBC.
More on this here:
https://auth0.com/blog/a-comprehensive-guide-for-connecting-with-r-to-redshift/
I've connected in the past using both RJDBC and RPostgreSQL - both work pretty well.
Bear in mind that both ODBC and JDBC Redshift drivers aren't supported on Shinyapps.io (because Shinyapps.io is built on Ubuntu) - RPostgreSQL may therefore be your best bet.
It's very easy to get a working connection in either RJDBC or RPostgreSQL.