We have tried kafka connect to push data to HDFS hive tables and NIFI to push data to Hbase from kafka topics but though hbase is nosql db,Kafka connect HDFS to Hive table seems to be much faster than Nifi and Hbase.
We are trying to build real time dash boards, Which combination do you feel is the most promising technology stack?
Related
I need your suggestions on the following Scenario:
USE CASE
I have an on-premises MySQL-Database (20 Tables) and I need to transfer/sync certain Tables(6 Tables) from this Database to BigQuery for Reporting.
Solution:
1- Transfer the whole Database to Cloud SQL using Database Migration Service DMS and then connect the Cloud-SQL instance with BigQuery and query the needed Tables for reporting.
2- Use Dataflow pipeline with pub/sub : How do I move data from MySQL to BigQuery?
Any suggestion How to Syn some Tables to BigQuery without migrating the Whole Database?
Big Thanks!
What is the best practice for moving data from a Kafka cluster to a Redshift table?
We have continuous data arriving on Kafka and I want to write it to tables in Redshift (it doesn't have to be in real time).
Should I use Lambda function?
Should I write a Redshift connector (consumer) that will run on a dedicated EC2 instance? (downside is that I need to handle redundancy)
Is there some AWS pipeline service for that?
Kafka Connect is commonly used for streaming data from Kafka to (and from) data stores. It does useful things like automagically managing scaleout, fail over, schemas, serialisation, and so on.
This blog shows how to use the open-source JDBC Kafka Connect connector to stream to Redshift. There is also a community Redshift connector, but I've not tried this.
This blog shows another approach, not using Kafka Connect.
Disclaimer: I work for Confluent, who created the JDBC connector.
I want to migrate our database and data warehouse into AWS.
Our on-prem database is Oracle and we use Oracle Data Integrator for data warehousing in IBM AIX.
My first thought was to migrate our database with AWS DMS (Data Migration Services) into a staging point (S3) and then using Lambda (for creating the trigger when data is updated, deleted or inserted) and Kinesis Firehose (For streaming and do the ETL) to send the data into Redshift.
The data in Redshift must be the replica of our on-prem data warehouse (containing facts and dimensions, aggregation and multiple joins) and I want whenever any changes happened in the on-prem database, it automatically updates the AWS S3 and Redshift so I can have near real-time data in my Redshift.
I was wondering if my architecture is correct and/or is there a better way to do it?
Thank you
I need suggestions on importing data from Hadoop datalake (Kerberos authenticated) to AWS. All the tables in the Hive table should land in s3 and then needs to be loaded to AWS RDS.
I have considered the following options:
1) AWS Glue ?
2) Spark connecting to hive metastore ?
3) Connecting to impala from AWS ?
There are around 50 tables to be imported. How can i maintain the schema ? IS it better to import the data and then create a seperate schema in RDS ?
Personally, I would dump a list of all tables needing moved.
From that, run SHOW CREATE TABLE over them all, and save the queries.
Run distcp, or however else you want to move the data into S3 / EBS
Edit each create table command to specify a Hive table location that's in the cloud data store. I believe you'll need to make all these as external tables since you cannot put data directly into the Hive warehouse directory and have the metastore immediately know of it.
Run all the commands on the AWS Hive connection.
I have coworkers who have used CircusTrain
Impala and Spark are meant for processing. You're going to need to deal mostly with the Hive metastore here.
I want to run incremental nightly job that extracts 100s of GBs of data from Oracle DataWarehouse into HDFS. After processing, the results (few GBs) needs to be exported back to Oracle.
We are running Hadoop in Amazon AWS, and our Data Warehouse is on premises. The data link between AWS and on premises is 100 mbps and not reliable.
If I use Sqoop-import to bring the data from Oracle, and the network experience intermittent outages, how does Sqoop handle this?
Also, what happens if I imported (or exported) 70% of my data, and during the remaining 30%, the network goes down?
Since by default Sqoop uses JDBC, how does the data transfer happen at a network level? Can we compress the data in transit?