Hi I am new to hadoop and wanted to create hdfs replication using cloudera manager API.
How to create hdfs replication using Cloudera CM API?
After doing lots of research, I was able to create the below command which will replicate from one location to the other in hdfs within same cluster.
But, with slight variations we can do remote cluster replication also.
Related
In the hope of achieving Cloudera Backup and Disaster Recovery to AWS-like functionality in GCP, I am searching for some alternatives.
Will the below approach work?
adding GCP connector to an on-prem Cloudera cluster
then copying with hadoop dist-cp
then syncing hdfs source directory to gcs directory with gsutil rsync [OPTION]... src_url dst_url
If the above approach is not possible then is there any other alternative to achieve Cloudera BDR in Google Cloud Storage (GCS)?
As of the moment, Cloudera Manager’s Backup and Disaster Recovery does not support Google Cloud Storage it is listed in limitations. Please check the whole documentation through this link for Configuring Google Cloud Storage Connectivity.
The above approach will work. We just need to add a few steps to begin with:
We first need to establish a private link between on-prem network and Google network using Cloud Interconnect or Cloud VPN.
Dataproc cluster is needed for data transfer.
Use Google CLI to connect to your master's instance.
Finally, you can run DistCp commands to move your data.
For more detailed information, you may check this full documentation on Using DistCp to copy your data to Cloud Storage.
Google also has its own BDR and you can check this Data Recovery planning guide.
Please be advised that Google Cloud Storage cannot be the default file system for the cluster.
You can also check this link: Working with Google Cloud partners
You could either use the following connectors:
In a Spark (or PySpark) or Hadoop application using the gs:// prefix.
The hadoop shell: hadoop fs -ls gs://bucket/dir/file.
The Cloud Console Cloud Storage browser.
Using the gsutil cp or gsutil rsync commands.
You can check this full documentation on using connectors.
Let me know if you have questions.
I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.
Can GCP Dataproc sqoop import data from local DB to put into GCP Storage (without GCP VPC)?
We have a remote Oracle DB connected to our local network via VPN tunnel that we use a Hadoop cluster to extract data out of each day via Apache Sqoop. Would like to replace this process with GCP Dataproc cluster to run the sqoop jobs and GCP Storage.
Found this article that appears to be doing something similar Moving Data with Apache Sqoop in Google Cloud Dataproc, but it assumes that users have GCP VPC (which I did not intend on purchasing).
So my question is:
Without this VPC connection, would the cloud dataproc cluster know how to get the data from the DB on our local network using the job submission API?
How would this work if so (perhaps I am do not understand enough about how Hadoop jobs work / get data)?
Some other way if not?
Without using VPC/VPN you will not be able to grant Dataproc access to your local DB.
Instead of using VPC, you can use VPN if it meets your needs better: https://cloud.google.com/vpn/docs/
Only other option that you have is to open up your local DB to Internet so Dataproc will be able to access it without VPC/VPN, but this is inherently insecure.
Installing the GCS connector on-prem might work in this case. It will not require VPC/VPN.
What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR
We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?
However, when creating a new job flow, we can only configure a S3 bucket as input data origin.
Any ideas/samples on how to do this?
Thanks!
P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.
How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.
If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.
If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.
Try using scp to copy files to your EMR instance:
my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file
(or use ftp, or wget, or curl, or anything else you want)
then log into your EMR instance with ssh and load it into hadoop:
my-desktop-box$ ssh my-emr-node
my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file