Syncing HDFS data and google cloud storage data (for BDR) - google-cloud-platform

In the hope of achieving Cloudera Backup and Disaster Recovery to AWS-like functionality in GCP, I am searching for some alternatives.
Will the below approach work?
adding GCP connector to an on-prem Cloudera cluster
then copying with hadoop dist-cp
then syncing hdfs source directory to gcs directory with gsutil rsync [OPTION]... src_url dst_url
If the above approach is not possible then is there any other alternative to achieve Cloudera BDR in Google Cloud Storage (GCS)?

As of the moment, Cloudera Manager’s Backup and Disaster Recovery does not support Google Cloud Storage it is listed in limitations. Please check the whole documentation through this link for Configuring Google Cloud Storage Connectivity.
The above approach will work. We just need to add a few steps to begin with:
We first need to establish a private link between on-prem network and Google network using Cloud Interconnect or Cloud VPN.
Dataproc cluster is needed for data transfer.
Use Google CLI to connect to your master's instance.
Finally, you can run DistCp commands to move your data.
For more detailed information, you may check this full documentation on Using DistCp to copy your data to Cloud Storage.
Google also has its own BDR and you can check this Data Recovery planning guide.
Please be advised that Google Cloud Storage cannot be the default file system for the cluster.
You can also check this link: Working with Google Cloud partners
You could either use the following connectors:
In a Spark (or PySpark) or Hadoop application using the gs:// prefix.
The hadoop shell: hadoop fs -ls gs://bucket/dir/file.
The Cloud Console Cloud Storage browser.
Using the gsutil cp or gsutil rsync commands.
You can check this full documentation on using connectors.
Let me know if you have questions.

Related

How to store bacula (community-edition) backup on Amazon S3?

I am using centOS-7 machine, bacula community-edition 11.0.5 and PostgreSql Database
Bacula is used to take full and incremental backup
I followed bellow document link to store the backup on an Amazon S3 bucket.
https://www.bacula.lat/community/bacula-storage-in-any-cloud-with-rclone-and-rclone-changer/?lang=en
I configured storage daemon as they shown in the above link, once after the backup, backup is success and backed up file storing in the given path /mnt/vtapes/tapes, but backup-file is not moving from /mnt/vtapes/tapes to AWS s3 bucket.
In the above document mentioned as, we need to create Schedule routines to the cloud to move backup file from /mnt/vtapes/tapes to Amazon S3 bucket.
**I am not aware of what is cloud Schedule routines in AWS, whether it is any cloud lambda function or something else?
Is there any S3 cloud driver which support bacula backup or any other way to store bacula-community backup file on Amazon S3 other than S3FS-Fuse and libs3 ?
The link which you shared is for bacula-enterprise, we are using bacula-community. so any related document you prefer for bacula-community edition
Bacula Community include AWS S3 cloud driver starting from 9.6.0. Check https://www.bacula.org/11.0.x-manuals/en/main/main.pdf - Chapter 3, New Features in 9.6.0. And additional: 4.0.1 New Commands, Resource, and Directives for Cloud. This is the same exact driver available at Enterprise version.

Migrating GCP Object storage data to AWS S3 Bucket

We have terabytes of Data in Google Cloud Object Storage, we want to migrate it to AWS S3. What are the best ways to do it? Is there any 3rd party tool that can be better instead of going for direct transfer?
There could be multiple options available even without using any device (cloud to cloud migration) in less time.
** gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:**
gsutil -m rsync -r gs://your-gcp-bucket s3://your-aws-s3-bucket
More details is available # https://cloud.google.com/storage/docs/gsutil/commands/rsync
Note: if you face speed challenge with default cloud shell, then you can create a big machine and execute above command from there.

What is the difference between gcloud and gsutil?

I want to know difference between gcloud and gsuitl. Where do we use what? Why certain commands begin with gsutil while others with gcloud?
The gsutil command is used only for Cloud Storage.
With the gcloud command, you can interact with other Google Cloud products like the App Engine, Google Kubernetes Engine etc. You can have a look here and here for more info.
The gsutil is a Python application that lets you access Google Cloud Storage from the command line. You can use gsutil to do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects.
Moving, copying, and renaming objects.
Editing object and bucket ACLs.
The gcloud command-line interface is the primary CLI tool to create and manage Google Cloud resources. You can use this tool to perform many common platform tasks either from the command line or in scripts and other automations.
For example, you can use the gcloud CLI to create and manage:
Google Compute Engine virtual machine instances and other resources,
Google Cloud SQL instances,
Google Kubernetes Engine clusters,
Google Cloud Dataproc clusters and jobs,
Google Cloud DNS managed zones and record sets,
Google Cloud Deployment manager deployments.
"gcloud" can create and manage Google Cloud resources while "gsutil" cannot do so.
"gsutil" can manipulate buckets, bucket's objects and bucket ACLs on GCS(Google Cloud Storage) while "gcloud" cannot do so.
With gcloud storage you can do now everything what you can do with gsutil. Look here: https://cloud.google.com/blog/products/storage-data-transfer/new-gcloud-storage-cli-for-your-data-transfers and also ACL on objects: https://cloud.google.com/sdk/gcloud/reference/storage/objects/update

How to download a file from the internet to a Google Cloud bucket directly

I want to download a file over 20GB from the internet into a google cloud bucket directly. Just like doing in a local command line the following:
wget http://some.url.com/some/file.tar
I refuse to download the file to my own computer and then copying the file to the bucket using:
gsutil cp file.tar gs://the-bucket/
For the moment I am trying (just at this very moment) to use datalab to download the file and then copying the file from there to the bucket.
A capability of the Google Cloud Platform as it relates to Google Cloud Storage is the functional area known as "Storage Transfer Service". The documentation for this is available here.
At the highest level, this capability allows you to define a source of data that is external to Google such as data available as a URL or on AWS S3 storage and then schedule that to be copied to Google Cloud Storage in the background. This function seems to perform the task you want ... the data is copied from an Internet source to GCS directly.
A completely different story would be the realization that GCP itself provides compute capabilities. What this means is that you can run your own logic on GCP through simple mechanisms such as a VM, Cloud Functions or Cloud Run. This helps us in this story by realizing that we could execute our code to download the Internet based data from within GCP itself to a local temp file. This file could then be uploaded into GCS from within GCP. At no time did the data that will end up in GCP ever go anywhere than from the source to Google. Once retrieved from the source, the transfer rate of the data from the GCP compute to GCS storage should be optimal as it is passing exclusively over Googles internal ultra high speed networks.
You can do the curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file command from inside cloud shell on GCP. That way it never uses your own network and stays in GCP entirely.
For large files, one-liners will very often fail, as will the Google Storage Transfer Service. Part two of Kolban's answer is then needed, and I thought I'd add a little more detail as it can take time to figure out the easiest way of actually downloading to a google compute instance and uploading to a bucket.
For many users, I believe the easiest way will be to open a notebook from the Google AI Platform and do the following:
%pip install wget
import wget
from google.cloud import storage # No install required
wget.download('source_url', 'temp_file_name')
client = storage.Client()
bucket = client.get_bucket('target_bucket')
blob = bucket.blob('upload_name')
blob.upload_from_filename('temp_file_name')
No need to set up an environment, benefits from the convenience of notebooks, and the client will have automatic access to your bucket if the notebook is hosted using same GCP account.
I found a similar post, where is explained that you can download a file from a Web and copy it to your bucket in just one command line:
curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file.tar
I tried in my own bucket and it works correctly, so I hope this is what you are expecting.

Can GCP Dataproc sqoop data (or run other jobs on) from local DB?

Can GCP Dataproc sqoop import data from local DB to put into GCP Storage (without GCP VPC)?
We have a remote Oracle DB connected to our local network via VPN tunnel that we use a Hadoop cluster to extract data out of each day via Apache Sqoop. Would like to replace this process with GCP Dataproc cluster to run the sqoop jobs and GCP Storage.
Found this article that appears to be doing something similar Moving Data with Apache Sqoop in Google Cloud Dataproc, but it assumes that users have GCP VPC (which I did not intend on purchasing).
So my question is:
Without this VPC connection, would the cloud dataproc cluster know how to get the data from the DB on our local network using the job submission API?
How would this work if so (perhaps I am do not understand enough about how Hadoop jobs work / get data)?
Some other way if not?
Without using VPC/VPN you will not be able to grant Dataproc access to your local DB.
Instead of using VPC, you can use VPN if it meets your needs better: https://cloud.google.com/vpn/docs/
Only other option that you have is to open up your local DB to Internet so Dataproc will be able to access it without VPC/VPN, but this is inherently insecure.
Installing the GCS connector on-prem might work in this case. It will not require VPC/VPN.