I created a Spark cluster using Dataproc with Jupyter Notebook attached to it. Then I Deleted the cluster and I assumed the notebooks are gone. However after creating another cluster (connected to the same Bucket) I can see my old notebooks. Does it mean that notebooks (or their checkpoints) are stored into my bucket? Or where are they stored and how to make sure they are deleted?
Dataproc allows creating distributed computing cluster (Hadoop, Map reduce, spark,...). It's only for processing (you can keep temporary data in the internal HDFS system) but all the input and output and done in a bucket (Cloud Storage is the new/internal Google version of HDFS -> HDFS is the open source implementation of the specification publicly released by Google. Since then, Google has internally improve the system (Cloud Storage), but it's still compliant with HDFS).
Therefore, yes, it's normal that your data are still in your Cloud Storage bucket.
Related
In the hope of achieving Cloudera Backup and Disaster Recovery to AWS-like functionality in GCP, I am searching for some alternatives.
Will the below approach work?
adding GCP connector to an on-prem Cloudera cluster
then copying with hadoop dist-cp
then syncing hdfs source directory to gcs directory with gsutil rsync [OPTION]... src_url dst_url
If the above approach is not possible then is there any other alternative to achieve Cloudera BDR in Google Cloud Storage (GCS)?
As of the moment, Cloudera Manager’s Backup and Disaster Recovery does not support Google Cloud Storage it is listed in limitations. Please check the whole documentation through this link for Configuring Google Cloud Storage Connectivity.
The above approach will work. We just need to add a few steps to begin with:
We first need to establish a private link between on-prem network and Google network using Cloud Interconnect or Cloud VPN.
Dataproc cluster is needed for data transfer.
Use Google CLI to connect to your master's instance.
Finally, you can run DistCp commands to move your data.
For more detailed information, you may check this full documentation on Using DistCp to copy your data to Cloud Storage.
Google also has its own BDR and you can check this Data Recovery planning guide.
Please be advised that Google Cloud Storage cannot be the default file system for the cluster.
You can also check this link: Working with Google Cloud partners
You could either use the following connectors:
In a Spark (or PySpark) or Hadoop application using the gs:// prefix.
The hadoop shell: hadoop fs -ls gs://bucket/dir/file.
The Cloud Console Cloud Storage browser.
Using the gsutil cp or gsutil rsync commands.
You can check this full documentation on using connectors.
Let me know if you have questions.
I was trying to run a Docker image with Cloud run and realised that there is no option for adding a persistent storage. I found a list of services in https://cloud.google.com/run/docs/using-gcp-services#connecting_to_services_in_code but all of them are access from code. I was looking to share volume with persistent storage. Is there a way around it ? Is it because persistent storage might not work shared between multiple instances at the same time ? Is there are alternative solution ?
Cloud Run is serverless: it abstracts away all infrastructure management.
Also is a managed compute platform that automatically scales your stateless containers.
Filesystem access The filesystem of your container is writable and is
subject to the following behavior:
This is an in-memory filesystem, so writing to it uses the container
instance's memory. Data written to the filesystem does not persist
when the container instance is stopped.
You can use Google Cloud Storage, Firestore or Cloud SQL if your application is stateful.
3 Great Options for Persistent Storage with Cloud Run
What's the default storage for Google Cloud Run?
Cloud Run (fully managed) has known services that's not yet supported including Filestore which is also a persistent storage. However, you can consider running your Docker image on Cloud Run Anthos which runs on GKE and there you can use persistent volumes which are typically backed by Compute Engine persistent disks.
Having persistent storage in (fully managed) Cloud Run should be possible now.
Cloud Run's second generation execution environment (gen2) supports network mounted file systems.
Here are some alternatives:
Cloud Run + GCS: Using Cloud Storage FUSE with Cloud Run tutorial
Cloud Run + Filestore: Using Filestore with Cloud Run tutorial
If you need help deciding between those, check this:
Design an optimal storage strategy for your cloud workload
NOTE: At the time of this answer, Cloud Run gen2 is in Preview.
Can GCP Dataproc sqoop import data from local DB to put into GCP Storage (without GCP VPC)?
We have a remote Oracle DB connected to our local network via VPN tunnel that we use a Hadoop cluster to extract data out of each day via Apache Sqoop. Would like to replace this process with GCP Dataproc cluster to run the sqoop jobs and GCP Storage.
Found this article that appears to be doing something similar Moving Data with Apache Sqoop in Google Cloud Dataproc, but it assumes that users have GCP VPC (which I did not intend on purchasing).
So my question is:
Without this VPC connection, would the cloud dataproc cluster know how to get the data from the DB on our local network using the job submission API?
How would this work if so (perhaps I am do not understand enough about how Hadoop jobs work / get data)?
Some other way if not?
Without using VPC/VPN you will not be able to grant Dataproc access to your local DB.
Instead of using VPC, you can use VPN if it meets your needs better: https://cloud.google.com/vpn/docs/
Only other option that you have is to open up your local DB to Internet so Dataproc will be able to access it without VPC/VPN, but this is inherently insecure.
Installing the GCS connector on-prem might work in this case. It will not require VPC/VPN.
When trying to use Amazon Redshift to create a datasource for my Machine Learning model, I encountered the following error when testing the access of my IAM role:
There is no '' cluster, or the cluster is not in the same region as your Amazon ML service. Specify a cluster in the same region as the Amazon ML service.
Is there anyway around this, as this would be a huge pain since all of our development team's data is stored in a region that Machine Learning doesn't work in?
That's an interesting situation to be in.
What probably you can do :
1) Wait for Amazon Web Services to support AWS ML in your preferred Region. (That's a long wait though).
2) OR what else you can do is Create a backup plan for your Redshift data.
Amazon Redshift provides you some by Default tools to back up your
cluster via snapshot to Amazon Simple Storage Service (Amazon S3).
These snapshots can be restored in any AZ in that region or
transferred automatically to other regions wherever you want (In your
case where your ML is running).
There is (Probably) no other way around to use your ML with Redshift being in different regions.
Hope it will help !
I have this MYSQL database pod running within a google cloud cluster, also using a persistent volume(GCEpersistentdisk) to backup my data.
Is there a way to also have a AWS persistent volume(AWSElasticBlockStore) backing up the same Pod, in case something goes wrong with google cloud platform and can't reach any of my data, so when I create another Kubernetes Pod within AWS, I'll be able to get my latest data(before GCP crushes) from that AWSElasticBlockStore.
If not what's the best way to simultaneously backing up a kubernetes database pod at two different cloud provider. So when one crushes, you'll be still able to deploy at the other.