Hi I'm playing around with hdinsight. Just started with creating a cluster in hdinsight but having trouble to understand how the HDFS is created. Is it using the local disk of datanodes or the azure storage(as i selected it while creating the cluster) I want to increase the size of HDFS in my cluster, how can i do it? Should i attach a managed disk to every datanode?
Thanks in advance.
Welcome to HDInsight users familiy. HDInsight is a fully managed big data service on Azure.
the article : https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters at section : Storage endpoints for clusters talks about the Storage available for HDInsight clusters.
the https://learn.microsoft.com/en-us/azure/hdinsight/ is a good starting point.
let me know if there are any further questions i can answer.
Best,
Amar
The cluster uses Azure Blob storage which isolates your data from your cluster.
While creating an HDInsight cluster you specify the Azure Storage account you want to associate with it. In addition to this storage account, you can add additional storage accounts from the same Azure subscription or different Azure subscriptions during the creation process or after a cluster has been created.
Also, HDInsight uses either Azure Storage or Azure Data Lake Store as the default storage depends on the option that you choose while creating the cluster.
For instructions about adding additional storage accounts, see Add additional storage accounts to HDInsight for more details.
Related
My org is currently using CLEO as a file transfer tools from gcp storage buckets to azure blob storage, due to high volume and size of files performance is becoming a concern.
I would like to know what are other possible ways for copying data files from gcp buckets to azure blob storage, Azcopy is one option I found out but is that performance efficient, do I need to install it on composer or VM, what kind of security access I need to deal with to get the privileges on azure storage, would that be service account.
Thanks in advance.
I created a Spark cluster using Dataproc with Jupyter Notebook attached to it. Then I Deleted the cluster and I assumed the notebooks are gone. However after creating another cluster (connected to the same Bucket) I can see my old notebooks. Does it mean that notebooks (or their checkpoints) are stored into my bucket? Or where are they stored and how to make sure they are deleted?
Dataproc allows creating distributed computing cluster (Hadoop, Map reduce, spark,...). It's only for processing (you can keep temporary data in the internal HDFS system) but all the input and output and done in a bucket (Cloud Storage is the new/internal Google version of HDFS -> HDFS is the open source implementation of the specification publicly released by Google. Since then, Google has internally improve the system (Cloud Storage), but it's still compliant with HDFS).
Therefore, yes, it's normal that your data are still in your Cloud Storage bucket.
There is not documentation that I can find about the storage that Google Cloud Run has. For example, does it contains few Gigabyte storage as we create a VM?
If not, is there a '/tmp' folder that I can put data temporarily into during the request? What's the limitation if available?
If neither of them available, what's the recommendation if I want to save some temporary data while running Cloud Run?
Cloud Run is a stateless service platform, and does not have any built-in storage mechanism.
Files can be temporarily stored for processing in a container instance, but this storage comes out of the available memory for the service as described in the runtime contract. Maximum memory available to a service is 8 GB.
For persistent storage the recommendation is to integrate with other GCP services that provide storage or databases.
The top services for this are Cloud Storage and Cloud Firestore.
These two are a particularly good match for Cloud Run because they have the most "serverless" compatibility: horizontal scaling to matching the scaling capacity of Cloud Run and the ability to trigger events on state changes to plug into asynchronous, serverless architectures via Cloud Pub/Sub and Cloud Storage's Registering Object Changes and Cloud Functions with Cloud Function Events & Triggers.
The writable disk storage is an in-memory filesystem, which limited by instance memory to a maximum of 8GB. Anything written to the filesystem is not persisted between instances.
See:
https://cloud.google.com/run/quotas
https://cloud.google.com/run/docs/reference/container-contract#filesystem
https://cloud.google.com/run/docs/reference/container-contract#memory
How we can find the details programmatically about GCP Infrastructure like various Folders, Projects, Compute Instances, datasets etc. which can help to have a better understanding of GCP platform.
Regards,
Neeraj
There is a service in GCP called Cloud Asset Inventory. Cloud Asset Inventory is a storage service that keeps a five week history of Google Cloud Platform (GCP) asset metadata.
It allows you to export all asset metadata at a certain timestamp to Google Cloud Storage or BigQuery.
It also allows you to search resources and IAM policies.
It supports a wide range of resource types, including:
Resource Manager
google.cloud.resourcemanager.Organization
google.cloud.resourcemanager.Folder
google.cloud.resourcemanager.Project
Compute Engine
google.compute.Autoscaler
google.compute.BackendBucket
google.compute.BackendService
google.compute.Disk
google.compute.Firewall
google.compute.HealthCheck
google.compute.Image
google.compute.Instance
google.compute.InstanceGroup
...
Cloud Storage
google.cloud.storage.Bucket
BigQuery
google.cloud.bigquery.Dataset
google.cloud.bigquery.Table
Find the full list here.
The equivalent service in AWS is called AWS Config.
I have found open source tool named as "forseti Security", which is easy to install and use. It has 5 major components in it.
Inventory : Regularly collects the data from GCP and store the results in cloudSQL under the table “gcp_inventory”. In order to refer to the latest inventory information you can refer to the max value of column : inventory_index_id.
Scanner : It periodically compares the policies applied on GCP resources with the data collected from Inventory. It stores the scanner information in table “scanner_index”
Explain : it helps to manage the cloud IAM policies.
Enforcer : This component use Google Cloud API to enforce the policies you have set in GCP platform.
Notifier : It helps to send notifications to Slack, Cloud Storage or SendGrid as show in Architecture diagram above.
You can find the official documentation here.
I tried using this tool and found it really useful.
Let's say a company has an application with a database hosted on AWS and also has a read replica on AWS. Then that same company wants to build out a data analytics infrastructure in Google Cloud -- to take advantage of data analysis and ML services in Google Cloud.
Is it necessary to create an additional read replica within the Google Cloud context? If not, is there an alternative strategy that is frequently used in this context to bridge the two cloud services?
While services like Amazon Relational Database Service (RDS) provides read-replica capabilities, it is only between managed database instances on AWS.
If you are replicating a database between providers, then you are probably running the database yourself on virtual machines rather than using a managed service. This means the databases appear just like any resource on the Internet, so you can connect them exactly the way you would connect two resources across the internet. However, you would be responsible for managing, monitoring, deploying, etc. This takes away from much of the benefit of using cloud services.
Replicating between storage services like Amazon S3 would be easier since it is just raw data rather than a running database. Also, Big Data is normally stored in raw format rather than being loaded into a database.
If the existing infrastructure is on a cloud provider, then try to perform the remaining activities on the same cloud provider.