hbase using google cloud storage (bucket) - google-cloud-platform

I'm new to hbase.
New to explore few options.
Can I use gcs file system as backing store for hbase on google cloud?
I know there's a google's big table, assuming setting up our own cluster with gcs as filesystem will reduce cost. Agree?

You can try the following as an experiment. I'm almost certain that it will work in principle, but I would be concerned with the impact on speed and performance, because HBase can be very IO intensive. So try this at small scale just to see if it works, and then try to stress-test it at scale:
HBase uses HDFS as the underlying file storage system. So your HBase config should point to an HDFS cluster that it writes to.
Your HDFS cluster, in turn, needs to point to a physical file directory to which it will write its files. So this really is just a physical directory on each HDFS Datanode in the HDFS cluster. Assuming you want to setup your HBase/HDFS cluster on a set of VMs in GCP, this is really just a directory on each VM that runs an HDFS Datanode
GCP has a utility called something like GCFuse (at least for Linux VMs). This is a utility that allows you to fuse a local directory on a Linux VM to a GCP cloud bucket. Once you make it work, you can read/write files directly to/from the GCP bucket as if it were just a directory on your Linux VM.
So why not do this for the directory that is configured as your underlying physical storage in HDFS? Simply fuse that directory to a GCP cloud bucket and it should work, at least in theory. I'd start with actually fusing a directory first, and writing a couple of files to it. Once you make that work, simply point you HDFS to that directory.
I don't see a reason why the above shouldn't work. But I'm not sure whether the GCFuse utility is meant for very heavy and fast IO loads. But you can give it a try!
If yo don't know how/where to configure all this at HBase and HDFS level, all you have to do is this:
In HBase, edit the hbase-site.xml file in the ../HBase/conf folder:
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode-address:9000/hbase</value>
</property>
So this entry tells HBase that it needs to write its files into HDFS, whose Namenode is running at address namenode-address, port 9000
From here, in HDFS, edit hdfs-site.xml in the ../hadoop/etc/hadoop folder:
<property>
<name>dfs.data.dir</name>
<value>file:///home/data/hdfs1/datanode</value>
</property>
This tells each HDFS Datanode where exactly to write its physical files. This could be the directory that is fused to the GCP bucket.
Another thought: If I'm not mistaken, HBase might now be available in GCP as part of Dataproc? I haven't used GCP for a while, and back in the day there was no HBase in Dataproc, but I think it's been added since.

Use BigTable as an alternative for HBase. Cloud BigTable open source client libraries have implemented the same set of interfaces with HBase client libraries making it HBase compliant. As for your last question, there's no way you could connect BigTable with Cloud Storage.

Related

Is it possible to run HBASE on AWS but it stores/pointes to HDFS?

just wanted to know if this question is even relevant?
I tried understanding many blogs but could not reach to a conclusion.
Yes, you can run HBase on Amazon EMR. And you can choose either S3 (via EMRFS) or native HDFS (on cluster):
It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore.

How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?

What I want to achieve:
Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently.
Context:
The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option.
It does however support exporting data to a SFTP server. Which is not available via any GCP tools directly.
So I have setup a manual SFTP server using vsftpd on a compute engine VM instance with expandable storage then giving it a static IP and hardwired that IP into my software. Data now comes to the compute engine instance on hourly intervals seamlessly.
Files are generated on hourly basis. Thus a different file for each hour. However they might contain some duplication .i.e. some of the end records of previous hour's file may overlap with the beginning of the current hour's file.
Files are coming from different source feeds and I have feed names in the filenames so ever-growing data on my compute engine VM instance look like:
feed1_210301_0500.csv
feed2_210301_0500.csv
feed3_210301_0500.csv
feed1_210301_0600.csv
feed2_210301_0600.csv
feed3_210301_0600.csv
feed1_210301_0700.csv
feed2_210301_0700.csv
feed3_210301_0700.csv
...
What I have tried:
I have set Bigquery access & cloud storage permissions within VM instance to access data from VM instance onto BigQuery:
I have tried importing data into BigQuery directly as well as on google cloud storage to import data from there and yet there is no option to directly import data from VM instance to BigQuery nor I can somehow import data from VM to GCS then to load into BigQuery but there is no option for that and documentation is silent on the matter of scheduled transfers as well.
There are some external data transfer services like Fivetran and HevoData but they are relatively expensive and also seem much of a overkill as both my source or destination is on the GCP and it wont be much different than having a third VM and scheduling some scripts for imports. (Which BTW is my current workaround :D i.e. Using python scripts to stream data into BigQuery as explained here)
Currently I am exploring DataFusion which is only free for 120 hrs each month, has extra costs for underlying Dataprep pipelines and not sure if its the right way to go about. Also I am currently exploring tools like Cloud Scheduler & Cloud Composer to see if any fits my data needs but as of now could not find a viable solution.
I am happy to learn any new tools and technologies and any advice bettering the situation in anyway is also appreciated.
I just tried uploading directly from the GCE VM and it worked flawlessly. I've enabled BigQuery in Cloud API access scopes, created the file (test_data.csv) with some random data, that satisfies the schema of the table (test_table) that I have in BigQuery table dataset (test_dataset) and ran:
bq load test_dataset.test_table test_data.csv
You could use GCS on-premise transfer (you can schedule it) => then schedule a GCS transfer to BigQuery.
If nor this, nor external data transfer services work for you, then I believe that your best bet is to create a script to schedule a batch load of the data from your VM to BigQuery.
Maybe this other answer might help you as well.

How Can we share files between multiple Windows VMs in GCP?

I have 10 Windows VMs where I want to have PD with both read-write in all the VM's. But I came to know that we cannot mount a disk to multiple VMs with read-write. SO I am looking for option where I can access a disk from any of those VMs. For Linux we can use GCSFuse to mount the Cloud storage as a disk, Do we have any option for windows where we can mount a single disk/Cloud Storage buckets to Multiple Windows VMs.
If you want it specifically to be a GCP Disk, your best option will be setting up an additional Windows instance, and set up a shared SMB disk with the other instances.
Another option, if you don't want to get too messy, best option would be using the Filestore service ( https://cloud.google.com/filestore/ ) , which is an NFS as a service, provided you have an NFS client for your Windows version
I believe you could use Google Cloud Storage buckets, which could be an intermediate transfer point between your instances, regardless of OS.
Upload your files from your workstation to a Cloud Storage bucket. Then, download those files from the bucket to your instances. When you need to transfer files in the other direction, reverse the process. Upload the files from your instance and then download those files to your workstation.
To achieve this follow these steps:
Create a new Cloud Storage bucket or identify an existing
bucket that you want to use to transfer files.
Upload files to
the bucket
Connect to your instance using RDP
upload/download files from the bucket.
However, there are other options like using file servers on Compute engine or following options:
Cloud Storage
Compute Engine persistent disks
Single Node File Server
Elastifile
Quobyte
Avere vFXT
These options have their advantages and disadvantages, for more details for the links attached to each of these options.

HDF to HDP data store

We have two clusters. One is HDF cluster including Nifi and the other one is HDP cluster included hdf, Hive other components. We are reading data from file and want to place in hdp cluster hdfs.
Can anybody point out to documentation regarding this or some examples..
Thanks in advance
NiFi's PutHDFS processor will write data to HDFS. You configure it with your hdfs-site.xml and core-site.xml files.
Sometimes network, security, or application configurations make it difficult to securely write files from a remote NiFi to a Hadoop cluster. A common pattern is to use two NiFis - one NiFi collects, formats, and aggregates records before transmitting to a second NiFi inside the Hadoop cluster via NiFi site-to-site protocol. Because the second NiFi is inside the Hadoop cluster, it can make it easier to write files securely to HDFS.
PutHDFS features in a couple NiFi Example Dataflow Templates, which also demonstrate commonly related activities like aggregating data, directory and file naming, and NiFi site-to-site communication.

How to configure putHDFS processor in Apache NiFi such that I could transfer file from a local machine to HDFS over the network?

I have data in a file on my local windows machine. The local machine has Apache NiFi running on it. I want to send this file to HDFS over the network using NiFi. How could I configure putHDFS processor in NiFi on the local machine such that I could send data to HDFS over the network?
Thank you!
You need to copy the core-site.xml and hdfs-site.xml from one of your hadoop nodes to the machine where NiFi is running. Then configure PutHDFS so that the configuration resources are "/path/to/core-site.xml,/path/to/hdfs-site.xml". That is all that is required from the NiFi perspective, those files contain all of the information it needs to connect to the Hadoop cluster.
You'll also need to ensure that the machine where NiFi is running has network access to all of the machines in your Hadoop cluster. You can look through those config files and find any hostnames and IP addresses and make sure they can be accessed from the machine where NiFi is running.
Using the GetFile processor or the combination of ListFile/FetchFile, it would be possible to bring this file from your local disk into NiFi and pass this onto the PutHDFS processor. The PutHDFS processor relies on the associated core-site.xml and hdfs-site.xml files in its configuration.
Just add Hadoop core configuration file directory to the first field
$HADOOP_HOME/conf/hadoop/hdfs-site.xml, $HADOOP_HOME/conf/hadoop/core-site.xml
and set the hdfs directory of the data ingestion to get stored in the field of Directory & let everything else default.