HDF to HDP data store - hdfs

We have two clusters. One is HDF cluster including Nifi and the other one is HDP cluster included hdf, Hive other components. We are reading data from file and want to place in hdp cluster hdfs.
Can anybody point out to documentation regarding this or some examples..
Thanks in advance

NiFi's PutHDFS processor will write data to HDFS. You configure it with your hdfs-site.xml and core-site.xml files.
Sometimes network, security, or application configurations make it difficult to securely write files from a remote NiFi to a Hadoop cluster. A common pattern is to use two NiFis - one NiFi collects, formats, and aggregates records before transmitting to a second NiFi inside the Hadoop cluster via NiFi site-to-site protocol. Because the second NiFi is inside the Hadoop cluster, it can make it easier to write files securely to HDFS.
PutHDFS features in a couple NiFi Example Dataflow Templates, which also demonstrate commonly related activities like aggregating data, directory and file naming, and NiFi site-to-site communication.

Related

hbase using google cloud storage (bucket)

I'm new to hbase.
New to explore few options.
Can I use gcs file system as backing store for hbase on google cloud?
I know there's a google's big table, assuming setting up our own cluster with gcs as filesystem will reduce cost. Agree?
You can try the following as an experiment. I'm almost certain that it will work in principle, but I would be concerned with the impact on speed and performance, because HBase can be very IO intensive. So try this at small scale just to see if it works, and then try to stress-test it at scale:
HBase uses HDFS as the underlying file storage system. So your HBase config should point to an HDFS cluster that it writes to.
Your HDFS cluster, in turn, needs to point to a physical file directory to which it will write its files. So this really is just a physical directory on each HDFS Datanode in the HDFS cluster. Assuming you want to setup your HBase/HDFS cluster on a set of VMs in GCP, this is really just a directory on each VM that runs an HDFS Datanode
GCP has a utility called something like GCFuse (at least for Linux VMs). This is a utility that allows you to fuse a local directory on a Linux VM to a GCP cloud bucket. Once you make it work, you can read/write files directly to/from the GCP bucket as if it were just a directory on your Linux VM.
So why not do this for the directory that is configured as your underlying physical storage in HDFS? Simply fuse that directory to a GCP cloud bucket and it should work, at least in theory. I'd start with actually fusing a directory first, and writing a couple of files to it. Once you make that work, simply point you HDFS to that directory.
I don't see a reason why the above shouldn't work. But I'm not sure whether the GCFuse utility is meant for very heavy and fast IO loads. But you can give it a try!
If yo don't know how/where to configure all this at HBase and HDFS level, all you have to do is this:
In HBase, edit the hbase-site.xml file in the ../HBase/conf folder:
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode-address:9000/hbase</value>
</property>
So this entry tells HBase that it needs to write its files into HDFS, whose Namenode is running at address namenode-address, port 9000
From here, in HDFS, edit hdfs-site.xml in the ../hadoop/etc/hadoop folder:
<property>
<name>dfs.data.dir</name>
<value>file:///home/data/hdfs1/datanode</value>
</property>
This tells each HDFS Datanode where exactly to write its physical files. This could be the directory that is fused to the GCP bucket.
Another thought: If I'm not mistaken, HBase might now be available in GCP as part of Dataproc? I haven't used GCP for a while, and back in the day there was no HBase in Dataproc, but I think it's been added since.
Use BigTable as an alternative for HBase. Cloud BigTable open source client libraries have implemented the same set of interfaces with HBase client libraries making it HBase compliant. As for your last question, there's no way you could connect BigTable with Cloud Storage.

Is it possible to run HBASE on AWS but it stores/pointes to HDFS?

just wanted to know if this question is even relevant?
I tried understanding many blogs but could not reach to a conclusion.
Yes, you can run HBase on Amazon EMR. And you can choose either S3 (via EMRFS) or native HDFS (on cluster):
It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore.

How to configure putHDFS processor in Apache NiFi such that I could transfer file from a local machine to HDFS over the network?

I have data in a file on my local windows machine. The local machine has Apache NiFi running on it. I want to send this file to HDFS over the network using NiFi. How could I configure putHDFS processor in NiFi on the local machine such that I could send data to HDFS over the network?
Thank you!
You need to copy the core-site.xml and hdfs-site.xml from one of your hadoop nodes to the machine where NiFi is running. Then configure PutHDFS so that the configuration resources are "/path/to/core-site.xml,/path/to/hdfs-site.xml". That is all that is required from the NiFi perspective, those files contain all of the information it needs to connect to the Hadoop cluster.
You'll also need to ensure that the machine where NiFi is running has network access to all of the machines in your Hadoop cluster. You can look through those config files and find any hostnames and IP addresses and make sure they can be accessed from the machine where NiFi is running.
Using the GetFile processor or the combination of ListFile/FetchFile, it would be possible to bring this file from your local disk into NiFi and pass this onto the PutHDFS processor. The PutHDFS processor relies on the associated core-site.xml and hdfs-site.xml files in its configuration.
Just add Hadoop core configuration file directory to the first field
$HADOOP_HOME/conf/hadoop/hdfs-site.xml, $HADOOP_HOME/conf/hadoop/core-site.xml
and set the hdfs directory of the data ingestion to get stored in the field of Directory & let everything else default.

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?
However, when creating a new job flow, we can only configure a S3 bucket as input data origin.
Any ideas/samples on how to do this?
Thanks!
P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.
How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.
If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.
If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.
Try using scp to copy files to your EMR instance:
my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file
(or use ftp, or wget, or curl, or anything else you want)
then log into your EMR instance with ssh and load it into hadoop:
my-desktop-box$ ssh my-emr-node
my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file

How to use external data with Elastic MapReduce

From Amazon's EMR FAQ:
Q: Can I load my data from the internet or somewhere other than Amazon S3?
Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB.
What are the specifications for loading data from external (non-S3) sources? There seems to be a dearth of resources around this option and doesn't appear to be documented in any form.
If you want to do it "a hadoop way" you should implement DFS over your data source, or to put referances to your source URLs into some file, which will be input for the MR job.
In the same time hadoop is about moving code to data. Even EMR over S3 is not ideal in this perspectice - EC2 and S3 are different cluster. So it is hard to imegine effective MR procesing if datasource is phisically outside of the data center.
Basically what Amazon is saying that programatically you can access any content from internet or any other source via your code. For example you can access a Couch database instance via any HTTP based client APIs.
I know that Cassandra package for java has one source package named org.apache.cassandra.hadoop and there are two classes in it that are needed for getting info from Cassandra when you are running the AWS Elastic MapReduce.
Essential classes: ColumnFamilyInputFormat.java and ConfigHelper.java
Go to this link to see an example of what I'm talking about.