YARN or HDFS logs in Filebeat - hdfs

If I want to ingest logs which are present in HDFS into Filebeat, how can i do that? I can specify any directory that will be on local drives but i want the Filebeat to pick data from HDFS. Is there any way this can be done? Any help will be greatly appreciated.

You can mount HDFS to local file system, follow instructions from this answer.

Related

hbase using google cloud storage (bucket)

I'm new to hbase.
New to explore few options.
Can I use gcs file system as backing store for hbase on google cloud?
I know there's a google's big table, assuming setting up our own cluster with gcs as filesystem will reduce cost. Agree?
You can try the following as an experiment. I'm almost certain that it will work in principle, but I would be concerned with the impact on speed and performance, because HBase can be very IO intensive. So try this at small scale just to see if it works, and then try to stress-test it at scale:
HBase uses HDFS as the underlying file storage system. So your HBase config should point to an HDFS cluster that it writes to.
Your HDFS cluster, in turn, needs to point to a physical file directory to which it will write its files. So this really is just a physical directory on each HDFS Datanode in the HDFS cluster. Assuming you want to setup your HBase/HDFS cluster on a set of VMs in GCP, this is really just a directory on each VM that runs an HDFS Datanode
GCP has a utility called something like GCFuse (at least for Linux VMs). This is a utility that allows you to fuse a local directory on a Linux VM to a GCP cloud bucket. Once you make it work, you can read/write files directly to/from the GCP bucket as if it were just a directory on your Linux VM.
So why not do this for the directory that is configured as your underlying physical storage in HDFS? Simply fuse that directory to a GCP cloud bucket and it should work, at least in theory. I'd start with actually fusing a directory first, and writing a couple of files to it. Once you make that work, simply point you HDFS to that directory.
I don't see a reason why the above shouldn't work. But I'm not sure whether the GCFuse utility is meant for very heavy and fast IO loads. But you can give it a try!
If yo don't know how/where to configure all this at HBase and HDFS level, all you have to do is this:
In HBase, edit the hbase-site.xml file in the ../HBase/conf folder:
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode-address:9000/hbase</value>
</property>
So this entry tells HBase that it needs to write its files into HDFS, whose Namenode is running at address namenode-address, port 9000
From here, in HDFS, edit hdfs-site.xml in the ../hadoop/etc/hadoop folder:
<property>
<name>dfs.data.dir</name>
<value>file:///home/data/hdfs1/datanode</value>
</property>
This tells each HDFS Datanode where exactly to write its physical files. This could be the directory that is fused to the GCP bucket.
Another thought: If I'm not mistaken, HBase might now be available in GCP as part of Dataproc? I haven't used GCP for a while, and back in the day there was no HBase in Dataproc, but I think it's been added since.
Use BigTable as an alternative for HBase. Cloud BigTable open source client libraries have implemented the same set of interfaces with HBase client libraries making it HBase compliant. As for your last question, there's no way you could connect BigTable with Cloud Storage.

Sync between HDFS and NFS folders

How to sync the folders between NFS and HDFS without mounting the NFS on HDFS configured server.
Sync should work on vice versa too. If any files or folders deleted in HDFS should reflect on NFS system.
Please tell me the best approach or commands which I can run over the HDFS configured Linux server

Is it possible to copy files from amazon aws s3 directly to remote server?

I receive some large data to process and I would like to copy the files to my remote GPU server for processing.
the data contains 8000 files x 9GB/per file which is quite large.
Is it possible to copy the files from aws directly to the remote server (used with ssh)
I have googled it online and did not find anyone come up with the question..
If anyone could kindly provide a guide/url example I would appreciate a lot.
Thanks.
I assume your files are residing in S3.
If that is the case then you can simply install AWS CLI on your remote machine and use aws s3 cp command
For more details click here

How to change yarn scheduler configuration on aws EMR?

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!
Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html

How to configure putHDFS processor in Apache NiFi such that I could transfer file from a local machine to HDFS over the network?

I have data in a file on my local windows machine. The local machine has Apache NiFi running on it. I want to send this file to HDFS over the network using NiFi. How could I configure putHDFS processor in NiFi on the local machine such that I could send data to HDFS over the network?
Thank you!
You need to copy the core-site.xml and hdfs-site.xml from one of your hadoop nodes to the machine where NiFi is running. Then configure PutHDFS so that the configuration resources are "/path/to/core-site.xml,/path/to/hdfs-site.xml". That is all that is required from the NiFi perspective, those files contain all of the information it needs to connect to the Hadoop cluster.
You'll also need to ensure that the machine where NiFi is running has network access to all of the machines in your Hadoop cluster. You can look through those config files and find any hostnames and IP addresses and make sure they can be accessed from the machine where NiFi is running.
Using the GetFile processor or the combination of ListFile/FetchFile, it would be possible to bring this file from your local disk into NiFi and pass this onto the PutHDFS processor. The PutHDFS processor relies on the associated core-site.xml and hdfs-site.xml files in its configuration.
Just add Hadoop core configuration file directory to the first field
$HADOOP_HOME/conf/hadoop/hdfs-site.xml, $HADOOP_HOME/conf/hadoop/core-site.xml
and set the hdfs directory of the data ingestion to get stored in the field of Directory & let everything else default.