Sync between HDFS and NFS folders - hdfs

How to sync the folders between NFS and HDFS without mounting the NFS on HDFS configured server.
Sync should work on vice versa too. If any files or folders deleted in HDFS should reflect on NFS system.
Please tell me the best approach or commands which I can run over the HDFS configured Linux server

Related

hbase using google cloud storage (bucket)

I'm new to hbase.
New to explore few options.
Can I use gcs file system as backing store for hbase on google cloud?
I know there's a google's big table, assuming setting up our own cluster with gcs as filesystem will reduce cost. Agree?
You can try the following as an experiment. I'm almost certain that it will work in principle, but I would be concerned with the impact on speed and performance, because HBase can be very IO intensive. So try this at small scale just to see if it works, and then try to stress-test it at scale:
HBase uses HDFS as the underlying file storage system. So your HBase config should point to an HDFS cluster that it writes to.
Your HDFS cluster, in turn, needs to point to a physical file directory to which it will write its files. So this really is just a physical directory on each HDFS Datanode in the HDFS cluster. Assuming you want to setup your HBase/HDFS cluster on a set of VMs in GCP, this is really just a directory on each VM that runs an HDFS Datanode
GCP has a utility called something like GCFuse (at least for Linux VMs). This is a utility that allows you to fuse a local directory on a Linux VM to a GCP cloud bucket. Once you make it work, you can read/write files directly to/from the GCP bucket as if it were just a directory on your Linux VM.
So why not do this for the directory that is configured as your underlying physical storage in HDFS? Simply fuse that directory to a GCP cloud bucket and it should work, at least in theory. I'd start with actually fusing a directory first, and writing a couple of files to it. Once you make that work, simply point you HDFS to that directory.
I don't see a reason why the above shouldn't work. But I'm not sure whether the GCFuse utility is meant for very heavy and fast IO loads. But you can give it a try!
If yo don't know how/where to configure all this at HBase and HDFS level, all you have to do is this:
In HBase, edit the hbase-site.xml file in the ../HBase/conf folder:
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode-address:9000/hbase</value>
</property>
So this entry tells HBase that it needs to write its files into HDFS, whose Namenode is running at address namenode-address, port 9000
From here, in HDFS, edit hdfs-site.xml in the ../hadoop/etc/hadoop folder:
<property>
<name>dfs.data.dir</name>
<value>file:///home/data/hdfs1/datanode</value>
</property>
This tells each HDFS Datanode where exactly to write its physical files. This could be the directory that is fused to the GCP bucket.
Another thought: If I'm not mistaken, HBase might now be available in GCP as part of Dataproc? I haven't used GCP for a while, and back in the day there was no HBase in Dataproc, but I think it's been added since.
Use BigTable as an alternative for HBase. Cloud BigTable open source client libraries have implemented the same set of interfaces with HBase client libraries making it HBase compliant. As for your last question, there's no way you could connect BigTable with Cloud Storage.

How to make spark save it's temp files on S3?

I am running spark jobs on the AWS EMR cluster submitting them from the client host machine.
Client machine is just an EC2 instance that submits jobs to the EMR with yarn in cluster mode.
The problem is - spark saves temp files each of 200Mb like:
/tmp/spark-456184c9-d59f-48f4-9b0560b7d310655/__spark_conf__6943938018805427428.zip
Tmp folder is getting filled with such files very fast and I start getting failed jobs with the error:
No space left on device
I tried to configure spark.local.dir in spark-defaults.conf to point to my s3 bucket, but it adds user directory prefix to the path like this: /home/username/s3a://my-bucket/spark-tmp-folder
Could you please suggest how I can fix this problem?
I uploaded the zip archive __spark_conf__6943938018805427428.zip
with spark libs to the s3 bucket.
Then I specified it in the spark-defaults.conf in the property
spark.yarn.archive s3a://mybucket/libs/spark_libs.zip on my
client host machine that submits jobs.
Now spark loads only configs to the local tmp folder that takes
only 170Kb instead of 200Mb.

Is it possible to copy files from amazon aws s3 directly to remote server?

I receive some large data to process and I would like to copy the files to my remote GPU server for processing.
the data contains 8000 files x 9GB/per file which is quite large.
Is it possible to copy the files from aws directly to the remote server (used with ssh)
I have googled it online and did not find anyone come up with the question..
If anyone could kindly provide a guide/url example I would appreciate a lot.
Thanks.
I assume your files are residing in S3.
If that is the case then you can simply install AWS CLI on your remote machine and use aws s3 cp command
For more details click here

Copy files for local Machine to AWS

A local project directory with its files and sub directory which is used for web app development needs to move to the AWS cloud. And once there, changes in the local machine version often will need to sync with aws version to update it.
The local Mac machine has aws-shell installed. The app gets built from Dockerfile on EC2 thus the project directory will eventually need to be on the EC2.
Options:
1. compress locally to 100Mb and scp to EC2, unzip on EC2 and use docker?
2. compress locally and copy to S3, copy from S3 to EC2?
What commands is used to to pull this off?
Thanks
compress locally to 100Mb and scp to EC2, unzip on EC2 and use docker?
What commands is used to to pull this off?
scp
compress locally and copy to S3, copy from S3 to EC2?
What commands is used to to pull this off?
aws s3 cp or aws s3 sync
See the documentation here

How to configure putHDFS processor in Apache NiFi such that I could transfer file from a local machine to HDFS over the network?

I have data in a file on my local windows machine. The local machine has Apache NiFi running on it. I want to send this file to HDFS over the network using NiFi. How could I configure putHDFS processor in NiFi on the local machine such that I could send data to HDFS over the network?
Thank you!
You need to copy the core-site.xml and hdfs-site.xml from one of your hadoop nodes to the machine where NiFi is running. Then configure PutHDFS so that the configuration resources are "/path/to/core-site.xml,/path/to/hdfs-site.xml". That is all that is required from the NiFi perspective, those files contain all of the information it needs to connect to the Hadoop cluster.
You'll also need to ensure that the machine where NiFi is running has network access to all of the machines in your Hadoop cluster. You can look through those config files and find any hostnames and IP addresses and make sure they can be accessed from the machine where NiFi is running.
Using the GetFile processor or the combination of ListFile/FetchFile, it would be possible to bring this file from your local disk into NiFi and pass this onto the PutHDFS processor. The PutHDFS processor relies on the associated core-site.xml and hdfs-site.xml files in its configuration.
Just add Hadoop core configuration file directory to the first field
$HADOOP_HOME/conf/hadoop/hdfs-site.xml, $HADOOP_HOME/conf/hadoop/core-site.xml
and set the hdfs directory of the data ingestion to get stored in the field of Directory & let everything else default.