If I were to copy a directory (with 10 files in it) from local to HDFS,
would it be better to write a single
hdfs dfs -copyFromLocal <dir_loc> <hdfs_loc>
or
hdfs dfs -copyFromLocal <File1> <hdfs_loc>
hdfs dfs -copyFromLocal <File2> <hdfs_loc>
..
The commands given above will be in multiprocessing code, hence not sequential.
My question is will it help improve speeds if I run commands in parallel? Or will it both be the same as it runs on the same cluster?
Your second code snippet will not actually run them in parallel; that would be sequential (the command is synchronous). If you actually want to run the uploads in parallel you should instead write:
hdfs dfs -copyFromLocal <File1> <hdfs_loc> &
hdfs dfs -copyFromLocal <File2> <hdfs_loc> &
...
Whether or not this will speed things up is very dependent on your hardware and configuration. Let's assume you are using the default replication factor (3), and that the machine you are running the upload from is identical to the machines you run your DataNode processes on (has the same available network bandwidth). When you upload a file to a DataNode, it then streams the data to other DataNodes to achieve the desired replication factor. Thus, if uploading a single file at a time, the DN's network should saturate before your uploading machine (it has to both receive the data and transmit it along). Doing multiple uploads in parallel will result in the transfer going to different DataNodes, so you may be able to use more available bandwidth. Doing more than a small number in parallel will likely saturate the uploading machine's network bandwidth and result in diminishing returns.
If, however, you did the uploads from multiple machines, then you could greatly speed up the process, as each uploading machine could be sending to a different DataNode on the cluster.
TL;DR It may help a bit but only to a certain extent; you will be limited by the capabilities of the uploading machine.
Related
We have a cluster set up with HDP, and we use it in order to execute a process that runs for ~40h, going through different tasks and stages. I would like to know what is a highest HDFS Disc Usage during this period, and at what time. I can see that Ambari Dashboard (v2.7.4.0) and NameNode UI provide current HDFS Disc Usage, but I can't find an option to show it over time (even though CPU and Memory usage have such option and nice graphs). Does anyone know is it possible to gather such statistics?
Hi could anyone explain me what is HDFS master (Namenode is responsible for? also what is exactly Namenode and Datanode metadata in HDFS. I am recently started studying SPARK but our lecture was not deep enough for HDFS. Many thanks
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes
https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.
I wrote a script that analyzes a lot of files on an AWS cluster.
Running it on the cloud seems to be slower than I expected - the filesystem is shared via NFS, so the round-trip through the network seems to be the limiting step here. Bottom line - the processing power of the cluster is limited by the speed of the internal network which is considerably slower than the speed of the SSD the data is located in.
How would you optimize the cluster so that IO intensive jobs will run efficiently?
There isn't much you can do given the circumstances.
Obviously the speed of the NFS itself is the drawback.
Consider:
Chunking - grab only pieces of the files required to do as much as possible
Copying locally - create a locking mechanism and copy the file in full locally, process, push back. This can require a lot of work (what if the worker gives up and doesnt clear the lock)
Optimize the NFS share - increase IO throughput by clustering the NFS, raiding it, etc.
With a remote FS you want to limit the amount of back and forth. You can be creative, but creative can be a problem in itself.
I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves).
Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the above by quite a margin.
What could be the possible reasons? Is it the network latency? It's running on 100Mbps Ethernet network. Will it help if I increase the number of nodes?
I am using Hadoop Streaming and my code is in python2.7.
MapReduce isn't really meant to handle that small of an input dataset. The MapReduce framework has to determine which nodes will run tasks and then spin up a JVM to run each individual Map and Reduce task(s) (the number of tasks is dependent on the size of your data set). That usually has a latency on the order of tens of seconds. Shipping non local data between nodes is also expensive as it involves sending data over the wire. For such a small dataset, the overhead of setting up a MapReduce job in a distributed cluster is likely higher than the runtime of the job itself. On a single node you only see the overhead of starting up tasks on a local machine and don't have to do any data copying over the network, that's why the job finishes faster on a single machine. If you had multi gigabyte files, you would see better performance on several machines.