How to tar a folder in HDFS? - hdfs

Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.

From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.

Related

Cannot use regular expression in hadoop in Linux command line

I have a folder that contains a large number of subfolders that are dates from 2018. In my HDFS I have created a folder of just December dates (formatted 2018-12-) and I need to delete specifically days 21 - 25. I copied this folder from my HDFS to my docker container and used the command
rm -r *[21-25]
in the folder it worked as expected. But when I run this same command adapted to hdfs
hdfs dfs –rm -r /home/cloudera/logs/2018-Dec/*[21-25]
it gives me the error
rm: `/home/cloudera/logs/2018-Dec/*[21-25]': No such file or directory."
If you need something to be explained in more detail leave a comment. I'm brand new to all of this and I don't 100% understand how to say some of these things.
I figured it out with the help of #Barmer. I was referring to my local systems base directory also I had to change the regular expression to 2[1-5]. So the command ended up being hdfs dfs -rm -r /user/cloudera/logs/2018-Dec/*2[1-5].

No such file exists while running Hadoop pipes using c++

While running hadoop map reduce program using hadoop pipes, the file which is present in the hdfs is not found by the map reduce. If the program is executed without hadoop pipes, the file is easily found by the libhdfs library but when running the program with
hadoop pipes -input i -ouput o -program p
command, the file is not found by the libhdfs and java.io.exception is thrown. Have tried to include the -fs parameter in the command but still the same results. I Have also included hdfs://localhost:9000/ with the files, and still no results. The file parameter is inside the c code as:
file="/path/to/file/in/hdfs" or "hdfs://localhost:9000/path/to/file"
hdfsFS fs = hdfsConnect("localhost", 9000);
hdfsFile input=hdfsOpenFile(fs,file,O_RDONLY,0,0,0);
Found the problem. The files in the hdfs are not available to the mapreduce task node. So instead had to pass the files to the distributed cache through the archive tag by compressing the files to a single tar file. Can also achieve this by writing a custom inputformat class and provide the files in the input parameter.

Getting status of gsutil cp command in parallel mode

This command copies a huge number of files from Google Cloud storage to my local server.
gsutil -m cp -r gs://my-bucket/files/ .
There are 200+ files, each of which is over 5GB in size.
Once all files are downloaded, another process kicks in and starts reading the files one by one and extract the info needed.
The problem is, even though the gsutil copy process is fast and downloads files in batches of multiple files at a very high speed, I still need to wait till all the files are downloaded before starting to process them.
Ideally I would like to start processing the first file as soon as it is downloaded. But with multiple cp mode, there seems to be no way of knowing when a file is downloaded (or is there?).
From Google docs, this can be done in individual file copy mode.
if ! gsutil cp ./local-file gs://your-bucket/your-object; then
<< Code that handles failures >>
fi
That means if I run the cp without -m flag, I can get a boolean on success for that file and I can kick off the file processing.
Problem with this approach is the overall download will take much longer as files are now downloading one by one.
Any insight?
One thing you could do is have a separate process that periodically lists the directory, filtering out the files that are incompletely downloaded (they are downloaded to a filename ending with '.gstmp' and then renamed after the download completes) and keeps track of files you haven't yet processed. You could terminate the periodic listing process when the gsutil cp process completes, or you could just leave it running, so it processes downloads for the next time you download all the files.
Two potential complications with doing that are:
If the number of files being downloaded is very large, the periodic directory listings could be slow. How big "very large" is depends on the type of file system you're using. You could experiment by creating a directory with the approximate number of files you expect to download, and seeing how long it takes to list. Another option would be to use the gsutil cp -L option, which builds a manifest showing what files have been downloaded. You could then have a loop reading through the manifest, looking for files that have downloaded successfully.
If the multi-file download fails partway through (e.g., due to a network connection that's dropped for longer than gsutil will retry), you'll end up with a partial set of files. For this case you might considering using gsutil rsync, which can be restarted and pick up where you left off.

Why does cp'ing a file work while mv does not in ssc.textFileStream and HDFS?

Following the standard Spark Streaming example using ssc.textFileStream for reading a file from an HDFS directory, I noticed that trying to read files placed there with mv did not work whereas cp did. That has surprised me.
I am surprised as cp does not seem to be a good idea to me because it is a copy in progress.
What could be going on here and why do I read to use mv then? - which seems sort of obvious.

Tarring only the files of a directory

If I have a folder with a bunch of images, how can I tar ONLY the images and not the folder structure leading to the images without having to CD into the directory of images?
tar czf images.tgz /path/to/images/*
Now when images.tgz is extracted, the contents that are extracted are /path/to/images/...
How I can only have the images included into the tgz file (and not the three folders that lead to the images)?
I know you can use --strip-components when untarring although I'm not sure if this would also work when creating.
Perhaps iterate through a folder structure and then pipe the results via stdout/stdin to tar sequentially? (with cat or similar?) IANA Linux Expert so afraid I can only theorise versus provide hard code at this second, you've got me wondering now though...