How to Import files to HDFS as HAR archive? in Java - mapreduce

Currently we are importing files into HDFS by invoking the the org.apache.hadoop.fs.FileSystem.moveFromLocalFile() method in the FileSystem API of hadoop, now we are encountering some large heap size in our namenode due to the number of small files being imported is too many and we want to reduce it. Is there an easier way to import the files as HAR into HDFS without having to import all small files first? in short I import the small files but in the HDFS there is 1 HAR file containing my imported files.

It is not possible to directly ingest HAR (Hadoop ARchive) files into HDFS.
The better approach would be to, copy smaller files into HDFS first and then create a HAR file by merging all these smaller files together.
You can use hadoop archive (Usage: hadoop archive -archiveName {name of the archive} -p {Input parent folder path} {Output folder Path}) command to create a HAR file and after creating the HAR file, you can delete your original files.
If there are millions of small files, then you can copy these files in chunks.
For e.g. let's assume that you have 100,000 small files. One possible approach:
Copy 10,000 files into a temporary location in HDFS. For e.g. hdfs:///tmp/partition1/
Create a HAR file from these 10,000 files. For e.g. hdfs:///tmp/archive1/
After creating the archive, delete the files from hdfs:///tmp/partition1/
Repeat steps 1 to 3, till you ingest all the 100,000 files.

Related

No such file exists while running Hadoop pipes using c++

While running hadoop map reduce program using hadoop pipes, the file which is present in the hdfs is not found by the map reduce. If the program is executed without hadoop pipes, the file is easily found by the libhdfs library but when running the program with
hadoop pipes -input i -ouput o -program p
command, the file is not found by the libhdfs and java.io.exception is thrown. Have tried to include the -fs parameter in the command but still the same results. I Have also included hdfs://localhost:9000/ with the files, and still no results. The file parameter is inside the c code as:
file="/path/to/file/in/hdfs" or "hdfs://localhost:9000/path/to/file"
hdfsFS fs = hdfsConnect("localhost", 9000);
hdfsFile input=hdfsOpenFile(fs,file,O_RDONLY,0,0,0);
Found the problem. The files in the hdfs are not available to the mapreduce task node. So instead had to pass the files to the distributed cache through the archive tag by compressing the files to a single tar file. Can also achieve this by writing a custom inputformat class and provide the files in the input parameter.

How to tar a folder in HDFS?

Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.
From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.

How to remove all files matching specific file content in HDFS?

By mistake, using NiFi, huge number of files got generated in HDFS directory with content "val3val2val1". I want to remove all files matching this content using HDFS command. Please advise

Apple script to move files of a type

I have a folder of approximately 1800~ .7z files. Some of them uncompress as a single file, some uncompress as a folder with several of the same type of file. If I were to uncompress all of them at the same time, what I end up with is a folder full of .7z files, folders containing multiple of a file type, and multiple single instances of that same file type.
What I would like to do is run a script that would take all of the same file types, from all containing folders below the main folder, and copy them to another specified folder. I unfortunately don't have really any experience with Apple Scripts and while this may be simple, it sounds insurmountable to me. Any input would be appreciated.

Getting status of gsutil cp command in parallel mode

This command copies a huge number of files from Google Cloud storage to my local server.
gsutil -m cp -r gs://my-bucket/files/ .
There are 200+ files, each of which is over 5GB in size.
Once all files are downloaded, another process kicks in and starts reading the files one by one and extract the info needed.
The problem is, even though the gsutil copy process is fast and downloads files in batches of multiple files at a very high speed, I still need to wait till all the files are downloaded before starting to process them.
Ideally I would like to start processing the first file as soon as it is downloaded. But with multiple cp mode, there seems to be no way of knowing when a file is downloaded (or is there?).
From Google docs, this can be done in individual file copy mode.
if ! gsutil cp ./local-file gs://your-bucket/your-object; then
<< Code that handles failures >>
fi
That means if I run the cp without -m flag, I can get a boolean on success for that file and I can kick off the file processing.
Problem with this approach is the overall download will take much longer as files are now downloading one by one.
Any insight?
One thing you could do is have a separate process that periodically lists the directory, filtering out the files that are incompletely downloaded (they are downloaded to a filename ending with '.gstmp' and then renamed after the download completes) and keeps track of files you haven't yet processed. You could terminate the periodic listing process when the gsutil cp process completes, or you could just leave it running, so it processes downloads for the next time you download all the files.
Two potential complications with doing that are:
If the number of files being downloaded is very large, the periodic directory listings could be slow. How big "very large" is depends on the type of file system you're using. You could experiment by creating a directory with the approximate number of files you expect to download, and seeing how long it takes to list. Another option would be to use the gsutil cp -L option, which builds a manifest showing what files have been downloaded. You could then have a loop reading through the manifest, looking for files that have downloaded successfully.
If the multi-file download fails partway through (e.g., due to a network connection that's dropped for longer than gsutil will retry), you'll end up with a partial set of files. For this case you might considering using gsutil rsync, which can be restarted and pick up where you left off.