How to exclude directories while using `copyToLocal` - hdfs

I want to copy files from HDFS. I want folders to be excluded while copying files. I tried hdfs dfs -copyToLocal but it also copies directories as I tested.
Is there any way/command to copy files but not directories?

As far as I know, there is no direct flag for -copyToLocal to copy only files. But you can make use of linux grep to exclude directories from the data you are copying. Something like this:
hdfs dfs -ls <HDFS_DIR_PATH> | grep "^-" | awk 'BEGIN{FL=""} {FL=FL" "$8} END{system("hdfs dfs -copyToLocal "FL" .")}'
where,
hdfs dfs -ls <HDFS_DIR_PATH> is for listing all the files and directories
grep "^-" is for excluding the directories
awk 'BEGIN{FL=""} {FL=FL" "$8} is for creating string with only file paths
END{system("hdfs dfs -copyToLocal "FL" .")}' is for copying the file path list
Note that, instead of . in the last command you can use any local file system path.

A variant of #daemon12's answer that achieves the same thing.
hadoop fs -ls <HDFS_DIR_PATH> | grep "^-" | \
awk '{print $8'} | hadoop fs -copyToLocal $(xargs) .
awk '{print $8'} is used to obtain the actual path column from ls output.
$(xargs) is used to concatenate lines of paths into space separated string.

Related

How use cat and grep with gsutil and filter with a subdirectory name?

Currently, for getting a string (here: 123456789) into some files in all my buckets I do the following:
gsutil cat -h gs://AAAA/** | grep '123456789' > 20221109.txt
And I get the name of my path file when I match, so it works, but if I do it this way, it will search among all the directories (and I have a thousand directories and thousand files, it makes so much time.
I want to filter with a date thanks to the name of the subdirectory, like:
gsutil cat -h gs://AAAA/*2022-11-09*/** | grep '123456789' > 20221109.txt
But it didn't work, and I have no clue how to solve my problem, I read a lot of answers in SO, but I don't find them.
ps: I can't use find with gsutil , so I try to make it with cat and grep with gsutil in a single command line.
Thanks in advance for your help.
Finally, I managed to get what I wanted, but it was highly illegible. I think it's possible to do better. I'm open to any improvement. Reminder, This solution avoid reading all the directory of a bucket.
1st Step :
I manage to get all the paths and the file that match my pattern of the subdirectory (like a date here):
gsutil ls gs://directory1/*2022-11-09*/** > gs_path_files_2022_11_09.txt
After that, I want to make a grep for each file and get in the output the name of the file and the line where I get my match (again in the terminal):
while read -r line; do
gsutil cat "$line" | awk -v l="'Command: gsutil cat $line | awk '/the_string_i_want_to_match_in_my_file/{print ARGV[ARGIND] ":" $0}':" '/the_string_i_want_to_match_in_my_file/{print l $0}' >> results.txt
done < test.txt
and you will get after that the command (and the name of the file ) + the line where you get your match.
Best regards

HDFS dfs -ls path/filename

I have copied few files to the path. but when I tried to run the command hdfs dfs -ls path/filename then it returns no file found.
hdfs dfs -ls till directory works but when i use the file name it returns no files found. For one of the file, I copied and pasted the file name using ambari. Then file started getting returned on using hdfs dfs -ls path/filename.
What is causing this issue?
Because when you are executing HDFS dfs -ls path/filename what you are saying to hdfs is show me all the files that are in the directory and if end path is a file, of course, you are not listing anything. You must point to a directory not a file.
#saravanan it seems like a permission issue if the file shows up only after using ambari. Make sure the files are owned correctly to confirm the commands. The ls command will list files and folders per documentation.
Here is full documentation for ls command:
[root#hdp ~]# hdfs dfs -help ls
-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. For a directory a
list of its direct children is returned (unless -d option is specified).
Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-C Display the paths of files and directories only.
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.
-q Print ? instead of non-printable characters.
-R Recursively list the contents of directories.
-t Sort files by modification time (most recent first).
-S Sort files by size.
-r Reverse the order of the sort.
-u Use time of last access instead of modification for
display and sorting.
-e Display the erasure coding policy of files and directories.

List all directories of files which has two specif strings and located in HDFS

I need to get all directories of files which have both family and test as words written in the directory :
So For all directories located in :
hdfs/rohd/data/1Ex3
I tried :
hadoop fs -ls hdfs/rohd/data/1Ex3 | grep family;test
But it doesn't work
In fact the needed result should be like this:
hdfs/rohd/data/1Ex3/1_family_Pub_test
hdfs/rohd/data/1Ex3/2_family_Pub_test
hdfs/rohd/data/1Ex3/7_family_Pub_test
hdfs/rohd/data/1Ex3/3_family_Pub_test
hdfs/rohd/data/1Ex3/5_family_Pub_test
The solution is :
hadoop fs -ls hdfs/rohd/data/1Ex3 | grep _family_Pub_test

How can I list subdirectories recursively for HDFS?

I have a set of directories created in HDFS recursively. How can list all the directories ? For a normal unix file system I can do that using the below command
find /path/ -type d -print
But I want to get the similar thing for HDFS.
To list directory contents recursively hadoop dfs -lsr /dirname command can be used.
To filter only directories , you can grep "drwx" (since owner has rwx permission on directories) in output of above command.
Hence whole command will look like as below.
$hadoop dfs -lsr /sqoopO7 | grep drwx
The answer given by #Shubhangi Pardeshi is correct but for latest hadoop version command has deprecated. So new latest command can be used as below
hdfs dfs -ls -R /user | grep drwx
The following method should be more robust to only get directories because it depends less on the permissions.
hdfs dfs -ls -R /folder | grep "^d"

grep through an explicit filelist

Stack,
We have many files in our library that were never used in subsequent projects. We are now at a development phase where we can do some good housekeeping and carefully remove unused library code. I am trying to optimize my grep command, it's current implementation is quite slow.
grep --include=*.cpp --recursive --files-with-matches <library function name> <network path to subsequent projects>
The main reason is that the projects path is expansive and the bulk of the time is spent just navigating the directory tree and applying the file mask. This grep command is called many times on the same set of project files.
Rather than navigating the directory tree every call, I would like to grep to reference a static filelist stored on my local disk.
Something akin to this:
grep --from-filelist=c:\MyProjectFileList.txt
The MyProjectFileList.txt would be:
\\server1\myproject1\main.cpp
\\server1\myproject1\func1.cpp
\\server1\myproject2\main.cpp
\\server1\myproject2\method.cpp
Grep would apply the pattern-expression to contents of each of those files. Grep output would be the fully qualified path of the project file that is uses a specific library function.
Grep commands for specific library functions that return no project files are extraneous and can be deleted.
How do you force grep to scan files from an external filelist stored in a text file?
(Thereby removing directory scanning.)
Try around a little using the 'xargs' command and pipes ("|").
Try the following:
while read line; do echo -e "$line"; done < list_of_files.txt | xargs -0 grep **YOUR_GREP_ARGS_HERE**
or in a Windows environment with Powershell installed try...
Get-Content List_of_files.txt | Foreach-Object {grep $_ GREP_ARGS_HERE}
I googled for windows args and found this:
FOR /F %k in (filelist.txt) DO grep yourgrepargs %k
(but I use linux, no idea if it works)