How can I list subdirectories recursively for HDFS? - list

I have a set of directories created in HDFS recursively. How can list all the directories ? For a normal unix file system I can do that using the below command
find /path/ -type d -print
But I want to get the similar thing for HDFS.

To list directory contents recursively hadoop dfs -lsr /dirname command can be used.
To filter only directories , you can grep "drwx" (since owner has rwx permission on directories) in output of above command.
Hence whole command will look like as below.
$hadoop dfs -lsr /sqoopO7 | grep drwx

The answer given by #Shubhangi Pardeshi is correct but for latest hadoop version command has deprecated. So new latest command can be used as below
hdfs dfs -ls -R /user | grep drwx

The following method should be more robust to only get directories because it depends less on the permissions.
hdfs dfs -ls -R /folder | grep "^d"

Related

HDFS dfs -ls path/filename

I have copied few files to the path. but when I tried to run the command hdfs dfs -ls path/filename then it returns no file found.
hdfs dfs -ls till directory works but when i use the file name it returns no files found. For one of the file, I copied and pasted the file name using ambari. Then file started getting returned on using hdfs dfs -ls path/filename.
What is causing this issue?
Because when you are executing HDFS dfs -ls path/filename what you are saying to hdfs is show me all the files that are in the directory and if end path is a file, of course, you are not listing anything. You must point to a directory not a file.
#saravanan it seems like a permission issue if the file shows up only after using ambari. Make sure the files are owned correctly to confirm the commands. The ls command will list files and folders per documentation.
Here is full documentation for ls command:
[root#hdp ~]# hdfs dfs -help ls
-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. For a directory a
list of its direct children is returned (unless -d option is specified).
Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-C Display the paths of files and directories only.
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.
-q Print ? instead of non-printable characters.
-R Recursively list the contents of directories.
-t Sort files by modification time (most recent first).
-S Sort files by size.
-r Reverse the order of the sort.
-u Use time of last access instead of modification for
display and sorting.
-e Display the erasure coding policy of files and directories.

List all directories of files which has two specif strings and located in HDFS

I need to get all directories of files which have both family and test as words written in the directory :
So For all directories located in :
hdfs/rohd/data/1Ex3
I tried :
hadoop fs -ls hdfs/rohd/data/1Ex3 | grep family;test
But it doesn't work
In fact the needed result should be like this:
hdfs/rohd/data/1Ex3/1_family_Pub_test
hdfs/rohd/data/1Ex3/2_family_Pub_test
hdfs/rohd/data/1Ex3/7_family_Pub_test
hdfs/rohd/data/1Ex3/3_family_Pub_test
hdfs/rohd/data/1Ex3/5_family_Pub_test
The solution is :
hadoop fs -ls hdfs/rohd/data/1Ex3 | grep _family_Pub_test

How to exclude directories while using `copyToLocal`

I want to copy files from HDFS. I want folders to be excluded while copying files. I tried hdfs dfs -copyToLocal but it also copies directories as I tested.
Is there any way/command to copy files but not directories?
As far as I know, there is no direct flag for -copyToLocal to copy only files. But you can make use of linux grep to exclude directories from the data you are copying. Something like this:
hdfs dfs -ls <HDFS_DIR_PATH> | grep "^-" | awk 'BEGIN{FL=""} {FL=FL" "$8} END{system("hdfs dfs -copyToLocal "FL" .")}'
where,
hdfs dfs -ls <HDFS_DIR_PATH> is for listing all the files and directories
grep "^-" is for excluding the directories
awk 'BEGIN{FL=""} {FL=FL" "$8} is for creating string with only file paths
END{system("hdfs dfs -copyToLocal "FL" .")}' is for copying the file path list
Note that, instead of . in the last command you can use any local file system path.
A variant of #daemon12's answer that achieves the same thing.
hadoop fs -ls <HDFS_DIR_PATH> | grep "^-" | \
awk '{print $8'} | hadoop fs -copyToLocal $(xargs) .
awk '{print $8'} is used to obtain the actual path column from ls output.
$(xargs) is used to concatenate lines of paths into space separated string.

How to list all subdirectories with a string but not subdirectories of a match

In a bash script I would like to parse the names of all subdirectories and find all subdirectories that have a matching string, but I do not want subdirectories of a match. I am interested in automating construction of my $PATH and $PYTHONPATH variables based on directory structure.
Here's an example:
Let's say I want to go through my ~/dev and ~/bin folders and find all subdirectories with bin/ which holds programs that I will want to run at the shell. I can get a list with
$ ls -lR $HOME/bin $HOME/dev |grep "\/" | grep "bin:"
/Users/dat5h/bin:
/Users/dat5h/bin/project/bin:
...
These can all be appended to $PATH and have all available scripts ready to run.
BUT, let's say I was searching for directories with python modules and packages to add to $PYTHONPATH. I could conceivably look for all directories that start with /py-. So, I try:
$ ls -lR $HOME/bin $HOME/dev |grep "\/" | grep "/py-"
/Users/dat5h/bin/py-test:
/Users/dat5h/bin/py-test/test-package:
/Users/dat5h/bin/py-test/test-package/nested-test:
...
My thinking is that I would not want to put package directories and subdirectories into the path. I'm pretty sure that would be strange, but I am actually new to python so suggestions would be helpful. How would I go about constructing a test case to only get directories with py-* but non of the subsequent subdirectories?
I tried:
$ ls -lR $HOME/bin $HOME/dev |grep "\/" | egrep "/py-.*[^/]:"
But this doesn't get the job done either. Maybe a better regex? Any help would be greatly appreciated!
SOLUTION
The solution I ended up satisfied with was the find suggested below with a cutoms regex:
find $HOME/bin $HOME/dev -type d -regex ".*\/py\(\w\|-\w\)*"
This will find all subdirectories of ~/bin and ~/dev that are some variant of "pySOMETHING", "py-SOMETHING", "pySOME_THING_ELSE", or "py-SOME_THING_ELSE" but does not grab any subdirectories of those unless they also match this string. This ensures that I can have some simple naming convention for all of my directories with python modules/packages and import them this way without accidentally being able to import nested packages without the hierarchy.
Does this:
find -type d -regex ".*py.*bin.*"
give you some start?

script to add files to SVN with filters

My bash scripting is weak. I want to create a script that filters and add files to the svn.
So far i have this
ls | egrep -v "(\.tab\.|\.yy\.|\.o$|\.exe$|~$)"
I tried to output it using exec but couldnt figure out how. Before that I checked if svn add uses regex. I am not sure if it does and i couldnt figure out how to reverse the above without the -v (i tired "[^((\.tab\.|\.yy\.|\.o$|\.exe$|~$))]" but that didnt work as expected (it seems to only ignore .tab. files))
How do i create a script to add files to svn after applying a filter? Would this be the most simple way? -> use ls, grep, put into a bash array then use a foreach with an svn add $element ?
NOTE: This is using linux, i dont think i'll have this running on windows (i couldnt set up bison) so as long as it works on most linux distros i am happy. Ignore the fact the above uses .exe
A number of ways:
Use backticks: svn add ``ls | egrep stuff
Use xargs: ls | egrep stuff | xargs svn add
Use find and xargs: find . -type f -name *.c -print | grep -v '\.svn' | xargs svn add
Obviously, change "stuff" and the "-name *.c" to suit your requirements...
Try using find.
find <pattern> -prune .svn -exec svn add {} \;
The command following exec will be executed for each file and {} will be replaced with the filename at each iteration.
I'm not in front of my linux system so I can't get you a pattern that you need right now but if you read the man, you might get there.
Another solution to this is to add those file extensions and the .svn folder to your SVN ignore pattern.
Armed with a client configured as such, you could then do svn add * and get only what you want into SVN.