hadoop delete file which is zero byte - hdfs

I am looking for a command in hadoop 2.x to delete files which are Zero bytes in hdfs.
Can any one please let me know appropriate command.
I am trying to find the files that has are of zero bytes in hdfs and delete them from the directory.

for f in $(hdfs dfs -ls -R / | awk '$1 !~ /^d/ && $5 == "0" { print $8 }'); do hdfs dfs -rm "$f"; done
Step by step:
hdfs dfs -ls -R / - list all files in HDFS recursively
awk '$1 !~ /^d/ && $5 == "0" { print $8 }') - print full path of those being not directories and with size 0
for f in $(...); do hdfs dfs -rm "$f"; done - iteratively remove

Building on Kombajn's answer, if you have a lot of files to delete it will be quicker to use xargs. This will allow you to delete multiple files per hdfs command, which is rather expensive.
hdfs dfs -ls -R / | awk '$1 !~ /^d/ && $5 == "0" { print $8 }' | xargs -n100 hdfs dfs -rm

Related

How to use hdfs dfs cp with xargs to work around linux argument list limit?

I have a lot of files to copy on HDFS and I encounter the maximum argument list limit of the operating system. A work around that currently works is to generate a single command for a single file to process. However, that requires time.
I am trying to work with xargs to get around the argument limit and reduce processing time. But I am not able to make it work.
Here is the current situation.
I echo (because I have read somewhere that echo is not subject to argument limit) the file names and pipe them to xarg.
echo "/user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl" | xargs -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
However this throws:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl': No such file or directory
Based on this example, I tried with:
echo "/user/florian_castelain/test/yolo" "/user/florian_castelain/ignore_dtl" | xargs -0 -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl
But no file has been copied at all.
How can I use xarg combined with the hdfs dfs -cp command to handle the copy of multiple files at once ?
Hadoop 2.6.0-cdh5.13.0
Edit 1
With the verbose flag and this config', I have the following output:
echo "/user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl" | xargs -I % -t hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
hdfs dfs -cp -p /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl /user/florian_castelain/test/xargs/
Which throws:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl': No such file or directory
While executing this command manually works fine. Why is that ?
Edit 2
Based on jjo answer, I tried the following:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | xargs -0 -t -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
hdfs dfs -cp -p /user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl
/user/florian_castelain/test/xargs/
And does not copy anything.
So I tried removing new line character before passing to xargs:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | tr -d "\n" | xargs -0 -t -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
hdfs dfs -cp -p /user/florian_castelain/test/yolo/user/florian_castelain/ignore_dtl /user/florian_castelain/test/xargs/
But nothing is copied as well. :(
The problem I see you're facing is that whitespace in yolo , plus xargs slurping stdin entries as separated by newlines.
As your files are local, you should leverage find -0 | xargs -0 as e.g.:
find /user/florian_castelain/foo/bar -type f -0 | xargs -0 -I hdfs dfs -cp -p % /some/dst
If you still need/want to feed xargs with "whitespace separated filenames", use printf "%s\n" instead (which is also a builtin in bash as echo), so that each file will be outputted with a newline between:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | xargs -I % hdfs dfs -cp -p % /some/dst

List Volumes with DF, Grep, Awk | Bash Shell

Trying to print all entries which start with /Volumes/, this to list mounted volumes on mac. See Updates.
IFS=$'\n' read -r -d '' -a volumes < <(
df | egrep -o '/Volumes/.*'
)
echo "${volumes}"
Update 1: This worked, but prints a space before each new line.
#!/usr/bin/env bash
IFS=$'\n' read -r -d '' -a volumes < <(
df | egrep -oi '(\s+/Volumes/\S+)'
)
printf "%s\n" "${volumes[#]}"
Update 2: Worked, but doesn't print volume names with spaces in it
IFS=$'\n' read -d '' -ra volumes < <(
df | awk 'index($NF, "/Volumes/")==1 { print $NF }'
)
printf '%s\n' ${volumes[#]}
Update 3: Prints the second part of the volume name with spaces in it on a new line
IFS=$'\n' read -d '' -ra volumes < <(
df | awk -F ' {2,}' 'index($NF, "/Volumes/")==1 { print $NF }'
)
printf '%s\n' ${volumes[#]}
Solution:
Tested Platform: macOS Catalina
IFS=$'\n' read -d '' -ra volumes < <(
df | sed -En 's~.* (/Volumes/.+)$~\1~p'
)
printf '%s\n' "${volumes[#]}"
DF Output
Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk1s5 976490576 21517232 529729936 4% 484332 4881968548 0% /
devfs 781 781 0 100% 1352 0 100% /dev
/dev/disk1s1 976490576 413251888 529729936 44% 576448 4881876432 0% /System/Volumes/Data
/dev/disk1s4 976490576 10487872 529729936 2% 6 4882452874 0% /private/var/vm
map auto_home 0 0 0 100% 0 0 100% /System/Volumes/Data/home
/dev/disk7s1 40880 5760 35120 15% 186 4294967093 0% /private/tmp/tnt12079/mount
/dev/disk8s1 21448 1560 19888 8% 7 4294967272 0% /Volumes/usb drive
/dev/disk6s1 9766926680 8646662552 1119135456 89% 18530 48834614870 0% /Volumes/root
/dev/disk2s1 60425344 26823168 33602176 45% 419112 525034 44% /Volumes/KINGS TON
You may use this pipeline in OSX:
IFS=$'\n' read -d '' -ra volumes < <(
df | sed -En 's~.* (/Volumes/.+)$~\1~p'
)
Check array content:
printf '%s\n' "${volumes[#]}"
or
declare -p volumes
declare -a volumes=([0]="/Volumes/Recovery" [1]="/Volumes/Preboot")
You may use
IFS=$'\n' read -r -d '' -a volumes < <(
df -h | awk 'NR>1 && $6 ~ /^\/Volumes\//{print $6}'
)
printf "%s\n" "${volumes[#]}"
The awk command gets all lines other than the first one (NR>1) and where Field 6 ("Mounted on") starts with /Volumes/ (see $6 ~ /^\/Volumes\/), and then prints the Field 6 value.
The printf "%s\n" "${volumes[#]}" command will print all the items in the volumes array on separate lines.
If the volume paths happen to contain spaces, you may check if there is a digit followed with % followed with whitespaces and /Volume/ and then get join the fields starting with Field 6 with a space:
df -h | awk 'NR>1 && $0 ~ /[0-9]%[ \t]+\/Volumes\//{for (i=6;i<=NF;i++) {a=a" "$i}; print a}'
It's a little unclear just what output you want, but you can always use awk to parse the information. For example if you want the "Filesytem" and "Mounted on" information, you can use with df:
df | awk '{
for (i=1; i<=NF; i++)
if ($i ~ /^\/Volumes/) {
print $1, substr($0, match($0,/\/Volumes/))
break
}
}'
Or using the input you provided in the file dfout, you could read the file as:
awk '{
for (i=1; i<=NF; i++)
if ($i ~ /^\/Volumes/) {
print $1, substr($0, print $1, substr($0, match($0,/\/Volumes/)))
break
}
}' dfout
Example Output
Using the file dfout with your data you would receive:
/dev/disk8s1 /Volumes/usb drive
/dev/disk6s1 /Volumes/root
/dev/disk2s1 /Volumes/KINGS TON
If you need more or less of each record, you can just output whatever other fields you like in the print statement.
Let me know if you want the format or output different and I'm happy to help further. I don't have a Mac to test on, but the functions uses are standard awk and not GNU awk specific.

Printing Both Matching and Non-Matching Patterns

I am trying to compare two files to then return one of the files columns upon a match. The code that I am using right now is excluding non-matching patterns and just printed out matching patterns. I need to print all results, both matching and non-matching, using grep.
File 1:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
File 2:
F
A
B
Z
C
P
E
Current Result:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
Expected Result:
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Bash Code:
while IFS=',' read point lat lon; do
check=`grep "${point} /home/aaron/file2 | awk '{print $1}'`
echo "${check},${lat},${lon}"
done < /home/aaron/file1
In awk:
$ awk -F, 'NR==FNR{a[$1]=$0;next}{print ($1 in a?a[$1]:$1)}' file1 file2
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Explained:
$ awk -F, ' # field separator to ,
NR==FNR { # file1
a[$1]=$0 # hash record to a, use field 1 as key
next
}
{
print ($1 in a?a[$1]:$1) # print match if found, else nonmatch
}
' file1 file2
If you don't care about order, there's a join binary in GNU coreutils that does just what you need :
$sort file1 > sortedFile1
$sort file2 > sortedFile2
$join -t, -a 2 sortedFile1 sortedFile2
A,42.4,-72.2
B,47.2,-75.9
C,41.7,-95.2
E
F
P
Z,38.3,-70.7
It relies on files being sorted and will not work otherwise.
Now will you please get out of my /home/ ?
another join based solution preserving the order
f() { nl -nln -s, -w1 "$1" | sort -t, -k2; }; join -t, -j2 -a2 <(f file1) <(f file2) |
sort -t, -k2 |
cut -d, -f2 --complement
F
A,42.4,-72.2,2
B,47.2,-75.9,3
Z,38.3,-70.7,4
C,41.7,-95.2,5
P
E
Cannot beat the awk solution but another alternative utilizing unix toolchain based on decorate-undecorate pattern.
Problems with your current solution:
1. You are missing a double-quote in grep "${point} /home/aaron/file2.
2. You should start with the other file for printing all lines in that file
while IFS=',' read point; do
echo "${point}$(grep "${point}" /home/aaron/file1 | sed 's/[^,]*,/,/')"
done < /home/aaron/file2
3. The grep can give more than one result. Which one do you want (head -1) ?
An improvement would be
while IFS=',' read point; do
echo "${point}$(grep "^${point}," /home/aaron/file1 | sed -n '1s/[^,]*,/,/p')"
done < /home/aaron/file2
4. Using while is the wrong approach.
For small files it wil get the work done, but you will get stuck with larger files. The reason is that you will call grep for each line in file2, reading file1 a lot of times.
Better is using awk or some other solution.
Another solution is using sed with the output of another sed command:
sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1
This will give commands for the second sed.
sed -f <(sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1) /home/aaron/file2

Copy files based on a timestamp value in a file name

All the files reside in one folder. File names look like following:
1695_6892_20160321000000_20160321235959.file.name.csv.gz
The third substring (after the second _) is a timestamp.
How do i copy all files with a timestamp < 20150531000000 to another folder my_folder?
Try this:
for i in *.gz; do test `echo $i | cut -d _ -f 3` -lt 20150531000000 && cp $i my_folder; done
And... You can use awk.
for i in $(ls -1 org_folder | awk -F"_" '{ if ($3 < 20150531000000) print $0 }'); cp mv org_foler/$i my_folder/; done
ls | awk -F'_' '$3<20150531000000{print}'
should be the files you want to move, so
for f in "$(ls|awk -F'_' '$3<20150531000000{print}')"; do mv "${f}" elsewhere/ ;done

Bash copy all directory with content that matches a pattern

Is there some way to copy the directories including the contents using bash script. For example
// Suppose there are many directory inside Test in c as,
/media/test/
-- en_US
-- file1
-- file 2
-- de_DE
-- file 1
-- SUB-dir1
-- sub file 1
-- file 2
.....
.....
-- Test 1
-- testfile1
-- folder
--- more 1
............
NoW i want to copy all the directories (including sub-directory and files)
to another location which matches the pattern.
--> for example , in above case I want the directories en_US and de_DE to be copied in another
location including sub-directories and files.
So Far I have done/ find out :
1) Needed Pattern as , /b/w{2}_/w{2}/b
2) I can list all the directories as ,
$MYDIR="/media/test/"
DIRS=`ls -l $MYDIR | egrep '^d' | awk '{print $10}'`
for DIR in $DIRS
do
echo ${DIR}
done
Now I need help in combining these together so that the script can copy all the directory(including sub contents) that matches the pattern to another location.
Thanks in advance.
To selectively copy an entire directory structure to a similar directory structure, while filtering the contents, in a general way your best bet is to archive the original directory and unarchive. For instance, using GNU Tar:
$ mkdir destdir
$ tar -c /media/test/{en_US,de_DE} | tar -C destdir -x --strip-components=1
In this example, the /media/test directory structure is partially recreated under destdir, excluding the /media prefix (thanks to --strip-components=1).
The left side tar archives just the directories/paths which match the pattern that we specified. The archive is produced on that command's standard output, which is piped to the decoding tar on the right hand side. The -C tells it to change to the destination directory. It extracts the files there, removing a leading path component.
$ ls destdir
test
$ ls destdir/test
en_US de_DE
Of course, your specific example test case is quite easily handled with cp -a:
$ mkdir destdir
$ cp -a /media/test/{en_US,de_DE} destdir
If the pattern is complicated, involving multiple selections of subtree material at deeper and/or different levels of the source directory hierarchy, then you need the more general approach, if you wish to do the copy in a single batch command which just specifies source patterns.
I'm not sure about your environment, but I guess you try to do this:
cp -r src_dir/??_?? dest_dir
Here is your starter for 10:
You will have to add the extra checks and balances that you require but it should give you a flying start.
#!/bin/bash
# assumes $1 is source to search and $2 to destination to copy to
subdirs=`find $1 -name ??_?? -print`
echo $subdirs
for x in $subdirs
do
echo $x
cp -a $x $2
done
Please check if this is what you wanted. It searches for directories with format xx_yy/ab_cd/&&_$$ (2char_2char) and copies the content to a new directory .
usage : ./script.sh
cat script.sh
#!/bin/bash
MYDIR="/media/test/"
NEWDIRPATH="/media/test_new"
DIRS=`ls -l $MYDIR | grep "^d" | awk '{print $9}'`
for DIR in $DIRS
do
total_characters=`echo $DIR | wc -m`
if [ $total_characters -eq 6 ]; then
has_underscore=`echo "$DIR" | grep "_"`
if [ "$has_underscore" != "" ]; then
echo "${DIR}"
start_string_count=`echo $DIR | awk -F '_' '{print $1}' | wc -m`
end_string_count=`echo $DIR | awk -F '_' '{print $2}' | wc -m`
echo "start_string_count => $start_string_count ; end_string_count => $end_string_count"
if [ $start_string_count -eq 3 ] && [ $end_string_count -eq 3 ]; then
mkdir -p $NEWDIRPATH/"$DIR"_new
cp -r $DIR $NEWDIRPATH/"$DIR"_new
fi
fi
fi
done