How to use hdfs dfs cp with xargs to work around linux argument list limit? - hdfs

I have a lot of files to copy on HDFS and I encounter the maximum argument list limit of the operating system. A work around that currently works is to generate a single command for a single file to process. However, that requires time.
I am trying to work with xargs to get around the argument limit and reduce processing time. But I am not able to make it work.
Here is the current situation.
I echo (because I have read somewhere that echo is not subject to argument limit) the file names and pipe them to xarg.
echo "/user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl" | xargs -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
However this throws:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl': No such file or directory
Based on this example, I tried with:
echo "/user/florian_castelain/test/yolo" "/user/florian_castelain/ignore_dtl" | xargs -0 -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl
But no file has been copied at all.
How can I use xarg combined with the hdfs dfs -cp command to handle the copy of multiple files at once ?
Hadoop 2.6.0-cdh5.13.0
Edit 1
With the verbose flag and this config', I have the following output:
echo "/user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl" | xargs -I % -t hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
hdfs dfs -cp -p /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl /user/florian_castelain/test/xargs/
Which throws:
cp: `/user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl': No such file or directory
While executing this command manually works fine. Why is that ?
Edit 2
Based on jjo answer, I tried the following:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | xargs -0 -t -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
hdfs dfs -cp -p /user/florian_castelain/test/yolo
/user/florian_castelain/ignore_dtl
/user/florian_castelain/test/xargs/
And does not copy anything.
So I tried removing new line character before passing to xargs:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | tr -d "\n" | xargs -0 -t -I % hdfs dfs -cp -p % /user/florian_castelain/test/xargs/
Which prints:
hdfs dfs -cp -p /user/florian_castelain/test/yolo/user/florian_castelain/ignore_dtl /user/florian_castelain/test/xargs/
But nothing is copied as well. :(

The problem I see you're facing is that whitespace in yolo , plus xargs slurping stdin entries as separated by newlines.
As your files are local, you should leverage find -0 | xargs -0 as e.g.:
find /user/florian_castelain/foo/bar -type f -0 | xargs -0 -I hdfs dfs -cp -p % /some/dst
If you still need/want to feed xargs with "whitespace separated filenames", use printf "%s\n" instead (which is also a builtin in bash as echo), so that each file will be outputted with a newline between:
printf "%s\n" /user/florian_castelain/test/yolo /user/florian_castelain/ignore_dtl | xargs -I % hdfs dfs -cp -p % /some/dst

Related

Cronned Django command output not posted in tee from bash script

I'm trying to have a bash script that controled by cron to be ran every day and I would like to grep some lines from the python (Django) output and post it with slacktee to my slack channel. But I am only catching some warnings from the script, not my own prints (something to do with std::out and std::err)? But I can't seem to be able to debug it.
#!/bin/bash
printf "\nStarting the products update command\n"
DATE=`date +%Y-%m-%d`
source mypath/bin/activate
cd some/path/_production/_server
./manage.py nn_products_update > logs/product_updates.log
tail --lines=1000 logs/product_updates.log | grep INFO | grep $DATE
So for each day, I'm trying to grep messages like these:
[INFO][NEURAL_RECO_UPDATE][2017-08-28 22:15:04] Products update...
But it doesn't get printed in the tee channel. Moreover, the file gets overwritten everyday and not appended - how to change that, please? The tail command works ok when ran in the shell by itself. How is it possible? (sorry for asking two, but I believe they are somehow related, just cant find an answer)
This is just in case the cron entry.
20 20 * * * /bin/bash /path/to/server/_production/bin/runReco.sh 2>&1 | slacktee.sh -c 'my_watch'
Many thanks
EDIT:
output when using grep -e INFO -e $DATE
grep: [INFO][NEURAL_RECO_UPDATE][2017-08-29: No such file or directory
grep: 07:36:56]: No such file or directory
grep: No: No such file or directory
grep: new: No such file or directory
grep: active: No such file or directory
grep: products: No such file or directory
grep: to: No such file or directory
grep: calculate.: No such file or directory
grep: Terminating...: No such file or directory
Using:
#!/bin/bash
set -euo pipefile
this will provide better debug output from your script
See bash strict mode article for a full explanation of the settings.
The file gets overwritten because you are using a single > (redirection) rather than >> which appends redirected output.
To help debug further it would probably make life easier if you put
2>&1 | slacktee.sh -c 'my_watch'
In your runReco.sh as in:
tail --lines=1000 logs/product_updates.log | grep INFO | grep $DATE 2>&1 | slacktee.sh -c 'my_watch'
Although chaining lots of commands together in shell scripts makes them harder to debug.
Hence breaking up the tail line:
TMPFILE=`tail --lines=1000 logs/product_updates.log`
grep -e INFO -e $DATE $TMPFILE 2>&1 | slacktee.sh -c 'my_watch'

Regex for finding a string recursively?

I have a set of commands listed in a file say commands.txt and I was trying to find all the php scripts recursively inside a directory using grep -r but I am not successful in substituting the variable that I get from the commands.txt into the grep.
For example,
when I try
grep -R --include "*.php" <command> .
I get the desired result but when I try to do a for loop like
for i in `cat /var/tmp/commands.txt`; do 'grep -R --include *.php $i .' ; done
or
for i in `cat /var/tmp/commands`; do 'echo $i | grep -r --include *.php .' ; done
the expression fails.
You want to find commands from /var/tmp/commands.txt in the PHP files in the current directory and its subdirectories, with (what looks like) GNU grep? You might go about it like this:
fgrep -Rwf /var/tmp/commands.txt --include '*.php' .

Find files with regex match and different regex not match

I have three files foo1.txt, foo2.txt and foo3.txt, which contain the following lines
# foo1.txt
JOBDONE
and
# foo2.txt
Execution halted
and
# foo3.txt
Execution halted
JOBDONE
I have been able find the ones with both JOBDONE and Execution halted using:
find ./foo*.txt | xargs grep -l 'Execution halted' | xargs grep -l "JOBDONE"
But have not been able to find those files which have JOBDONE or Execution halted but not both. I have tried:
find ./foo*.txt | xargs grep -lv "JOBDONE" | xargs grep -l "Execution halted"
find ./foo*.txt -exec grep -lv "JOBDONE" {} \; | xargs grep -l "Execution halted"
but have been incorrectly (to my understanding) returning
./foo2.txt
./foo3.txt
What is wrong with my understanding of how xargs and exec works with grep and how do I use grep or another portable command to select those logs that have JOBDONE but not Execution halted or vice versa?
Here is an gnu awk (gnu due to multiple characters in RS)
awk -v RS="#-#-#" '/JOBDONE/ && /Execution halted/ {print FILENAME}' foo*
foo3.txt
Setting RS to something that is not in the file, it will thread all lines as one.
Then test if the long line has both string, and if yes, print filename

how to prevent all pipe redirections from getting appended to a file

I am trying the following hadoop command in unix.
*hadoop fs -ls <HDFS path> | grep MatchValue | cut -d "/" f11*
or
*hadoop fs -ls <HDFS path> | sed -e '/MatchValue/!d' | cut -d "/" f11*
I get the desired output what I intended.
Now here comes my problem. I am trying to redirect this output to a file in shell script.
hadoop fs -ls <HDFS path> | sed -e '/MatchValue/!d' | cut -d "/" f11 >> LogName.lst
or
hadoop fs -ls <HDFS path> | sed -e '/MatchValue/!d' | cut -d "/" f11 1>> LogName1>.lst 2>> LogName2.lst
Now in the logs I am also getting first and second pipe results also.
I also tried with only first pipe two commands alone without cut, even there I am getting the hadoop command results too.
I tried this approach both ksh and bash. No use.
No by using pipe you won't get stdout content of previous piped commands in the chain. Most likely you are getting stderr content.
Try this command to suppress error:
hadoop fs -ls 2>/dev/null | sed -e '//!d' | cut -d "/" f11 >> .lst

How can I exclude directories matching certain patterns from the output of the Linux 'find' command?

I want to use regex's with Linux's find command to dive recursively into a gargantuan directory tree, showing me all of the .c, .cpp, and .h files, but omitting matches containing certain substrings. Ultimately I want to send the output to an xargs command to do certain processing on all of the matching files. I can pipe the find output through grep to remove matches containing those substrings, but that solution doesn't work so well with filenames that contain spaces. So I tried using find's -print0 option, which terminates each filename with a nul char instead of a newline (whitespace), and using xargs -0 to expect nul-delimited input instead of space-delimited input, but I couldn't figure out how to pass the nul-delimited find through the piped grep filters successfully; grep -Z didn't seem to help in that respect.
So I figured I'd just write a better regex for find and do away with the intermediary grep filters... perhaps sed would be an alternative?
In any case, for the following small sampling of directories...
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
...I want the output to include all of the .h, .c, and .cpp files but NOT those ones that appear in the 'generated' and 'deploy' directories.
BTW, you can create an entire test directory (named fredbarney) for testing solutions to this question by cutting & pasting this whole line into your bash shell:
mkdir fredbarney; cd fredbarney; mkdir fred; cd fred; mkdir inc; mkdir docs; mkdir generated; mkdir deploy; mkdir src; echo x > inc/dino.h; echo x > docs/info.docx; echo x > generated/dino.h; echo x > deploy/dino.h; echo x > src/dino.cpp; cd ..; mkdir barney; cd barney; mkdir inc; mkdir docs; mkdir generated; mkdir deploy; mkdir src; echo x > 'inc/bam bam.h'; echo x > 'docs/info info.docx'; echo x > 'generated/bam bam.h'; echo x > 'deploy/bam bam.h'; echo x > 'src/bam bam.cpp'; cd ..;
This command finds all of the .h, .c, and .cpp files...
find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$"
...but if I pipe its output through xargs, the 'bam bam' files each get treated as two separate (nonexistant) filenames (note that here I'm simply using ls as a stand-in for what I actually want to do with the output):
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" | xargs -n 1 ls
ls: ./barney/generated/bam: No such file or directory
ls: bam.h: No such file or directory
ls: ./barney/src/bam: No such file or directory
ls: bam.cpp: No such file or directory
ls: ./barney/deploy/bam: No such file or directory
ls: bam.h: No such file or directory
ls: ./barney/inc/bam: No such file or directory
ls: bam.h: No such file or directory
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
So I can enhance that with the -print0 and -0 args to find and xargs:
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" -print0 | xargs -0 -n 1 ls
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
...which is great, except that I don't want the 'generated' and 'deploy' directories in the output. So I try this:
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" -print0 | grep -v generated | grep -v deploy | xargs -0 -n 1 ls
barney fred
...which clearly does not work. So I tried using the -Z option with grep (not knowing exactly what the -Z option really does) and that didn't work either. So I figured I'd write a better regex for find and this is the best I could come up with:
find . -regextype posix-egrep -regex "(?!.*(generated|deploy).*$)(.+\.(c|cpp|h)$)" -print0 | xargs -0 -n 1 ls
...but bash didn't like that (!.*: event not found, whatever that means), and even if that weren't an issue, my regex doesn't seem to work on the regex tester web page I normally use.
Any ideas how I can make this work? This is the output I want:
$ find . [----options here----] | [----maybe grep or sed----] | xargs -0 -n 1 ls
./barney/src/bam bam.cpp
./barney/inc/bam bam.h
./fred/src/dino.cpp
./fred/inc/dino.h
...and I'd like to avoid scripts & temporary files, which I suppose might be my only option.
Thanks in advance!
-Mark
This works for me:
find . -regextype posix-egrep -regex '.+\.(c|cpp|h)$' -not -path '*/generated/*' \
-not -path '*/deploy/*' -print0 | xargs -0 ls -L1d
Changes from your version are minimal: I added exclusions of certain path patterns separately, because that's easier, and I single-quote things to hide them from shell interpolation.
The event not found is because ! is being interpreted as a request for history expansion by bash. The fix is to use single quotes instead of double quotes.
Pop quiz: What characters are special inside of a single-quoted string in sh?
Answer: Only ' is special (it ends the string). That's the ultimate safety.
grep with -Z (sometimes known as --null) makes grep output terminated with a null character instead of newline. What you wanted was -z (sometimes known as --null-data) which causes grep to interpret a null character in its input as end-of-line instead of a newline character. This makes it work as expected with the output of find ... -print0, which adds a null character after each file name instead of a newline.
If you had done it this way:
find . -regextype posix-egrep -regex '.+\.(c|cpp|h)$' -print0 | \
grep -vzZ generated | grep -vzZ deploy | xargs -0 ls -1Ld
Then the input and output of grep would have been null-delimited and it would have worked correctly... until one of your source files began being named deployment.cpp and started getting "mysteriously" excluded by your script.
Incidentally, here's a nicer way to generate your testcase file set.
while read -r file ; do
mkdir -p "${file%/*}"
touch "$file"
done <<'DATA'
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
DATA
Since I did this anyway to verify I figured I'd share it and save you from repetition. Don't do anything twice! That's what computers are for.
Your command:
find . -regextype posix-egrep -regex "(?!.*(generated|deploy).*$)(.+\.(c|cpp|h)$)" -print0 | xargs -0 -n 1 ls
fails because you are trying to use Posix extended regular expressions, which dont support lookaround/lookbehind etc. https://superuser.com/a/596499/658319
find does support pcre, so if you convert to pcre, this should work.