Regex grep file contents and invoke command - regex

I have a file that has been generated containing MD5 info along with filenames. I'm wanting to remove the files from the directory they are in. I'm not sure how to go about doing this exactly.
filelist (file) contains:
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
my command (which i would like to include with rm) looks like this:
grep -o "\((.*)\)" filelist
returns this:
(dupe)
(somefile)
*almost good, although the parentheses need to be eliminated (not sure how). I tried using grep -Po "(?<=\().*(?=\))" filelist using a lookahead/lookaround, but the command didn't work.
The next thing I would like to do is take the output filenames and delete them from the directory they are in. I'm not sure how to script it, but it would essentially do:
<returned results from grep>
rm dupe $target
rm somefile $target

If I understand correctly, you want to take lines like these
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
extract the second column without the parentheses to get the filenames
dupe
somefile
and then delete the files?
Assuming the filenames don't have spaces, try this:
# this is where your duplicate files are.
dupe_directory='/some/path'
# Check that you found the right files:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} ls -l "$dupe_directory/{}"
# Looks ok, delete:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} rm -v "$dupe_directory/{}"
xargs -I{} means to replace the argument (dupe filename) with {} so it can be used in a more complex command.

The tool you're looking for is xargs: http://unixhelp.ed.ac.uk/CGI/man-cgi?xargs
It's pretty standard on *nix systems.
UPDATE: Given that target equals the directory where the files live...
I believe the syntax would look something like:
yourgrepcmd | xargs -I{} rm "$target{}"
The -I creates a placeholder string, and each line from your grep command gets inserted there.
UPDATE:
The step you need to remove the parens is a little use of sed's substitution command (http://unixhelp.ed.ac.uk/CGI/man-cgi?sed)
Something like this:
cat filelist | sed "s/MD5 (\([^)]*\)) .*$/\1/" | xargs -I{} rm "$target/{}"
The moral of the story here is, if you learn to utilize sed and xargs (or awk if you want something a little more advanced) you'll be a more capable linux user.

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

Find all pdf in folder of a given year in filename

I have a folder with thousands of pdf named according to dates like 20100820.pdf or 20000124.pdf etc.
In the command line, I used the following command in other projects to search for all pdf in a folder and attach a command to it, like so ls | grep -E "\.pdf$" | [command here]. Now I would like it to search only those pdf in a given folder from the year 2010 for example. How can I achieve that?
Wow, it was so easy in the end.
This solves the problem:
ls | grep -E "\.pdf$" | grep -E "2010"
ls | grep -E "\.pdf$" | awk -F "." '{if ($1>20100000) print $1}'
This command takes all the pdfs and splits the filename, the digits in file name are then compared with 20100000. If greater then, it prints the filename.

Make changes in multiple files using sed and find

I've got to go through 100+ sites and add two lines to the same file in all of them (sendmail1.php).
The boss wants me to hand copy/paste this stuff, but there's GOT to be an easier way, so I'm trying to do it with find and sed, both of which apparently I'm not using well. I just want to run a script in the dir housing the directories that have all of the sites in them.
I have this:
#!/bin/bash
read -p "Which file, sir? : " file
# find . -depth 1 -type f -name 'sendmail1.php' -exec \
sed -i 's/require\ dirname\(__FILE__\).\'\/includes\/validate.php\';/ \
a require_once dirname\(__FILE__\).\'\/includes\/carriersoft_email.php\';' $file
sed -i 's/\else\ if\($_POST[\'email\']\ \&\&\ \($_POST\'work_email\'\]\ ==\ \"\"\)\){/ \
a\t$carriersoft_sent = carriersoft_email\(\);' $file
exit 0
At the moment, I have the find commented out while trying to sort out sed here and testing the script, but I'd like to solve both.
I think I'm not escaping something necessary in the sed bits, but I keep going over it and changing it and getting different errors (sometimes "unfinished s/ statement" other times. other stuff.
The point is I have to do this:
Right below require dirname(__FILE__).'/includes/validate.php';, add this line:
require_once dirname(__FILE__).'/includes/carriersoft_email.php';
AND
Under else if($_POST['email'] && ($_POST['work_email'] == "")){, add this line:
$carriersoft_sent = carriersoft_email();
I'd like to turn this 4 hours copy/pasta nightmare into a 2 minutes lazy admin type script it and get it done job.
But my fu is not strong with the sed or the find...
As for the find, I get "path must preceed expression: 1"
I've found questions here addressing that error, but indicating that using the '' to surround the filename should resolve it, but it's not working.
Keep it simple and just use awk since awk can operate with strings, unlike sed which only works on REs with additional caveats:
find whatever |
while IFS= read -r file
do
awk '
{ print }
index($0,"require dirname(__FILE__).\047/includes/validate.php\047;") {
print "require_once dirname(__FILE__).\047/includes/carriersoft_email.php\047;"
}
index($0,"else if($_POST[\047email\047] && ($_POST[\04work_email\047] == "")){") {
print "$carriersoft_sent = carriersoft_email();"
}
' "$file" > /usr/tmp/tmp_$$ &&
mv /usr/tmp/tmp_$$ "$file"
done
With GNU awk you can use -i inplace to avoid manually specifying the tmp file name if you like, just like with sed -i.
The \047s are one way to specify a single quote inside a single-quote-delimited script.
Try this:
sed -e "s/require dirname(__FILE__).'\/includes\/validate.php';/&\nrequire_once dirname(__FILE__).'\/includes\/carriersoft_email.php'\;/" \
-e "s/else if(\$_POST\['email'\] && (\$_POST\['work_email'\] == \"\")){/&\n\$carriersoft_sent = carriersoft_email();/" \
file
Note: I haven't used the -i flag. Once you are confirmed that it works for you, you can use the -i flag. Also I have combined your two sed command into one with -e option.
I think it would be clearer if instead of s you used another fine command, a. To output one changed file, create a script (eg. script.sed) with the following content:
/require dirname(__FILE__)\.'\/includes\/validate.php';/a\
require_once dirname(__FILE__).'/includes/carriersoft_email.php';
/else if(\$_POST\['email'\] && (\$_POST\['work_email'\] == "")){/a\
$carriersoft_sent = carriersoft_email();
and run sed -f script.sed sendmail1.php.
To apply changes in all files, run:
find . -name 'sendmail1.php' -exec sed -i -f script.sed {} \;
(-i causes sed to change file in-place).
It is always advisable in such operations to do a backup and check out the exact changes after running the command. :)

Find directories with names matching pattern and move them

I have a bunch of directories like 001/ 002/ 003/ mixed in with others that have letters in their names. I just want to grab all the directories with numeric names and move them into another directory.
I try this:
file */ | grep ^[0-9]*/ | xargs -I{} mv {} newdir
The matching part works, but it ends up moving everything to the newdir...
I am not sure I understood correctly but here is at least something to help.
Use a combination of find and xargs to manipulate lists of files.
find -maxdepth 1 -regex './[0-9]*' -print0 | xargs -0 -I'{}' mv "{}" "newdir/{}"
Using -print0 and -0 and quoting the replacement symbol {} make your script more robust. It will handle most situations where non-printable chars are presents. This basically says it passes the lines using a \0 char delimiter instead of a \n.
mv is not powerfull enough by itself. It cannot work on patterns.
Try this approach: Rename multiple files by replacing a particular pattern in the filenames using a shell script
Either use a loop or a rename command.
With loop and array,
Your script would be something like this:
#!/bin/bash
DIR=( $(file */ | grep ^[0-9]*/ | awk -F/ '{print $1}') )
for dir in "${DIR[#]}"; do
mv $dir /path/to/DIRECTORY
done

How can I exclude one word with grep?

I need something like:
grep ^"unwanted_word"XXXXXXXX
You can do it using -v (for --invert-match) option of grep as:
grep -v "unwanted_word" file | grep XXXXXXXX
grep -v "unwanted_word" file will filter the lines that have the unwanted_word and grep XXXXXXXX will list only lines with pattern XXXXXXXX.
EDIT:
From your comment it looks like you want to list all lines without the unwanted_word. In that case all you need is:
grep -v 'unwanted_word' file
I understood the question as "How do I match a word but exclude another", for which one solution is two greps in series: First grep finding the wanted "word1", second grep excluding "word2":
grep "word1" | grep -v "word2"
In my case: I need to differentiate between "plot" and "#plot" which grep's "word" option won't do ("#" not being a alphanumerical).
If your grep supports Perl regular expression with -P option you can do (if bash; if tcsh you'll need to escape the !):
grep -P '(?!.*unwanted_word)keyword' file
Demo:
$ cat file
foo1
foo2
foo3
foo4
bar
baz
Let us now list all foo except foo3
$ grep -P '(?!.*foo3)foo' file
foo1
foo2
foo4
$
The right solution is to use grep -v "word" file, with its awk equivalent:
awk '!/word/' file
However, if you happen to have a more complex situation in which you want, say, XXX to appear and YYY not to appear, then awk comes handy instead of piping several greps:
awk '/XXX/ && !/YYY/' file
# ^^^^^ ^^^^^^
# I want it |
# I don't want it
You can even say something more complex. For example: I want those lines containing either XXX or YYY, but not ZZZ:
awk '(/XXX/ || /YYY/) && !/ZZZ/' file
etc.
Invert match using grep -v:
grep -v "unwanted word" file pattern
grep provides '-v' or '--invert-match' option to select non-matching lines.
e.g.
grep -v 'unwanted_pattern' file_name
This will output all the lines from file file_name, which does not have 'unwanted_pattern'.
If you are searching the pattern in multiple files inside a folder, you can use the recursive search option as follows
grep -r 'wanted_pattern' * | grep -v 'unwanted_pattern'
Here grep will try to list all the occurrences of 'wanted_pattern' in all the files from within currently directory and pass it to second grep to filter out the 'unwanted_pattern'.
'|' - pipe will tell shell to connect the standard output of left program (grep -r 'wanted_pattern' *) to standard input of right program (grep -v 'unwanted_pattern').
The -v option will show you all the lines that don't match the pattern.
grep -v ^unwanted_word
I excluded the root ("/") mount point by using grep -vw "^/".
# cat /tmp/topfsfind.txt| head -4 |awk '{print $NF}'
/
/root/.m2
/root
/var
# cat /tmp/topfsfind.txt| head -4 |awk '{print $NF}' | grep -vw "^/"
/root/.m2
/root
/var
I've a directory with a bunch of files. I want to find all the files that DO NOT contain the string "speedup" so I successfully used the following command:
grep -iL speedup *