How to call grep on pattern files? - regex

I'm trying to grep over files which have names matching regexp. But following:
#!/bin/bash
grep -o -e "[a-zA-Z]\{1,\}" $1 -h --include=$2 -R
is working only in some cases. When I call this script like that:
./script.sh dir1/ [A-La-l]+
it doesn't work. But following:
./script.sh dir1/ \*.txt
works fine. I have also tried passing arguments within double-quotes and quotes but neither worked for me.
Do you have any ideas how to solve this problem?

grep's --include option does not accept a regex but a glob (such as *.txt), which is considerably less powerful. You will have to decide whether you want to match regexes or globs -- *.txt is not a valid regex (the equivalent regex is .*\.txt) while [A-La-l]+ is not a valid glob.
If you want to do it with regexes, you will not be able to do it with grep alone. One thing you could do is to leave the file selection to a different tool such as find:
find "$1" -type f -regex "$2" -exec grep -o -e '[a-zA-Z]\{1,\}' -h '{}' +
This will construct and run a command grep -o -e '[a-zA-Z]\{1,\}' -h list of files in $1 matching the regex $2. If you replace the + with \;, it will run the command for each file individually, which should yield the same results (very) slightly more slowly.
If you want to do it with globs, you can carry on as before; your code already does that. You should put double quotes around $1 and $2, though.

Related

ls and regular expression linux

I have two directories:
run.2016-02-25_01.
run.2016-02-25_01.47.04
Both these directories are present under a common directory called gte.
I want a directory that ends without a dot character ..
I am using the following command, however, I am not able to make it work:
ls run* | grep '.*\d+'
The commands is not able to find anything.
The negated character set in shell globbing uses ! not ^:
ls -d run*[!.]
(The ^ was at one time an archaic synonym for |.) The -d option lists directory names, not the contents of those directories.
Your attempt using:
ls run* | grep '.*\d+'
requires a PCRE-enabled grep and the PCRE regex option (-P), and you are looking for zero or more of any character followed by one or more digits, which isn't what you said you wanted. You could use:
ls -d run* | grep '[^.]$'
which doesn't require the PCRE regexes, but simply having the shell glob the right names is probably best.
If you're worried that there might not be a name starting run and ending with something other than a dot, you should consider shopt -s nullglob, as mentioned in Anubhava's answer. However, note the discussion below between hek2mgl and myself about the potentially confusing behaviour of, in particular, the ls command in conjunction with shopt -s nullglob. If you were using:
for name in run*[!.]
do
…
done
then shopt -s nullglob is perfect; the loop iterates zero times when there's no match for the glob expression. It isn't so good when the glob expression is an argument to commands such as ls that provide a default behaviour in the absence of command line arguments.
You don't need grep. Just use:
shopt -s nullglob
ls -d run*[0-9]
If your directories are not always ending with digits then use extglob:
shopt -s nullglob extglob
ls -d run*+([^.])
or to list all entries inside the run* directory ending without DOT:
printf "%s\n" run*+([^.])/*
This works...
ls|grep '.*[^.]$'
That is saying I want any amount of anything but I want the last character before the line ending to be anything except for a period.
To list the directories that don't end with a . .
ls -d run* |grep "[^.]$"
I would use find
find -regextype posix-awk -maxdepth 1 -type d -regex '-*[[:digit:]]+$'

Remove duplicate filename extensions

I have thousands of files named something like filename.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz
I am using the find command like this find . -name "*.gz*" to locate these files and either use -exec or pipe to xargs and have some magic command to clean this mess, so that I end up with filename.gz
Someone please help me come up with this magic command that would remove the unneeded instances of .gz. I had tried experimenting with sed 's/\.gz//' and sed 's/(\.gz)//' but they do not seem to work (or to be more honest, I am not very familiar with sed). I do not have to use sed by the way, any solution that would help solve this problem would be welcome :-)
one way with find and awk:
find $(pwd) -name '*.gz'|awk '{n=$0;sub(/(\.gz)+$/,".gz",n);print "mv",$0,n}'|sh
Note:
I assume there is no special chars (like spaces...) in your filename. If there were, you need quote the filename in mv command.
I added a $(pwd) to get the absolute path of found name.
you can remove the ending |sh to check generated mv ... .... cmd, if it is correct.
If everything looks good, add the |sh to execute the mv
see example here:
You may use
ls a.gz.gz.gz |sed -r 's/(\.gz)+/.gz/'
or without the regex flag
ls a.gz.gz.gz |sed 's/\(\.gz\)\+/.gz/'
ls *.gz | perl -ne '/((.*?.gz).*)/; print "mv $1 $2\n"'
It will print shell commands to rename your files, it won't execute those commands. It is safe. To execute it, you can save it to file and execute, or simply pipe to shell:
ls *.gz | ... | sh
sed is great for replacing text inside files.
You can do that with bash string substitution:
for file in *.gz.gz; do
mv "${file}" "${file%%.*}.gz"
done
This might work for you (GNU sed):
echo *.gz | sed -r 's/^([^.]*)(\.gz){2,}$/mv -v & \1\2/e'
find . -name "*.gz.gz" |
while read f; do echo mv "$f" "$(sed -r 's/(\.gz)+$/.gz/' <<<"$f")"; done
This only previews the renaming (mv) command; remove the echo to perform actual renaming.
Processes matching files in the current directory tree, as in the OP (and not just files located directly in the current directory).
Limits matching to files that end in at least 2 .gz extensions (so as not to needlessly process files that end in just one).
When determining the new name with sed, makes sure that substring .gz doesn't just match anywhere in the filename, but only as part of a contiguous sequence of .gz extensions at the end of the filename.
Handles filenames with special chars. such as embedded spaces correctly (with the exception of filenames with embedded newlines.)
Using bash string substitution:
for f in *.gz.gz; do
mv "$f" "${f%%.gz.gz*}.gz"
done
This is a slight modification of jaypal's nice answer (which would fail if any of your files had a period as part of its name, such as foo.c.gz.gz). (Mine is not perfect, either) Note the use of double-quotes, which protects against filenames with "bad" characters, such as spaces or stars.
If you wish to use find to process an entire directory tree, the variant is:
find . -name \*.gz.gz | \
while read f; do
mv "$f" "${f%%.gz.gz*}.gz"
done
And if you are fussy and need to handle filenames with embedded newlines, change the while read to while IFS= read -r -d $'\0', and add a -print0 to find; see How do I use a for-each loop to iterate over file paths output by the find utility in the shell / Bash?.
But is this renaming a good idea? How was your filename.gz.gz created? gzip has guards against accidentally doing so. If you circumvent these via something like gzip -c $1 > $1.gz, buried in some script, then renaming these files will give you grief.
Another way with rename:
find . -iname '*.gz.gz' -exec rename -n 's/(\.\w+)\1+$/$1/' {} +
When happy with the results remove -n (dry-run) option.

Replace text across multiple files in a directory with sed

I need to hide the IP addresses in the log files for security reasons. The IP addresses are of version 4 and 6. How do I hide the addresses in a way that, IPv4 example 123.4.32.16 is replaced by x.x.x.x and IPv6 example 232e:23o5:te43:5423:5433:0000:ef09:23ff is replaced by x:x:x:x:x:x:x:x? Is it possible to do this using a single sed command?
You might want to use find and sed for this.
Let's assume your logs have the extension ".log":
find /path/to/logs -type f -name '*.log' -exec \
sed -i -e 's,[0-9]\+\(\.[0-9]\+\)\{3\},x.x.x.x,g' \
-e 's,[0-9a-f]\+\(:[0-9a-f]\+\)\{7\},x:x:x:x:x:x:x:x,gi' {} \;
How does this work?
First, we ask find to recursively locate files with the .log extension starting from /path/to/logs. -type f tells find we wan't to find regular files.
For each file, it will execute sed. The -i argument tells sed you want to edit the file in place. (Check out http://www.grymoire.com/Unix/Sed.html)
One solution using find and perl:
find /the/directory -type f -exec perl -pi -e '
s/\b\d{1,3}(\.\d{1,3}){3}\b/x.x.x.x/g;
s/\b[a-f\d]{1,4}(:[a-f\d]{1,4}){7}\b/x:x:x:x:x:x:x:x/gi' {} \;
(type on one line)
Well, first you should probably just fix whatever is doing the logging to log the way you want to.
Now if you need to go back and modify historical files, you might consider using sed
sed -e 's/\b(\d{1,3}\.){3}\d{1,3}\b/x.x.x.x/' /path/to/file
sed -e 's/\b([:xdigit:]{4}:){7}[:xdigit:]{4}\b/x.x.x.x.x.x.x.x/' /path/to_file
I use this:
find . -name "*.log" -exec grep -izl PATTERN {} \; | xargs perl -i.orig -e -n 's/PATTERN/REPLACEMENT/g'
You'd want to insert your PATTERN(s) and replace *.log with something else depending on the name of your log files.
The -i.orig backs up the files being replaced with an extension of .orig.
I found that this was relatively faster than other things I tried. find/grep combo to indentify candidates, then perl to do the work.

Recursive find and replace based on regex

I have changed up my director structure and I want to do the following:
Do a recursive grep to find all instances of a match
Change to the updated location string
One example (out of hundreds) would be:
from common.utils import debug --> from etc.common.utils import debug
To get all the instances of what I'm looking for I'm doing:
$ grep -r 'common.' ./
However, I also need to make sure common is preceded by a space. How would I do this find and replace?
It's hard to tell exactly what you want because your refactoring example changes the import as well as the package, but the following will change common. -> etc.common. for all files in a directory:
sed -i 's/\bcommon\./etc.&/' $(egrep -lr '\bcommon\.' .)
This assumes you have gnu sed available, which most linux systems do. Also, just to let you know, this will fail if there are too many files for sed to handle at one time. In that case, you can do this:
egrep -lr '\bcommon\.' . | xargs sed -i 's/\bcommon\./etc.&/'
Note that it might be a good idea to run the sed command as sed -i'.OLD' 's/\bcommon\./etc.&/' so that you get a backup of the original file.
If your grep implementation supports Perl syntax (-P flag, on e.g. Linux it's usually available), you can benefit from the additional features like word boundaries:
$ grep -Pr '\bcommon\.'
By the way:
grep -r tends to be much slower than a previously piped find command as in Rob's example. Furthermore, when you're sure that the file-names found do not contain any whitespace, using xargs is much faster than -exec:
$ find . -type f -name '*.java' | xargs grep -P '\bcommon\.'
Or, applied to Tim's example:
$ find . -type f -name '*.java' | xargs sed -i.bak 's/\<common\./etc.common./'
Note that, in the latter example, the replacement is done after creating a *.bak backup for each file changed. This way you can review the command's results and then delete the backups:
$ find . -type f -name '*.bak' | xargs rm
If you've made an oopsie, the following command will restore the previous versions:
$ find . -type f -name '*.bak' | while read LINE; do mv -f $LINE `basename $LINE`; done
Of course, if you aren't sure that there's no whitespace in the file names and paths, you should apply the commands via find's -exec parameter.
Cheers!
This is roughly how you would do it using find. This requires testing
find . -name \*.java -exec sed "s/FIND_STR/REPLACE_STR/g" {}
This translates as "Starting from the current directory find all files that end in .java and execute sed on the file (where {} is a place holder for the currently found file) "s/FIND_STR/REPLACE_STR/g" replaces FIND_STR with REPLACE_STR in each line in the current file.

Using non-consuming matches in Linux find regex

Here's my problem in a simplified scenario.
Create some test files:
touch /tmp/test.xml
touch /tmp/excludeme.xml
touch /tmp/test.ini
touch /tmp/test.log
I have a find expression that returns me all the XML and INI files:
[root#myserver] ~> find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)'
/tmp/test.ini
/tmp/test.xml
/tmp/excludeme.xml
I now want a way of modifying this -regex to exclude the excludeme.xml file from being included in the results.
I thought this should be possible by using/combining a non-consuming regex (?=expr) with a negated match (?!expr). Unfortunately I can't quite get the format of the command right, so my attempts result in no matches being returned. Here was one of my attempts (I've tried many different forms of this with different escaping!):
find /tmp -name -prune -o -regex '\(?=.*excludeme\.xml\).*\.\(xml\|ini\)'
I can't break down the command into multiple steps (e.g. piping through grep -v) as the find command is assumed as input into other parts of our tool.
This does what you want on linux:
find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)' \! -regex '.*excludeme\.xml'
I'm not sure if the "!" operator is unique to gnu find.
Not sure about what escapes you need or if lookarounds work, but these work for Perl:
/^(?!.*\/excludeme\.).*\.(xml|ini)$/
/(?<!\/excludeme)\.(xml|ini)$/
Edit - Just checked find command, best you can do with find is to change the regextype to -regextype posix-extended but that doesen't do stuff like look-arounds. The only way around this looks to be using some gnu stuff, either as #unholygeek suggests with find or piping find into gnu grep with the -P perl option. You can use the above regex verbatim if you go with a gnu grep. Something like find .... -print | xargs grep -P ...
Sorry, thats the best I can do.