grep and replace multiple strings between multiple files - regex

I have multiple XML files with strings that are all set to
<msg type="Status"
Some of these strings should be type "Warning" and I have a separate text file with the warning strings.
I can do a
grep -f strings.txt *.xml
this allows me to see which warning strings are incorrectly listed as status strings. What I would like to do is
grep -f strings.txt *.xml | sed 's/status/warning/'
This gives me my desired output, but it is only being displayed (not saved). There are multiple xml files so i can't just save the output to one file. I need sed to replace the string in the original .xml file it originated from. Thanks for your help.

you were close instead of grep -f strings.txt *.xml | sed 's/status/warning/'
do
grep -l strings.txt *.xml | xargs sed -i 's/status/warning/g'
you forgot to use xargs, read here for more info about xargs.

Recommend trying xmllint. Using the --pattern option, you can specify xpath search queries to search for exactly what you need it to find without complicated regular expressions.

There is an Perl script xml_grep using xpath as query language:
http://search.cpan.org/dist/XML-Twig/tools/xml_grep/xml_grep
xml_grep '//msg[#type="Warning"]' *.xml

Related

How use cat and grep with gsutil and filter with a subdirectory name?

Currently, for getting a string (here: 123456789) into some files in all my buckets I do the following:
gsutil cat -h gs://AAAA/** | grep '123456789' > 20221109.txt
And I get the name of my path file when I match, so it works, but if I do it this way, it will search among all the directories (and I have a thousand directories and thousand files, it makes so much time.
I want to filter with a date thanks to the name of the subdirectory, like:
gsutil cat -h gs://AAAA/*2022-11-09*/** | grep '123456789' > 20221109.txt
But it didn't work, and I have no clue how to solve my problem, I read a lot of answers in SO, but I don't find them.
ps: I can't use find with gsutil , so I try to make it with cat and grep with gsutil in a single command line.
Thanks in advance for your help.
Finally, I managed to get what I wanted, but it was highly illegible. I think it's possible to do better. I'm open to any improvement. Reminder, This solution avoid reading all the directory of a bucket.
1st Step :
I manage to get all the paths and the file that match my pattern of the subdirectory (like a date here):
gsutil ls gs://directory1/*2022-11-09*/** > gs_path_files_2022_11_09.txt
After that, I want to make a grep for each file and get in the output the name of the file and the line where I get my match (again in the terminal):
while read -r line; do
gsutil cat "$line" | awk -v l="'Command: gsutil cat $line | awk '/the_string_i_want_to_match_in_my_file/{print ARGV[ARGIND] ":" $0}':" '/the_string_i_want_to_match_in_my_file/{print l $0}' >> results.txt
done < test.txt
and you will get after that the command (and the name of the file ) + the line where you get your match.
Best regards

Safe search&replace on linux

Let's consider I have files located in different subfolders and I would like to search, test and replace something into these files.
I would like to do it in three steps:
Search of a specific pattern (with or without regexp)
Test to replace it with something (with or without regexp)
Apply the changes only to the concerned files
My current solution is to define some aliases in my .bashrc in order to easily use grep and sed:
alias findsrc='find . -name "*.[ch]" -or -name "*.asm" -or -name "*.inc"'
alias grepsrc='findsrc | xargs grep -n --color '
alias sedsrc='findsrc | xargs sed '
Then I use
grepsrc <pattern> to search my pattern
(no solution found yet)
sedsrc -i 's/<pattern>/replace/g'
Unfortunately this solution does not satisfy me. The first issue is that sed touch all the files even of no changes. Then, the need to use aliases does not look very clean to me.
Ideally I would like have a workflow similar to this one:
Register a new context:
$ fetch register 'mysrcs' --recurse *.h *.c *.asm *.inc
Context list:
$ fetch context
1. mysrcs --recurse *.h *.c *.asm *.inc
Extracted from ~/.fetchrc
Find something:
$ fetch files mysrcs /0x[a-f0-9]{3}/
./foo.c:235 Yeah 0x245
./bar.h:2 Oh yeah 0x2ac hex
Test a replacement:
$ fetch test mysrcs /0x[a-f0-9]{3}/0xabc/
./foo.c:235 Yeah 0xabc
./bar.h:2 Oh yeah 0xabc hex
Apply the replacement:
$ fetch subst --backup mysrcs /0x[a-f0-9]{3}/0xabc/
./foo.c:235 Yeah 0xabc
./bar.h:2 Oh yeah 0xabc hex
Backup number: 242
Restore in case of mistake:
$ fetch restore 242
This kind of tools look pretty standard to me. Everybody needs to search and replace. What alternative can I use that is standard in Linux?
#!/bin/ksh
# Call the batch with the 2 (search than replace) pattern value as argument
# assuming the 2 pattern are "sed" compliant regex
SearchStr="$1"
ReplaceStr="$2"
# Assuming it start the search from current folder and take any file
# if more filter needed, use a find before with a pipe
grep -l -r "$SearchStr" . | while read ThisFile
do
sed -i -e "s/${SearchStr}/${ReplaceStr}/g" ${ThisFile}
done
should be a base script to adapt to your need
I often have to perform such maintenance tasks. I use a mix of find, grep, sed, and awk.
And instead of aliases, I use functions.
For example:
# i. and ii.
function grepsrc {
find . -name "*.[ch]" -or -name "*.asm" -or -name "*.inc" -exec grep -Hn "$1"
}
# iii.
function sedsrc {
grepsrc "$1" | awk -F: '{print $1}' | uniq | while read f; do
sed -i s/"$1"/"$2"/g $f
done
}
Usage example:
sedsrc "foo[bB]ar*" "polop"
for F in $(grep -Rl <pattern>) ; do sed 's/search/replace/' "$F" | sponge "$F" ; done
grep with the -l argument just lists files that match
We then use an iterator to just run those files which match through sed
We use the sponge program from the moreutils package to write the processed stream back to the same file
This is simple and requires no additional shell functions or complex scripts.
If you want to make it safe as well... check the folder into a Git repository. That's what version control is for.
Yes there is a tool doing exactely that you are looking for. This is Git. Why do you want to manage the backup of your files in case of mistakes when specialized tools can do that job for you?
You split your request in 3 subquestions:
How quickly search into a subset of my files?
How to apply a substitution temporarly, then go back to the original state?
How to substitute into your subset of files?
We first need to do some jobs in your workspace. You need to init a Git repository then add all your files into this repository:
$ cd my_project
$ git init
$ git add **/*.h **/*.c **/*.inc
$ git commit -m "My initial state"
Now, you can quickly get the list of your files with:
$ git ls-files
To do a replacement, you can either use sed, perl or awk. Here the example using sed:
$ git ls-files | xargs sed -i -e 's/search/replace/'
If you are not happy with this change, you can roll-back anytime with:
$ git checkout HEAD
This allows you to test your change and step-back anytime you want to.
Now, we did not simplified the commands yet. So I suggest to add an alias to your Git configuration file, usually located here ~/.gitconfig. Add this:
[alias]
sed = ! git grep -z --full-name -l '.' | xargs -0 sed -i -e
So now you can just type:
$ git sed s/a/b/
It's magic...

Remove duplicate filename extensions

I have thousands of files named something like filename.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz
I am using the find command like this find . -name "*.gz*" to locate these files and either use -exec or pipe to xargs and have some magic command to clean this mess, so that I end up with filename.gz
Someone please help me come up with this magic command that would remove the unneeded instances of .gz. I had tried experimenting with sed 's/\.gz//' and sed 's/(\.gz)//' but they do not seem to work (or to be more honest, I am not very familiar with sed). I do not have to use sed by the way, any solution that would help solve this problem would be welcome :-)
one way with find and awk:
find $(pwd) -name '*.gz'|awk '{n=$0;sub(/(\.gz)+$/,".gz",n);print "mv",$0,n}'|sh
Note:
I assume there is no special chars (like spaces...) in your filename. If there were, you need quote the filename in mv command.
I added a $(pwd) to get the absolute path of found name.
you can remove the ending |sh to check generated mv ... .... cmd, if it is correct.
If everything looks good, add the |sh to execute the mv
see example here:
You may use
ls a.gz.gz.gz |sed -r 's/(\.gz)+/.gz/'
or without the regex flag
ls a.gz.gz.gz |sed 's/\(\.gz\)\+/.gz/'
ls *.gz | perl -ne '/((.*?.gz).*)/; print "mv $1 $2\n"'
It will print shell commands to rename your files, it won't execute those commands. It is safe. To execute it, you can save it to file and execute, or simply pipe to shell:
ls *.gz | ... | sh
sed is great for replacing text inside files.
You can do that with bash string substitution:
for file in *.gz.gz; do
mv "${file}" "${file%%.*}.gz"
done
This might work for you (GNU sed):
echo *.gz | sed -r 's/^([^.]*)(\.gz){2,}$/mv -v & \1\2/e'
find . -name "*.gz.gz" |
while read f; do echo mv "$f" "$(sed -r 's/(\.gz)+$/.gz/' <<<"$f")"; done
This only previews the renaming (mv) command; remove the echo to perform actual renaming.
Processes matching files in the current directory tree, as in the OP (and not just files located directly in the current directory).
Limits matching to files that end in at least 2 .gz extensions (so as not to needlessly process files that end in just one).
When determining the new name with sed, makes sure that substring .gz doesn't just match anywhere in the filename, but only as part of a contiguous sequence of .gz extensions at the end of the filename.
Handles filenames with special chars. such as embedded spaces correctly (with the exception of filenames with embedded newlines.)
Using bash string substitution:
for f in *.gz.gz; do
mv "$f" "${f%%.gz.gz*}.gz"
done
This is a slight modification of jaypal's nice answer (which would fail if any of your files had a period as part of its name, such as foo.c.gz.gz). (Mine is not perfect, either) Note the use of double-quotes, which protects against filenames with "bad" characters, such as spaces or stars.
If you wish to use find to process an entire directory tree, the variant is:
find . -name \*.gz.gz | \
while read f; do
mv "$f" "${f%%.gz.gz*}.gz"
done
And if you are fussy and need to handle filenames with embedded newlines, change the while read to while IFS= read -r -d $'\0', and add a -print0 to find; see How do I use a for-each loop to iterate over file paths output by the find utility in the shell / Bash?.
But is this renaming a good idea? How was your filename.gz.gz created? gzip has guards against accidentally doing so. If you circumvent these via something like gzip -c $1 > $1.gz, buried in some script, then renaming these files will give you grief.
Another way with rename:
find . -iname '*.gz.gz' -exec rename -n 's/(\.\w+)\1+$/$1/' {} +
When happy with the results remove -n (dry-run) option.

How to grep a word inside xml files in a folder

I know I can use grep to find a word in all the files present in a folder like this
grep -rn core .
But my current directory has many sub-directories and I just want to search in all xml files present in the current directory and its all sub directories. How can I do that ?
I tried this
grep -rn core *.xml // Does not work
But it searches for xml files present in the current directory only. It does not do it recursively.
Try the --include option
grep -R --include="*.xml" "pattern" /path/to/dir
Reference: Grep Include Only *.txt File Pattern When Running Recursive Mode
Use find:
find /path/to/dir -name '*.xml' -exec grep -H 'pattern' {} \;
I use this often:
grep 'pattern' /path/to/dir/**/*.xml
EDIT:
with zsh

How to grep for a file extension

I am currently trying to a make a script that would grep input to see if something is of a certain file type (zip for instance), although the text before the file type could be anything, so for instance
something.zip
this.zip
that.zip
would all fall under the category. I am trying to grep for these using a wildcard, and so far I have tried this
grep ".*.zip"
But whenever I do that, it will find the .zip files just fine, but it will still display output if there are additional characters after the .zip so for instance .zippppppp or .zipdsjdskjc would still be picked up by grep. Having said that, what should I do to prevent grep from displaying matches that have additional characters after the .zip?
Test for the end of the line with $ and escape the second . with a backslash so it only matches a period and not any character.
grep ".*\.zip$"
However ls *.zip is a more natural way to do this if you want to list all the .zip files in the current directory or find . -name "*.zip" for all .zip files in the sub-directories starting from (and including) the current directory.
On UNIX, try:
find . -type f -name \*.zip
You can also use grep to find all files with a specific extension:
find .|grep -e "\.gz$"
The . means the current folder.
If you want to specify a folder other than the current folder, just replace the . with the path of the folder.
Here is an example: Let's find all files that end with .gz and are in the folder /var/log
find /var/log/ |grep -e "\.gz$"
The output is something similar to the following:
✘ ⚙> find /var/log/ |grep -e "\.gz$"
/var/log//mail.log.1.gz
/var/log//mail.log.0.gz
/var/log//system.log.3.gz
/var/log//system.log.7.gz
/var/log//system.log.6.gz
/var/log//system.log.2.gz
/var/log//system.log.5.gz
/var/log//system.log.1.gz
/var/log//system.log.0.gz
/var/log//system.log.4.gz
The $ sign says that the file extension is ending with gz
I use this to get a listing of the file types inside a folder.
find . -type f | egrep -i -E -o "\.{1}\w*$" | sort -su
Outputs for example:
.DS_Store
.MP3
.aif
.aiff
.asd
.doc
.flac
.jpg
.m4a
.m4p
.m4r
.mp3
.pdf
.png
.txt
.wav
.wma
.zip
BONUS: with
find . -type f | egrep -i -E -o "\.{1}\w*$" | sort | uniq -c
You'll get the file count:
106 .DS_Store
35 .MP3
89 .aif
5 .aiff
525 .asd
1 .doc
60 .flac
48 .jpg
149 .m4a
11 .m4p
1 .m4r
12844 .mp3
1 .pdf
5 .png
9 .txt
108 .wav
44 .wma
2 .zip
You need to do a couple of things. It should look like this:
grep '.*\.zip$'
You need to escape the second dot, so it will just match a dot, and not any character. Using single quotes makes the escaping a bit easier.
You need the dollar sign at the end of the line to indicate that you want the "zip" to occur at the end of the line.
grep -r pattern --include="*.txt" /path/to/dir/
Try: grep -o -E "(\\.([A-z])+)+"
I used this to get multi-dotted/multiple extensions. So if the input was hello.tar.gz, then it would output .tar.gz.
For single dotted, use grep -o -E "\\.([A-z])+$".
Tested on Cygwin/MingW+MSYS.
One more fix/addon of the above example:
# multi-dotted/multiple extensions
grep -oEi "(\\.([A-z0-9])+)+" file.txt
# single dotted
grep -oEi "\\.([A-z0-9])+$" file.txt
This will get file extensions like '.mp3' and etc.
Just reviewing some of the other answers. The .* isn't necessary, and if you're looking for a certain file extension, it's best to include -i so that it's case-insensitive; in case the file is HELLO.ZIP, for example. I don't think the quotes are necessary, either.
grep -i \.zip$
If you just want to find in the current folder, why not with this simple command without grep ?
ls *.zip
Simply do :
grep ".*.zip$"
The "$" indicates the end of line