Using sed/awk and regex to process logs - regex

I have 1000s of log files generated by a very verbose PHP script. The general structure is as follows
###Unknown no of lines, which I want to ignore###
=================================================
$insert_vars['cdr_pkey']=17568
$id<TAB>$g1<TAB>$i1<tab>rating1<TAB>$g2<TAB>$i2<tab>rating2 #<TAB>more $gX,$iX,$ratingX
#numerical values of $id $g1 $i1 etc. separated by tab
#numerical values of ---""---
#I do not know how many lines will be there (unique column is $id)
=================================================
###Unknown no of lines, which I want to ignore###
I have to process these log files and create an excel sheet (I am thinking csv format) and report the data back. I am really bad at excel, but I thought of outputting something like :
cdr_pkey<TAB>id<TAB>g1<TAB>i1<TAB>rating1<TAB>g2<TAB>rating2 #and so on
17568<TAB>1349<TAB>0.0004532<TAB>0.01320<TAB>2.014E-4<TAB>...#rest of numerical values
17568<TAB>1364<TAB>...#values for id=1364
17568<TAB>1321<TAB>...#values for id=1321
...
17569<TAB>1048<TAB>...#values for id=1048
17569<TAB>1426<TAB>...#values for id=1426
...
...
So my cdr_pkey is unique column in the sheet, and for each $cdr_pkey, I have multiple $ids, each having their own set of $g1,$i1,$rating1...
After testing such format, it can be read by excel. Now I just want to extend it to all those 1000s of files.
I am just not sure how to proceed further. What's the next step?

The following bash script does something that might be related to what you want. It is parameterized by what you meant when you said <TAB>. I assume you mean the ascii tab character, but if your logs are so verbose that they spell out <TAB> you will need to modify the variable $WHAT_DID_YOU_MEAN_BY_TAB accordingly. Note that there is very little about this script that does The Right Thing™; it reads the entire file into a string variable, which might not even be possible depending on how big your log files are. On the up side, the script could be easily modified to make two passes, instead, if you think that's better.
#!/bin/bash
WHAT_DID_YOU_MEAN_BY_TAB='\t'
if [[ $# -ne 1 ]] ; then echo "Requires one argument: the file to process" ; exit 1 ; fi
FILENAME="$1"
RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
CDR_PKEY=$(echo "$RELEVANT" | \
grep '$insert_vars\['"'cdr_pkey'\]" | \
sed 's/.*=\(.*\)/\1/')
echo "$RELEVANT" | sed '1,2d' | \
sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"
The following find command is an example use, but your case will depend on how your logs are organized.
find . LOG_PATTERN -exec THIS_SCRIPT '{}' \;
Lastly, I have ignored the issue of putting the CSV headers on the output. This is easily done out-of-band.
(Edit: updated the script to reflect discussion in the comments.)

EDIT: James tells me that changing the sed in last echo from ... 1d ... to ... 1,2 ... and dropping the grep -v 'id' should do the trick.
Confirmed that it works. So changing it below. Thanks again to James Wilcox.
Based on #James script this is what I came up with. I just piped the final echo to grep -v 'id'
Thanks again James Wilcox
WHAT_DID_YOU_MEAN_BY_TAB='\t'
if [[ $# -lt 1 ]] ; then echo "Requires at least one argument: the files to process" ; exit 1 ; fi
echo -e "key\tid\tg1\ti1\td1\tc1\tr1\tg2\ti2\td2\tc2\tr2\tg3\ti3\td3\tc3\tr3"
for i in "$#"
do
FILENAME="$i"
RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
CDR_PKEY=$(echo "$RELEVANT" | \
grep '$insert_vars\['"'cdr_pkey'\]" | \
sed 's/.*=\(.*\)/\1/')
echo "$RELEVANT" | sed '1, 2d' | \
sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"
#the one with grep looked like :-
#echo "$RELEVANT" | sed '1d' | \
#sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/" | grep -v 'id'
done

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

Extract strings from text files and rename them accordingly in bash

I have a lot of text file named randomly (something like 70000 files); all I know is that somewhere in the first 30 lines there are two lines of the format Author: Samuel Richardson and another line Title: Clarissa, Volume 5 (of 9). I am not sure of the case of these two lines.
I want to extract the title and the author and rename the file accordingly, something like "Clarissa, Volume 5 (of 9) ,___, Samuel Richardson.txt" (I use ,___, so that there are valid separators between author and titles.
My code is
for filename in *.txt; do
title=$(head -n 30 $filename.txt | grep -i 'Title:' | sed -n 's/^.*Title: //p')
author=$(head -n 30 $filename.txt | grep -i 'Author:' | sed -n 's/^.*Author: //p')
new_name="$title ,___, $author"
mv $filename $new_name.txt
done
It is not working as expected. The subcode
echo "title: $title _"
echo "author: $author _"
new_name="$title ,___, $author"
echo $new_name
prints as output the following
_tle: Clarissa, Volume 5 (of 9)
_thor: Samuel Richardson
,___, Samuel Richardson)
Moreover, I don't know how to save the computation of the extraction of the first 30 lines with the head command to a variable firstlines, so that it should not be re-computated.
The code
firstlines=$(head -n 30 randomname.txt)
and the use of title=$($firstlines | grep -i 'Title:' | sed -n 's/^.*Title: //p')
prints out the error command not found.
#Poshi's comment about about line endings is correct, and #B.Shefter's answer is on the right track but has a number of problems (unquoted variable references, relying on nonstandard features of echo and sed), so I thought I'd rewrite with (hopefully) the problems fixed.
Also, I'll repeat the recommendation I gave in a comment: use mv -n or mv -i to avoid overwriting files if anything goes wrong, and make a backup first. (You have a backup anyway, right? You should always have a backup of anything you don't want to lose.)
Anyway, here's my take on it:
#!/bin/bash
for filename in *.txt; do
# Grab the first 30 lines with carriage returns removed:
firstlines=$(head -n 30 "$filename" | tr -d '\r')
# Capture the title and author. Note that sed doesn't have case-insensitive
# patterns, so use e.g. [Tt] to manually make them case-insensitive. Also, use
# [[:blank:]]* to allow any number of spaces and/or tabs after the ":".
title=$(echo "$firstlines" | sed -n 's/^.*[Tt][Ii][Tt][Ll][Ee]:[[:blank:]]*//p')
if [ -z "$title" ]; then
echo "Unable to find Title: in $filename; skipping" >&2
continue
fi
author=$(echo "$firstlines" | sed -n 's/^.*[Aa][Uu][Tt][Hh][Oo][Rr]:[[:blank:]]*//p')
if [ -z "$author" ]; then
echo "Unable to find Author: in $filename; skipping" >&2
continue
fi
new_name="$title ,___, $author.txt"
# Note: the filenames here will contain spaces, so double-quoting is *critical*
mv -i "$filename" "$new_name"
done
#Poshi's right: your main issue is line endings. It looks as if each line ending includes a carriage return (\r). By itself, \r just moves the cursor back to the beginning of the line. When coupled with \n it works fine--because it moves to the beginning of the next line--but by itself it causes what you're seeing: some text, followed by the cursor going back to the beginning of the line, followed by more text overwriting what was there originally.
EDIT: It would probably help if I included a solution for that. Something like this should work, inserted before the assignment to new_name:
title=$(echo -e $title | sed 's/\r//')
author=$(echo -e $author | sed 's/\r//')
As for your second problem, the reason you're getting command not found is that the first word in the variable $firstlines isn't a command. You want something like:
title=$(echo -e $firstlines | grep -i 'Title:' | sed -n 's/^.*Title: //p')

How to grep two patterns at once and have the result in one string?

I have existing log files that have, among others, following type of lines:
2018-05-14T10:10:22.769029+03:00 timom usbmonitor: [INFORMATION 6] [FILE: UsbChecker.cpp:51][FUNC: vendorCheck][MSG: USB vendors changed: "0403 14e1 05e3 05e3 03f0 0403 0bda 1d6b 1d6b 1d6b 1d6b 1d6b 1d6b 1d6b" ]
From these files I want to grep lines above so that I get the timestamp from the beginning and the text inside quotes so that I'd have a nice and compact output:
2018-05-14T10:10:22.769029+03:00 0403 14e1 05e3 05e3 03f0 0403 0bda 1d6b 1d6b 1d6b 1d6b 1d6b 1d6b 1d6b
Is there a way to do this with a one-liner?
I'm looking for a way to efficiently get the desired output without the need to loop over grepped lines. I have thousands of log files each of which may have hundreds of matches so the grep/sed/whatever needs to be efficient.
So far I've done it like this:
#!/bin/bash
INPUTDIR=
OUTPUTDIR=
while getopts ":h:d:o:" OPTION; do
case $OPTION in
h)
usage
exit 1
;;
d)
INPUTDIR=$OPTARG
;;
o)
OUTPUTDIR=$OPTARG
;;
?)
usage
exit 1
;;
esac
done
if [ -z $INPUTDIR ] || [ -z $OUTPUTDIR ]; then
echo "BAD ARGUMENTS: both directories aren't given" >&2
usage
exit 1
fi
OUTPUTFILE="$(date +%Y%m%d%H%M%S)-usb-analysis-summary"
for i in $( ls $INPUTDIR ); do
# Interesting files are of format <number>_<number>
if [ $(echo "$i" | grep -Ev "^[0-9]+_[0-9]+$") ] ; then
echo "Skipping $i"
continue
fi
grep vendorCheck $INPUTDIR/$i | while read -r l ; do
# We do know timestamp is 32 characters long. GEFN
echo "$l" | sed -r "s|^(.{32}).*changed: \"(.*)\".*|\1 \2|" >>$OUTPUTFILE
done
done
But this is not optimal as now I'm looping the files and then looping grep matches from each file.
I tried
grep "vendorCheck" $INPUTDIR/$i | sed -r "s|^(.{32}).*changed: \"(.*)\".*|\1 \2|"
But this removes line breaks.
Then if I put multiple patterns in one grep I'm also in trouble with formatting; I need to get the timestamp and text inside quotes to one line, and next similar match to next line.
Sed can do the line selection matching and editing all at a go.
You could also use $(...) to generate sed's input file list, so you really can get it all into one line, I think, but that ls isn't ideal, and you said you needed filenames in a comment below, so...
Rather than
sed -r -n '/vendorCheck/{s/(.{32}).*changed: \"(.*)\"/\1 \2/; p;}' $( ls -1 $INPUTDIR | egrep '^[0-9]+_[0-9]+$' ) >> $OUTPUTFILE
You can embed some whitespace to make it a little less ugly without changing the "one-liner" functionality, and a loop can replace the ls:
for f in $INPUTDIR/[0-9]*_[0-9]* # limit input, not a definitive check
do echo "$f" | egrep '^[0-9]+_[0-9]+$' || continue # CONFIRM filename match
[[ -f $f ]] || continue # and assert file, not dir
sed -r -n "/vendorCheck/{
s/(.{32}).*changed: \"(.*)\"/\1 \2/;
s/^/$f: /;
p;
}" "$f" # the "s/^/$f: /;" is a placeholder of your need for the name
done >> $OUTPUTFILE
NOTE: deleted my test data, so this rework didn't get vetted as carefully. Let me know if anyone sees a typo.

Escape dollar sign in regexp for sed

I will introduce what my question is about before actually asking - feel free to skip this section!
Some background info about my setup
To update files manually in a software system, I am creating a bash script to remove all files that are not present in the new version, using diff:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g"); do echo "rm -f $i" >> REMOVEOLDFILES.sh; done
This works fine. However, apparently my files often have a dollar sign ($) in the filename, this is due to some permutations of the GWT framework. Here is one example line from the above created bash script:
rm -f var/lib/tomcat7/webapps/ROOT/WEB-INF/classes/ExampleFile$3$1$1$1$2$1$1.class
Executing this script would not remove the wanted files, because bash reads these as argument variables. Hence I have to escape the dollar signs with "\$".
My actual question
I now want to add a sed-Command in the aforementioned pipeline, replacing this dollar sign. As a matter of fact, sed also reads the dollar sign as special character for regular expressions, so obviously I have to escape it as well.
But somehow this doesn't work and I could not find an explanation after googling a lot.
Here are some variations I have tried:
echo "Bla$bla" | sed "s/\$/2/g" # Output: Bla2
echo "Bla$bla" | sed 's/$$/2/g' # Output: Bla
echo "Bla$bla" | sed 's/\\$/2/g' # Output: Bla
echo "Bla$bla" | sed 's/#"\$"/2/g' # Output: Bla
echo "Bla$bla" | sed 's/\\\$/2/g' # Output: Bla
The desired output in this example should be "Bla2bla".
What am I missing?
I am using GNU sed 4.2.2
EDIT
I just realized, that the above example is wrong to begin with - the echo command already interprets the $ as a variable and the following sed doesn't get it anyway... Here a proper example:
Create a textfile test with the content bla$bla
cat test gives bla$bla
cat test | sed "s/$/2/g" gives bla$bla2
cat test | sed "s/\$/2/g" gives bla$bla2
cat test | sed "s/\\$/2/g" gives bla2bla
Hence, the last version is the answer. Remember: when testing, first make sure your test is correct, before you question the test object........
The correct way to escape a dollar sign in regular expressions for sed is double-backslash. Then, for creating the escaped version in the output, we need some additional slashes:
cat filenames.txt | sed "s/\\$/\\\\$/g" > escaped-filenames.txt
Yep, that's four backslashes in a row. This creates the required changes: a filename like bla$1$2.class would then change to bla\$1\$2.class.
This I can then insert into the full pipeline:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g" | sed "s/\\$/\\\\$/g"; do echo "rm -f $i" >> REMOVEOLDFILES.sh; done
Alternative to solve the background problem
chepner posted an alternative to solve the backround problem by simply adding single-quotes around the filenames for the output. This way, the $-signs are not read as variables by bash when executing the script and the files are also properly removed:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g"); do echo "rm -f '$i'" >> REMOVEOLDFILES.sh; done
(note the changed echo "rm -f '$i'" in that line)
There are other problems with your script, but file names containing $ are not a problem if you properly quote the argument to rm in the resulting script.
echo "rm -f '$i'" >> REMOVEOLDFILES.sh
or using printf, which makes quoting a little nicer and is more portable:
printf "rm -f '%s'" "$i" >> REMOVEOLDFILES.sh
(Note that I'm addressing the real problem, not necessarily the question you asked.)
There is already a nice answer directly in the edited question that helped me a lot - thank you!
I just want to add a bit of curious behavior that I stumbled across: matching against a dollar sign at the end of lines (e.g. when modifying PS1 in your .bashrc file).
As a workaround, I match for additional whitespace.
$ DOLLAR_TERMINATED="123456 $"
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$/END/"
123456END
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$$/END/"
sed: -e expression #1, char 13: Invalid back reference
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$\s*$/END/"
123456END
Explanation to the above, line by line:
Defining DOLLAR_TERMINATED - I want to replace the dollar sign at the end of DOLLAR_TERMINATED with "END"
It works if I don't check for the line ending
It won't work if I match for the line ending as well (adding one more $ on the left side)
It works if I additionally match for (non-present) whitespace
(My sed version is 4.2.2 from February 2016, bash is version 4.3.48(1)-release (x86_64-pc-linux-gnu), in case that makes any difference)

Replace string with another string based on backreference with sed

I'm trying to convert a predefined string %c# where # can be some number with another string. The catch is that the length of the other string must be truncated to # number of characters.
Ideally these set of commands would work:
FORMAT="%c10"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
echo $FORMAT | sed "s/%c\([0-9]\+\)/${LAST_COMMIT:0:\1}/g"
but clearly there is a syntax error on the \1. You can replace it with a number to see what I'm trying to get as output.
I'm open to using some other program other than sed to achieve this but ideally it should be programs that are pretty much native to most linux installations.
Thanks!
This is my idea.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c//')
Get number with sed and get first some character with head.
EDIT1
This might be better.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c\([0-9]\+\)/\1/')
EDIT2
I make the script because it is too tough to understand. Please try this.
$ cat sample.sh
#!/bin/bash
FORMAT="%b-%t-%c10-%c5"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
## List numbers
lengths=$(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g")
## Substitute %cXX to first XX characters of LAST_COMMIT
for n in ${lengths}
do
to_str=$(echo ${LAST_COMMIT:0:${n}})
FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/")
done
## Print result
echo ${FORMAT}
This is the result.
$ ./sample.sh
%b-%t-5189e42b1410-5189e5
Also this is one line commands (Same contents but too long and too tough)
for n in $(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g"); do to_str=$(echo ${LAST_COMMIT:0:${n}}); FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/"); done; echo ${FORMAT}
The value of $LAST_COMMIT gets interpolated before sed runs, so there is no backreference to refer back to yet. There is an /e extension in GNU sed which would support something like this, but I would simply use a slightly more capable tool.
perl -e '$fmt = shift; $fmt=~ s/%c(\d+)/%.$1s/g; printf("$fmt\n", #ARGV)' '%c10' "$LAST_COMMIT"
Of course, if you can let go of your own ad-hoc format string specifier, and switch to a printf-compatible format string altogether, just use the printf shell command straight off.
length=$(echo $FORMAT | sed "s/%c\([0-9]\+\)/\1/g")
echo "${LAST_COMMIT:0:$length}"