grep through a file conditionally in both directions - regex

I have a log file written to by several instances of a cgi script. I need to extract certain information, with the following typical workflow:
search for the first occurrence of RequestString
extract PID from that log line
search backwards for the first occurrence of PID<separator>ConnectionString, to identify the client that initiated the request
do something with ConnectionString and repeat the search from after 'RequestString'
What is the best way to do this? I was thinking of writing a perl script to do this with caching the last N lines, and then match through those lines to perform 3.
Is there any better way to do this? Like extended regex that would do exactly this?
Sample with line numbers for reference -- not part of the file:
1 date pid1 ConnectionString1
2 date pid2 ConnectionString2
3 date pid3 ConnectionString3
4 date pid2 SomeOutput2
5 date pid2 SomeOutput2
6 date pid4 ConnectionString4
7 date pid3 SomeOutput3
8 date pid4 RequestString4
9 date pid1 SomeOutput1
10 date pid1 ConnectionString1
11 date pid1 RequestString1
12 date pid5 RequestString5
When I grep through this sample file, I wish for the following to match:
line 8, paired with line 6
line 11, paired with line 10 (and not with line 1)
Specifically, the following shouldn't be matched:
line 12, because no matching ConnectionString with that pid is found (pid5)
line 1, because there is a new ConnectionString for that pid before the next RequestString for that pid (line 10). Imagine that the first connection attempt failed before logging the RequestString)
any of the lines from pid2/pid3, because hey dont have a RequestString logged.
I could imagine writing a regex with the option for . to match \n:((pid\d)\s*(ConnectionString\d))(?!\1).*\2\s*RequestString\d and then use \3 to identify the client.
However, there are disproportionately more (perhaps between 1000 and 10000 times more) ConnectionStrings than RequestStrings, so my intuition was to first go for the RequestString and then backtrack.
I guess I could play with (?<) for lookbehind, but the lengths between ConnectionStrings and RequestStrings are essentially arbitrary -- will that work well?

Something along these lines:
#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
echo $n,$string # Debug
head -n $n file | tail -r | grep -m1 Connection
done
Output
4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection
with this input file
6189:Connection
RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3
Note: I used tail -r because OSX lacks tac which I would have preferred.

Related

regex | List of result from grep

The following grep command gives me the number of requests from July 1st to July 31st between 8 a.m. and 4 p.m.
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
I don't want to get all requests in the month, but the requests per day. I could of course enter the command 31 times, but that's tedious. Is there a way to display the requests per day one below the other, so that I get the following as a result (ideally sorted by number), for example
543
432
321
etc.
How to do that?
You want to count lines based on a certain value in a line. That's a good job for awk. With grep-only, you would always have to process the input files once per day. In any way, we need to fix your regex first:
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
[08\-16] matches the characters 0, 8, -, 1 and 6. What you want to match is (0[89])|(1[0-6]); that is 0, followed by one of 8 or 9 - or - 1 followed by one of range 0-6. To make it easier, we assume normal days in the date and therefore match the day with [0-9]{2} (two digits).
Here's a complete awk for your task:
awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
Explanation:
/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/ matches date + time for every day (at 08-16) in july
{a[$1]++} builds an array with key=day and a counter of occurrences.
END{for (i in a) print "day " i ": " a[i]} prints the array when all input files were processed
Because we've set the field separator to /, you need to change a[$1] to address the correct position (for two more slashes before the actual date: a[$3]). (Of course this can be solved in a more dynamic way.)
Example:
$ cat localhost_access.log
01/Jul/2021:08 log message
01/Jul/2021:08 log message
02/Jul/2021:08 log message
02/Jul/2021:07 log message
$ awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
day 01: 2
day 02: 1
Run zcat | awk in case your log files are compressed, but remember the regex above searches for "Jul/2021" only.

How to add a prefix to all lines that don't start with one of multiple words

I am trying to add a prefix to all the lines in a file that don't start with one of multiple words using sed.
Example :
someText
sleep 1
anotherString
sleep 1
for i in {1..50}
do
command
sleep 1
secondCommand
sleep 1
done
Should become
PREFIX_someText
sleep 1
PREFIX_anotherString
sleep 1
for i in {1..50}
do
PREFIX_command
sleep 1
PREFIX_secondCommand
sleep 1
done
I am able to exclude any line starting with a single pattern word (ie: sleep, for, do, done), but I don't know how to exclude all lines starting with one of multiple patterns.
Currently I use the following command :
sed -i '/^sleep/! s/^/PREFIX_/'
Which works fine on all the lines starting with sleep.
I imagine there is some way to combine pattern words, but I can't seem to find a solution.
Something like this (which obviously doesn't work) :
sed -i '/[^sleep;^for;^do]/! s/^/PREFIX_/'
Any help would be greatly appreciated.
Use alternation with multiple words for negation:
sed -i -E '/^(sleep|for|do)/! s/^/PREFIX_/' file
PREFIX_someText
sleep 1
PREFIX_anotherString
sleep 1
for i in {1..50}
do
PREFIX_command
sleep 1
PREFIX_secondCommand
sleep 1
done
/^(sleep|for|do)/! will match all lines except those that start with sleep, or for or do words.
I like awk.
awk '/sleep|for|do/ { print; next; } { print "PREFIX_" $0 }' filename

AWK: Pattern match multiline data with variable line number

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!
You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.
awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

Delete Specific Lines with AWK [or sed, grep, whatever]

Is it possible to remove lines from a file using awk? I'd like to find any lines that have Y in the last column and then remove any lines that match the value in column 2 of said line.
Before:
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,N
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,N
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,Y
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,Y
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
So awk would find that row 3 has Y in the last column, then look at column 2 [TRACKINGKEY1] and remove all lines that have TRACKINGKEY1 in column 2.
Expected result:
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
The reason for this is that our shipping program puts out a file whenever a shipment is processed, as well as when that shipment gets voided [in case of an error]. So what I end up with is the initial package info, then the same info indicating that it was voided, then yet another set of lines with the new shipment info. Unfortunately our ERP software has a fairly simple scripting language in which I can't even make an array so I'm limited to shell tools.
Thanks in advance!
One way is to take 2 pass to same file using awk:
awk -F, 'NR == FNR && $NF=="Y" && !($2 in seen){seen[$2]}
NR != FNR && !($2 in seen)' file file
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
Explanation:
NR == FNR # if processing the file 1st time
&& $NF=="Y" # and last field is Y
&& !($2 in seen) { # we haven't seen field 2 before
seen[$2]} # store field 2 in array seen
}
NR != FNR # when processing the file 2nd time
&& !($2 in seen) # array seen doesn't have field 2
# take default action and print the line
This solution is kind of gross, but kind of fun.
grep ',Y$' file | cut -d, -f2 | sort -u | grep -vwFf - file
grep ',Y$' file -- find the lines with Y in the last column
cut -d, -f2 -- print just the tracking key from those lines
sort -u -- give just the unique keys
grep -vwFf - file --
read the unique tracking keys from stdin (-f -)
only consider them a match if they are whole words (-w)
they are fixed strings, not regular expressions (-F)
then exclude lines matching these patterns (-v) from file

bash: Batch reformatting using sed + date?

I have a bunch of data that looks like this:
"2004-03-23 20:11:55" 3 3 1
"2004-03-23 20:12:20" 1 1 1
"2004-03-31 02:20:04" 15 15 1
"2004-04-07 14:33:48" 141 141 1
"2004-04-15 02:08:31" 2 2 1
"2004-04-15 07:56:01" 1 2 1
"2004-04-16 12:41:22" 4 4 1
and I need to feed this data to a program which only accepts time in UNIX (Epoch) format. Is there a way I can change all the dates in bash? My first instinct tells me to do something like this:
sed 's/"(.*)"/`date -jf "%Y-%m-%d %T" "\1" "+%s"`'
But I am not entirely sure that the \1 inside the date call will properly backreference the regex matched by sed. In fact, when I run this, I get the following response:
sed: 1: "s/(".*")/`date -jf "% ...": unterminated substitute in regular expression
Can anyone guide me in the right direction on this? Thank you.
Nothing is going to be expanded between single quotes. Also, no, the shell expansions are going to happen before the sed \1 expansion, so your code isn't going to work. How about something like this (untested):
while IFS= read -r date time a b c
do
date --date "${date:1} ${time::-1}" # Cut the variables to remove the literal quotes
printf " %s %s %s\n" "$a" "$b" "$c"
done < file