awk regex: difference between using it or without variables - regex

I have an awk script that behaves different when I put a regular expression in different places. Obviously I make the logic of the program to work the same in both cases, but it does not. The script is for analyzing some logs where each transaction has an unique ID. The log looks like
timestamp (ID) more info
for example:
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with other information
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with more information
2014-10-06 05:24:40,035 INFO (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction
What I want is to process all the log lines of a certain type of transaction to see how much time do they take. Each transaction is spread across several log lines and its identified by its unique ID. To know if a certain transaction is of the type I want I have to search for certain string in the first line of that transaction. In the log could be lines without the above format.
What do I want:
Distinguish if the current line is part of a transaction (it has an ID)
Check if the ID is already registered in an cumulative array.
If not, check if it is of the desired type: search for a fixed string in the body of the line.
If it is, register the timestamp, and blah blah
And here is the code (note this is a very minified version).
This is what I would like to use, first check if it is a transaction line and after check if it is of the correct type
awk '$4 ~ /^\([:alnum:]/
{
name=$4;gsub(/[()]|:.*/,"",name);++matched
if(!(name in arr)){
if($0 ~ /transaction type/){arr[name]=1;print name}}
}END
{
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
That script only finds 868 transactions and there are some more than 14K. If I change the script to look like the code below if finds all the 14k transactions, but only the first line of all of them, so it is not useful for me.
awk '/transaction type/
{
name=$4;gsub(/[()]|:.*/,"",name);++matched
if(!(name in arr)){
arr[name]=1;print name
}
}END
{
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
Thanks in advance.
Edit
Shame on me. There were more than one actual problem in this topic.
The main one was that the regex was not matching the proper string. The ID string and the type of transaction string were on the same line, that is true, but on those lines the ID was like (aaaaaabbbbbcccc: ), with two spaces at the end. That makes AWK to parse
"(aaaaaaaabbbbcccc:" and ")" as two different fields. I realized when I did
$4 !~ /regex/ print $4
and a lot of valid IDs appeared.
The second problem, which appeared after fixing the regular expression have been addressed by some people here. Having the main regular expression and the firs { in separated lines makes awk to print each record. I realized that myself and the same day later I read here the solutions. Amazing.
Thank you very much to every one. I can only accept one answer as valid, but I learned a lot from all of them.

white space matters in awk. This:
/foo/ {
print "found"
}
means print 'found' every time "foo" is present while this:
/foo/
{
print "found"
}
means print the current record every time "foo" is present and print "found" for every single input record so chances are when you wrote:
$4 ~ /^\([:alnum:]/
{
....
}
you actually meant to write:
$4 ~ /^\([:alnum:]/ {
....
}
also, chances are you meant to use the POSIX character class [[:alnum:]] instead of the set of characters [ : a l n u m as described by the character set [:alnum:]:
$4 ~ /^\([[:alnum:]]/ {
....
}
If you fix those things and you still need help, provide some testable sample input and expected output we can help you more.

It's only a syntax error. When you use a posix character class you must enclose it between square brackets:
[[:alnum:]]
Otherwise [:alnum:] is seen as a character class that contains : a l m n u

So in brief if I understood properly you wish to get ids of certain type of transactions.
First assumption: id and transaction type are on the same line, something like this should do (largely adapted from your code)
awk 'BEGIN {
matched=0 # more for clarity than really needed
}
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only
gsub(/[()]/,"",$4) # strip the () around the id
++matched # to get the number of matched lines including the multiples ones.
if (!($4 in arr)) { # as yours, if the id is not in array
arr[$4]=1 # add the found id to array for no including it twice
print $4 # print the found id (only once as we're in the if
}
}
END { # nothing changed here, printing the stats...
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
Output of this from your sample input:
prompt=> awk 'BEGIN { matched=0}; / \([a-z0-9]*\) / { gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput
4aaaaaaaaabbbbbbcccb
4xxbbbbbbbbbbbbbcccb
Found: 2
Processed 4
4 lines matched the filter
I've ommitted the transaction in the test as I've no clue on what it may be

Related

How to check multiline text with regex after first match and only before the second one

I'm trying to find needed log in a pretty big log file(let's say 250 mb).
Every single log starts with
YYYY-MM-DD time:
Next goes some one or multiline text that I want to match
And finally ends with a newline and new DateTime pattern.
The question is how to match the text inside a log if it is multiline and only before the next log.
The order of matching values is unknown as well as the line of them.
I have tried next solution
grep -Pzio '^(\d{4}-\d{2}-\d{2} timePattern)(?=[\s\S]*?Value1)(?=[\s\S]*?Value2)(?=[\s\S]*?Value3)[\s\S]*?(?=(\n\1|\Z)' file.log
But it comes to overhead PCRE limit even with ungreedy [\s\S]*? or simply gets previous unmatched log and includes lots of other logs in [\s\S]* before it finally finds all three values to match before the first capturing group and just gives me back huge text.
So the only difficulty is multiline I think here.
Will appreciate any help!
EDIT 0: I need to find only one log that has all the values that I'm trying to match.
EDIT 1: Example
2018-02-09 03:52:46,347 Activity=SomeAct
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value2</S:Second></S:Key></S:Info></S:RQ></S:Body>
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>FAILURE</S:StatusCode></S:RS></SOAP-ENV:Body>
2018-02-09 03:52:51,377 Activity=SomeAct
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value2</S:Second></S:Key></S:Info></S:RQ></S:Body>
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>SUCCESSFUL</S:StatusCode></S:RS></SOAP-ENV:Body>
2018-02-09 03:52:52,112 Activity=SomeAct
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>FAILURE</S:StatusCode></S:RS></SOAP-ENV:Body>
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value3</S:Second></S:Key></S:Info></S:RQ></S:Body>
I need to get only the record with value1 and value2 in SUCCESFULL status. BUT it is not necessary that response is after request or <first> goes before <second> or RS\RQ are only one lines.
It's not really clear what you want to find but a common approach is to use Awk with a custom record separator so that a record can be multiple lines. Or you can collect the records manually:
awk '/^YYYY-MM-DD time: / { if (seen1 && seen2 && seen3) print rec;
seen1 = seen2 = seen3 = 0; rec = "" }
{ rec = (rec ? rec "\n" $0 : $0 }
/Value1/ { seen1++ }
/Value2/ { seen2++ }
/Value3/ { seen3++ }
END { if (seen1 && seen2) print rec; }' file
This collects into rec the lines we have seen since the previous separator, and when we see a new separator, we print the previous value from rec before starting over if all the "seen" flags are set, indicating that we have matched all the regexes with the text in the current rec.
A common omission is forgetting to also do this in the END block, when we reach the end of the file.

BASH: Split strings without any delimiter and keep only first sub-string

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command
From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'
If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.
This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

Stopping regex at the first match, it shows two times

I am writing a perl script and I have a simple regex to capture a line from a data file. That line starts with IG-XL Version:, followed by the data, so my regex matches that line.
if($row =~/IG-XL Version:\s(.*)\;/)
{
print $1, "\n";
}
Let's say $1 prints out 9.0.0. That's my desired outcome. However in another part of the same data file also has a same line IG-XL Version:. $1 now prints out two of the data 9.0.0.
I only want it to match the first one so I can only get the one value. I have tried /IG-XL Version:\s(.*?)\;/ which is the most suggested solution by adding a ? so it'll be .*? but it still outputs two. Any help?
EDIT:
The value of $row is:
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
The desired value I want is 8.00.01_uflx (P7) which I did get, but two times.
The only way to do this while reading the file line by line is to keep a status flag that records whether you have already found that pattern. But if you are storing the data in a hash, as you were in your previous question, then it won't matter as you will just overwrite the hash element with the same value
if ( $row =~ /IG-XL Version:\s*([^;]+)/ and not $seen_igxl_vn ) {
print $1, "\n";
$seen_igxl_vn = 1;
}
Or, if the file is reasonably small, you could read the whole thing into memory and search for just the first occurrence of each item
I suggest you should post a question showing your complete program, your input data, and your required output, so that we can give you a complete solution rather than seeing your problem bit by bit

Perl, regular expression, matching exactly 2 spaces does not work

Working on the parser for STA/SSTA timing reports. The following cases of "Arrival Time" occurrence are possible:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
The goal is to match cases 1st and 2nd, but ignore 3rd case.
I tried two matching patterns in my Perl code:
1) if (m/^-?\s{1,2}Arrival\sTime/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
2) if (m/^-\sArrival\sTime/ || m/^\s{1,2}Arrival\sTime/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
Both of them pick up the 3rd case as well. I do not understand why.
I defined specifically one or two space characters \s{1,2}, no more than that. As the 3rd line contains more than two whitespace character it should not match the pattern. How is this possible?
The data you have published is not the same as you used in your test.
This program checks both of the regex patterns against the data copied directly from an edit of your original post. Neither pattern matches any of the lines in your data
use strict;
use warnings;
use 5.010;
my (%STA_DATA, $file, $path);
while ( <DATA> ) {
if ( /^-?\s{1,2}Arrival\sTime/ ) {
say 'match1';
$STA_DATA{$file}{$path}{Arrival_Time} = m/\sArrival\sTime\s+(.*)\s+$/
}
if ( /^-\sArrival\sTime/ or m/^\s{1,2}Arrival\sTime/ ) {
say 'match2';
$STA_DATA{$file}{$path}{Arrival_Time} = m/\sArrival\sTime\s+(.*)\s+$/
}
}
__DATA__
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
Here is a possible workaround you can try:
if (m/^-?\s{1,2}Arrival\sTime\s{2,}/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
You can match the string "Arrival Time " with two or more spaces after it, ruling out the string "Arrival Time Report"
Can you confirm your regex is inside a loop reading the input line by line ?
In case $_ contains the whole text your observation would be expected because you anchored the extracting regex to the end of the text by using a $.
It should help to replace spaces in your data with Unicode U+2423 OPEN BOX that is commonly used to signify a space using a visible character.
␣␣␣␣␣␣Arrival␣Time␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣3373.000
␣␣␣␣-␣Arrival␣Time␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣638.700␣|␣100.404
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣Arrival␣Time␣Report␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣
As rightfully requested by Borodin, for the learning of others I'm gong to explain the mistake I have done and show the solution.
The mistake that I have done is following:
I wrongly assumed that my matching pattern is being applied on the text as seen in the .rpt file.
Three cases (relevant for my matching pattern) that can occur in such a file are following:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
But, I have forgotten that somewhere in the code I have implemented following line:
s/->//g; s/\s\S+\s[v\^]\s//g; s/\s+/ /g;
It is namely the last substitution in this series of substitutions that changes the original text into:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
There for my matching patterns (that are presented in the question above) did not work.
Knowing this, the solution is simple. I have adjusted matching pattern as follows:
if (m/^\-?\sArrival\sTime\s\d+/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s(.*)\s?$/ }
I appreciate all the help and feedback received, and I truly sorry for wasting everyone's time with this ill defined problem.

Parsing log files

I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2
Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.
One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}