Parsing log files - regex

I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2

Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.

One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}

Related

RegEx to format Wikipedia's infoboxes code [SOLVED]

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.
The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.
#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...
I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

perl regex to handle and preserver single and multiple words into a variable

I am writing a perl script to read the full name of a member and save it to variables firstname and lastname like below:
my ($firstname, $lastname) = $member =~ m/^(\w+.*?) +(\w+)$/;
my $member_name = $firstname.' '.$lastname;
The value for $member comes from an upstream service which would be like for example "Jane Doe"
Now the code above cannot handle when the service sends $member value like "Jane". The regex fails to handle a single word in that code line. I need it to handle both multiple and single words. I cannot implement a new code functionality so I am looking to add to the existing regex so that there is minimal change and that it can handle both the conditions.
So far this is what I am testing with in the command line but so far no luck:
perl -e 'my ($firstname, $lastname) = "Jane Doe" =~ m/^(\w+.*?) +(\w+)$/|m/^(\w+)$/; print "$firstname\n$lastname";'
When I substitute "Jane Doe" with "Jane", nothing prints. I want the code to be in this format though. like if the value is multiple words it should print them both, otherwise just the single word.
Your help will be greatly appreciated.
There is a syntax error in your Perl code. You terminated the pattern too early.
# / / / /
# V
m/^(\w+.*?) +(\w+)$/|m/^(\w+)$/
This will lead to the | being interpreted as a bit-wise or. Since there's another m// behind it, the | will take the return values of both m// operations and do its magic. The second m// will just match against the topic $_.
What you actually want is to merge both patterns.
my ($firstname, $lastname) = "Jane Doe" =~ m/^(?:(\w+.*?) +)?(\w+)$/;
You need to make the first name optional with a non-capture group (?:), followed by a ? none-or-one quantifier.
You cannot have three capture groups, like you probably intended, because the third one would go to $3, and not $1.
However, the above solution uses the last name, which you then assign to the $firstname variable. Your full name pattern allows for names with any characters in them, like Jean-Luc Picard. But if you pass in just Jean-Luc, the match will fail. So if you want only the first name, you should use the correct pattern to make it consistent.
A simple way of doing that is to make the last name optional instead.
my ($firstname, $lastname) = "Jane" =~ m/^(\w+.*?)(?: +(\w+))?$/;
Remember that this will set $lastname to undef, which doesn't matter so much in your command line example, but in a proper program with strict and warnings (which you of course have turned on, right?) it will complain if $lastname is used as a string while it's undef.
I suggest you read this article about names.

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

awk regex: difference between using it or without variables

I have an awk script that behaves different when I put a regular expression in different places. Obviously I make the logic of the program to work the same in both cases, but it does not. The script is for analyzing some logs where each transaction has an unique ID. The log looks like
timestamp (ID) more info
for example:
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with other information
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with more information
2014-10-06 05:24:40,035 INFO (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction
What I want is to process all the log lines of a certain type of transaction to see how much time do they take. Each transaction is spread across several log lines and its identified by its unique ID. To know if a certain transaction is of the type I want I have to search for certain string in the first line of that transaction. In the log could be lines without the above format.
What do I want:
Distinguish if the current line is part of a transaction (it has an ID)
Check if the ID is already registered in an cumulative array.
If not, check if it is of the desired type: search for a fixed string in the body of the line.
If it is, register the timestamp, and blah blah
And here is the code (note this is a very minified version).
This is what I would like to use, first check if it is a transaction line and after check if it is of the correct type
awk '$4 ~ /^\([:alnum:]/
{
name=$4;gsub(/[()]|:.*/,"",name);++matched
if(!(name in arr)){
if($0 ~ /transaction type/){arr[name]=1;print name}}
}END
{
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
That script only finds 868 transactions and there are some more than 14K. If I change the script to look like the code below if finds all the 14k transactions, but only the first line of all of them, so it is not useful for me.
awk '/transaction type/
{
name=$4;gsub(/[()]|:.*/,"",name);++matched
if(!(name in arr)){
arr[name]=1;print name
}
}END
{
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
Thanks in advance.
Edit
Shame on me. There were more than one actual problem in this topic.
The main one was that the regex was not matching the proper string. The ID string and the type of transaction string were on the same line, that is true, but on those lines the ID was like (aaaaaabbbbbcccc: ), with two spaces at the end. That makes AWK to parse
"(aaaaaaaabbbbcccc:" and ")" as two different fields. I realized when I did
$4 !~ /regex/ print $4
and a lot of valid IDs appeared.
The second problem, which appeared after fixing the regular expression have been addressed by some people here. Having the main regular expression and the firs { in separated lines makes awk to print each record. I realized that myself and the same day later I read here the solutions. Amazing.
Thank you very much to every one. I can only accept one answer as valid, but I learned a lot from all of them.
white space matters in awk. This:
/foo/ {
print "found"
}
means print 'found' every time "foo" is present while this:
/foo/
{
print "found"
}
means print the current record every time "foo" is present and print "found" for every single input record so chances are when you wrote:
$4 ~ /^\([:alnum:]/
{
....
}
you actually meant to write:
$4 ~ /^\([:alnum:]/ {
....
}
also, chances are you meant to use the POSIX character class [[:alnum:]] instead of the set of characters [ : a l n u m as described by the character set [:alnum:]:
$4 ~ /^\([[:alnum:]]/ {
....
}
If you fix those things and you still need help, provide some testable sample input and expected output we can help you more.
It's only a syntax error. When you use a posix character class you must enclose it between square brackets:
[[:alnum:]]
Otherwise [:alnum:] is seen as a character class that contains : a l m n u
So in brief if I understood properly you wish to get ids of certain type of transactions.
First assumption: id and transaction type are on the same line, something like this should do (largely adapted from your code)
awk 'BEGIN {
matched=0 # more for clarity than really needed
}
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only
gsub(/[()]/,"",$4) # strip the () around the id
++matched # to get the number of matched lines including the multiples ones.
if (!($4 in arr)) { # as yours, if the id is not in array
arr[$4]=1 # add the found id to array for no including it twice
print $4 # print the found id (only once as we're in the if
}
}
END { # nothing changed here, printing the stats...
print "Found :"length(arr)
print "Processed "NR
print matched" lines matched the filter"
}'
Output of this from your sample input:
prompt=> awk 'BEGIN { matched=0}; / \([a-z0-9]*\) / { gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput
4aaaaaaaaabbbbbbcccb
4xxbbbbbbbbbbbbbcccb
Found: 2
Processed 4
4 lines matched the filter
I've ommitted the transaction in the test as I've no clue on what it may be

Regex named capture group with multiple values

I seem to be having a tough regex week. Anyone that can save me from throwing my laptop out the window gets a virtual beer. I have some data in the form of:
... f=something group="First Group,Group2" foo=val ...
where the number of groups can vary. I need to capture each group entry to a named capture. Based on a previous post, The difference here is that I don't have a constant to key off of within the values (i.e. ID-1-1, ID-2-2 allows me to say ID-\d+-\d+ whereas these values could be pretty much anything). I've been trying a ton of stuff, but I tend to get matches that are far too greedy, or I (often) get these 2 values:
First Group
First Group,Group2
What I need is:
First Group
Group2
...
I'm currently trying regex such as this where I'm trying to anchor to the group=" portion, and not exceed the ending ":
(?:(?:group=\")|(?:\"))(?<group>(?:(.+)+?)
Hopefully someone can make my day a lot better...
Here's the PHP solution. Once again, regex doesn't like capturing the multiple values so we need to break it in to two searches. One extracts the group value, the next extracts each value from the group
$test = 'f=something group="First Group,Group2" foo=val';
$re = '/(?:group=)?\x22(?<group>(?:[^\x2C]+\x2C*)+)\x22/';
$_ = null;
if (preg_match($re,$test,$_))
echo "Group Contents: ".$_['group']."\r\n";
$__ = null;
$re = '/(?:^|\x2C)(?<value>(?:[^\x2C]+)+)/';
if (preg_match_All($re,$_['group'],$__))
echo "Group Values: ".print_r($__['value'],true);
Should be pretty easy to port in to another language, just extract the regexes out and manage them the way you normally would.