Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour? - regex

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

Related

Regex with multiple groups, some of which are optional

I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.
As input, I have cells which look like this:
SEPA Overboeking IBAN: AB1234 BIC: LALA678 Naam: John Smith Omschrijving: Hello hello Kenmerk: 03-05-2019 23:12 533238
I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.
For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238.
To obtain this, I've used:
.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)
This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched.
I've tried variations with greedy/non greedy, but couldn't get it to work.
Help would be greatly appreciated!
N.B.: I'm working in KNIME (open source data analysis tool)

I was able to split your input using the following regular expression:
^.*
\s+IBAN\:\s*(?<IBAN>.*?)
\s+BIC\:\s*(?<BIC>.*?)
\s+Naam\:\s*(?<Naam>.*?)
(?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
(?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
$
This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):
For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:
I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.

Aws CloudWatch filtering compact json with # in it

We use serilog to output from our .nrt core app. We are using compact json to keep size down. In compact it seems to put the error key with an # sign;
"#l": "Warning"
I can’t seem to get a filter working it either returns no results or says error. I’ve tried many things but I’m sure this should work;
{ $.#l = "Warning" }
Anyone suggest where I’m going wrong.

I don't think you can use # in the selector. From the docs:
Property selectors are alphanumeric strings that also support '-' and '_' characters.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#extract-json-log-event-values
One way to get around this would be to match the line as if it's not part of json.
For example, if your log line looks like this:
"#l": "Warning"
you could filter it out with:
[key="#l", colon, value=Warning]

I had the same issue. Most likely you used Serilog.Formatting.Compact.CompactJsonFormatter as me.
Implementing own ITextFormatter is a workaround because prefixes like # or $ are hardcoded inside CompactJsonFormatter.
I used CompactJsonFormatter as a basis, replaced there usage of #, $ by s_ and it works.

How do I Build a Regex Expression to Find String

I've been studying content on the regex topic, but am having trouble understanding how to make it work! I need to build a regex to locate a particular string, potentially in multiple places throughout numerous log files. If I were keying the search expression into a text editor, it would look like this...
*Failed to Install*
Following is a typical example of a line containing the string I would like to search for (exit code # will vary)
!!! Failed to install, with exit code 1603
I would really appreciate any help on how to build the regex for this. I suspect I might need the end of line character too?
I plan on using it in a variation of the script that was provided by https://stackoverflow.com/users/3142139/m-hassan in the following thread
Use PowerShell to Quickly Search Files for Regex and Output to CSV
I'm a newbie to powershell scripts, but I'd rather spend the time to figure this out, than pour over hundreds of log files!
Thanks,
Jim

You're in luck - You only require very simple regex for this. Assuming you want to capture the error code, this will work fine:
^.*Failed to install.*(exit code \d+)$
Try it online!
If you don't care about the error code, and just want to know if it failed or not, you can honestly get away with something as simple as:
^.*Failed to install.*$
Hope this helps.

parse hl7 with regex

I have the following hl7 message:
MSH|^~\&|EPIC|SMHRMC|JCAPS|QHN|20170626165726|EDILABIH|ORU^R01^LAB|00004841|P|2.3|||||||||
PID|1||W00xxxxx^^^SMHRMC||mouse^Mickey^E||19860905|F||1|2601 somestreet AVE NO 8^^City^ST^zip^USA^^^county|MESA|(970)xxx-xxxx^P^PH|||Single||175375903|xxxxxxx||last^first^^|NON-HISPANIC||||||||||
PV1|1|I|MNEU^908^A^^R^^^^^^||||9999999^pcp^pcp^LYNNE^^^^^NPI^^^^NPI~999999999^last^first^LEE^^^^^NPI^^^^NPI||||||||||00000000^last^first^LYNNE^^^^^NPI^^^^NPI||000000603|CAID||||||||||||||||||||||||20170626000000
Hl7 is hard to extract with regex however I have an field that is always in the same location and feel that might be easier. I need to pull the encounter number which is the 'W00xxxxx' in the stream above. It is always in the 3rd pipe delimited section of the PID and stops at the ^.
Currently I have: select substring(column from 'PID\|[1]\|\|(.)\^') but this is not working. However when I use select substring(column from 'PV1\|[1]\|(.)\|') it will pull the 'I'. I can't see the big differences in my regex to know why this isn't working. Thanks.

how about this:
PID\|[1]\|\|(.+?)\^

You can't reliably parse HL7 V2.x messages using regex because the encoding characters may change in MSH-1 and MSH-2. Whatever language you're using there's probably already an HL7 parsing library you can use instead.

Yahoo Pipes Using Regex to change link

Hi I am pretty new to regex I can do some basic functions but having trouble with this. I need to change the link in the rss feed.
I have a url like this:
http://mysite.test/Search/PropDetail.aspx?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
and want to change it to updated site:
http://mysite.test/PropertyDetail/?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
Where only thing changed is from /Search/PropDetail.aspx
to /PropertyDetail/
I don't have access to the orginal rss feed or I would change it there so I have to use pipes. Please help, Thanks!

Use the regex control.
In it, specify the DOM address of the node containing your link (prefixed by "item.") within the "In" field. For the "replace" field type
(.*)//Search//PropDetail/.aspx
and in the "with" field type use:
$1//PropertyDetail//.*
I've 'escaped' the '/' character in the with field. However, I'm not sure you need to do this except before the '.*' Some trial and error may be needed.
Hopefully this will achieve the result you want.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour? - regex

Related

Regex with multiple groups, some of which are optional

Aws CloudWatch filtering compact json with # in it

How do I Build a Regex Expression to Find String

parse hl7 with regex

Yahoo Pipes Using Regex to change link

Categories

Resources