Regex, Grafana Loki, Promtail: Parsing a timestamp from logs using regex - regex

I want to parse a timestamp from logs to be used by loki as the timestamp.
Im a total noob when it comes to regex.
The log file is from "endlessh" which is essentially a tarpit/honeypit for ssh attackers.
It looks like this:
2022-04-03 14:37:25.101991388 2022-04-03T12:37:25.101Z CLOSE host=::ffff:218.92.0.192 port=21590 fd=4 time=20.015 bytes=26
2022-04-03 14:38:07.723962122 2022-04-03T12:38:07.723Z ACCEPT host=::ffff:218.92.0.192 port=64475 fd=4 n=1/4096
What I want to match, using regex, is the second timestamp present there, since its a utc timestamp and should be parseable by promtail.
I've tried different approaches, but just couldn't get it right at all.
So first of all I need a regex that matches the timestamp I want.
But secondly, I somehow need to form it into a regex that exposes the value in some sort?
The docs offer this example:
.*level=(?P<level>[a-zA-Z]+).*ts=(?P<timestamp>[T\d-:.Z]*).*component=(?P<component>[a-zA-Z]+)
Afaik, those are named groups, and that is all that it takes to expose the value for me to use it in the config?
Would be nice if someone can provide a solution for the regex, and an explanation of what it does :)

You could for example create a specific pattern to match the first part, and capture the second part:
^\d{4}-\d{2}-\d{2} \d\d:\d\d:\d\d\.\d+\s+(?P<timestamp>\d{4}-\d{2}-\d{2}T\d\d:\d\d:\d\d\.\d+Z)\b
Regex demo
Or use a very broad if the format is always the same, repeating an exact number of non whitespace characters parts and capture the part that you want to keep.
^(?:\S+\s+){2}(?<timestamp>\S+)
Regex demo

Related

Regex needed to search for a numeric id within a tag

My very basic regex skills are not allowing me to successfully extract an id number within a tag.
I think it would be fairly straightforward. I would like to extract the id from the following extract.
<id>53222132</id>
The id number is not a specific length but I just need to be able to find the id number which is numeric only.
More specifically this is the only instance of the tag id so it's unique and should be used within the regex.
Finally is there a way that this can saved within a variable.
Using regex as part of a splunk query where I will use the variable to make it distinct.
I have got as far as the following which captures everything including the tag.
<\s*id[^>]*>(.*?)<\s*\/\s*id>
Thanks in advance
(?<=<id>)\d+(?=<\/id>)
This would be my first thought. This will use a positive look ahead and positive look behind and it will only match a string of digit characters in the middle. Another alternative is:
\d+(?=<\/id>)
This will only use the look ahead as the look behind is not entirely supported. One other option:
\d+(?=\s*<\s*\/\s*id\s*>)
This will ignore any spaces that might be present in that ending tag, and still find the id regardless. One of these should work for your scenario.

Extracting data using regex from bank feed

I am looking to extract some text from a raw credit card feed for a workflow. I have gotten almost where I want to but am struggling with the final piece of information I'm trying to extract.
An example of the raw feed is:
LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
I am looking to extract this from the above:
(ICGROUP,INC.MELBOURNE)June5UNITEDSTATESDOLLARAUD(50.07)includesconversioncommissionof
with the brackets representing the two groups I am after. The consistent parts across all instances of what I'm trying to extract is:
DIGITS (TEXT) DATE TEXT AMOUNT includesconversioncommissionof
I have been able to use the regex:
([A-Z][a-z]\d)[A-Z]AUD(\d\,?\d+?.\d*)includesconversioncommissionofAUD
to get me the date and the amount. I am struggling to find a way to get as per the example above the words ICGROUP,INC.MELBOURNE
I have tried putting \d\d(.*) before the above regex but that doesn't work for some reason.
Would appreciate if anyone is able to help with what I'm after!
The closest I think we can get (PCRE) is something like:
/
[\d,.]+ # a currency value to bookend
(.+?) # capture everything in-between
[A-Z][a-z]+\d+ # a month followed by a day, e.g. "June5"
.+? # everything in-between
([\d,.]+) # capture a currency value
includesconversioncommissionof # our magic token to bookend
/x
The technique here is to pit greedy expressions against non-greedy expressions in a very deliberate way. Let me know if you have any questions about it. I would be extremely hesitant to put this in production—or even trust its output as an ad-hoc pass—without rigorous testing!
I'm using the pattern [\d,.] for currency, but you can replace that with something more sophisticated, especially if you expect weird formats and currency symbols. The biggest potential pitfall here is if the ICGROUP,INC.MELBOURNE token might start with a number. Then you'll definitely need a more sophisticated currency pattern!
Here's what I've got (in php).
$string = "LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE";
$cleaned = preg_replace("/^(LEO'SFINEFOOD&WINEHARTWELL)([A-Za-z]{3,9})(\.|\d)*/", "", $string);
echo $cleaned;
what it returns is: ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
Which you can then use and run your own little regex on.
Explanation:
The \w{3,9} is used to remove the month which may be 3-9 characters long. Then the (\.|\d)* is to remove the digits and dots. I'm thinking that we could parse the month/date better using your regex to extract that June 5 part but from your example given, it shouldn't be necessary.
However, it would be much more helpful if you could provide at least 3 examples, optimally 5, so we can get a good feel of the pattern. Otherwise this is the best I can do with what you've given.

RegEx SQL, issue escaping quotes

I am trying to use PSQL, specifically AWS Redshift to parse a line. Sample data follows
{"c.1.mcc":"250","appId":"sx-calllog","b.level":59,"c.1.mnc":"01"}
{"appId":"sx-voice-call","b.level":76,"foreground":9}
I am trying the following regex in order to to extract the appId field, but my query is returning empty fields.
'appId\":\"[\w*]\",'
Query
SELECT app_params,
regexp_substr(app_params, 'appId\":\"[\w*]\",')
FROM sample;
You can do that as follows:
(\"appId\":\"[^"]*\")(?:,)
Demo: http://regex101.com/r/xP0hW3
The first extracted group is what you want.
Your regex was not matching because \w does not match -
Adding this here despite this being an old question since it may help someone viewing this down the road...
If your lines of data are valid json, you can use Redshift's JSON_EXTRACT_PATH_TEXT function to extract the value a given key. Emphasis on the json being valid, as it will fail if even one line cannot be parsed and Redshift will throw a JSON parsing error.
Example using given data:
select json_extract_path_text('{"c.1.mcc":"250","appId":"sx-calllog","b.level":59,"c.1.mnc":"01"}','appId');
returns sx-calllog
This is especially useful since Redshift does not support lookahead/lookbehind (it is POSIX regex) & extract groups.
You can try using some lookahead and look behinds to isolate just the text inside the quotes for the appid. (?<=appId\":\")(?=.*\",)[^\"]*. I tested this out a bit using your examples you provided here.
To explain the regex a bit more: (?<=appId\":\")(?=.*\",)[^\"]*
(?<=appId\":\"): positive look behind for appid":". Since you don't want the appid text itself being returned (just the value), you can preface the regex with a look behind to say "find me the following regex, but only when it is following the look behind text.
(?=.*\",): positive look ahead for the ending ",. You don't want quotes to be returned in your match, but as with number 1 you want your regex to be bounded a bit and a look ahead does that.
[^\"]*: The actual matching portion. You want to find the string of chars that are NOT ". This will match the entire value and stop matching right before the closing ".
EDIT: Changed the 3rd step a little bit, removed the , from that last piece, it is not needed and would break the match if the value were to actually contain a ,.

Extract text between two given strings

Hopefully someone can help me out. Been all over google now.
I'm doing some zone-ocr of documents, and want to extract some text with regex. It is always like this:
"Til: Name Name Name org.nr 12323123".
I want to extract the name-part, it can be 1-4 names, but "Til:" and "org.nr" is always before and after.
Anyone?
If you can't use capturing groups (check your documentation) you can try this:
(?<=Til:).*?(?=org\.nr)
This solution is using look behind and lookahead assertions, but those are not supported from every regex flavour. If they are working, this regex will return only the part you want, because the parts in the assertions are not matched, it checks only if the patterns in the assertions are there.
Use the pattern:
Til:(.*)org\.nr
Then take the second group to get the content between the parenthesis.

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/