RegEx SQL, issue escaping quotes - regex

I am trying to use PSQL, specifically AWS Redshift to parse a line. Sample data follows
{"c.1.mcc":"250","appId":"sx-calllog","b.level":59,"c.1.mnc":"01"}
{"appId":"sx-voice-call","b.level":76,"foreground":9}
I am trying the following regex in order to to extract the appId field, but my query is returning empty fields.
'appId\":\"[\w*]\",'
Query
SELECT app_params,
regexp_substr(app_params, 'appId\":\"[\w*]\",')
FROM sample;

You can do that as follows:
(\"appId\":\"[^"]*\")(?:,)
Demo: http://regex101.com/r/xP0hW3
The first extracted group is what you want.
Your regex was not matching because \w does not match -

Adding this here despite this being an old question since it may help someone viewing this down the road...
If your lines of data are valid json, you can use Redshift's JSON_EXTRACT_PATH_TEXT function to extract the value a given key. Emphasis on the json being valid, as it will fail if even one line cannot be parsed and Redshift will throw a JSON parsing error.
Example using given data:
select json_extract_path_text('{"c.1.mcc":"250","appId":"sx-calllog","b.level":59,"c.1.mnc":"01"}','appId');
returns sx-calllog
This is especially useful since Redshift does not support lookahead/lookbehind (it is POSIX regex) & extract groups.

You can try using some lookahead and look behinds to isolate just the text inside the quotes for the appid. (?<=appId\":\")(?=.*\",)[^\"]*. I tested this out a bit using your examples you provided here.
To explain the regex a bit more: (?<=appId\":\")(?=.*\",)[^\"]*
(?<=appId\":\"): positive look behind for appid":". Since you don't want the appid text itself being returned (just the value), you can preface the regex with a look behind to say "find me the following regex, but only when it is following the look behind text.
(?=.*\",): positive look ahead for the ending ",. You don't want quotes to be returned in your match, but as with number 1 you want your regex to be bounded a bit and a look ahead does that.
[^\"]*: The actual matching portion. You want to find the string of chars that are NOT ". This will match the entire value and stop matching right before the closing ".
EDIT: Changed the 3rd step a little bit, removed the , from that last piece, it is not needed and would break the match if the value were to actually contain a ,.

Related

Regex, Grafana Loki, Promtail: Parsing a timestamp from logs using regex

I want to parse a timestamp from logs to be used by loki as the timestamp.
Im a total noob when it comes to regex.
The log file is from "endlessh" which is essentially a tarpit/honeypit for ssh attackers.
It looks like this:
2022-04-03 14:37:25.101991388 2022-04-03T12:37:25.101Z CLOSE host=::ffff:218.92.0.192 port=21590 fd=4 time=20.015 bytes=26
2022-04-03 14:38:07.723962122 2022-04-03T12:38:07.723Z ACCEPT host=::ffff:218.92.0.192 port=64475 fd=4 n=1/4096
What I want to match, using regex, is the second timestamp present there, since its a utc timestamp and should be parseable by promtail.
I've tried different approaches, but just couldn't get it right at all.
So first of all I need a regex that matches the timestamp I want.
But secondly, I somehow need to form it into a regex that exposes the value in some sort?
The docs offer this example:
.*level=(?P<level>[a-zA-Z]+).*ts=(?P<timestamp>[T\d-:.Z]*).*component=(?P<component>[a-zA-Z]+)
Afaik, those are named groups, and that is all that it takes to expose the value for me to use it in the config?
Would be nice if someone can provide a solution for the regex, and an explanation of what it does :)
You could for example create a specific pattern to match the first part, and capture the second part:
^\d{4}-\d{2}-\d{2} \d\d:\d\d:\d\d\.\d+\s+(?P<timestamp>\d{4}-\d{2}-\d{2}T\d\d:\d\d:\d\d\.\d+Z)\b
Regex demo
Or use a very broad if the format is always the same, repeating an exact number of non whitespace characters parts and capture the part that you want to keep.
^(?:\S+\s+){2}(?<timestamp>\S+)
Regex demo

Regex needed to search for a numeric id within a tag

My very basic regex skills are not allowing me to successfully extract an id number within a tag.
I think it would be fairly straightforward. I would like to extract the id from the following extract.
<id>53222132</id>
The id number is not a specific length but I just need to be able to find the id number which is numeric only.
More specifically this is the only instance of the tag id so it's unique and should be used within the regex.
Finally is there a way that this can saved within a variable.
Using regex as part of a splunk query where I will use the variable to make it distinct.
I have got as far as the following which captures everything including the tag.
<\s*id[^>]*>(.*?)<\s*\/\s*id>
Thanks in advance
(?<=<id>)\d+(?=<\/id>)
This would be my first thought. This will use a positive look ahead and positive look behind and it will only match a string of digit characters in the middle. Another alternative is:
\d+(?=<\/id>)
This will only use the look ahead as the look behind is not entirely supported. One other option:
\d+(?=\s*<\s*\/\s*id\s*>)
This will ignore any spaces that might be present in that ending tag, and still find the id regardless. One of these should work for your scenario.

Extract only the text field needed

I am at the beginning of learning Regex, and I use every opportunity to understand how it's working. Currently I am trying to extract dates from a text file (which is in fact a vnt-file type from my mobile phone). It looks like following:
BEGIN:VNOTE
VERSION:1.1
BODY;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:18.07.=0A14.08.=0A15.09.=0A15.10.=
=0A13.11.=0A13.12.=0A12.01.=0A03.02. Grippe=0A06.03.=0A04.04.2015=0A0=
5.05.2015=0A03.06.2015=0A03.07.2015=0A02.08.2015=0A30.08.2015=0A28.09=
17.11.2017=0A
DCREATED:20171118T095601
X-IRMC-LUID:150
END:VNOTE
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
and so on. If the date has also a year, it should also be displayed.
I almost found out how to detect the dates by the following regex:
.+(\d\d\.\d\d\.(2015|2016|2017)?).+
But it only detect very few of the dates. The result is this:
BEGIN:VNOTE
VERSION:1.1
15.10.
04.04.2015
30.08.2015
24.01.2016
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
Then I tried to add a question mark which makes the .+ not greedy, as far as I read in tutorials. Then the regex looks like:
.+?(\d\d\.\d\d\.(2015|2016|2017)?).+?
But the result is still not what I am looking for:
BEGIN:VNOTE
VERSION:1.1
21.03.20.04.18.05.18.06.18.07.14.08.15.09.15.10.
13.11.13.12.12.01.03.02.06.03.04.04.20150A0=
03.06.201503.07.201502.08.201530.08.20150A28.09=
28.10.201525.11.201528.12.201524.01.20160A
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
For someone who is familiar with regex I am pretty sure this is very easy to solve, but I don't get it. It's very confusing when you are new to regex. I tried to find a hint in some tutorials or stackoverflow posts, but all I found is this: Notepad++ how to extract only the text field which is needed?
But it doesn't work for me. I assume it might have something to do with the fact that my text file is not one single line.
I have my example on regex101 too.
I would be very thankful if maybe someone can give me a hint what else I can try.
Edit: I would like to detect the dates with the regex and as a result have a list with only the dates (maybe it is called substitute?)
Edit 2: Sorry for not mentioning it earlier: I just want to use the regex in e.g. Notepad++ or an online regex test website. Just to get the result of the dates and save the result in a new txt-file. I don't want to use the regex in an programming language. My apologies for not being precisely before.
Edit 3: The result should be a list with the dates, and each date in a new line:
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
I suggest this pattern:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)
This makes use of the \G flag that, in this case, allows for multiple matches from the very start of the match without letting any single unmatched character in the text, thus allowing the removal of all but what's wanted.
If you want to remove the extra matches as well, add |.* at the end:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)|.*
regex101 demo
In N++, make sure the options underlined are selected, and that the cursor is at the beginning. In the picture below, I replaced then undid the replacement, only to show that matches were identified (16 replacements).
You can try using the following pattern:
\d{2}\.\d{2}\.(?:\d{4})?
This will match day.month dates of the form 18.07., but it also allows such a date to be followed by a four digit year, e.g. 18.07.2017. While it would be nice to make the pattern more restrictive, to avoid false fire matches, I do not see anything obvious which can be added to the above pattern. Follow the demo link below to see the pattern in action.
Demo

Reg Ex Facebook

I am trying to extract some information from facebook using Regex. Here is a link with an example:
https://graph.facebook.com/210989592315921
I was interested in what would the regular expression be in order to extract just the number of likes from this string.
I have tried for example this expression:
"likes":\s[0-9]$
Thank you in advance for any advice regarding this matter,
Mark
You should follow "#Hope I helped" comment and use a json parser. You can't be sure the text is going to be formatted always the same way:
Are you always going to have a single space between : and the number ?
By the way, here is the error you are looking for, your current regex matches a single figure, not a multiple digit number, you should use something like: [0-9]+ and probably remove the $ which is not correct in your example, as you have a comma after the number.

Extract text between two given strings

Hopefully someone can help me out. Been all over google now.
I'm doing some zone-ocr of documents, and want to extract some text with regex. It is always like this:
"Til: Name Name Name org.nr 12323123".
I want to extract the name-part, it can be 1-4 names, but "Til:" and "org.nr" is always before and after.
Anyone?
If you can't use capturing groups (check your documentation) you can try this:
(?<=Til:).*?(?=org\.nr)
This solution is using look behind and lookahead assertions, but those are not supported from every regex flavour. If they are working, this regex will return only the part you want, because the parts in the assertions are not matched, it checks only if the patterns in the assertions are there.
Use the pattern:
Til:(.*)org\.nr
Then take the second group to get the content between the parenthesis.