AWS Athena regexp_extract() broken - regex

I am using AWS Athena to extract some statistics from CloudWatch logs. However, attempting to use the Presto regexp_extract() is generating empty result sets even though the rexexp looks good according to online regexp testers.
The source cloudwatch log sample is as follows:
2021-10-04 00:10:56.201 INFO 10711 --- [io-5000-exec-31] au.com.crecy.VP4CStatistics : {"atlassianLicense" : {"key" : "visio-publisher-for-confluence","version" : "1.1.5-AC","state" : "ENABLED","installedDate" : 1619695028000,"lastUpdated" : 1632692975000,"license" : {"active" : true,"type" : "COMMERCIAL","evaluation" : false,"supportEntitlementNumber" : "SEN-0123456789"},"valid" : true,"host" : {"product" : "Confluence","contacts" : [ ]},"links" : {"marketplace" : [{"href" : "https://marketplace.atlassian.com/plugins/visio-publisher-for-confluence"}],"self" : [{"href" : "https://acme.atlassian.net/wiki/rest/atlassian-connect/1/addons/visio-publisher-for-confluence"}]}},"viewAttachments" : [{"height" : "1000","width" : "100%","scrolling" : "no","frameBorder" : "hide","url" : "/download/attachments/574160906/Foo.html.zip?version=22&modificationDate=1632311039065&cacheVersion=1&api=v2","space" : "VM","page" : 574160906,"id" : "att568885320","frameBorderStyle" : "border:none;"}],"durations" : {"1" : {"method" : "ModelGenAtlassianConnectPlugin.loadHtmlAttachment","startTime" : 2542004145837271,"endTime" : 2542005346331840,"durationMillis" : 1200,"durationNanos" : 1200494569},"2" : {"method" : "AtlassianHostRestClientsHelper.getLicense","startTime" : 2542004145845740,"endTime" : 2542004523777555,"durationMillis" : 377,"durationNanos" : 377931815},"3" : {"method" : "AtlassianHostRestClientsHelper.processJwt","startTime" : 2542004523813282,"endTime" : 2542004525757229,"durationMillis" : 1,"durationNanos" : 1943947},"4" : {"method" : "AttachmentLoaderHelper.loadAttachment","startTime" : 2542004525774026,"endTime" : 2542005346321184,"durationMillis" : 820,"durationNanos" : 820547158},"5" : {"method" : "AtlassianHostRestClientsHelper.getAttachments","startTime" : 2542004525784513,"endTime" : 2542004796450920,"durationMillis" : 270,"durationNanos" : 270666407},"6" : {"method" : "AtlassianHostRestClientsHelper.getCompressedPageSource","startTime" : 2542004796503557,"endTime" : 2542005341655641,"durationMillis" : 545,"durationNanos" : 545152084},"7" : {"method" : "ChecksumHelper.checksumValid","startTime" : 2542005341695482,"endTime" : 2542005341889382,"durationMillis" : 0,"durationNanos" : 193900},"8" : {"method" : "UnzipCompressionHelper.unzipCompressedPageSource","startTime" : 2542005341899585,"endTime" : 2542005346303984,"durationMillis" : 4,"durationNanos" : 4404399},"9" : {"method" : "VP4CResponseHeaderFilter.doFilter","startTime" : 2542008074147431,"endTime" : 2542008074514454,"durationMillis" : 0,"durationNanos" : 367023}},"uncompressedSize" : 520631,"compressedSize" : 48836}
the AWS Athena / presto query is as follows:
select regexp_extract(message, '(au.com.crecy.VP4CStatistics : )({.*}$)', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams"
where message LIKE '%visio-publisher-for-confluence%'
order by time desc
In short I want to extract the JSON payload at the end of the log message. The above query is generating empty result sets.
Thanks and regards,
Andrew

Note that {, }, and . are regex metacharacters, and probably need to be escaped via backslash.
SELECT REGEXP_EXTRACT(message, 'au\.com\.crecy\.VP4CStatistics : (\{.*\})$', 1)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams"
WHERE message LIKE '%visio-publisher-for-confluence%'
ORDER BY time DESC;

Ok, figured it out - the regexp wants a match for multiple spaces or tabs (even though the sample log appears to have a single space. The following pattern works:
select regexp_extract(message, '(au\.com\.crecy\.VP4CStatistics[ \t]*:[ \t]*)(\{.*\})', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams"
where message LIKE '%visio-publisher-for-confluence%'
order by time desc

Related

Parsing String into Custom Object with Powershell and regex

I have a String, which I try to parse into a array of PSCustom Object with sub expression.
The String looks like this :
date=2021-09-13 time=20:05:25 devname="chwitrfg01" devid="FG10E0TB20903187" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" eventtime=1631556325 srcip=192.168.10.226 srcname="192.168.10.226" srcport=54809 srcintf="port8" srcintfrole="dmz" dstip=8.8.4.4 dstname="dns.google" dstport=53 dstintf="wan1" dstintfrole="lan" poluuid="01533038-da7b-51eb-b854-8fd38a0deba3" sessionid=1472996904 proto=17 action="accept" policyid=278 policytype="policy" service="DNS" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=194.56.218.226 transport=54809 duration=180 sentbyte=245 rcvdbyte=144 sentpkt=2 rcvdpkt=1 shapersentname="default_class" shaperdropsentbyte=0 shaperrcvdname="default_class" shaperdroprcvdbyte=0 appcat="unscanned" dstdevtype="Unknown" dstdevcategory="None" masterdstmac="00:00:0c:07:ac:8d" dstmac="00:00:0c:07:ac:8d" dstserver=1
And I tried something like this, but I'm a total noob in regex and have no Idea how to solve this. Is there a easy way, to add each value to a property of the custom object?
$Pattern = #(
'(?<devname>\devname=w+)'
'(?<srcip>(srcip=?:[0-9]+\.){3}[0-9]+):(?<srcport>srcport=[0-9]+)'
'(?<dstip>(dstip=?:[0-9]+\.){3}[0-9]+):(?<dstport>dstport=[0-9]+)'
) -join '\s+'
$cmd |
ForEach-Object {
if ($_ -match $Pattern) {
$Matches.Remove(0)
[PsCustomObject]#{
srcip = $_.Groups['srcip'].Value
dstip = $_.Groups['dstip'].Value
dstport = $_.Groups['dstport'].Value
srcport = $_.Groups['srcport'].Value
fw = $_.Groups['devname'].Value
}
}
}| Select-Object -First 5
$cmd | Format-Table
The simplest way to do this that I know of us the ConvertFrom-StringData cmdlet. That cmdlet creates a hashtable of name/value pairs out of a set of name=value formatted things. What you would do is put each value on its own line to make a multi-ling string, then create a new custom object, and use that hashtable to define the properties.
$cmd -replace ' (\w+=)',"`n`$1"|
%{new-object psobject -prop (ConvertFrom-StringData $_)}
Or the shorter version in v3+ (thanks to #mklement0):
$cmd -replace ' (\w+=)',"`n`$1"|
%{[pscustomobject] (ConvertFrom-StringData $_)}
When I ran that against the string you provided I got back:
sessionid : 1472996904
action : "accept"
rcvdbyte : 144
vd : "root"
logid : "0000000013"
policyid : 278
duration : 180
proto : 17
dstname : "dns.google"
srcintf : "port8"
eventtime : 1631556325
appcat : "unscanned"
srcip : 192.168.10.226
dstip : 8.8.4.4
trandisp : "snat"
srcname : "192.168.10.226"
srcport : 54809
devid : "FG10E0TB20903187"
dstdevcategory : "None"
level : "notice"
sentbyte : 245
shaperdroprcvdbyte : 0
sentpkt : 2
masterdstmac : "00:00:0c:07:ac:8d"
shaperrcvdname : "default_class"
poluuid : "01533038-da7b-51eb-b854-8fd38a0deba3"
type : "traffic"
srcintfrole : "dmz"
subtype : "forward"
policytype : "policy"
dstport : 53
transip : 194.56.218.226
shapersentname : "default_class"
dstdevtype : "Unknown"
dstserver : 1
dstcountry : "United States"
dstintf : "wan1"
service : "DNS"
srccountry : "Reserved"
shaperdropsentbyte : 0
dstintfrole : "lan"
transport : 54809
date : 2021-09-13
rcvdpkt : 1
dstmac : "00:00:0c:07:ac:8d"
devname : "chwitrfg01"
time : 20:05:25
You could probably strip quotes out of it if that is desired.

fail2ban-regex doesn't match snort logfile in alert_json format

I try to match a fail2ban-regex with a snort3 logfile in alert_json format.
example alert_json output in log-file:
{ "timestamp" : "21/03/22-12:23:56.370262", "seconds" : 1616412236, "action" : "allow", "class" : "none", "b64_data" : "lVAAFpTzAXEAAAAAoAJyELUuAAACBAW0BAIICikv9agAAAAAAQMDBw==", "dir" : "C2S", "dst_addr" : "6.7.8.9", "dst_ap" : "6.7.8.9:0", "eth_dst" : "00:11:22:33:44:55", "eth_len" : 102, "eth_src" : "11:11:22:33:44:55", "eth_type" : "0x800", "gid" : 1, "icmp_code" : 3, "icmp_id" : 0, "icmp_seq" : 0, "icmp_type" : 3, "iface" : "eth0", "ip_id" : 5814, "ip_len" : 68, "msg" : "ICMP Traffic Detected", "mpls" : 0, "pkt_gen" : "raw", "pkt_len" : 88, "pkt_num" : 2270045, "priority" : 0, "proto" : "ICMP", "rev" : 0, "rule" : "1:10000001:0", "service" : "unknown", "sid" : 10000001, "src_addr" : "1.2.3.4", "src_ap" : "1.2.3.4:0", "tos" : 192, "ttl" : 64, "vlan" : 0 }
my fail2ban-regex which didn't match:
^\{.*\"src_addr\"\ :\ \"<HOST>\".*\}$
i tryed this on regexr.com and it match.
i already found out there is maybe some problem with the timestamp but i didn't figured out which?
can somebody help here?
thanks
It'd probably depend on fail2ban version, for example latest fail2ban >= 0.10.6/0.11.2 does not require timestamp anymore (it would simulate "now"), so it shows to me the IP and current time (as I execute it):
$ fail2ban-regex -v /tmp/log '^\{.*\"src_addr\"\ :\ \"<HOST>\".*\}$'
...
Lines: 1 lines, 0 ignored, 1 matched, 0 missed
To specify own datepattern you have to set it in filter (or supply to fail2ban-regex with -d parameter), so this will work:
# either for timestamp tag:
$ fail2ban-regex -v -d ^\{\s*"timestamp"\s*:\s*"%y/%m/%d-%H:%M:%S\.%f" /tmp/log \"src_addr\"\ :\ \"<HOST>\"
# or for posix seconds (probably better because don't need conversion):
$ fail2ban-regex -v -d '"seconds"\s*:\s*{EPOCH}\s*,\s*' /tmp/log '\"src_addr\"\ :\ \"<HOST>\"'
Note that in fail2ban configs you must escape every % as %% due to python ini-configs substitution rules.
Also note that fail2ban cuts part of message matching date pattern out before it apply pref- or failregex.
Also note that your RE is a bit vulnerable, see https://github.com/fail2ban/fail2ban/issues/2932#issuecomment-777320874 for a better example.

Regex to find string between patterns not containing specific string

Ok gurus,
Lets say I have the following string:
{
"event" : "party" ,
"Id" : "store" ,
"timestamp" : "2019-07-07T13:14:26.329Z" ,
"localDateTime" : "2019-07-07T16:14" ,
"orderStateUpdate" : {
"id" : "fj09bA9ywfGS" ,
"orderId" : "2315043" ,
"visitId" : "2315043" ,
"items" :{{
"id" : "fj09bA6K3K8u" ,
"quantity" : 1 ,
"stat" : "ok"
},
{
"id" : "fj09bA6K3K8u2" ,
"quantity" : 2 ,
"stat" : "ok"
}}
,
"items" :{{
"id" : "fj09bA6K3K8u" ,
"quantity" : 1 ,
"stat" : "junk"
},
{
"id" : "fj09bA6K3K8u2" ,
"quantity" : 2 ,
"stat" : "ok"
}}
,
"extraParams" : {"extraparamstuff1":"bugger"},"somethingelse" :"blahblahblah"
}}
The string has two (nested arrays) wrapped by double curly braces. This string specifically contains an error where the LAST curly brace is ALSO double; what I am trying to capture with regex is the string that starts with '}}' , ends with '}}' and DOES NOT CONTAIN '{{' like so:
}}
,
"extraParams" : {"extraparamstuff1":"bugger"},"conversationLink" :"https://qa.app.package.ai/qa/#/app/dashboard?d=1561248000000&c=fdxkID9IifGv&p=fdxfaFgV1l1Y"
}}
I am Regex-challenged, but have come up with this:
(?:(\}\})).*(?:\{\{).*(?:\}\s*?\})
which captures
}}
,
"items" :{{
"id" : "fj09bA6K3K8u" ,
"quantity" : 1 ,
"itemState" : "LOADED"
},
{
"id" : "fj09bA6K3K8u2" ,
"quantity" : 2 ,
"itemState" : "LOADED2"
}}
,
"extraParams" : {"extraparamstuff1":"bugger"},"conversationLink" :"https://qa.app.package.ai/qa/#/app/dashboard?d=1561248000000&c=fdxkID9IifGv&p=fdxfaFgV1l1Y"
}}
which is too much. Can someone help me understand how to find this? This is for error-checking inbound data (and yes I need to check for extra opening '{{' as well).
Okay, so, I think you need a negative lookahead since you have to accept curly braces, but not doubles... this is what I've come up with, not sure if it will work in every case though.
}}([^{]|{(?!{))+}}
It basically says: look for two closing curlies (}}), then either any non-opening curly character ([^{]) OR a single opening curly character (using negative lookahead) ({(?!{)), repeat that as many times as needed (+), and finish with a double closing curly (}})
Link to live (updateable) demo: https://regex101.com/r/kwlzco/2

how to add special characters in mongo $regex

I want to look for "\r" in a string field I have in mongo, and I fount this, which looks like it works good:
db.users.findOne({"username" : {$regex : ".*son.*"}});
the problem is that i want to look for "\r" and I can find it, which I know its there, so I just did:
db.users.findOne({"username" : {$regex : ".*\r.*"}});
and it dosent work, how can I fix this?
example document:
{
"personId" : 1,
"personName" : "john",
"address" : {
"city" : "Rue Neuve 2\\r\\rue Pré-du-Mar \\r ché 1 1003 Lausanne",
"street" : "",
"zipCode" : "",
"streetNumber" : ""
}
}
so my query is:
db.users.findOne({"address.city" : {$regex : ".*\r.*"}});
also tried:
db.users.findOne({"address.city" : {$regex : ".*\\r.*"}});
try
db.users.findOne({"username" : {$regex : ".*\\r.*"}});
I think your issue is that you have your .* backwards at the end. You are looking for a "2." literal followed by any characters as opposed to what you have at the beginning, .*, saying anything before the literal that isn't a carriage return. Try to change this to
db.users.findOne({"username" : {$regex : ".*\\r*."}});
Which says give me "\r" with any non carriage return characters before the literal and any non carriage return characters after the literal.
I found that the way to do it is:
db.users.findOne({"username" : {$regex : ".*\\\\.*"}});

Vi regular expression

Looking to perform a find and replace on the following string:
"_id" : { "$oid" : "52853800bb1177ca391c17ff" }, "Ticker" : "A", "Profit Margin" : 0.137, "Institutional Ownership" : 0.847, "EPS growth past 5 years" : 0.158, "Total Debt/Equity" : 0.5600000000000001, "CurrentRatio" : 3, "Return on Assets" : 0.089, "Sector" : "Healthcare", "P/S" : 2.54, "Change from Open" : -0.0148, "Performance (YTD)" : 0.2605, "Performance (Week)" : 0.0031, "Quick Ratio" : 2.3, "Insider Transactions" : -0.1352, "P/B" : 3.63, "EPS growth quarter over quarter" : -0.29, "Payout Ratio" : 0.162, "Performance (Quarter)" : 0.09279999999999999, "Forward P/E" : 16.11, "P/E" : 19.1, "200-Day Simple Moving Average" : 0.1062, "Shares Outstanding" : 339, "Earnings Date" : { "$date" : 1384464600000 }, "52-Week High" : -0.0544, "P/Cash" : 7.45, "Change" : -0.0148, "Analyst Recom" : 1.6, "Volatility (Week)" : 0.0177, "Country" : "USA", "Return on Equity" : 0.182, "50-Day Low" : 0.0728, "Price" : 50.44, "50-Day High" : -0.0544, "Return on Investment" : 0.163, "Shares Float" : 330.21, "Dividend Yield" : 0.0094, "EPS growth test years" : 0.13 }
Specifically, I want to find all characters in quotations and remove any whitespaces found. i.e. "Profit Margin" becomes "ProfitMargin", "Institutional Ownership" becomes "InstitutionalOwnership" etc. I'd like to do this in Vi.
Thanks for the help in advance!
A possible answer:
:%s/\("[^"]*"\)/\=substitute(submatch(1), " ", "", "g")/g
And the way I got it:
Search what we want to replace => /".*" (quote symbol + n times whatever + quote symbol)
Do it properly => /"[^"]*" (quote symbol + n times whatever is not a quote symbol + quote symbol)
Transform that into a substitution that does nothing => :%s/\("[^"]*"\)/\1/g
Check :help :%s, from there :help sub-replace-special.
Use the magic \= learned before, still doing nothing => :%s/\("[^"]*"\)/\=submatch(1)/g
Replace \=submatch(1) by something useful => :%s/\("[^"]*"\)/\=substitute(submatch(1), " ", "", "g")/g (:help substitute).