Issue with a expression capturing groups

Issue with a expression capturing groups - regex

I have some data like this
Wed Mar 18 15:16:10 2015 eth0:1 109.224.232.219 up (not currently mapped)
Wed Mar 18 15:18:12 2015 eth0:1 109.224.232.219 down (not responding)
Wed Mar 18 15:20:46 2015 eth0:1 109.224.232.219 up (not currently mapped)
Wed Mar 18 15:22:52 2015 eth0:1 109.224.232.219 down (not responding)
Wed Mar 18 15:24:26 2015 eth0:1 109.224.232.219 up (not currently mapped)
I am trying to capture the IP and the date string on each line, I thought I could just do anything before the word eth and then my IP check, but this isn't working. Have I mis understood the concept of capture groups?
Is there a sensible way to get this data from 1 regex?
(^(.*?)eth)(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
Any help would be appreciated.
This is an image of the regex currently
https://www.debuggex.com/i/BaXnqh2DzRhUCph8.png

You're almost there. You just need to add .*? after eth so that it would match the characters present in-between eth and the ip-address.
^(.*?)eth.*?\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
DEMO
If you don't want the space before eth not to be captured by group 1 then you could change your regex like this,
^(.*?)\s+eth.*?\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
DEMO

Sometimes, people ignore what a well-defined sequence of characters a dotted-decimal IP representation is. I have almost no problems identifying an IP when I fully detail a proper IP octet.
my $octet = qr/\b(?:0|1\d{0,2}|2(?:[0-4]\d?|5[0-5]?|[6-9])?|[3-9]\d?)\b/;
my ( $foctet = "$octet" ) =~ s/0[|]//;
And then on top of that, I specify that a IP address is a set of four octets, separated by a dot.
my $ip_regex = qr/($foctet(\.$octet){3})/;
This little beauty will almost always pull for me anything that is a valid IP from any file.
Along with this, dates can be specified with greater specification. And again, following this specification, what you get will almost inevitably be a date:
my $dow = qr/\b(?:Fri|Mon|Sat|Sun|Thu|Tue|Wed)\b/;
my $mon = qr/\b(?:Apr|Aug|Dec|Feb|Jan|Jul|Jun|Mar|May|Nov|Oct|Sep)\b/;
my $day = qr/\b(?:[012]\d?|3[01]?|[4-9])\b/;
my $hr24 = qr/\b(?:[01]\d?|2[0-3])\b/;
my $minsec = qr/\b(?:[0-5]\d)\b/;
my $datetime_regex = qr/$dow\s+$mon\s+$day\s+$hr24:$minsec:$minsec\s+\d+/;
So simply using both regexes against the source line, you get what you want without a whole lot of backtracking needed.
my #date_parts = $line =~ /$datetime_regex/;
my ( $ip ) = $line =~ /$ip_regex/;
In fact, if performance is a concern, I saw numerous failures in the single regex with the non-greedy match, whereas the the ip regex succeeds on first try. The regex engine finds '.' at offset 35 and starts back at position 32.
However, the following, does not fail once for both. Just an indication of how it can help to specify your expressions to the expected range of data:
my ( $dt, $ip ) = m/($datetime_regex)\s+eth\d:\d+\s+($ip_regex)/;

Related

How to match, but not capture, part of a regex with Powershell

I have a line in a text doc that im trying to pull data from.
Example
Line im searching for in Text file:
Valid from: Sun May 17 19:00:00 CDT 1998
I want to find the key words "Valid from:" and then get only Sun May 17 and 1998
So end result should look like this:
Sun May 17 1998
I think im close to getting it right. This is what I have. It finds the keyword Valid From: but it returns more than I need
Sun May 17 19:00:00 CDT 1998
(?<=Valid from:)\s+\w+\s+\w+\s+\d+\s+\d+:\d+:\d+\s+\w+\s+\d+
Thank you in advance for any assistance.

I would use two capture groups (…) instead of lookaround construct:
$sampleText = 'Valid from: Sun May 17 19:00:00 CDT 1998'
$regEx = 'Valid from:\s+(\w+\s+\w+\s+\d+)\s+\d+:\d+:\d+\s+\w+\s+(\d+)'
if( $sampleText -match $regEx ) {
# Combine the matched values of both capture groups into a single string
$matches[1,2] -join ' '
}
Output:
Sun May 17 1998
If the -match operator successfully matches the pattern on the right-hand-side with the input text on the left-hand-side, the automatic variable $matches is set.
$matches contains the full match at index 0 and the matched values of any capture groups at subsequent indices, in this case 1 and 2.
Using the -join operator we combine the matched values of the capture groups into a single string.
Demo at regex101.

Graylog regex extract first valid Mac Address in message

I am trying to extract the first valid common mac address out of several different message entries in Graylog. I can do it with different Grok Extractors, but am wanting to do it with Regex so I can do conversions on the Mac to all lower case. Below are some sample messages and the Grok Patterns that work.
Question, how would I convert these Grok extractors to regex and or is there a single regex that would work in all 4 examples? Basically the regex would just need to match the first valid MAC address in each string and extract it.
Sample 1:
Equinox: *spamApTask1: Mar 20 15:26:04.033: #CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7019 Did not receive heartbeat reply; AP: 00:3a:9a:48:9b:40
Sample 2:
Equinox: *spamReceiveTask: Mar 17 12:34:39.264: #CAPWAP-3-DTLS_CONN_ERR: capwap_ac.c:934 00:3a:9a:30:f5:90: DTLS connection not found forAP 192.168.99.74 (43456), Controller: 192.168.99.2 (5246) send packet
Sample3:
Equinox: *spamApTask1: Mar 22 08:35:14.562: #LWAPP-4-SIG_INFO1: spam_lrad.c:44474 Signature information; AP 00:14:1b:61:f8:40, alarm ON, standard sig NULL probe resp 1, track per-Macprecedence 2, hits 1, slot 0, channel 1, most offending MAC 00:00:00:00:00:00 #yes but must make Mac lowercase
Sample 4:
Equinox: *idsTrackEventTask: Mar 22 08:40:13.816: #WPS-4-SIG_ALARM_OFF: sig_event.c:656 AP 00:14:1B:61:F8:40 : Alarm OFF, standard sig NULL probe resp 1, track=per-Mac preced=2 hits=1 slot=0 channel=1 yes but must make Mac lowercase
Sample1 Grok pattern:%{GREEDYDATA}AP: {COMMONMAC:WLC_APBaseMac}
Sample2 Grok pattern:%{GREEDYDATA}capwap_ac.c:934 %{COMMONMAC:WLC_APBaseMac}
Sample3 Grok pattern:%{GREEDYDATA}AP %{COMMONMAC:WLC_APBaseMac}
Sample4 Grok pattern:%{GREEDYDATA}AP %{COMMONMAC:WLC_APBaseMac}

You can make a pattern, which matches 5 groups of 2 hex digits followed by a semicolon, followed by the last 6th group of 2 hex digits:
(?i)(?:[0-9a-f]{2}:){5}[0-9a-f]{2}
Demo here. The (?i) at the start make the search case-insensitive.
UPDATED
If the above regex does not work in Graylog then you can try the very basic form of it, where all the quantifiers and character sets are expanded:
[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F]
Demo here.

Regex for date format dd Mmm yyyy from email header

I have the following regex that I have been working on:
^(\d\d)\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(\d{4})?$
I am trying to grab the date from an email header that is formatted like so:
"Mon, 18 Nov 2019 09:19:17 -0700 (MST)"
and I want the result to be:
18 Nov 2019
It seems that the \s for whitespace could be the culprit, but I have yet to find another forum result that grabs dates with whitespace instead of "-" or "/".
Does anyone have any suggestions for getting this working to extract as described above? Thanks in advance.

The problem is that you have added the "^" and "$" symbol on the start and end of the regex.
"^n": The ^n quantifier matches any string with n at the beginning of it.
"n$": The n$ quantifier matches any string with n at the end of it.
Since the text is not start with 2 digit (\d\d) and end with 2 digit (\d{4}). You will not get any result from this regex.
You can simply remove those two symbol or use the following code to achieve that.
/(\d{2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4})/.exec("Mon, 18 Nov 2019 09:19:17 -0700 (MST)")[1]

REQ: Assistance with Splunk - Rex Query

I'm having some issues with a rex query where a single digit date renders an incorrect result, but a double digit date provides the correct result.
These are the log entries I'm querying:
Mar 7 14:24:29 10.52.176.215 Mar 7 12:24:29 963568 - Melbourne details-cable-issue - vdvfvfv
Mar 20 09:52:55 10.52.176.215 Mar 20 07:52:55 963569 - Brisbane cable-issue
And this is the query:
^(?:[^ \n]* ){7}(?P<extension>[^ ]+)[^\-\n]*\-\s+(?P<location>\w+)
For the Mar 7 entry, my query is giving me group extension "7" whilst my Mar 20 entry is giving me group extension "963569" which is correct.
Can someone shed some light on my query to acknowledge a single and double digit date? #7 vs 20
Thanks all :)

There are several consecutive spaces (they look like padding spaces) in the first string, and since you only match one space within (?:[^ \n]* ) you get mismatches.
I suggest matching 1 or more spaces in that first group and adjusting the limiting quantifier:
^(?:[^ \n]* +){5}(?P<extension>[^ ]+)[^-\n]*-\s+(?P<location>\w+)
^ ^
See the regex demo

Regular expression: "between" : and a space

I know there are a load of similar "combine 2 regular expressions" posts on here, but I've tried the solutions and keep getting errors.
I've got the regular expressions to parse a description such as:
Org Biomol Chem. 2011 May 7;9(9):3549-59. doi: 10.1039/c1ob05128h. Epub 2011 Mar 28.
to extract the DOI (Digital Object Identifier):
([^:]+$) --> 10.1039/c1ob05128h. Epub 2011 Mar 28.
([^\s]+) --> 10.1039/c1ob05128h.
But am pretty clueless as to how to combine these. If it's difficult, then it's not necessary, but would simplify my calculations.
I also can't figure out how to get rid of that last "." which is not part of the DOI string (for the record there may be more than 2 full stops in a DOI so the regex can't simply be "after the 2nd full stop").
Some other examples as requested:
Chem Soc Rev. 2008 Nov;37(11):2413-21. doi: 10.1039/b719548f. Epub 2008 Sep 16.
Small. 2010 Dec 20;6(24):2796-820. doi: 10.1002/smll.201001881. Review.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Dec 27;16(48):14285-9. doi: 10.1002/chem.201002111. No abstract available.
All the attempts I've made so far give a result much the same as this:
Some of the exceptions to Dukeling's suggestion of "doi: ([^\s]+).? ([^:]+).?", for reasons unknown, were:
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.

If you only want the . gone, this seems to work:
"doi: ([^\s]+)\."
So we're just putting the . outside the brackets, so it doesn't get grouped with the string.
If you want to extract 10.1039/c1ob05128h and Epub 2011 Mar 28 in 2 separate strings, you can do this with groups. You can make the regex something like:
"doi: ([^\s]+)\.(?: ([^:]+)\.)?"
Given that the second part appears to be optional, we need to surround it with brackets which we mark as optional with ? (and the ?: makes it a non-capturing group, so you don't get that in your second cell rather than what you want).
And Google seems to automatically fill =CONTINUE(..., 1, 2) into the next cell, which gives you the two groups next to one another.
The pursuit to make the .'s optional
At first I tried just saying \.?, but obviously the [^\s]+ will then consume the . (which is not desired).
So you need to include something inside the brackets to prevent this. Specifically, you need to check the last character and make sure it's not a ..
This led me to:
"doi: ([^\s]*[^.\s])\.?(?: ([^:]*[^.:])\.?)?"
This allows for optional .'s, but if there are more than 1 . at the end, it won't work. Assuming we want none of these in our output, it's easily fixed by changing the \.? to \.*.
"doi: ([^\s]*[^.\s])\.*(?: ([^:]*[^.:])\.*)?"

I believe this may do the trick:
/doi: ((\S+)(?:\. .+)?)\.$/
The outermost group (which captures the longer string) is capture group 1, and the innermost group is capture group 2.

=REGEXEXTRACT(cell;"doi: ([.\d]+\/[\w\.]+)\.(?: |$)")
--> it extracts 10.1039/c1ob05128h
No need to combine regular expressions, it can be done at once.
I tried it on all of your examples and it works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issue with a expression capturing groups - regex

Related

How to match, but not capture, part of a regex with Powershell

Graylog regex extract first valid Mac Address in message

Regex for date format dd Mmm yyyy from email header

REQ: Assistance with Splunk - Rex Query

Regular expression: "between" : and a space

Categories

Resources