Regex for capture fields from multiple format data logs - regex

I have a Regex for capturing certain fields from logs. It working fine for one type of log entry. but sometimes, I get little different format and it capture different field.
Demo
In the given example, there are 2 different log entry. Regex working fine for first entry.
In second entry, there are 3 different sets
1. 314624K->314624K(419456K)
2. 9862316K->9542223K(12478080K)
3. 12261641K->11966292K(12478080K)
my regex is skipping the 1st one and capturing the next two. I want to capture first two occurrence.
My regex:
(?=[^P]*(?:ParNew|P.*ParNew|PSYoungGen|DefNew)).*?(?P<ParNew_before_1>\d+)K->(?P<ParNew_after_1>\d+)K\((?P<young_heap_size>\d+)K\), (?P<par_new_duration>\d+\.\d+) secs\] (?P<ParNew_before_2>\d+)K\->(?P<ParNew_after_2>\d+)K\((?P<total_heap_size>\d+)
I think problem is that i am using ".*?". I have tried to change it with \s+ but its not working as well.

Related

regex how to get one or more of "(.*)#" without grouping them?

im a real Regex beginner. I have the folloing string:
<string1#string2#string3#string4>" from that I want to get
Match 0: <string1#string2#string3#string4> Groups: "string1","string2","string3","string4"
I want to all the strings to captured in different groups of one match. I've already tried: <(.*)#(.*)#(.*)#(.*)>. How can I look for one or more of these combination (.*)# without making another group so it works with any amount of strings?

Excluding 3dots additional to other characters with regex in a string

I have such an http-url detector regex:
(?:http|https)(?::\/{2}[\w]+)(?:[\/|\.]?)(?:[^\s<"]*)
It works pretty well for the following url representation:
http://www.acer.com/clearfi/download/
What kind of modification I can do to extract
http://schemas.microsoft.com/office/word/2003/wordml2450
from
Huanghhttp://schemas.microsoft.com/office/word/2003/wordml2450...)()()()()()
?
You can modify it to capture:
group of http stuff
followed by (group of) subdomain stuff
followed by as many as possible groups of:
one point or slash
followed by a group of characters (non-point, non-space, non-", non-<)
(?:http|https)(?:\/{2}[\w]+)([\/|\.][^\s<"\.]+)*
I made capturing groups to visualize the results
I've changed your expression here and there: (.*)(https?:\/{2}[\w]+[\/|\.]?[^\s<"]*)(\.{3}.*) and get only second capturing group from it. See example here: https://regex101.com/r/0viPC5/2
This expression probably can be simplified further but I don't know your exact input and search criteria so let's stick with what you already wrote.

Getting "null" when extracting string using REGEXP_EXTRACT in Tableau

I have been trying to use the REGEXP_EXTRACT function in Tableau without success (see image below). I have a string column 'FOB', and I want to extract the leading capital letters. Sometimes there's a dash following the capital letters, sometimes not, so I used the following syntax in the created field 'Advertiser':
REGEXP_EXTRACT([FOB],'^[A-Z]*')
However, this produces a column full of "null". The weird thing is even if I changed the pattern from '^[A-Z]*' to 'SDM', it was still the same. It just seems that Tableau is not regex enabled...
I did check my regex online here and it worked... getting really confused, any help will be appreciated.
Since you need to extract the first character in each [FOB] column cell, you need to use ^ anchor and a [A-Z] character class, but also you need to wrap the pattern with a capturing group (i.e. paired parentheses, (...)) to tell Tableau you need to extract this pattern part:
REGEXP_EXTRACT([FOB],'^([A-Z])')
^ ^
To extract all (one or more) leading capital letters, add +:
REGEXP_EXTRACT([FOB],'^([A-Z]+)')
^
See Mark Jackson's regex blog excerpt:
The whole pattern is wrapped in parenthesis to tell Tableau what part of the pattern to return. This is an update from the earlier beta version I was using when I created this post. The nice thing about this addition is that Tableau lets you pattern match on a larger portion of the string, but allows you to return a subset of the pattern.

Regex - orderless extraction of string

I have 2 strings which are 2 records
string1 = "abc/BS-QANTAS\\/DS-12JUL15\\dfd"
string2 = "/DS-10JUN15\\/BS-AIRFRANCE\\dfdsfsdf"
BS is booking airline
DS is Date
I want to use a single regex and extract the booking source & date. Please let me know if it is feasible.
I have tried lookaheads and still couldn't achieve
The target language is Splunk and not Javascript.
Whatever may be the language please post I'll give a try in Splunk
You mentioned that you've tried lookahead, what about lookbehind?
(?<=BS-|DS-)(\w+)
Tested at Regex101
Here's a more scalable (and more readable, IMO) alternative to miroxlav's answer:
(?:\/BS-(?P<source>\w+)|\/DS-(?P<date>\w+)|[^\/\v]+)+
I'm assuming the fields you're interested in always start with a slash. That allows me to use [^/]+ to safely consume the junk between/around them.
demo
This is effectively three regexes in one, wrapped in a group, to give each one a chance to match in turn, and applied multiple times. If the first alternative matches, you're looking at a "source airline" field, and the name is captured in the group named "source". If then second alternative matches, you're looking at the date, which is captured in the "date" group.
But, because the fields aren't in a predetermined order, the regex has to match the whole string to be sure of matching both fields (in fact, I should have used start and end anchors--^ and $--to enforce that; I've added them below). The third alternative, [^/]+, allows it to consume the parts that the first two can't, thus making an overall match possible. Here's the updated regex:
^(?:\/BS-(?P<source>\w+)|\/DS-(?P<date>\w+)|[^\/\v]+)+$
...and the updated demo. As noted in the comment, the \v is there only because I'm combining your two examples into one multiline string and doing two matches. You shouldn't need it in real life.
This gives you both strings filled either in match groups airline1+date1 or in airline2+date2:
((BS-(?<airline1>\w+).*DS-(?<date1>[\w]+))|(DS-(?<date2>[\w]+).*BS-(?<airline2>\w+)))
>> view at regex101.com
Since there are only 2 groups, I used simple permutation.
This regex will take last of occurrences, if there are more. If you need earliest one (using lookbehind), let me know.

Problem getting nested groups in Regex

Given the following text:
//[&][$][*]\n81723&8992%9892*2343%8734
I need to get:
1. &
2. $
3. *
4. 81723&8992%9892*2343%8734
The first line defines delimiters that separates the numbers at the second line.
There is an undefined number of delimiters.
I made this regex:
//(?:\[([^\]]+)\])+\n(.+)
But only 2 groups are obtained. The first is the last delimiter and the second is the string containing the numbers. I tried but I couldn't get all the delimiters.
I'm not good at regex, but I think the first group is being overwritten on every iteration of (?:[([^]]+)])+ and I can't solve this.
Any help?
Regards
Victor
That's not a nested group you're dealing with, it's a repeated group. And you're right: when a capturing group is controlled by a quantifier, it gets repopulated on every iteration, so the final value is whatever was captured the last time around.
What you're trying to do isn't possible in any regex flavor I'm familiar with.
Here's a fuller explanation: Repeating a Capturing Group vs. Capturing a Repeated Group
The best thing I see that you could do (with regex) would be something like this:
(?:\[([^\]]+)\])?(?:\[([^\]]+)\])? #....etc....# \n(.+)
You can’t write something like (foo)+ and match against "foofoofoo" and expect to get three groups back. You only get one per open paren. That means you need more groups that you’ve written.
The following regex works for javascript:
(\[.+\])(\[.+\])(\[.+\])\\n(.*)
This assumes your & $ * will have values.