Replace / Delete everything after first + character in datastudio - regex

I have a string looking like this (stored as an Event Action value from Google Analytics)
0+171235652++zu
or
122+115166747++en
I would like (with the use of calculate fields) create a new field that will show only the number before the 1st '+' character. So in those examples above
0 or 122
What I tried was (below), but it did not help, Any ideas?
REGEXP_REPLACE(Event Action, '(^\\+).*', '')

You may use
REGEXP_EXTRACT(Event Action, '^([^+]+)')
See the regex in action. The regex matches:
^ - start of string
([^+]+) - Capturing group 1: any one or more chars other than a + (you may use ([^+]*) if you want to also get empty match when a + is the first char).
If you want a replacement function, you may use
REGEXP_REPLACE(Event Action,"[+].*","")

The pattern you tried (^\\+).* did not work because this part ^\\+ matches the start of the string followed by 1 or more times a plus sign.
If what comes before the first plus sign should be digits and the plus sign itself should be present, you could capture the leading digits followed by matching the plus sign followed by the rest of the string.
Use group 1 using \\1 in the replacement.
^(\\d+)\\+.*
In parts
^ Start of string
(\\d+) Capture group 1, match 1 or more digits
\\+.* Match a + char and 0 or more times any char except a newline
Regex demo
Example code
REGEXP_REPLACE(Event Action, '^(\\d+)\\+.*', '\\1')

Related

How to match strings that are entirely composed of a predefined set of substrings with regex

How to match strings that are entirely composed of a predefined set of substrings. For example, I want to see if a string is composed of only the following allowed substrings:
,
034
140
201
In the case when my string is as follows:
034,201
The string is fully composed of the 'allowed' substrings, so I want to positively match it.
However, in the following string:
034,055,201
There is an additional 055, which is not in my 'allowed' substrings set. So I want to not match that string.
What regex would be capable of doing this?
Try this one:
^(034|201|140|,)+$
Here is a demo
Step by step:
^ begining of a line
(034|201|140|,) captures group with alternative possible matches
+ captured group appears one or more times
$ end of a line
This regex will match only your values and ensure that the line doesn't start or end with a comma. Only matches in group 0 if it is valid, the groups are non-matching.
^(?:034|140|201)(?:,(?:034|140|201))*$
^: start
(?:034|140|201): non-matching group for your set of items (no comma)
(?:,(?:034|140|201))*: non-matching group of a comma followed by non-matching group of values, 0 or more times
$: end

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Regex for validating multiple Inputs as per the below regex

I have a regex that validates my inputs like
Regex :
^(?=.{1,15}$)([a-zA-Z0-9]+)(?:[-]{1})[a-zA-Z0-9]+$
Example 1: BBB-123BBB
Now, I want to create a regex using the above, where my regex can validate multiple inputs with a semicolon (;) as a delimiter & the maximum input that can be there is 20.
Like for ex 2:
BBB-123BB,AAA-1234;EEE-9876....20 items
Ex 2.
BB-123BB,AAA3-1234;EEE334-9876....20 items
How can I extend my regex above (the first one) to allow multiple inputs to be added while letting them be split by a semicolon and can have a maximum of 20 items (as shown in ex 2)?
Building on your pattern, I removed unnecessary capturing groups and used simple -, which is equivalent to (?:[-]{1}). Here's what I came up with:
^(?:(?:^|;)(?=[^;]{1,15})[a-zA-Z0-9]+-[a-zA-Z0-9]+){1,20}$
Explanation:
^ - match beginning of a string
(?:...) - non-capturing group
^|; - alternation: match ; literally or match beginning of string
[^;]{1,15}; - match between 1 and 15 characters other than ;
{1,20} - match preceding pattern between 1 and 20 times
$ - match end of a string
Demo
EDIT: Pattern:
^(?=[^;]{1,15})[a-zA-Z0-9]+-[a-zA-Z0-9]+(?:;(?=[^;]{1,15})[a-zA-Z0-9]+-[a-zA-Z0-9]+){0,19}$
won't accept ; at the beginning.
SECOND EDIT:
^(?=[^;]{1,15}(?:;|$))[a-zA-Z0-9]+-[a-zA-Z0-9]+(?:;(?=[^;]{1,15})[a-zA-Z0-9]+-[a-zA-Z0-9]+){0,19}$
Added: (?:;|$) - match either ; literally or $ - end of string
What it does: correctly limits length of a token to 15
If the maximum is not important to enforce, simply allow arbitrary repetitions.
^(?=[^;]{3,15}(?:;[^;]{3,15})*$)[a-zA-Z0-9]+-[a-zA-Z0-9]+(;[a-zA-Z0-9]+-[a-zA-Z0-9]+)*$
If you want to specifically allow between 0 and 19 repeats, change the last * to {0,19}.
The minimal string which can match the main expression has three characters; so I updated the length constraint to {3,15}.
A minus simply matches itself so there is no need to put it in a character class; and there is never a good reason to specify a single repetition of anything, so I simplified the main regex accordingly.

Puppet dynamic variable from hostname

I am looking at trying to get a dynamic variable out of my ec2's hostname. Hostnames follow this pattern
us-east-1b-application-type-environment-138-10.domain.com
I would like my variable to end up looking like this
application-type-environment
Using this
$variable = regsubst($hostname, '/[a-z]{1}[0-9]{1}-([^-]+)-[0-9]{1,3}/', '')
I end up with this though
us-east-1b-application-type-environment-138-10
How can I get my expected outcome?
You do not need regex delimiters in regsubst. You need to match the whole string to be able to remove it and only keep what you need. The techique consists in matching what you do not want to keep and matching and capturing what you do want to have asa result.
You can use
regsubst($hostname, '^[^0-9]*[0-9][a-z]-(.*?)-[0-9]{1,3}.*$', '\1')
I think you are trying to get just what is in between the first [digit][lowercase-letter] chunk and a three digit chunk.
Here is a regex demo
Breakdown of the expression:
^ - start of line (if start of string is meant, replace with \A)
[^0-9]* - 0 or more non-digit symbols (all but digits, this can be replaced with \D*)
[0-9][a-z]- - a digit followed by a lowercase letter followed by - (the same as \d[a-z])
(.*?) - match and capture any characters but a newline as few as possible before the closest...
-[0-9]{1,3} - 1 to 3 digits (the same as \d{1,3})
.*$ - 0 or more any characters but a newline up to the end of line (if end of string is meant, replace with \z).