How can I get two variables from a string with Regex - regex

I have the following string that I need to extract two variables from with Regex. I've been looking at tutorials, but not getting the results I need. Can anyone help?
AWS-010062347904-uptree-base-prod-admin
I need the 12 digits between the first two hyphens as accountid (ie. 010062347904)
and everything after the 2nd hyphen as role (ie. uptree-base-prod-admin)

Try Regex: (?:[^-]*-)(?<accountId>\d{12})(?:[^-]*-)(?<role>.*)
Demo

From your pattern in the comments ^AWS\-(?<accountid>\d+) you are matching AWS- from the start ^ of the string and after that you are using a named capturing group accountId to capture 1+ times a digit.
You could get that accountId by referring to that named capturing group. To capture the role you can use another named capturing group:
^AWS-(?<accountId>\d+)-(?<role>.*)$
Regex demo
If you want the digits after the first hyphen without taking into account what is at the start of the string you could use a negated character class [^-]* at the start matching 0+ times not a hyphen:
^[^-]*-(?<accountId>\d+)-(?<role>.*)$
Regex demo

Related

Regex to extract text between two character patterns

I have multiple rows of data that look like the following:
dgov-nonprod-adp-personal.groups
dgov-prod-gcp-sensitive.groups
I want to get the text between the last hyphen and before the period so:
personal
sensitive
I have this regex (?:prod-(.*)-)(.*).groups however it gives two groups and in bigquery I can only extract if there is one group, what would the regex be to just extract the text i want?
Note: after the second hyphen and before the third it will always be prod or nonprod, that's why in my original regex i use prod- since that will be a constant
Assuming the BigQuery function you are using supports a capture group, I would phrase your requirement as:
([^-]+)\.groups$
Demo
For the example data, you can make the pattern a bit more specific matching -nonprod or -prod with a single capture group:
-(?:non)?prod-[^-]+-([^-]+)\.groups$
See a regex demo.
If there can be more occurrences of the hyphen:
-(?:non)?prod(?:-[^-]+)*-([^-]+)\.groups$
The pattern matches
-(?:non)?prod Match either -nonprod or -prod
(?:-[^-]+)* Optionally match - followed by 1+ chars other than -
- Match literally
([^-]+) Capture group 1, match 1+ chars other than -
\.groups Match .groups
$ End of string
See another regex demo.

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

Google Data Studio Regexp Replace formula - delete all characters after ? and #

I have a dasbhoard in Google Data Studio
I'm trying to create a custom field and replace all the characters that are going after # and ? sing (of course them too). But this formula - i dont know why - does not work
I was trying this one
REGEXP_REPLACE(Landing Page,'(#|\?)(.*)','')
Could you please help?
The pattern you tried (#|\?)(.*) caputures either # or ? using a capturing group with an alternation | followed by capturing 0+ times any char in another capturing group.
But in the replacement there is an empty string specified, removing all that is matched.
You could make use of a character class ([#?]) in a capturing group to capture one of the listed.
To only do the replacement where there is something after the match, you could match 1+ times any character except a newline using .+
To remove what comes after the matched character, you could refer to the capturing group using \\1 so that you keep the # or ? and remove what is matched afterwards.
The pattern could look like:
([#?]).+

How can I remove something from the middle of a string with regex?

I have strings which look like this:
/xxxxx/xxxxx-xxxx-xxxx-338200.html
With my regex:
(?<=-)(\d+)(?=\.html)
It matches just the numbers before .html.
Is it possible to write a regex that matches everything that surrounds the numbers (matches the .html part and the part before the numbers)?
In your current pattern you already use a capturing group. In that case you might also match what comes before and after instead of using the lookarounds
-(\d+)\.html
To get what comes before and after the digits, you could use 2 capturing groups:
^(.*-)\d+(\.html)$
Regex demo
In the replacement use the 2 groups.
This should do the job:
.*-\d+\.html
Explanation: .* will match anything until -\d+ say it should match a - followed by a sequence of digits before a \.html (where \. represents the character .).
To capture groups, just do (.*-)(\d+)(\.html). This will put everything before the number in a group, the number in another group and everything after the number in another group.

Regex match negative optional group

I'm trying to match MenuSearch and User in this ocurrencies:
/MenuSearch?action=read
/User
The following regex match the first case:
/\/(.*)(?=\?)/g
But doesn't match User because they doesn't have ? character in your line. How can I can make the second regex group optional?
See online:
https://regex101.com/r/qU6hN6/2
/\/([^?^\n]*)(\?.*)?/g
This grabs a forward slash, \/ , followed by any number of non-? non-newline characters, ([^?^\n]*), optionally followed by a question mark followed by any number of characters, (\?.*)?
The first capture group is the menu item, the second capture group is the query.
You can use this negation based regex:
/^\/([^?]+)/gm
Updated RegEx Demo
You could also use the \w metacharacter if you just need to find word characters.
\/(\w+)/g