BigQuery - Alternative method to Positive Lookahead for RegExes - regex

I've written a RegEx pattern that identifies alpha-characters that are immediately followed by a numeric character, with the intention that it would used in BigQuery's REGEXP_EXTRACT function.
Here's the pattern: ([A-Z]|[a-z])*(?=[0-9])
However, due to BigQuery's use of RE2 expression library, the Positive Lookahead function does not work. What's an alternative method of identifying the numeric character without including it in the extracted string/match?
Use case:
To extract the first 1 or 2 alpha-characters of a UK postcode, e.g.
NW9 9KL
M1 0TE
ph3 2ee
N10 10KE

You can use
REGEXP_EXTRACT(col, '^[A-Za-z]+')
The ^[A-Za-z]+ regex matches
^ - start of string
[A-Za-z]+ - one or more letters.
Also, if you MUST check for a digit right after the initial letters, you can use a
REGEXP_EXTRACT(col, '^([A-Za-z]+)[0-9]')
The ^([A-Za-z]+)[0-9] regex matches and captures into Group 1 the initial letters, and then just matches a digit (with [0-9]). The REGEXP_EXTRACT function returns the captured substring if there is a capturing group.

Related

Using REGEXEXTRACT on an IMPORTRANGE in Google Docs

I am importing a range from another Google sheet and I need to pull a specific number from the data that is imported. The data looks something like:
R2.word.4.word
I want to extract the second number. It will always follow this format (a letter and a number then a period then a word then a period then a number (might be single or double digit) then a period and a word). The regex to extract the second number should be: (\d+)(?!.*\d) and I have tested it in multiple regex test sites. However, Google docs gives me an error stating it is not a regular expression. I tried something like this (edited out URL and the sheet name):
=REGEXEXTRACT(IMPORTRANGE(URL,Sheet!A2:A200), "(\d+)(?!.*\d"))
Can anyone help me understand how I can fix this?
And the other issue here is that it isn't actually importing the range. I only get it to import on the first cell and not down the column.
You could write a pattern like:
=REGEXEXTRACT(A2,"^[A-Z]\d+\.\w+\.(\d+)")
Explanation
^ Start of string
[A-Z] Match a single uppercase char
\d+ Match 1+ digits
\. Match a dot
\w+ Match 1+ word characters
\. Match a dot
(\d+) Capture group 1, match 1+ digits
Regex demo
With your shown samples please try following regex.
=REGEXEXTRACT(A2,"^[a-zA-Z]\d+\.[^.]*\.(\d+)\.\S+$")
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^[a-zA-Z] ##From starting of value matching a-zA-Z here.
\d+ ##Matching 1 or more occurrences of digits.
\.[^.]*\. ##Matching literal dot till next occurrence of dot here.
(\d+) ##Creating 1 capturing group and which has 1 or more digits matching in it.
\.\S+$ ##Matching literal dot followed by 1o or more non-spaces till end of value.
"It will always follow this format"
Based on the above; you can use REGEXEXTRACT() but it's slow compared to simple SPLIT() which in your standardized format is ideal:
Formula in B2:
=INDEX(SPLIT(A2:A3,"."),0,3)
This is an array-formula by default and will spill all values down. Just apply it to your entire range.

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

Pattern to match everything except a string of 5 digits

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:
You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.
The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

How do I create a regex expression for a 10 digit phone number with the same separator?

I am trying to create a basic regular expression to match a phone number which can either use dots [.] or hyphens [-] as the separator.
The format is 123.456.7890 or 123-456-7890.
The expression I am currently using is:
\d\d\d[-.]\d\d\d[-.]\d\d\d\d
The issue here is that it also matches the phone numbers that have both separators in them which I want to be termed as invalid/not a match. For example, with my expression, 123.456-7890 and 123-456.7890 show up as a match, something I do not want happening.
Is there a way to do that?
Use a backreference:
^\d{3}([.-])\d{3}\1\d{4}$
Here is an explanation of the regex:
^ from the start of the number
\d{3} match any 3 digits
([.-]) then match AND capture either a dot or a dash separator
\d{3} match any 3 digits
\1 match the SAME separator seen earlier
\d{4} match any 4 digits
$ end of the number
You can use this regex:
^\d{3}([-.])\d{3}\1\d{4}$
You can see that it works here.
Key point here - is that you capture your desired character using brackets ([-.])
and then reuse it with back reference \1.

Elastic search regex to get last 7 digits from right

I have data indexed in this format 676767 2343423 2344444 32494444. I need a regular expression to pattern anlayser last 7 digits from right. Ex output: 2494444. Pattern which we have tried [0-9]{7} which is not working.
In ElasticSearch, the pattern is anchored by default. That means, you cannot rely on partial matches, you need to match the entire string and capture the last consecutive 7 digits.
Use
.*([0-9]{7})
where
.* - will match any 0+ chars other than newline (as many as possible) and then will backtrack to match...
([0-9]{7}) - 7 digits placed into Capture group 1.
The Sense plug-in returns the captured value if a capturing group is defined in the regular expression pattern, so, no additional extraction work (or group accessing work) needs to be done.