Using REGEXEXTRACT on an IMPORTRANGE in Google Docs

Using REGEXEXTRACT on an IMPORTRANGE in Google Docs - regex

I am importing a range from another Google sheet and I need to pull a specific number from the data that is imported. The data looks something like:
R2.word.4.word
I want to extract the second number. It will always follow this format (a letter and a number then a period then a word then a period then a number (might be single or double digit) then a period and a word). The regex to extract the second number should be: (\d+)(?!.*\d) and I have tested it in multiple regex test sites. However, Google docs gives me an error stating it is not a regular expression. I tried something like this (edited out URL and the sheet name):
=REGEXEXTRACT(IMPORTRANGE(URL,Sheet!A2:A200), "(\d+)(?!.*\d"))
Can anyone help me understand how I can fix this?
And the other issue here is that it isn't actually importing the range. I only get it to import on the first cell and not down the column.

You could write a pattern like:
=REGEXEXTRACT(A2,"^[A-Z]\d+\.\w+\.(\d+)")
Explanation
^ Start of string
[A-Z] Match a single uppercase char
\d+ Match 1+ digits
\. Match a dot
\w+ Match 1+ word characters
\. Match a dot
(\d+) Capture group 1, match 1+ digits
Regex demo

With your shown samples please try following regex.
=REGEXEXTRACT(A2,"^[a-zA-Z]\d+\.[^.]*\.(\d+)\.\S+$")
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^[a-zA-Z] ##From starting of value matching a-zA-Z here.
\d+ ##Matching 1 or more occurrences of digits.
\.[^.]*\. ##Matching literal dot till next occurrence of dot here.
(\d+) ##Creating 1 capturing group and which has 1 or more digits matching in it.
\.\S+$ ##Matching literal dot followed by 1o or more non-spaces till end of value.

"It will always follow this format"
Based on the above; you can use REGEXEXTRACT() but it's slow compared to simple SPLIT() which in your standardized format is ideal:
Formula in B2:
=INDEX(SPLIT(A2:A3,"."),0,3)
This is an array-formula by default and will spill all values down. Just apply it to your entire range.

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org

This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?

With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.

You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

Regex to remove time zone stamp

In Google Sheets, I have time stamps with formats like the following:
5/25/2022 14:13:05
5/25/2022 13:21:07 EDT
5/25/2022 17:07:39 GMT+01:00
I am looking for a regex that will remove everything after the time, so the desired output would be:
5/25/2022 14:13:05
5/25/2022 13:21:07
5/25/2022 17:07:39
I have come up with the following regex after some trial and error, although I am not sure if it is prone to errors: [^0-9:\/' '\n].*
And the function in Google Sheets that I plan to use is REGEXREPLACE().
My goal is to be able to do calculations regardless of one's time zone, however the result will be stamped with the user's local time zone.
Could someone confirm this is correct? Appreciate any feedback I can get!

You can use
=REGEXREPLACE(A1, "^(\S+\s\S+).*", "$1")
=REGEXREPLACE(A1, "^([\d/]+\s[\d:]+).*", "$1")
See the regex demo #1 / regex demo #2.
Details:
^ - start of string
(\S+\s\S+) - Group 1: one or more non-whitespaces, one or more whitespaces and one or more non-whitespaces
[\d/]+\s[\d:]+ - one or more digits or / chars, a whitespace, one or more digits or colons
.* - any zero or more chars other than line break chars as many as possible.
The $1 is a replacement backreference that refers to the Group 1 value.

With your shown samples, attempts please try following regex in REGEXREPLACE. This will help to match time stamp specifically. Here is the Online demo for following regex. This will create only 1 capturing group with which we are replacing the whole value(as per requirement).
=REGEXREPLACE(A1, "^((?:\d{1,2}\/){2}\d{4}\s+(?:\d{1,2}:){2}\d{1,2}).*", "$1")
Explanation: Adding detailed explanation for above used regex.
^( ##Matching from starting of the value and creating/opening one and only capturing group.
(?:\d{1,2}\/){2} ##Creating a non-capturing group and matching 1 to 2 digits followed by / with 2 times occurrence here.
\d{4}\s+ ##Matching 4 digits occurrence followed by 1 or more spaces here.
(?:\d{1,2}:){2} ##In a non-capturing group matching 1 to 2 occurrence of digits followed by colon and this combination should occur2 times.
\d{1,2} ##Matching 1 to 2 occurrences of digits.
) ##Closing capturing group here.
.* ##This will match everything till last but its not captured.

You can do this without REGEX by simply splitting and adding the first and second index.
=ARRAYFORMULA(
IF(ISBLANK(A2:A),,
INDEX(SPLIT(A2:A," "),0,1)+
INDEX(SPLIT(A2:A," "),0,2)))

Capture number if string contains "X", but limit match (cannot use groups)

I need to extract numbers like 2.268 out of strings that contain the word output:
Approxmiate output size of output: 2.268 kilobytes
But ignore it in strings that don't:
some entirely different string: 2.268 kilobytes
This regex:
(?:output.+?)([\d\.]+)
Gives me a match with 1 group, with the group being 2.268 for the target string. But since I'm not using a programming language but rather CloudWatch Log Insights, I need a way to only match the number itself without using groups.
I could use a positive lookbehind ?<= in order to not consume the string at all, but then I don't know how to throw away size of output: without using .+, which positive lookbehind doesn't allow.

With your shown samples, please try following regex.
output:\D+\K\d(?:\.\d+)?
Online demo for above regex
Explanation: Adding detailed explanation for above.
output:\D+ ##Matching output colon followed by non-digits(1 or more occurrences)
\K ##\K to forget previous matched values to make sure we get only further matched values in this expression.
\d(?:\.\d+)? ##Matching digit followed by optional dot digits.

Since you are using PCRE, you can use
output.*?\K\d[\d.]*
See the regex demo. This matches
output - a fixed string
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that removes all text matched so far from the overall match memory buffer
\d - a digit
[\d.]* - zero or more digits or periods.

Regex to get value from <key, value> by asserting conditions on the value

I have a regex which takes the value from the given key as below
Regex .*key="([^"]*)".* InputValue key="abcd-qwer-qaa-xyz-vwxc"
output abcd-qwer-qaa-xyz-vwxc
But, on top of this i need to validate the value with starting only with abcd- and somewhere the following pattern matches -xyz
Thus, the input and outputs has to be as follows:
I tried below which is not working as expected
.*key="([^"]*)"?(/Babcd|-xyz).*
The key value pair is part of the large string as below:
object{one="ab-vwxc",two="value1",key="abcd-eest-wd-xyz-bnn",four="obsolete Values"}
I think by matching the key its taking the value and that's y i used this .*key="([^"]*)".*
Note:
Its a dashboard. you can refer this link and search for Regex: /"([^"]+)"/ This regex is applied on the query result which is a string i referred. Its working with that regex .*key="([^"]*)".* above. I'm trying to alter with that regexGroup itself. Hope this helps?
Can anyone guide or suggest me on this please? That would be helpful. Thanks!

Looks like you could do with:
\bkey="(abcd(?=.*-xyz\b)(?:-[a-z]+){4})"
See the demo online
\bkey=" - A word-boundary and literally match 'key="'
( - Open 1st capture group.
abcd - Literally match 'abcd'.
(?=.*-xyz\b) - Positive lookahead for zero or more characters (but newline) followed by literally '-xyz' and a word-boundary.
(?: - Open non-capturing group.
-[a-z]+ - Match an hyphen followed by at least a single lowercase letter.
){4} - Close non-capture group and match it 4 times.
) - Close 1st capture group.
" - Match a literal double quote.
I'm not a 100% sure you'd only want to allow for lowercase letter so you can adjust that part if need be. The whole pattern validates the inputvalue whereas you could use capture group one to grab you key.
Update after edited question with new information:
Prometheus uses the RE2 engine in all regular expressions. Therefor the above suggestion won't work due to the lookarounds. A less restrictive but possible answer for OP could be:
\bkey="(abcd(?:-\w+)*-xyz(?:-\w+)*)"
See the online demo

Will this work?
Pattern
\bkey="(abcd-[^"]*\bxyz\b[^"]*)"
Demo

You could use the following regular expression to verify the string has the desired format and to match the portion of the string that is of interest.
(?<=\bkey=")(?=.*-xyz(?=-|$))abcd(?:-[a-z]+)+(?=")
Start your engine!
Note there are no capture groups.
The regex engine performs the following operations.
(?<=\bkey=") : positive lookbehind asserts the current
position in the string is preceded by 'key='
(?= : begin positive lookahead
.*-xyz : match 0+ characters, then '-xyz'
(?=-|$) : positive lookahead asserts the current position is
: followed by '-' or is at the end of the string
) : end non-capture group
abcd : match 'abcd'
(?: : begin non-capture group
-[a-z]+ : match '-' followed by 1+ characters in the class
)+ : end non-capture group and execute it 1+ times
(?=") : positive lookahead asserts the current position is
: followed by '"'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using REGEXEXTRACT on an IMPORTRANGE in Google Docs - regex

You could write a pattern like: =REGEXEXTRACT(A2,"^[A-Z]\d+\.\w+\.(\d+)") Explanation ^ Start of string [A-Z] Match a single uppercase char \d+ Match 1+ digits \. Match a dot \w+ Match 1+ word characters \. Match a dot (\d+) Capture group 1, match 1+ digits Regex demo

Related

Regex Help required for User-Agent Matching

regex match two words based on a matching substring

Regex to remove time zone stamp

Capture number if string contains "X", but limit match (cannot use groups)

Regex to get value from <key, value> by asserting conditions on the value

Categories

Resources