Capture number if string contains "X", but limit match (cannot use groups) - regex

I need to extract numbers like 2.268 out of strings that contain the word output:
Approxmiate output size of output: 2.268 kilobytes
But ignore it in strings that don't:
some entirely different string: 2.268 kilobytes
This regex:
(?:output.+?)([\d\.]+)
Gives me a match with 1 group, with the group being 2.268 for the target string. But since I'm not using a programming language but rather CloudWatch Log Insights, I need a way to only match the number itself without using groups.
I could use a positive lookbehind ?<= in order to not consume the string at all, but then I don't know how to throw away size of output: without using .+, which positive lookbehind doesn't allow.

With your shown samples, please try following regex.
output:\D+\K\d(?:\.\d+)?
Online demo for above regex
Explanation: Adding detailed explanation for above.
output:\D+ ##Matching output colon followed by non-digits(1 or more occurrences)
\K ##\K to forget previous matched values to make sure we get only further matched values in this expression.
\d(?:\.\d+)? ##Matching digit followed by optional dot digits.

Since you are using PCRE, you can use
output.*?\K\d[\d.]*
See the regex demo. This matches
output - a fixed string
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that removes all text matched so far from the overall match memory buffer
\d - a digit
[\d.]* - zero or more digits or periods.

Related

Using REGEXEXTRACT on an IMPORTRANGE in Google Docs

I am importing a range from another Google sheet and I need to pull a specific number from the data that is imported. The data looks something like:
R2.word.4.word
I want to extract the second number. It will always follow this format (a letter and a number then a period then a word then a period then a number (might be single or double digit) then a period and a word). The regex to extract the second number should be: (\d+)(?!.*\d) and I have tested it in multiple regex test sites. However, Google docs gives me an error stating it is not a regular expression. I tried something like this (edited out URL and the sheet name):
=REGEXEXTRACT(IMPORTRANGE(URL,Sheet!A2:A200), "(\d+)(?!.*\d"))
Can anyone help me understand how I can fix this?
And the other issue here is that it isn't actually importing the range. I only get it to import on the first cell and not down the column.
You could write a pattern like:
=REGEXEXTRACT(A2,"^[A-Z]\d+\.\w+\.(\d+)")
Explanation
^ Start of string
[A-Z] Match a single uppercase char
\d+ Match 1+ digits
\. Match a dot
\w+ Match 1+ word characters
\. Match a dot
(\d+) Capture group 1, match 1+ digits
Regex demo
With your shown samples please try following regex.
=REGEXEXTRACT(A2,"^[a-zA-Z]\d+\.[^.]*\.(\d+)\.\S+$")
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^[a-zA-Z] ##From starting of value matching a-zA-Z here.
\d+ ##Matching 1 or more occurrences of digits.
\.[^.]*\. ##Matching literal dot till next occurrence of dot here.
(\d+) ##Creating 1 capturing group and which has 1 or more digits matching in it.
\.\S+$ ##Matching literal dot followed by 1o or more non-spaces till end of value.
"It will always follow this format"
Based on the above; you can use REGEXEXTRACT() but it's slow compared to simple SPLIT() which in your standardized format is ideal:
Formula in B2:
=INDEX(SPLIT(A2:A3,"."),0,3)
This is an array-formula by default and will spill all values down. Just apply it to your entire range.

Is there an easier way to find regex of the string?

I'm just beginning with regex and just got stuck in a difficult situation.
The string i have is:
ITEM DESCRIPTION: KING AUTHUR 2LB FLOUR PACK: 10 SIZE: 0011.00 OZ
I need to get the parts within "< >":
ITEM DESCRIPTION: <KING AUTHUR 2LB FLOUR> PACK: <10> SIZE: <0011.00 OZ>
I've tried
: *([\w\.]+ ?[\w]* [\d\w]* *[\w]*)
which is not 100% accurate and feels repetitive and also becomes tedious when the text gets longer (multiple key:value).
Is there a generalized way of getting all the values from a key:value pair from a text of indefinite length?
And also why something like (ITEM).*: doesn't stop at ITEM DESCRIPTION: but selects all the way upto ITEM DESCRIPTION: ... SIZE: if I just want to get the first key?
Here is one way with a PCRE-compatible regex:
:\s*\K.*?(?=\s*\w+:|$)
See the regex demo.
An ECMAScript 2018+/.NET/Python PyPi regex compliant pattern is
(?<=:\s*\b).*?(?=\s*\w+:|$)
See this regex demo.
For the rest, you may rely on capturing:
:\s*(.*?)(?=\s*\w+:|$)
See the regex demo.
Details:
:\s* - a colon and zero or more whitespaces
\K - a match reset operator that discards the whole text matched so far from the match memory buffer
(?<=:\s*\b) - a positive lookbehind that matches a location that is immediately preceded with :, zero or more whitespaces and a word boundary
.*? - any zero or more chars other than line break chars as few as possible
(?=\s*\w+:|$) - a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces, one or more word chars and then a colon, or end of string.

Regex to find a line with two capture groups that match the same regex but are still different

I am trying to analyse my source code (written in C) for not corresponding timer variable comparisons/allocations. I have a rage of timers with different timebases (2-250 milliseconds). Every timer variable contains its granularity in milliseconds in its name (e.g. timer10ms) as well as every timer-photo and define (e.g. fooTimer10ms, DOO_TIMEOUT_100MS).
Here are some example lines:
fooTimer10ms = timer10ms;
baaTimer20ms = timer10ms;
if (DIFF_100MS(dooTimer10ms) >= DOO_TIMEOUT_100MS)
if (DIFF_100MS(dooTimer10ms) < DOO_TIMEOUT_100MS)
I want to match those line where the timebases are not corresponding (in this case the second, third and fourth line). So far I have this regex:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))
that is capable of finding every line where there are two of those granularities. So instead of just line 2, 3 and 4 it matches all of them. The only idea I had to narrow it down is to add a negative lookbehind with a back-reference, like so:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))(?<!\1)
but this is not allowed because a negative lookbehind has to have a fixed length.
I found these two questions (one, two) but the fist does not have the restriction of having both capture groups being of the same kind and the second is looking for equal instances of the capture group.
If what I want can be achieved way easier, by using something else than regex, I would be happy to know. My mind is just stuck due to my believe that regex is capable of that and I am just not creative enough to use it properly.
One option is to match the timer part followed by the digits and use a negative lookahead with a backreference to assert that it does not occur at the right.
For the example data, a bit specific pattern using a range from 2-250 might be:
.*?(timer(?:2[0-4]\d|250|1?\d\d|[2-9])ms)\b\S*[^\S\r\n]*[<>]?=[^\S\r\n]*\b(?!\S*\1)\S+
The pattern matches
.*? Match any char except a newline, as least as possible (Non greedy)
( Capture group 1
timer Match literally
(?:2[0-4]\d|250|1?\d\d|[2-9]) Match a digit in the range of 2-250
ms Match literally
)\b Close group and a word boundary
\S*[^\S\r\n]* Match optional non whitespace chars and optional spaces without newlines
[<>]?= Match an optional < or > and =
[^\S\r\n]*\b Match optional whitespace chars without a newline and a word boundary
(?!\S*\1) Negative lookahead, assert no occurrence of what is captured in group 1 in the value
\S+ Match 1+ non whitespace chars
Regex demo
Or perhaps a broader pattern matching 1-3 digits and optional whitespace chars which might also match a newline:
.*?(timer\d{1,3}ms\b)\S*\s*[<>]?=\s*\b(?!.*\1)\S+
Regex demo
Note that {1-3} should be {1,3} and could also match 999

Regex Exclude Number Within Two Characters of Number

I have some manually entered data (it's an email subject), and I am trying to extract the correct ID to perform a series of actions with RPA on.
RE:'HC=312-822-281' abc2-1234567 7354612
I have a regex query:
(?<!\d)\d{7}(?!\d)
I want to extract 7354612 but not 1234567.
I want to avoid matching any 7-digit number that is preceded with a hyphen, or a hyphen and a space.
My initial query works 80% of the time, but this hyphen issue is interfering with the other 20%.
You can modify the existing (?<!\d) lookbehind to also exclude the position after a hyphen, i.e. (?<![\d-]), and add another lookbehind to exclude the hyphen + space context ((?<!- ) or (?<!-\s)):
(?<![\d-])(?<!- )\d{7}(?!\d)
(?<![\d-])(?<!-\s)\d{7}(?!\d)
Note \s matches any whitespace. See the regex demo.
Details
(?<![\d-]) - a negative lookbehind that fails the match if there is a digit or a hyphen immediately to the left of the current location
(?<!-\s) - a negative lookbehind that fails the match if there is a - and a space after it immediately to the left of the current location
\d{7} - any seven digits
(?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current location.
Variations
With PCRE regex, you may also use
-\s*\d{7}(?!\d)(*SKIP)(*F)|(?<!\d)\d{7}(?!\d)
See the regex demo, where -\s*\d{7}(?!\d)(*SKIP)(*F)| matches -, 0+ spaces, seven digits after which there are no more digits and skips that match, only returning matches for the (?<!\d)\d{7}(?!\d) pattern.
In .NET, modern JavaScript and PyPi regex in Python, you may use
(?<!\d|-\s*)\d{7}(?!\d)
See this regex demo. Here, (?<!\d|-\s*) negative lookbehind fails the match if there is a digit or - + 0 or more whitespace chars immediately to the left of the current position.

Using regex to determine straight (unordered hand)

A straight in poker is five cards in a row, for example 23456 or 89TJQ. With a "sorted" hand, the regex could be written as:
^(A2345|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)$
It's a bit verbose but straightforward enough. However, would it be possible to generate a (sensible) regex if the hand was unordered? For example, if the hand was 52634 or JQ89T??
One possible way would be to use a ?=.*<item> lookahead (which would essentially be "unsorted"), for example:
^(?:
(?=.*A)(?=.*2)(?=.*3)(?=.*4)(?=.*5)
|(?=.*2)(?=.*3)(?=.*4)(?=.*5)(?=.*6)
|(?=.*3)(?=.*4)(?=.*5)(?=.*6)(?=.*7)
|(?=.*4)(?=.*5)(?=.*6)(?=.*7)(?=.*8)
|(?=.*5)(?=.*6)(?=.*7)(?=.*8)(?=.*9)
|(?=.*6)(?=.*7)(?=.*8)(?=.*9)(?=.*T)
|(?=.*7)(?=.*8)(?=.*9)(?=.*T)(?=.*J)
|(?=.*8)(?=.*9)(?=.*T)(?=.*J)(?=.*Q)
|(?=.*9)(?=.*T)(?=.*J)(?=.*Q)(?=.*K)
|(?=.*T)(?=.*J)(?=.*Q)(?=.*K)(?=.*A)
)
.{5}$
Are there other / better approaches to finding if a straight exists using regex only?
You can use the following regex:
See regex in use here
(?!.*(.).*\1)(?:[A2345]{5}|[23456]{5}|[34567]{5}|[45678]{5}|[56789]{5}|[6789T]{5}|[789TJ]{5}|[89TJQ]{5}|[9TJQK]{5}|[TJQKA]{5})
This works by first using a negative lookahead to ensure that the string doesn't contain any duplicates (?!.*(.).*\1). Then it matches 5 characters from any of the straight possibilities.
(?!.*(.).*\1)
#^^^ ^ negative lookahead ensuring what follows doesn't match
# ^^ match any character any number of times
# ^^^ capture a character into capture group #1
# ^^ match any character any number of times
# ^^ match the same text as most recently matched by the 1st capture group
Against JQQ89, it works as follows:
- .* matches J
- (.) captures Q
- .* matches nothing
- \1 tries to match Q (and succeeds)
- Negative lookahead has a match, so fail the match.