Regexport() both integers and numbers with decimals - regex

I'm working in Google Sheets and wondering if it's possible to use one regexport() function to export both whole and partial numbers.
I have a column with:
1 Ml/ 2 Ml
2 Ml/ 2.02 Ml
3 Ml/ 4.01 Ml
and want a column with:
2
2.02
4.01
The first value could be 2.00 as well.
I was wondering if this is possible specifically with regular expressions. I know how to do it without. I currently have regexport(cell#, "\/\D(\d+)\D")
Thanks!

I think all you need as a pattern is:
(\d+(?:\.\d+)?) Ml$
( - 1St apture group.
\d+ - One or more digits.
(?: - Open non-capture group.
\.\d+ - A literal dot followed by one or more digits.
)? - Close non-capture group and make it optional.
) - Close 1st capture group.
Ml$ - Match "Ml" literally upto the end string ancor ($).
Add this to an ARRAYFORMULA() like:
=ARRAYFORMULA(REGEXEXTRACT(A1:A3,"(\d+(?:\.\d+)?) Ml$"))

without Regex
We want to grab a value encapsulated by slash-space on the left and space on the right:
=TRIM(MID(A1,FIND("/ ",A1)+2,FIND(" ",A1,FIND("/ ",A1)+2)-(FIND("/ ",A1)+2)))
(Both Excel and Google Sheets should work the same way. If we have to grab multiple instances, I would use Regex.)

use:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A1:A2, " (\d+\.\d+) Ml"),
REGEXEXTRACT(A1:A2, " (\d+) Ml")))

You can also try the simpler which takes care of errors and returns results as numbers at the same time.
=ArrayFormula(IFERROR(
REGEXEXTRACT(A1:A,"/ (.*) ")*1))

Use
=REGEXEXTRACT(A1, "(\d[\d.]*)\s*Ml$")
See proof
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
[\d.]* any character of: digits (0-9), '.' (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
Ml 'Ml'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Related

regex if (text contain this text) match this

I have these two sentence
TAGGING ODP:-7.160792, 113.496069
TAGGING pel:-7.160792, 113.496069
I want to match -7.160792 part only if the full sentence contain "odp" in it.
I tried the following (?(?=odp)-\d+.\d+) but it doesn't work, i don't know why.
Any help is appreciated.
(?(?=odp)-\d+\.\d+) won't work because (?=odp) is a positive lookahead that imposes a constraint on the pattern on the right, -\d+\.\d+. Namely, it requires odp string to occur exactly at the same location where - and a number are expected.
Use
(?<=ODP:)-\d+\.\d+
ODP:(-\d+\.\d+)
If lookbehinds are supported, the first variant is more viable.
Otherwise, another option with capturing groups is good to use.
And if odp can appear anywhere, even after the number:
(?i)^(?=.*odp).*(-\d+\.\d+)
This will capture the value into a group.
EXPLANATION
--------------------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
odp 'odp'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
You can use the regex, (?i)(?<=odp:)[^,]*.
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
[^,]*: Anything but ,
👉 If you want the match to be restricted to numbers only, you can use the regex, (?i)(?<=odp:)(?:-\d+.\d+)
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
(?:: Start non capturing group
-: Literal, -
\d+: 1+ digit(s)
.\d+: . followed by 1+ digit(s)
): End non capturing group
👉 If the sign can be either + or -, you can use the regex, (?i)(?<=odp:)(?:[+-]\d+.\d+)
The pattern (?(?=odp)\-\d+\.\d+) is using a conditional (? stating in the if clause:
If what is directly to the right from the current position is odp,
then match -\d+.\d+
That can not match.
What you also could do is match odp followed by any char other than a digit using \D* and capture the digit part in a group.
\bodp\b\D*(-\d+\.\d+)\b
The pattern matches:
\bodp\b match odp between word boundaries to prevent a partial match
\D* Optionally match any char other than a digit
(-\d+\.\d+) Capture - and 1+ digits with a decimal part in group 1
\b A word boundary
Regex demo
(?<=ODP:)(-\d+.\d+)
You can try using the negative look behind.
This should solve for the code you ve provided.

Matching regex after colon but before underscore

I have two strings below which i need to apply a regex function to in Google BigQuery with its desired outputs: Input:
MERCURE ENGAGEMENT_LaL_FB_TALENT:HENRIQUE_PORTUGAL_WEEK 4_IMAGE CAROUSEL_I19
MERCURE ENGAGEMENT_LaL_FB_UGC:_ENGLAND_TBC_WEEK 4_IMAGE CAROUSEL_I25
Output:
HENRIQUE
ENGLAND
I cannot use a reverse or positive look ahead within bigquery.
The closest I have gotten is the following:
:\D*
Which matches the word after the colon but before the white space.
Any ideas helpful
You might also use a capturing group with with REGEXP_EXTRACT.
:_?([^\s_]+)
Explanation
:_? Match : and an optional underscore
( Capture group 1
[^\s_]+ Match 1+ times any char other than a whitespace char or an underscore (Omit \s if there can also be spaces in between)
) Close group 1
Regex demo
You could also exclude matching an underscore from a word character which narrows down the range of accepted characters.
:_?([^\W_]+)
One approach uses REGEXP_REPLACE:
SELECT REGEXP_REPLACE(col, r'^.*:_?([^_]+)_.*$', r'\1') AS output
FROM yourTable;
Use
REGEXP_EXTRACT("column_name", r":[^a-zA-Z]*([a-zA-Z]+)")
See regex proof
Explanation
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
[^a-zA-Z]* any character except: 'a' to 'z', 'A' to
'Z' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1

Fix for the catastrophic backtracking in the regex expression

I have encountered the following regex in the company's code base.
(?:.*)(\\bKEYWORD\\b(?:\\s+)+(?:.*)+;)(?:.*)
Breakdown of Regex:
Non Capturing group: (?:.*)
Capturing group (\\bKEYWORD\\b(?:\\s+)+(?:.*)+;)
2.1 Non Captuing group: (?:\\s+)+
2.2 Non Capturing group: (?:.*)+
Non Capturing group: (?:.*)
The above regex goes into catastrophic bactracking when it fails to match ; or the test sample becomes too long. Check below the two test samples:
1. -- KEYWORD the procedure if data match between Type 1 and Type 2/3 views is not required.
2. KEYWORD SET MESSAGE_TEXT = '***FAILURE : '||V_G_ERR_MSG_01; /*STATEMENT TO THROW USER DEFINED ERROR CODE AND EXCEPTION TO THE CALLER*/
I went though Runaway Regular Expression's article aswell and tried to use to Atomic Grouping but still no results. Can anybody help me how to fix this regex ?
According to the link you provided, there are patterns like (x+x+)+ in your expression: (?:\\s+)+ and the subsequent (?:.*)+. The first matches one or more whitespace characters one or more times, and the second matches indefinite amount of any chars one or more times. This hardly makes sense.
Non-capturing groups are unnecessary here.
Use
.*\\b(KEYWORD\\s.*;).*
See proof
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
KEYWORD 'KEYWORD'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

regex filename of a unixpath without the first two digits

I have filename in a unix-path starting with two digits ... how can i extract the name without the extension
/this/is/my/path/to/the/file/01filename.ext should be filename
I currently have [^/]+(?=\.ext$) so I get 01filename, but how do I get rid of the first two digits?
You can add a look-behind in front of what you already have, looking for two digits:
(?<=\d\d)[^/]+(?=.ext$)
This only works if you have exactly two digits! Unfortunately, in most regex engines it is not possible to use quantifiers like * or + in lookbehinds.
(?<=\d\d) - checks for two digits before the match
[^/]+ - matches 1 or more characters, except /
(?=.ext$) - checks for .ext behind the match
Try this one :
/\d\d(.*?).\w{3}$
Explanation :
/\d\d : slash followed by two digit
(.*?) : the capture
.\w{3} : a dot followed by three letters
$ : end of string
It works for me on Expresso
Consider the following Regex...
(?<=\d{2})[^/]+(?=.ext$)
Good Luck!
A more general regex:
(?:^|\/)[\d]+([^.]+)\.[\w.]+$
Explanation:
(?: group, but do not capture:
^ the beginning of the string
| OR
\/ '/'
) end of grouping
[\d]+ any character of: digits (0-9) (1 or more
times (matching the most amount possible))
( group and capture to \1:
[^.]+ any character except: '.' (1 or more
times (matching the most amount
possible))
) end of \1
\. '.'
[\w\.]+ any character of: word characters (a-z, A-
Z, 0-9, _), '.' (1 or more times
(matching the most amount possible))
$ before an optional \n, and the end of the
string

Perl, delete everything after first three characters

I promise you all I've searched the site for about two hours now. I've found several that should have worked, but they didn't.
I have a line that consists of a varying amount of numbers separated by spaces. I want to delete everything after the third number.
I should say that everything I've been writing has been assuming that \S\s\S\s\S would match the first three numbers. with spaces between 1 and 2, and 2 and 3.
I anticipated the following working:
s/^.*?[\S\s\S\s\S].{5}//s;
but it did the exact opposite of what I wanted.
I would like 2 3 0 4 5 6 7 1 0 1 2 to become 2 3 0
I would really prefer to keep it substitution. I've tried look-behind as one person mentioned and I had no luck. Should I be saving the first 3 numbers as a string before I'm trying these commands?
EDIT:
I should have clarified that these numbers could be in the form 1.57 or 1.00E01 as well. I had integers when I was trying to get that to just baseline work.
\S\s\S\s\S will indeed match three non-space characters separated by space characters. However, ^.*?[\S\s\S\s\S].{5} does something completely different:
^ matches the beginning of the line.
.*? matches characters until the next match can start (not as many as it can). Since you specify /s, . will match newline as well.
[\S\s\S\s\S] is a character class, and so is the same as [\S\s]—match either \S or \s, which is to say anything.
.{5} will match five characters.
Since [\S\s] and . with /s match the same things, the .*? will never match any characters as it wants to match as little as possible. Thus, this is the same as s/^.{6}//s—delete the first six characters from the string. As you can see, that's not what you wanted!
One way to keep the first three numbers is to explicitly match them: s/^(\d \d \d).*/$1/s. Here, \d matches a single digit (0–9) with literal spaces in between them. We match the first three followed by anything at all, and then replace the whole match—since it ends in .*, that's the whole string—with just the bit in between parentheses, i.e. the first three numbers. If your numbers can be more than one digit long, then s/^(\d+ \d+ \d+).*/$1/s will do what you want; if you can have arbitrary space-like characters (space, tab, newline) separating them, then s/^(\d\s\d\s\d\s).*/$1/s is what you want (or \s+ if you can have multiple spaces). If you want to catch lines which have things other than digits, you can use \S or \S+, just as you were.
Another approach, using lookbehind, would be s/(?<=^\d \d \d).*//s. In other words, delete any characters which are preceded by ^\d \d \d—the beginning of the string followed by three space-separated numbers. There's no real advantage to this approach—I'd probably do it the other way—but since you mentioned lookbehind, here's how you can do it. (Again, things like s/(?<=^\S\s\S\s\S).*//s are more general.)
So match the first three numbers explicitly, and drop everything else.
s/^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$/$1 $2 $3/;
This works as follows:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(q{^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$})->explain;'
The regular expression:
(?-imsx:^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(Updated in consideration of the changes the OP made to the original specification.)
Your code where you say s/^.*?[\S\s\S\s\S].{5}//s;
I would write as: s/^(\S\s\S\s\S).*$/$1/
You're forgetting to use a $1 to capture the part of the substitution that you want to keep, and having a .* at the beginning could lead to starting numbers being removed instead of trailing numbers.
Also, I'm not sure if you have some guarantee of single digit numbers, or of single whitespace characters, so you could write the code with s/^(\S+\s+\S+\s+\S+).*$/$1/ to capture all of the spaces and all of the digits.
Let me know if I need to clarify that a little more.
Here's a website I find super helpful for Perl regex: http://pubcrawler.org/perl-reference.html
Question is, why do u want to do such a thing with regexp? it seems easier to me with:
substr $string, 5;
or if u really want to (I didn't test):
s/^(.{5})(.*)/$1/
parentheses allows you to "remember" patterns, this is the way to say that you want to replace pretty much everything with just the first part of the pattern (the first five characters). this pattern will match any line of text and leave just the first 5 characters maybe you want to modify it to match 3 digits with spaces between them