Regex: How to Implement Negative Lookbehind in PL/SQL - regex

How do I match all the strings that begin with loockup. and end with _id but not prefixed by msg? Here below are some examples:
lookup.asset_id -> should match
lookup.msg_id -> shouldn't match
lookup.whateverelse_id -> should match
I know Oracle does not support negative lookbehind (i.e. (?<!))... so I've tried to explicitly enumerate the possibilities using alternation:
regexp_count('i_asset := lookup.asset_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
dbms_output.put_line('match'); -- this matches as expected
end if;
regexp_count('i_msg := lookup.msg_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
dbms_output.put_line('match'); -- this shouldn’t match
-- but it does like the previous example... why?
end if;
The second regexp_count expression should't match... but it does like the first one. Am I missing something?
EDIT
In the real use case, I've a string that contains PL/SQL code that might contains more than one lookup.xxx_id instances:
declare
l_source_code varchar2(2048) := '
...
curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');
asset : = lookup.asset_id(key_val => ''UBSN''); -- this is wrong since it does
-- not specify key_type
...
msg := lookup.msg_id(key_val => ''hello''); -- this is fine since msg_id does
-- not require key_type
';
...
end;
I need to determine whether there is at least one wrong lookup, i.e. all occurrences, except lookup.msg_id, must also specify the key_type parameter.

With lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id, you are basically asking to check for a string
starting with lookup. denoted by lookup\.,
followed by at least one character different from ( denoted by [^\(]+,
followed by either -- ( | | )
one character different from m -- [^m], or
two characters: m plus no s -- m[^s], or
three characters: ms and no g -- ms[^g], and
ending in _id denoted by _id.
So, for lookup.msg_id, the first part matches obviously, the second consumes ms, and leaves the g for the first alternative of the third.
This could be fixed by patching up the third part to be always three characters long like lookup\.[^\(]+([^m]..|m[^s.]|ms[^g])_id. This, however, would fail everything, where the part between lookup. and _id is not at least four characters long:
WITH
Input (s, r) AS (
SELECT 'lookup.asset_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.msg_id', 'shouldn''t match' FROM DUAL UNION ALL
SELECT 'lookup.whateverelse_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.a_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.ab_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.abc_id', 'should match' FROM DUAL
)
SELECT
r, s, INSTR(s, 'lookup.msg_id') has_msg, REGEXP_COUNT(s , 'lookup\.[^\(]+([^m]..|m[^s]|ms[^g])_id') matched FROM Input
;
| R | S | HAS_MSG | MATCHED |
|-----------------|------------------------|---------|---------|
| should match | lookup.asset_id | 0 | 1 |
| shouldn't match | lookup.msg_id | 1 | 0 |
| should match | lookup.whateverelse_id | 0 | 1 |
| should match | lookup.a_id | 0 | 0 |
| should match | lookup.ab_id | 0 | 0 |
| should match | lookup.abc_id | 0 | 0 |
If you have just to make sure, there is no msg in the position in question, you might want to go for
(INSTR(s, 'lookup.msg_id') = 0) AND REGEXP_COUNT(s, 'lookup\.[^\(]+_id') <> 0
For code clarity REGEXP_INSTR(s, 'lookup\.[^\(]+_id') > 0 might be preferable…
#j3d Just comment if further detail is required.

With the requirements still being kind of vague…
Split the string at the semicolon.
Check each substring s to comply:
WITH Input (s) AS (
SELECT ' curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
SELECT 'curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
SELECT 'asset := lookup.asset_id(key_val => ''UBSN'');' FROM DUAL UNION ALL
SELECT 'msg := lookup.msg_id(key_val => ''hello'');' FROM DUAL
)
SELECT
s
FROM Input
WHERE REGEXP_LIKE(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id\(key_val => ''[a-zA-Z0-9]+''\);$')
OR
((REGEXP_INSTR(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id') = 0)
AND (REGEXP_INSTR(s, '[(,]\s*key_type') > 0)
AND (REGEXP_INSTR(s,
'^\s*[a-z]+\s+:=\s+lookup\.[a-z]+_id\(( ?key_[a-z]+ => ''[a-zA-Z_]+?'',?)+\);$') > 0))
;
| S |
|--------------------------------------------------------------------------|
|[tab] curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
| curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
| msg := lookup.msg_id(key_val => 'hello');|
This would tolerate a superfluous comma right before the closing parenthesis. But if the input is syntactically correct, such a comma won't exist.

Related

Ora 06512/04088 triggers errors when INSERT INTO statement

I'm at work for a trigger which provide a "domain" for column Molteplicità in a table called Partecipa using a function.
The trigger I've created is the following:
CREATE OR REPLACE TRIGGER dominioMolteplicità
BEFORE INSERT OR UPDATE ON partecipa
FOR EACH ROW
BEGIN
IF moltepl_valido(:NEW.molteplicità) = 'f' THEN
RAISE_APPLICAZION_ERROR(-20002, 'Invalid type');
END IF;
END;
which uses the following function:
CREATE OR REPLACE FUNCTION motepl_valido(mol VARCHAR2) RETURN CHAR IS
BEGIN
IF regexp_like(LOWER(mol), ' [*]\..[*] ') THEN
RETURN 't';
ELSE
RETURN 'f';
END IF;
END;
Table Partecipa has the following columns:
CodP INT,
molteplicità VARCHAR2,
codAss INT,
className VARCHAR2,
PRIMARY KEY (codP),
FOREIGN KEY (className) REFERENCES class(name),
FOREIGN KEY (codAss) REFERENCES associazione(cod)`
and even though in my Associazione table there are rows (in particular codaAss: 42) and in my Class table there are rows (in particular className: 'Impiegato')
When I execute the following statement
insert into Partecipa(molteplicità, className, codAss)
values ('*..*', 'Impiegato', 42);
I get these errors:
ORA-20002 INVALID TYPE
ORA-06512: AT "dominioMolteplicità", line 3
ORA-04088: ERROR DURING EXECUTION OF TRIGGER "dominioMolteplicità"
(Note that if I disable my trigger, the insert statement works properly. There's some problem with the trigger, but I can't find the mistake.)
It's not related to the trigger.
Your function motepl_valido raises ORA-20002 INVALID TYPE if the supplied string (in this case '*..*') does not match the regex ' [*]\..[*] '. It doesn't match because it's missing the required spaces.
Demo showing the effect of a selection of regex patterns (I've added | around the patterns to show the leading and trailing spaces):
with demo (molteplicita) as
( select '*..*' from dual union all
select ' *..* ' from dual union all
select ' *x.* ' from dual )
, patterns (pattern) as
( select '[*]\..[*]' from dual union all
select ' [*]\..[*] ' from dual union all
select ' *[*]\..[*] *' from dual union all
select ' *\*\..\* *' from dual )
select '|'||pattern||'|' as pattern
, '|'||molteplicita||'|' as molteplicita
, case when regexp_like(molteplicita, pattern) then 'Yes' else 'No' end as matched
from demo cross join patterns
order by pattern, molteplicita desc;
PATTERN MOLTEPLICITA MATCHED
---------------- ------------ -------
| *[*]\..[*] *| |*..*| Yes
| *[*]\..[*] *| | *x.* | No
| *[*]\..[*] *| | *..* | Yes
| *\*\..\* *| |*..*| Yes
| *\*\..\* *| | *x.* | No
| *\*\..\* *| | *..* | Yes
| [*]\..[*] | |*..*| No
| [*]\..[*] | | *x.* | No
| [*]\..[*] | | *..* | Yes
|[*]\..[*]| |*..*| Yes
|[*]\..[*]| | *x.* | No
|[*]\..[*]| | *..* | Yes
12 rows selected.
Beacue your pattern doesn't conform your data
I suppose regexp_like( lower(mol), '\*..\*') would be alright, and in this case the values such as '*=-*' or '*34*' for molteplicità would work.
Btw, even using '[\*]..[\*]'(where backslash used as an escape character) as the pattern might be possible for the above regular expression.
Demo :
with t( mol ) as
(
select '*24*' from dual union all
select 'B' from dual union all
select '*=-*' from dual
)
select
case when regexp_like(lower(mol), '\*..\*') then 't' else 'f' end suggested_pattern1,
case when regexp_like(lower(mol), '[\*]..[\*]') then 't' else 'f' end suggested_pattern2,
case when regexp_like(lower(mol), '[*]\..[*]') then 't' else 'f' end original_pattern,
case when regexp_like(lower(mol), '*..*') then 't' else 'f' end anticipated_pattern
from t;
SUGGESTED_PATTERN1 SUGGESTED_PATTERN2 ORIGINAL_PATTERN ANTICIPATED_PATTERN
t t f t
f f f t
t t f t
P.S. Note that anticipated_pattern would fail also ( for mol = 'B' in the above sample).

Spark - extracting numeric values from an alphanumeric string using regex

I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1
Here's one approach using a UDF that applies pattern matching to extract the string content into a case class. The pattern matching centers around the numeric value with Regex pattern [+-]?(?:\d*\.)?\d+ to extract the first occurrence of numbers like "1.23", ".99", "-100", etc. A subsequent check of numbers in the remaining substring captured in suffix determines whether the numeric substring in the original string is legitimate.
import org.apache.spark.sql.functions._
import spark.implicits._
case class RegexRes(prefix: String, suffix: String, value: Option[Double], pure_text: String)
val regexExtract = udf{ (s: String) =>
val pattern = """(.*?)([+-]?(?:\d*\.)?\d+)(.*)""".r
s match {
case pattern(pfx, num, sfx) =>
if (sfx.exists(_.isDigit))
RegexRes("", "", None, s)
else
RegexRes(pfx, sfx, Some(num.toDouble), "")
case _ =>
RegexRes("", "", None, s)
}
}
val df = Seq(
"11 H", "111L", ".9", "<.004", ">= 0.78", "val<=0.6", "xyz 100 abc", "1-9", "aaa 100.3.4", "a1q1"
).toDF("result")
df.
withColumn("regex_res", regexExtract($"result")).
select($"result", $"regex_res.prefix", $"regex_res.suffix", $"regex_res.value", $"regex_res.pure_text").
show
// +-----------+------+------+-----+-----------+
// | result|prefix|suffix|value| pure_text|
// +-----------+------+------+-----+-----------+
// | 11 H| | H| 11.0| |
// | 111L| | L|111.0| |
// | .9| | | 0.9| |
// | <.004| <| |0.004| |
// | >= 0.78| >= | | 0.78| |
// | val<=0.6| val<=| | 0.6| |
// |xyz 100 abc| xyz | abc|100.0| |
// | 1-9| | | null| 1-9|
// |aaa 100.3.4| | | null|aaa 100.3.4|
// | a1q1| | | null| a1q1|
// +-----------+------+------+-----+-----------+

Matching five characters in Excel/VBA using RegEx, with first character being dependant on cell value

I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2

Checking if last word of string is (case-insensitively) contained in another string

I'm using the regex SPARQL function and I pass two variables to it in this way:
FILTER regex(?x, ?y, "i")
I would like, for example, to compare these two strings: Via de' cerretani and via dei Cerretani. by extracting the significant word of the first string, which is usually the last word, cerretani in this case, and check if it's contained in the second string. As you can see, I pass these two strings as variables. How can I do this?
At first I though that this was a duplicate of your earlier question, Comparing two strings with SPARQL, but that's asking about a function that returns an edit distance. The task here is much more specific: Check whether the last word of a string is contained (case insensitively) in another string. As long as we take your specification that
the significant word of the string … is usually the last one
strictly and always use only the last word of the string (since there's no way to determine, in general, what the “significant word of the string” is), we can do this. You won't end up using the regex function, though. Instead we'll use replace, contains, and lcase (or ucase).
The trick is that we can get the last word of a string ?x by using replace to remove all the words by the last one (and the space before the one), and can then use strcontains to check whether this last word is contained in the other string. Using case normalization functions (in the following code, I used lcase, but ucase should work, too) we can do the containment check case insensitively.
select ?x ?y ?lastWordOfX ?isMatch ?isIMatch where {
# Values gives us some test data. It just means that ?x and ?y
# will be bound to the specified values. In your final query,
# these would be coming from somewhere else.
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
# For "the significant word of the string which is
# usually the last one", note that the "all but the last word"
# is matched by the pattern ".* ". We can replace "all but the
# last word to leave just the last word. (Note that if the
# pattern doesn't match, then the original string is returned.
# This is good for us, because if there's just a single word,
# then it's also the last word.)
bind( replace( ?x, ".* ", "" ) as ?lastWordOfX )
# When you check whether the second string contains the first,
# you can either leave the cases as they are and have a case
# sensitive check, or you can convert them both to the same
# case and have a case insensitive match.
bind( contains( ?y, ?lastWordOfX ) as ?isMatch )
bind( contains( lcase(?y), lcase(?lastWordOfX) ) as ?isIMatch )
}
---------------------------------------------------------------------------------
| x | y | lastWordOfX | isMatch | isIMatch |
=================================================================================
| "Via de' cerretani" | "via dei Cerretani" | "cerretani" | false | true |
| "Doctor Who" | "Who's on first?" | "Who" | true | true |
| "CaT" | "The cAt in the hat" | "CaT" | false | true |
| "John Doe" | "Don't, John!" | "Doe" | false | false |
---------------------------------------------------------------------------------
That might look like a lot of code, but's because there are comments, and the last word is bound to another variable, and I've included both case sensitive and case insensitive matches. When you're actually using this, it will be much shorter. For instance, to select only those ?x and ?y that match in this way:
select ?x ?y {
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
filter( contains( lcase(?y), lcase(replace( ?x, ".* ", "" ))))
}
----------------------------------------------
| x | y |
==============================================
| "Via de' cerretani" | "via dei Cerretani" |
| "Doctor Who" | "Who's on first?" |
| "CaT" | "The cAt in the hat" |
----------------------------------------------
It's true that
contains( lcase(?y), lcase(replace( ?x, ".* ", "" )))
is a bit longer than something like
regex( ?x, ?y, "some-special-flag" )
but I think it's fairly short. If you're willing to use the last word of ?x as a regular expression (which probably isn't a good idea, because you don't know that it doesn't contain special regular expression characters) you could even use:
regex( replace( ?x, ".* ", "" ), ?y, "i" )
but I suspect that it's probably faster to use contains, since regex has many more things to check.

isnumeric() with PostgreSQL

I need to determine whether a given string can be interpreted as a number (integer or floating point) in an SQL statement. As in the following:
SELECT AVG(CASE WHEN x ~ '^[0-9]*.?[0-9]*$' THEN x::float ELSE NULL END) FROM test
I found that Postgres' pattern matching could be used for this. And so I adapted the statement given in this place to incorporate floating point numbers. This is my code:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'))
SELECT x
, x ~ '^[0-9]*.?[0-9]*$' AS isnumeric
FROM test;
The output:
x | isnumeric
---------+-----------
| t
. | t
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
(11 rows)
As you can see, the first two items (the empty string '' and the sole period '.') are misclassified as being a numeric type (which they are not). I can't get any closer to this at the moment. Any help appreciated!
Update Based on this answer (and its comments), I adapted the pattern to:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x
, x ~ '^([0-9]+[.]?[0-9]*|[.][0-9]+)$' AS isnumeric
FROM test;
Which gives:
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | f
(13 rows)
There are still some issues with the scientific notation and with negative numbers, as I see now.
As you may noticed, regex-based method is almost impossible to do correctly. For example, your test says that 1.234e-5 is not valid number, when it really is. Also, you missed negative numbers. What if something looks like a number, but when you try to store it it will cause overflow?
Instead, I would recommend to create function that tries to actually cast to NUMERIC (or FLOAT if your task requires it) and returns TRUE or FALSE depending on whether this cast was successful or not.
This code will fully simulate function ISNUMERIC():
CREATE OR REPLACE FUNCTION isnumeric(text) RETURNS BOOLEAN AS $$
DECLARE x NUMERIC;
BEGIN
x = $1::NUMERIC;
RETURN TRUE;
EXCEPTION WHEN others THEN
RETURN FALSE;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;
Calling this function on your data gets following results:
WITH test(x) AS ( VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x, isnumeric(x) FROM test;
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | t
(13 rows)
Not only it is more correct and easier to read, it will also work faster if data was actually a number.
You problem is the two 0 or more [0-9] elements on each side of the decimal point. You need to use a logical OR | in the number identification line:
~'^([0-9]+\.?[0-9]*|\.[0-9]+)$'
This will exclude a decimal point alone as a valid number.
I suppose one could have that opinion (that it's not a misuse of exception handling), but generally I think that an exception handling mechanism should be used just for that. Testing whether a string contains a number is part of normal processing, and isn't "exceptional".
But you're right about not handling exponents. Here's a second stab at the regular expression (below). The reason I had to pursue a solution that uses a regular expression was that the solution offered as the "correct" solution here will fail when the directive is given to exit when an error is encountered:
SET exit_on_error = true;
We use this often when groups of SQL scripts are run, and when we want to stop immediately if there is any issue/error. When this session directive is given, calling the "correct" version of isnumeric will cause the script to exit immediately, even though there's no "real" exception encountered.
create or replace function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is null or rtrim($1)='' then
return false;
else
return (select $1 ~ '^ *[-+]?[0-9]*([.][0-9]+)?[0-9]*(([eE][-+]?)[0-9]+)? *$');
end if;
end;
$$;
Since PostgreSQL 9.5 (2016) you can just ask the type of a json field:
jsonb_typeof(field)
From the PostgreSQL documentation:
json_typeof(json)
jsonb_typeof(jsonb)
Returns the type of the outermost JSON value as a text string. Possible types are object, array, string, number, boolean, and null.
Example
When aggregating numbers and wanting to ignore strings:
SELECT m.title, SUM(m.body::numeric)
FROM messages as m
WHERE jsonb_typeof(m.body) = 'number'
GROUP BY m.title;
Without WHERE the ::numeric part would crash.
The obvious problem with the accepted solution is that it is an abuse of exception handling. If there's another problem encountered, you'll never know it because you've tossed away the exceptions. Very bad form. A regular expression would be the better way to do this. The regex below seems to behave well.
create function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is not null then
return (select $1 ~ '^(([-+]?[0-9]+(\.[0-9]+)?)|([-+]?\.[0-9]+))$');
else
return false;
end if;
end;
$$
;