I'm just wondering if it's possible to use one regular expression to match another, that is some sort of:
['a-z'].match(['b-x'])
True
['m-n'].match(['0-9'])
False
Is this sort of thing possible with regex at all? I'm doing work in python, so any advice specific to the re module's implementation would help, but I'll take anything I can get concerning regex.
Edit: Ok, some clarification is obviously in order! I definitely know that normal matching syntax would look something like this:
expr = re.compile(r'[a-z]*')
string = "some words"
expr.match(string)
<sRE object blah blah>
but I'm wondering if regular expressions have the capability to match other, less specific expressions in the non-syntacticly correct version I tried to explain with above, any letter from b-x would always be a subset (match) of any letter from a-z. I know just from trying that this isn't something you can do by just calling the match of one compiled expression on another compiled expression, but the question remains: is this at all possible?
Let me know if this still isn't clear.
I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:
Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.
However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.
But what are you trying to do? Perhaps there's an easier approach...?
Verification of the post by "antinome" using two regex : 55* and 5* :
REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]
REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]
[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]
REGEX_A can have an NFA as below:
{A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
{B} -----------------epsilon --------> {E}
{C} <--- epsilon ------ {E}
REGEX_B can have an NFA as below:
{A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
{A} --------------epsilon -----------> {D}
{B} <--- epsilon ------ {D}
Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:
NFA:
{state A} ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
{state C} ---epsilon --> {state D}
{state C} <---epsilon -- {state D}
{state A} ---epsilon --> {state E} ---5--> {state F}
{state E} ---epsilon --> {state F}
{state E} <---epsilon -- {state F}
NFA -> DFA:
| 5 | epsilon*
----+--------------+--------
A | B,C,E,F,G | A,C,E,F
B | C,D,E,F | B,C,E,F
c | C,D,E,F | C
D | C,D,E,F,G | C,D,E,F
E | C,D,E,F,G | C,E,F
F | C,E,F,G | F
G | C,D,E,G | C,E,F,G
5(epsilon*)
-------------+---------------------
A | B,C,E,F,G
B,C,E,F,G | C,D,E,F,G
C,D,E,F,G | C,D,E,F,G
Finally the DFA for (REGEX_A|REGEX_B) is:
{A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
{C,D,E,F,G}---5--> {C,D,E,F,G}
Note: {A} is start state and {C,D,E,F,G} is accepting state.
Similarly DFA for REGEX_A (55*) is:
| 5 | epsilon*
----+--------+--------
A | B,C,E | A
B | C,D,E | B,C,E
C | C,D,E | C
D | C,D,E | C,D,E
E | C,D,E | C,E
5(epsilon*)
-------+---------------------
A | B,C,E
B,C,E | C,D,E
C,D,E | C,D,E
{A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
{C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state
Similarly DFA for REGEX_B (5*) is:
| 5 | epsilon*
----+--------+--------
A | B,C,D | A,B,D
B | B,C,D | B
C | B,C,D | B,C,D
D | B,C,D | B,D
5(epsilon*)
-------+---------------------
A | B,C,D
B,C,D | B,C,D
{A} ---- 5 -----> {B,C,D}
{B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state
Conclusions:
DFA of REGX_A|REGX_B identical to DFA of REGX_A
-- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B
-- cannot infer about either gerexes.
In addition to antinome's answer:
Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?) to (x|), (x+) to (xx*), character classes like [a-d] to their corresponding union (a|b|c|d) etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.
Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.
Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).
pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False
Also tested successfully with Python 2. See also how to do it with a different library.
Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:
from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n
You should do something along these lines:
re.match("\[[b-x]-[b-x]\]", "[a-z]")
The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.
If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.
It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.
Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.
Some clarification is required, I think:
.
rA = re.compile('(?<! )[a-z52]+')
'(?<! )[a-z52]+' is a pattern
rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .
Personally I use the term regex exclusively for this kind of objects, not for the pattern .
Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)
ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch)
# or
y = rA.search(ch)
x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*
.
That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.
I don't think such a way exists, whatever theoretically or practically.
I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2
I need to determine whether a given string can be interpreted as a number (integer or floating point) in an SQL statement. As in the following:
SELECT AVG(CASE WHEN x ~ '^[0-9]*.?[0-9]*$' THEN x::float ELSE NULL END) FROM test
I found that Postgres' pattern matching could be used for this. And so I adapted the statement given in this place to incorporate floating point numbers. This is my code:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'))
SELECT x
, x ~ '^[0-9]*.?[0-9]*$' AS isnumeric
FROM test;
The output:
x | isnumeric
---------+-----------
| t
. | t
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
(11 rows)
As you can see, the first two items (the empty string '' and the sole period '.') are misclassified as being a numeric type (which they are not). I can't get any closer to this at the moment. Any help appreciated!
Update Based on this answer (and its comments), I adapted the pattern to:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x
, x ~ '^([0-9]+[.]?[0-9]*|[.][0-9]+)$' AS isnumeric
FROM test;
Which gives:
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | f
(13 rows)
There are still some issues with the scientific notation and with negative numbers, as I see now.
As you may noticed, regex-based method is almost impossible to do correctly. For example, your test says that 1.234e-5 is not valid number, when it really is. Also, you missed negative numbers. What if something looks like a number, but when you try to store it it will cause overflow?
Instead, I would recommend to create function that tries to actually cast to NUMERIC (or FLOAT if your task requires it) and returns TRUE or FALSE depending on whether this cast was successful or not.
This code will fully simulate function ISNUMERIC():
CREATE OR REPLACE FUNCTION isnumeric(text) RETURNS BOOLEAN AS $$
DECLARE x NUMERIC;
BEGIN
x = $1::NUMERIC;
RETURN TRUE;
EXCEPTION WHEN others THEN
RETURN FALSE;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;
Calling this function on your data gets following results:
WITH test(x) AS ( VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x, isnumeric(x) FROM test;
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | t
(13 rows)
Not only it is more correct and easier to read, it will also work faster if data was actually a number.
You problem is the two 0 or more [0-9] elements on each side of the decimal point. You need to use a logical OR | in the number identification line:
~'^([0-9]+\.?[0-9]*|\.[0-9]+)$'
This will exclude a decimal point alone as a valid number.
I suppose one could have that opinion (that it's not a misuse of exception handling), but generally I think that an exception handling mechanism should be used just for that. Testing whether a string contains a number is part of normal processing, and isn't "exceptional".
But you're right about not handling exponents. Here's a second stab at the regular expression (below). The reason I had to pursue a solution that uses a regular expression was that the solution offered as the "correct" solution here will fail when the directive is given to exit when an error is encountered:
SET exit_on_error = true;
We use this often when groups of SQL scripts are run, and when we want to stop immediately if there is any issue/error. When this session directive is given, calling the "correct" version of isnumeric will cause the script to exit immediately, even though there's no "real" exception encountered.
create or replace function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is null or rtrim($1)='' then
return false;
else
return (select $1 ~ '^ *[-+]?[0-9]*([.][0-9]+)?[0-9]*(([eE][-+]?)[0-9]+)? *$');
end if;
end;
$$;
Since PostgreSQL 9.5 (2016) you can just ask the type of a json field:
jsonb_typeof(field)
From the PostgreSQL documentation:
json_typeof(json)
jsonb_typeof(jsonb)
Returns the type of the outermost JSON value as a text string. Possible types are object, array, string, number, boolean, and null.
Example
When aggregating numbers and wanting to ignore strings:
SELECT m.title, SUM(m.body::numeric)
FROM messages as m
WHERE jsonb_typeof(m.body) = 'number'
GROUP BY m.title;
Without WHERE the ::numeric part would crash.
The obvious problem with the accepted solution is that it is an abuse of exception handling. If there's another problem encountered, you'll never know it because you've tossed away the exceptions. Very bad form. A regular expression would be the better way to do this. The regex below seems to behave well.
create function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is not null then
return (select $1 ~ '^(([-+]?[0-9]+(\.[0-9]+)?)|([-+]?\.[0-9]+))$');
else
return false;
end if;
end;
$$
;