Regex Like for ORACLE with lookahead and negative lookahead - regex

I am working with an programm which uploads emailadresses to another programm - but it accepts emails only in one way:
i tried to write a reglular expression to filter out emailadresse which are not accepted
^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$
The description says:
username#domain
The at sign ('#') must be present and not first or last character.
The length of the name can have up to and including 64 characters.
The length of the domain can have up to and including 64 characters.
All email addresses are forced to lowercase when the email is sent. Therefore any email addresses requiring uppercase will most likely not be delivered correctly by the ISP as we will have changed it to lowercase.
username
Can contain:
A-Z
a-z
0-9
! # $ % * + - . = ? ~ | ` _ ^
The entire name can be surrounded by double quotes (though this is not supported by many ISPs). In this case, the following additional characters are allowed between the quotes - ( ) { } < > # , ; : [ ] (space)
domain
Can contain:
A-Z
a-z
0-9
Cannot contain 2 or more consecutive periods
Must contain at least 1 period
Domain - Cannot begin or end with a period or dash
also the part with [] does not work
Thanks for your help.

Oracle does not, natively, support non-capturing groups, look-ahead or look-behind in regular expressions.
However, if you have Java enabled in the database then you can compile a Java class:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
And create a PL/SQL wrapper function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
and then you can use your regular expression (or any other regular expression that Java supports):
SELECT REGEXP_JAVA_MATCH(
'alice#example.com',
'^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$'
) AS match
FROM DUAL
Which outputs:
MATCH
1
db<>fiddle here

Your regular expression can be re-written into a format that Oracle supports as:
(?:) non-capturing group are not supported and should just be a () capturing group instead.
Look-ahead is not supported but you can rewrite the look-ahead patterns using character list so #(?!\.)(?!-)([A-Za-z0-9.-]{1,61}) can be rewritten as #[A-Za-z0-9][A-Za-z0-9.-]{0,60}.
The (?!.*\.$) look-ahead is redundant as the pattern ends with ([a-z]{2,10})$ and can never match a trailing ..
If you want to include ] and - in a character list then ] should be the first character and - the last in the set.
The only thing that cannot be implemented in an Oracle regular expression is simultaneously restricting the length of the post-# segment and ensuring there are no .. double dots; to do that you need to check for one of those two conditions in a second regular expression.
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[A-Za-z0-9!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[]A-Za-z0-9!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[A-Za-z0-9]([A-Za-z0-9.-]{0,60})?'
-- Top-level domain
|| '\.[a-z]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains - exclude .. patterns
|| '([^.]+\.)+[a-z]{2,10}$'
) AS match
FROM DUAL
Or, using POSIX character lists:
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[[:alnum:]!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[][:alnum:]!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[[:alnum:]]([[:alnum:].-]{0,60})?'
-- Top-level domain
|| '\.[[:lower:]]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains
|| '([^.]+\.)+[[:lower:]]{2,10}$'
) AS match
FROM DUAL
Which both output:
MATCH
alice#example.com
db<>fiddle here

Related

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75
The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.
Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1
If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

Postgres Regex Negative Lookahead

Scenario: Match any string that starts with "J01" except the string "J01FA09".
I'm baffled why the following code returns nothing:
SELECT 1
WHERE
'^J01(?!FA09).*' ~ 'J01FA10'
when I can see on regexr.com that it's working (I realize there are different flavors of regex and that could be the reason for the site working).
I have confirmed in the postgres documentation that negative look aheads are supported though.
Table 9-15. Regular Expression Constraints
(?!re) negative lookahead matches at any point where no substring
matching re begins (AREs only). Lookahead constraints cannot contain
back references (see Section 9.7.3.3), and all parentheses within them
are considered non-capturing.
Match any string that starts with "J01" except the string "J01FA09".
You can do without a regex using
WHERE s LIKE 'J01%' AND s != 'J01FA09'
Here, LIKE 'J01%' requires a string to start with J01 and then may have any chars after, and s != 'J01FA09' will filter out the matches.
If you want to ahieve the same with a regex, use
WHERE s ~ '^J01(?!FA09$)'
The ^ matches the start of a string, J01 matches the literal J01 substring and (?!FA09$) asserts that right after J01 there is no FA09 followed with the end of string position. IF the FA09 appears and there is end of string after it, no match will be returned.
See the online demo:
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('J01NNN'),
('J01FFF'),
('J01FA09'),
('J02FA09')
;
SELECT * FROM table1 WHERE s ~ '^J01(?!FA09$)';
SELECT * FROM table1 WHERE s LIKE 'J01%' AND s != 'J01FA09';
RE is a right side operand:
SELECT 1
WHERE 'J01FA10' ~ '^J01(?!FA09)';
?column?
----------
1
(1 row)

Oracle Substring after specific character

I already found out I need to use substr/instr or regex but with reading the documentation about those, I cant get it done...
I am here on Oracle 11.2.
So here is what I have.
A list of Strings like:
743H5-34L-56
123HD34-7L
12HSS-34R
23Z67-4R-C23
What I need is the number (length 1 or 2) after the first '-' until there comes a 'L' or 'R'.
Has anybody some advice?
regexp_replace(string, '^.*?-(\d+)[LR].*$', '\1')
fiddle
Another version (without fancy lookarounds :-) :
with v_data as (
select '743H5-34L-56' val from dual
union all
select '123HD34-7L' val from dual
union all
select '12HSS-34R' val from dual
union all
select '23Z67-4R-C23' val from dual
)
select
val,
regexp_replace(val, '^[^-]+-(\d+)[LR].*', '\1')
from v_data
It matches
the beginning of the string "^"
one or more characters that are not a '-' "[^-]+"
followed by a '-' "-"
followed by one ore more digits (capturing them in a group) "(\d+)"
followed by 'L' or 'R' "[LR]"
followed by zero or more arbitrary characters ".*"

Proper way to add unescaped text from a field to a regex in postgres?

What's the proper way to add a literal text value from a field to a regex in postgres?
For example, something like this where some_field could contain invalid regex syntax if left unescaped:
where some_text ~* ('\m' || some_field || '\M');
The easiest thing to do is to use a regex to prep your string to be in a regex. Escaping non-word characters in your string should be sufficient to make it regex-safe, for example:
=> select regexp_replace('. word * and µ{', E'([^\\w\\s])', E'\\\\\\1', 'g');
regexp_replace
--------------------
\. word \* and µ\{
So something like this should work in general:
where some_text ~* x || regexp_replace(some_field, E'([^\\w\\s])', E'\\\\\\1', 'g') || y
where x and y are the other parts of the regex.
If you didn't need a regex at the end (i.e. no y above), then you could use (?q):
An ARE can begin with embedded options: a sequence (?xyz) (where xyz is one or more alphabetic characters) specifies options affecting the rest of the RE.
and a q means that the:
rest of RE is a literal ("quoted") string, all ordinary characters
So you could use:
where some_text ~* x || '(?q)' || some_field
in this limited case.

How can use regex to validate iSCSI target names?

I am trying to craft a regexp to validate iSCSI qualified names. An example of a qualified name is iqn.2011-08.com.example:storage This is example is minimal, I have seen other examples that are more extended.
So far what I have to validate off of it this:
print "Enter a new target name: ";
my $target_name = <STDIN>;
chomp $target_name;
if ($target_name =~ /^iqn\.\d{4}-\d{2}/xmi) {
print GREEN . "Target name is valid!" . RESET . "\n";
} else {
print RED . "Target name is not valid!" . RESET . "\n";
}
How can I extend that to work with rest up to the : I am not going to parse after the : becuase it is a description tag.
Is there a limit to how big a domain name can be?
According to RFC3270 (and in turn RFC1035),
/
(?(DEFINE)
(?<IQN_PAT>
iqn
\.
[0-9]{4}-[0-9]{2}
\.
(?&REV_SUBDOMAIN_PAT)
(?: : .* )?
)
(?<EUI_PAT>
eui
\.
[0-9A-Fa-f]{16}
)
(?<REV_SUBDOMAIN_PAT>
(?&LABEL_PAT) (?: \. (?&LABEL_PAT) )*
)
(?<LABEL_PAT>
[A-Za-z] (?: [A-Za-z0-9\-]* [A-Za-z0-9] )?
)
)
^ (?: (?&IQN_PAT) | (?&EUI_PAT) ) \z
/sx
It's not clear if the eui names accept lowercase hex digits or not. I figured it was safer to allow them.
If you condense the above, you get /^(?:iqn\.[0-9]{4}-[0-9]{2}(?:\.[A-Za-z](?:[A-Za-z0-9\-]*[A-Za-z0-9])?)+(?::.*)?|eui\.[0-9A-Fa-f]{16})\z/s.
(By the way, your use /m is wrong, your use of /i is wrong, and \d can match far more than the allowed [0-9].)
If you only need part before : then you can use following regexp:
if ($target_name =~ /^iqn\.(\d{4}-\d{2})\.([^:]+):/xmi) {
my ($date, $reversed_domain_name) = ($1, $2);
Regexp [^:]+ matches to 1 or more non-: symbols. It will match even if domain name is not well formed. Further improvements depends on your goal: do you need just get individual components of iSCSI name or do you need to validate its syntax?
Is there a limit to how big a domain name can be?
From Wikipedia:
The full domain name may not exceed a total length of 253