Regular expression for Swift X character set in Oracle

Regular expression for Swift X character set in Oracle - regex

As per the swift documentation, The SWIFT X character set consists of the below
X Character Set – SWIFT Character Set
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
/ – ? : ( ) . , ‘ + CrLf Space
I have come up with the below to validate the swift character set which seems to be working but want to know if there is a better way of doing it.Also what should I use for CRLF to be OS neutral.Since i use unix I have put chr(10)
^[a-zA-Z0-9 -?:().,''+chr(10)/]*$

Unfortunately, a range like a-z may include accented letters and collation elements, depending on the value of nls_sort at the time of running a query. And, alas, Oracle does not support the character class [[:ascii:]], which would be another way to specify what you need.
You have two choices. Either you specify the nls_sort parameter explicitly every time, before running the query (or rely on it being something like 'English' already), which to me doesn't sound like a good practice; or you specify all letters explicitly.
There are a few more things to fix. The dash - has special meaning in a bracketed expression; if you want it to mean a literal dash, it should appear as either the first or the last character in the list, where it can't have its special meaning. All other regexp special characters are not special in a bracketed expression, so you don't need to worry about dot, question mark, asterisk, parentheses, etc.
However, note that the single-quote character, while it has no special meaning in a regular expression (in a bracketed expression or otherwise), it does have a special meaning in a string in Oracle; to include a single-quote in a hard-coded string, you must escape it by typing two single-quote characters.
Then - if you write chr(10) in a bracketed expression, that is characters c, h, ... - if you mean LF, you need to either actually include a newline character in your string (probably a bad idea), or concatenate it by hand.
And if you want to validate against the official character set of "swift x" (whatever that is), you should include all characters, regardless of your OS. So you should accept CR (chr(13)) too, unless you have a better reason to omit it. If it is present but you don't want it in your db, you should accept it and then remove it after the fact and save the resulting string (after you remove CR), not reject the entire string altogether.
To keep the work organized, I would create a very small table (or view) to store the needed validation string, then use it in all queries that need it.
Something like this:
create table swift_validation (validation_pattern varchar2(100));
insert into swift_validation (validation_pattern)
with helper(ascii_letters) as (
select 'abcdefghijklmnopqrstuvwxyz' from dual
)
select '^[' || ascii_letters -- a-z (ASCII)
|| upper(ascii_letters) -- A-Z (ASCII)
|| '0-9'
|| chr(10) || chr(13) -- LF and CR
|| '/?:().,''+ -'
|| ']*$'
from helper
;
commit;
Check what was saved in the table:
select * from swift_validation;
VALIDATION_PATTERN
------------------------------------------------------------
^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0-9
/?:().,'+ -]*$
Note that the result is in three lines. chr(10) is seen as a newline; then chr(13) by itself is converted to another newline.
In any case, if you really want to see the exact characters saved in this string, you can use the dump function. With option 17 for the second argument, the output is readable (you will have to scroll though):
select dump(validation_pattern, 17) from swift_validation;
DUMP(VALIDATION_PATTERN,17)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Typ=1 Len=73: ^,[,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,0,-,9,^J,^M,/,?,:,(,),.,,,',+, ,-,],*,$
Notice in particular the control characters, ^J and ^M; they mean chr(10) and chr(13) respectively (easy to remember: J and M are the tenth and thirteenth letters of the Latin alphabet).
Then you use this as follows:
with
test_strings (str) as (
select 'abc + (12)' from dual union all
select '$122.38' from dual union all
select null from dual union all
select 'café au lait' from dual union all
select 'A / B - C * D' from dual
)
select t.str,
case when regexp_like(t.str, sv.validation_pattern)
then 'valid' else 'invalid' end as swift_valid
from test_strings t, swift_validation sv
;
STR SWIFT_VALID
------------- -----------
abc + (12) valid
$122.38 invalid
invalid
café au lait invalid
A / B - C * D invalid
Notice one last oddity here. In my test, I included a row where the input string is empty (null). Regular expressions are odd in this respect: null is not (regexp_) like something like 'a*' - even though * is supposed to mean "zero or more ...". Oracle's reasoning, perhaps, is that null may be anything - just one of the hundreds of ways the Oracle identification of null and "empty string" is just plain idiotic. It is what it is though; make sure you don't reject a row with an empty string. I assume "swift x" allows empty strings. You will need to handle that separately, like this:
with
test_strings (str) as (
select 'abc + (12)' from dual union all
select '$122.38' from dual union all
select null from dual union all
select 'café au lait' from dual union all
select 'A / B - C * D' from dual
)
select t.str,
case when t.str is null
or regexp_like(t.str, sv.validation_pattern)
then 'valid' else 'invalid' end as swift_valid
from test_strings t, swift_validation sv
;
STR SWIFT_VALID
------------- -----------
abc + (12) valid
$122.38 invalid
valid
café au lait invalid
A / B - C * D invalid
Left as exercise:
You may need to find the invalid characters in an invalid string. For such generalized applications (more than a straight validation of a whole string), you might be better off saving just the bracketed expression in the swift_validation table (without the leading anchor ^, and the trailing quantifier * and anchor $). Then you need to re-write the validation query slightly, to concatenate these fragments to the validation pattern in the regexp_like condition; but then you can include, for example, an additional column to show the first invalid character in an invalid string.
EDIT
In follow-up discussion (see comments below this answer), the OP clarified that only the combination chr(13) || chr(10) (in that order) is permitted. chr(10) and chr(13) are invalid if they appear by themselves, or in the wrong order.
This makes the problem more interesting (more complicated). To allow only the letters a, b, c or the sequence xy (that is: x alone, or y alone, are not allowed; every x must appear followed immediately by y, and every y must appear immediately preceded by x), the proper matching pattern looks like
'^([abc]|xy)*$'
Here expr1|expr2 is alternation, and it needs to be enclosed in parentheses to apply the * quantifier.
An additional complication is that $ doesn't actually match "the end of the input string"; it anchors either at the end of the input string, or if the input string ends in newline (chr(10)), it anchors before that character. Happily, there is the alternative anchor \z that doesn't suffer from that defect; it anchors truly at the end of the input string. This will be needed if we don't want to validate input strings that end in chr(10) not preceded immediately by chr(13). (If we do want to allow those - even though technically they do violate the "swift x" rules - then replace \z with $ as we had it before).
Here I demonstrate a slightly modified approach - now the small table that stores the validation rule only contains the alternation bit - either one character out of an enumeration, or the two-character sequence chr(13) || chr(10)), letting the "caller" wrap this within whatever is needed for a complete matching pattern.
The small table (note that I changed the column name):
drop table swift_validation purge;
create table swift_validation (valid_patterns varchar2(100));
insert into swift_validation (valid_patterns)
with helper(ascii_letters) as (
select 'abcdefghijklmnopqrstuvwxyz' from dual
)
select '[' -- open bracketed expression
|| ascii_letters -- a-z (ASCII)
|| upper(ascii_letters) -- A-Z (ASCII)
|| '0-9'
|| '/?:().,''+ -' -- '' escape for ', - last
|| ']' -- close bracketed expression
|| '|' -- alternation
|| chr(13) || chr(10) -- CR LF
from helper
;
commit;
Testing (notice the modified match pattern: now the ^ and \z anchors, the parentheses and the * quantifier are hard-coded in the query, not in the saved string):
with
test_strings (id, str) as (
select 1, 'abc + (12)' from dual union all
select 2, '$122.38' from dual union all
select 3, null from dual union all
select 4, 'no_underline' from dual union all
select 5, 'A / B - C * D' from dual union all
select 6, 'abc' || chr(10) || chr(13) from dual union all
select 7, 'abc' || chr(10) from dual union all
select 8, 'abc' || chr(13) || chr(10) from dual union all
select 9, 'café au lait' from dual
)
select t.id, t.str,
case when t.str is null
or regexp_like(t.str, '^(' || sv.valid_patterns || ')*\z')
then 'valid' else 'invalid' end as swift_valid
from test_strings t, swift_validation sv
;
ID STR SWIFT_VALID
-- ------------- -----------
1 abc + (12) valid
2 $122.38 invalid
3 valid
4 no_underline invalid
5 A / B - C * D invalid
6 abc invalid
7 abc invalid
8 abc valid
9 café au lait invalid
The newline characters (CR and LF) aren't clearly visible in the output; I added an id column so you can reference the output according to the input in the with clause.

Related

Regex Like for ORACLE with lookahead and negative lookahead

I am working with an programm which uploads emailadresses to another programm - but it accepts emails only in one way:
i tried to write a reglular expression to filter out emailadresse which are not accepted
^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$
The description says:
username#domain
The at sign ('#') must be present and not first or last character.
The length of the name can have up to and including 64 characters.
The length of the domain can have up to and including 64 characters.
All email addresses are forced to lowercase when the email is sent. Therefore any email addresses requiring uppercase will most likely not be delivered correctly by the ISP as we will have changed it to lowercase.
username
Can contain:
A-Z
a-z
0-9
! # $ % * + - . = ? ~ | ` _ ^
The entire name can be surrounded by double quotes (though this is not supported by many ISPs). In this case, the following additional characters are allowed between the quotes - ( ) { } < > # , ; : [ ] (space)
domain
Can contain:
A-Z
a-z
0-9
Cannot contain 2 or more consecutive periods
Must contain at least 1 period
Domain - Cannot begin or end with a period or dash
also the part with [] does not work
Thanks for your help.

Oracle does not, natively, support non-capturing groups, look-ahead or look-behind in regular expressions.
However, if you have Java enabled in the database then you can compile a Java class:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
And create a PL/SQL wrapper function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
and then you can use your regular expression (or any other regular expression that Java supports):
SELECT REGEXP_JAVA_MATCH(
'alice#example.com',
'^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$'
) AS match
FROM DUAL
Which outputs:
MATCH
1
db<>fiddle here

Your regular expression can be re-written into a format that Oracle supports as:
(?:) non-capturing group are not supported and should just be a () capturing group instead.
Look-ahead is not supported but you can rewrite the look-ahead patterns using character list so #(?!\.)(?!-)([A-Za-z0-9.-]{1,61}) can be rewritten as #[A-Za-z0-9][A-Za-z0-9.-]{0,60}.
The (?!.*\.$) look-ahead is redundant as the pattern ends with ([a-z]{2,10})$ and can never match a trailing ..
If you want to include ] and - in a character list then ] should be the first character and - the last in the set.
The only thing that cannot be implemented in an Oracle regular expression is simultaneously restricting the length of the post-# segment and ensuring there are no .. double dots; to do that you need to check for one of those two conditions in a second regular expression.
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[A-Za-z0-9!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[]A-Za-z0-9!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[A-Za-z0-9]([A-Za-z0-9.-]{0,60})?'
-- Top-level domain
|| '\.[a-z]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains - exclude .. patterns
|| '([^.]+\.)+[a-z]{2,10}$'
) AS match
FROM DUAL
Or, using POSIX character lists:
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[[:alnum:]!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[][:alnum:]!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[[:alnum:]]([[:alnum:].-]{0,60})?'
-- Top-level domain
|| '\.[[:lower:]]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains
|| '([^.]+\.)+[[:lower:]]{2,10}$'
) AS match
FROM DUAL
Which both output:
MATCH
alice#example.com
db<>fiddle here

Regexp to Validate an email local-part ORACLE

I'm trying to build a regexp for email validation on both parts; local-part and Domain-part, respectively:
Local-Part: ^[A-Z0-9][A-Z0-9._%+-]{0,63} - The maximum total length of the local-part of an email address is 64 octets.
Domain-Part: (([A-Z0-9]{1,63})[A-Z0-9]+(-[A-Z0-9]+)*\.){1,8}[A-Z]{2,63}$
Regexp : ^[A-Z0-9][A-Z0-9._%+-]{0,63}#(([A-Z0-9]{1,63})[A-Z0-9]+(-[A-Z0-9]+)*\.){1,8}[A-Z]{2,63}$
I'm satisfied with Domain Part. But I need to ensure some rules on Local-part that I'm not being able to achieve, regarding Dots '.'.
Dot's Rule :
provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. John..Doe#example.com is not allowed but "John..Doe"#example.com is allowed).
On my regexp I already guarantee that Dot is not first char. I need to check for consecutive Dots (e.g. '..' not allowed but ".." is allowed, and that '.#' must not happen too).
Any help please?

Add a second check that the email does not match any number of unquoted, non-at characters and then two consecutive dots.
SELECT *
FROM your_table
WHERE REGEXP_LIKE(
email,
-- your regular expression
'^[A-Z0-9][A-Z0-9._%+-]{0,63}#(([A-Z0-9]{1,63})[A-Z0-9]+(-[A-Z0-9]+)*\.){1,8}[A-Z]{2,63}$'
)
AND NOT REGEXP_LIKE( email, '^[^"#]+\.\.' )
However, I feel that you would be better off not using your regular expression (since it does not accept quotes or extended character sets and if you can get it working in Oracle use this one) or, even better, just accepting whatever has been entered and sending a confirmation e-mail to the user as this not-only checks that the e-mail is valid syntactically but also checks that the e-mail exists and that the user wants whatever service you are providing.
Update:
To use the regular expression in the answer linked above:
CREATE TABLE test_data ( id, email ) AS
SELECT 1, 'abc#example.com' FROM DUAL UNION ALL
SELECT 2, 'abc.def#example.com' FROM DUAL UNION ALL
SELECT 3, 'abc..def#example.com' FROM DUAL UNION ALL
SELECT 4, 'abc.def.#example.com' FROM DUAL UNION ALL
SELECT 5, '".abc.."#example.com' FROM DUAL UNION ALL
SELECT 6, 'abc.def++yourdomain.com#example.com' FROM DUAL UNION ALL
SELECT 7, '"with\"quotes\""#example.com' FROM DUAL UNION ALL
SELECT 8, '""#example.com' FROM DUAL UNION ALL
SELECT 9, '"\' || CHR(9) || '"#example.com' FROM DUAL UNION ALL
SELECT 10, '"""#example.com' FROM DUAL UNION ALL
SELECT 11, '123456789.123456789.123456789.123456789.123456789.123456789.1234567890#example.com' FROM DUAL UNION ALL
SELECT 12, 'ABC#example.com' FROM DUAL;
Query:
SELECT *
FROM test_data
WHERE REGEXP_LIKE(
email,
'^('
-- Unquoted local-part
|| '[a-z0-9!#$%&''*+/=?^_{|}~-]+(\.[a-z0-9!#$%&''*+/=?^_{|}~-]+)*'
-- ^^
-- Allow a dot but always expect a
-- non-dot after it.
-- Quoted local-part
|| '|"('
-- Unescaped characters in the quotes
|| '[]' || CHR(1) || '-' || CHR(8) || CHR(11) || CHR(12) || CHR(14) || '-!#-[^-'||CHR(127)||']'
-- Escaped characters in the quotes
|| '|\\[' || CHR(1) || '-' || CHR(9) || CHR(11) || CHR(12) || CHR(14) || '-' || CHR(127) || ']'
|| ')*"'
|| ')'
-- Match at symbol at end of local-part
|| '#',
-- Case insensitive
'i'
)
Output:
ID | EMAIL
-: | :---------------------------------------------------------------------------------
1 | abc#example.com
2 | abc.def#example.com
5 | ".abc.."#example.com
6 | abc.def++yourdomain.com#example.com
7 | "with\"quotes\""#example.com
8 | ""#example.com
9 | "\ "#example.com
11 | 123456789.123456789.123456789.123456789.123456789.123456789.1234567890#example.com
12 | ABC#example.com
db<>fiddle here

Regex_replace Postgres - Check if <= 2 characters length

I need 3 characters minimum for users accounts. I reuse existing names like
"tata-fzef - vcefv" or "kk" from the IMP_FR field to make this accounts.
In the second exemple, "kk" should become "k_k" because less than 3 characters.
How to do it with Postgresql?
regexp_replace( IMP_FR , regexp, first_character + '_' + last character, 'g')

Regular expressions won't help much here since REGEXP_REPLACE does not support conditional replacement patterns. You need different replacement pattern here for cases when the input only contains one, two or three or more chars.
So, it is better to rely on CASE ... WHEN ... ELSE here and the regular string manipulation functions:
CREATE TABLE tabl1
(s character varying)
;
INSERT INTO tabl1
(s)
VALUES
('tata-fzef - vcefv'),
('kkk'),
('kk'),
('k')
;
SELECT
CASE
WHEN LENGTH(s) = 1 THEN '_' || s || '_'
WHEN LENGTH(s) = 2 THEN LEFT(s,1) || '_' || RIGHT(s,1)
ELSE s
END AS Result
FROM tabl1;
See the online demo. Result:

Oracle Substring after specific character

I already found out I need to use substr/instr or regex but with reading the documentation about those, I cant get it done...
I am here on Oracle 11.2.
So here is what I have.
A list of Strings like:
743H5-34L-56
123HD34-7L
12HSS-34R
23Z67-4R-C23
What I need is the number (length 1 or 2) after the first '-' until there comes a 'L' or 'R'.
Has anybody some advice?

regexp_replace(string, '^.*?-(\d+)[LR].*$', '\1')
fiddle

Another version (without fancy lookarounds :-) :
with v_data as (
select '743H5-34L-56' val from dual
union all
select '123HD34-7L' val from dual
union all
select '12HSS-34R' val from dual
union all
select '23Z67-4R-C23' val from dual
)
select
val,
regexp_replace(val, '^[^-]+-(\d+)[LR].*', '\1')
from v_data
It matches
the beginning of the string "^"
one or more characters that are not a '-' "[^-]+"
followed by a '-' "-"
followed by one ore more digits (capturing them in a group) "(\d+)"
followed by 'L' or 'R' "[LR]"
followed by zero or more arbitrary characters ".*"

Proper way to add unescaped text from a field to a regex in postgres?

What's the proper way to add a literal text value from a field to a regex in postgres?
For example, something like this where some_field could contain invalid regex syntax if left unescaped:
where some_text ~* ('\m' || some_field || '\M');

The easiest thing to do is to use a regex to prep your string to be in a regex. Escaping non-word characters in your string should be sufficient to make it regex-safe, for example:
=> select regexp_replace('. word * and µ{', E'([^\\w\\s])', E'\\\\\\1', 'g');
regexp_replace
--------------------
\. word \* and µ\{
So something like this should work in general:
where some_text ~* x || regexp_replace(some_field, E'([^\\w\\s])', E'\\\\\\1', 'g') || y
where x and y are the other parts of the regex.
If you didn't need a regex at the end (i.e. no y above), then you could use (?q):
An ARE can begin with embedded options: a sequence (?xyz) (where xyz is one or more alphabetic characters) specifies options affecting the rest of the RE.
and a q means that the:
rest of RE is a literal ("quoted") string, all ordinary characters
So you could use:
where some_text ~* x || '(?q)' || some_field
in this limited case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression for Swift X character set in Oracle - regex

Related

Regex Like for ORACLE with lookahead and negative lookahead

Regexp to Validate an email local-part ORACLE

Regex_replace Postgres - Check if <= 2 characters length

Oracle Substring after specific character

Proper way to add unescaped text from a field to a regex in postgres?

Categories

Resources