Finding and removing Non-ASCII characters from an Oracle Varchar2 - regex

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?

I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.

I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')

The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03  ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);

I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);

Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual

You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est

Related

Oracle regex and replace

I have varchar field in the database that contains text. I need to replace every occurrence of a any 2 letter + 8 digits string to a link, such as VA12345678 will return /cs/page.asp?id=VA12345678
I have a regex that replaces the string but how can I replace it with a string where part of it is the string itself?
SELECT REGEXP_REPLACE ('test PI20099742', '[A-Z]{2}[0-9]{8}$', 'link to replace with')
FROM dual;
I can have more than one of these strings in one varchar field and ideally I would like to have them replaced in one statement instead of a loop.
As mathguy had said, you can use backreferences for your use case. Try a query like this one.
SELECT REGEXP_REPLACE ('test PI20099742', '([A-Z]{2}[0-9]{8})', '/cs/page.asp?id=\1')
FROM DUAL;
For such cases, you may want to keep the "text to add" somewhere at the top of the query, so that if you ever need to change it, you don't have to hunt for it.
You can do that with a with clause, as shown below. I also put some input data for testing in the with clause, but you should remove that and reference your actual table in your query.
I used the [:alpha:] character class, to match all letters - upper or lower case, accented or not, etc. [A-Z] will work until it doesn't.
with
text_to_add (link) as (
select '/cs/page.asp?id=' from dual
)
, sample_strings (str) as (
select 'test VA12398403 and PI83048203 to PT3904' from dual
)
select regexp_replace(str, '([[:alpha:]]{2}\d{8})', link || '\1')
as str_with_links
from sample_strings cross join text_to_add
;
STR_WITH_LINKS
------------------------------------------------------------------------
test /cs/page.asp?id=VA12398403 and /cs/page.asp?id=PI83048203 to PT3904

Oracle SQL String Manipulation

My field contains short codes that I want to access, such as C-COR3.
The issue is some records have additional information (F and H with numbers). An example is C-COR3 F1.54H19, I only care about C-COR3. Anything after "F" I want to ignore.
Code below works, but only if I hard-code the full F1.54H19. I want to use wildcards to abstract this for other occurrences that have F and H info in the field. (Ex C-R3 F0.18H18 -> C-R3 or C-COR3 F0.23H8.5 -> C-COR3), note varying short code string lengths.
/* Translates C-COR3 F1.54H19 to C-COR3. */
select distinct SUBSTR(lud_code_short,1,INSTR(lud_code_short, 'F1.54H19')-2)
from rep_dba.mytable
I've read that SUBSTR does not allow wildcards, but have had no luck trying my hand at REGEXP_INSTR and REGEX_SUBSTR instead. Any help appreciated.
Assuming that the "code" is always the first continuous sequence of non-space characters (and that there are no leading spaces - if there are, that's easy to handle), you could do something like this. Note the str || ' ' in the call to instr() - that takes care of the case when the input string has no spaces in it to begin with. Also notice the last input - since there are no spaces anywhere, the output is the same as the input. (Showing that if the "code" is not always separated from the "additional information" by at least one space, the solution would not work.)
with
test_data (str) as (
select 'C-COR3 F14H2.5' from dual union all
select 'C-AB3' from dual union all
select null from dual union all
select 'C-AB2F14H2.5' from dual
)
select str, substr(str, 1, instr(str || ' ', ' ') - 1) as code
from test_data
;
STR CODE
-------------- --------------
C-COR3 F14H2.5 C-COR3
C-AB3 C-AB3
C-AB2F14H2.5 C-AB2F14H2.5
Try using regexp_replace within your query like below
SELECT
regexp_replace('C-COR3 F14H2.5', '(C-[[:alnum:]]+) [FH].*', '\1')
FROM dual;

PL/SQL: Find all cyrillic (or non-latin1) signs via regex

I'm currently trying to figure out a way to output the IDs of all Rows within a table that contain any cyrillic (or non-latin-1) letters, no matter what column they're in
I've inherited a script that uses cursors to iterate through the tables and columns and searches for the cyrillic signs via a regex statement using unistr(), but i can't figure out why it does not seem to be working anymore on our oracle 12 db
The statement is as follows:
stmt := 'select ID from '||table_name || ' where regexp_LIKE('||table_name||'.'||column_name||','||stmt_template|| ')';
table_name and column name should be selft explanatory, stmt_template is a template that is defined earlier and contains my problem. 'stmt' is used as follows (and works):
OPEN stmt_cursor for stmt;
LOOP [some code]
The stmt_template is defined as follows and always throws me an error
stmt_template VARCHAR(32767) := '^[''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||unistr(''\1E93'')||unistr(''\1E80'')||''-''||unistr(''\1E85'')||unistr(''\1E6A'')||''-''||unistr(''\1E6B'')||unistr(''\1E60'')||''-''||unistr(''\1E63'')||unistr(''\1E56'')||''-''||unistr(''\1E57'')||unistr(''\1E44'')||''-''||unistr(''\1E45'')||unistr(''\1E40'')||''-''||unistr(''\1E41'')||unistr(''\1E30'')||''-''||unistr(''\1E31'')||unistr(''\1E24'')||''-''||unistr(''\1E27'')||unistr(''\1E1E'')||''-''||unistr(''\1E21'')||unistr(''\1E10'')||''-''||unistr(''\1E11'')||unistr(''\1E0A'')||''-''||unistr(''\1E0B'')||unistr(''\1E02'')||''-''||unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||unistr(''\0233'')||unistr(''\01FA'')||''-''||unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||unistr(''\01F5'')||unistr(''\01E2'')||''-''||unistr(''\01EF'')||unistr(''\01DE'')||''-''||unistr(''\01DF'')||unistr(''\01CD'')||''-''||unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||unistr(''\01b0'')||unistr(''\01A0'')||''-''||unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||unistr(''\0188'')||unistr(''\0134'')||''-''||unistr(''\017f'')||unistr(''\00AE'')||''-''||unistr(''\0131'')||unistr(''\00A1'')||''-''||unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||unistr(''\007E'')||'']*$'')';
This is supposed to be searching for a long list of cyrillic letters and other special characters, though it throws me the following:
ORA-00936: missing expression
I've already tried to search for everything not within the ascii table using
stmt_template VARCHAR(32767) :='''[^-~]''';
though this doesn't seem to give me the test-tuples I prepared (using some cyrillic characters as well as a € sign and stuff) but some rows that don't contain any 'illegal' characters
stmt_template VARCHAR(32767) := '''[^.' || CHR (1) || '-' || CHR (255) || ']''';
doesn't work either as it gives me the same as the above
can anyone help me identify my mistake/typo or whatever error there is in the first regex statement?
If you need any more information, please tell me, thx in advance
Your statement evaluates to:
select ID from table_name where regexp_LIKE(table_name.column_name,,'^['||unistr('\20AC')||unistr('\1EF8')||'-'||unistr('\1EF9')||unistr('\1EF2')||'-'||unistr('\1EF3')||unistr('\1EE4')||'-'||unistr('\1EE5')||unistr('\1ED6')||'-'||unistr('\1ED7')||unistr('\1ECA')||'-'||unistr('\1ECF')||unistr('\1EC4')||'-'||unistr('\1EC5')||unistr('\1EBD')||unistr('\1EAA')||'-'||unistr('\1EAC')||unistr('\1EA0')||'-'||unistr('\1EA1')||unistr('\1E9E')||unistr('\1E9B')||unistr('\1E8C')||'-'||unistr('\1E93')||unistr('\1E80')||'-'||unistr('\1E85')||unistr('\1E6A')||'-'||unistr('\1E6B')||unistr('\1E60')||'-'||unistr('\1E63')||unistr('\1E56')||'-'||unistr('\1E57')||unistr('\1E44')||'-'||unistr('\1E45')||unistr('\1E40')||'-'||unistr('\1E41')||unistr('\1E30')||'-'||unistr('\1E31')||unistr('\1E24')||'-'||unistr('\1E27')||unistr('\1E1E')||'-'||unistr('\1E21')||unistr('\1E10')||'-'||unistr('\1E11')||unistr('\1E0A')||'-'||unistr('\1E0B')||unistr('\1E02')||'-'||unistr('\1E03')||unistr('\0292')||unistr('\0259')||unistr('\022A')||'-'||unistr('\0233')||unistr('\01FA')||'-'||unistr('\021F')||unistr('\01F7')||unistr('\01F4')||'-'||unistr('\01F5')||unistr('\01E2')||'-'||unistr('\01EF')||unistr('\01DE')||'-'||unistr('\01DF')||unistr('\01CD')||'-'||unistr('\01D4')||unistr('\01BF')||unistr('\01B7')||unistr('\01AF')||'-'||unistr('\01b0')||unistr('\01A0')||'-'||unistr('\01A1')||unistr('\018F')||unistr('\0187')||'-'||unistr('\0188')||unistr('\0134')||'-'||unistr('\017f')||unistr('\00AE')||'-'||unistr('\0131')||unistr('\00A1')||'-'||unistr('\00AC')||unistr('\0009')||unistr('\000A')||unistr('\000D')||unistr('\0020')||'-'||unistr('\007E')||']*$'))
Which, with the guts of the regular expression removed looks like:
REGEXP_LIKE(table_name.column_name,,'your regex...'))
You need to remove the duplicate comma from the start of the regular expression string and the duplicate closing round bracket from the end.
Change your definition of stmt_template to
stmt_template VARCHAR(32767) := '^[''''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||
unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||
unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||
unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||
unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||
unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||
unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||
unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||
unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||
unistr(''\1E93'')||unistr(''\1E80'')||''-''||
unistr(''\1E85'')||unistr(''\1E6A'')||''-''||
unistr(''\1E6B'')||unistr(''\1E60'')||''-''||
unistr(''\1E63'')||unistr(''\1E56'')||''-''||
unistr(''\1E57'')||unistr(''\1E44'')||''-''||
unistr(''\1E45'')||unistr(''\1E40'')||''-''||
unistr(''\1E41'')||unistr(''\1E30'')||''-''||
unistr(''\1E31'')||unistr(''\1E24'')||''-''||
unistr(''\1E27'')||unistr(''\1E1E'')||''-''||
unistr(''\1E21'')||unistr(''\1E10'')||''-''||
unistr(''\1E11'')||unistr(''\1E0A'')||''-''||
unistr(''\1E0B'')||unistr(''\1E02'')||''-''||
unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||
unistr(''\0233'')||unistr(''\01FA'')||''-''||
unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||
unistr(''\01F5'')||unistr(''\01E2'')||''-''||
unistr(''\01EF'')||unistr(''\01DE'')||''-''||
unistr(''\01DF'')||unistr(''\01CD'')||''-''||
unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||
unistr(''\01b0'')||unistr(''\01A0'')||''-''||
unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||
unistr(''\0188'')||unistr(''\0134'')||''-''||
unistr(''\017f'')||unistr(''\00AE'')||''-''||
unistr(''\0131'')||unistr(''\00A1'')||''-''||
unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||
unistr(''\007E'')||'''']*$'')';
It appears that the original definition left an unbalanced single-quote at the beginning and end of the string. I'm still not certain that will work as there appears to be an unmatched right-parenthesis at the very end of the string but it might be better.
Best of luck.
This should give you data that isn't within the ascii-7 range chr(32) - chr(127):
select col1
from my_table
where regexp_like(col1, '[^'||chr(32)||'-'||chr(127)||']')
Note that I'm excluding control characters (less than dec 32) and extended ascii (> 127) in my range.

Extract numbers from a field in PostgreSQL

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.
Below is the example:
SQL Fiddle.
I tired this statement:
select
(case when regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test
But I got an error for explicit cast:
Simply:
SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM tbl;
\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.
For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:
Regex remove all occurrences of multiple characters in a string
PostgreSQL SELECT only alpha characters on a row
Is there a regexp_replace equivalent for postgresql 7.4?
But why Postgres 8.4? Consider upgrading to a modern version.
Consider pitfalls for outdated versions:
Order varchar string as numeric
WARNING: nonstandard use of escape in a string literal
I think you want something like this:
select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
end) as po_number_new
from test;
That is, you need to do the conversion on the string after replacement.
Note: This assumes that the "number" is just a string of digits.
The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.
If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.
select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
then regexp_replace(po_number, '[^0-9]', '', 'g')
else null
end as po_number_new
from test
If you want to extract floating numbers try to use this:
SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;
It's the same as Erwin Brandstetter answer but with different expression:
[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...
\. - point character (also you can change it to , char)
\d - digit character
Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.
Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by #ErwinBrandstetter!
I would do this as follows (all of the code below is available on the fiddle here):
CREATE TABLE s
(
num TEXT,
new_num INTEGER GENERATED ALWAYS AS
(NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);
You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!
And checking:
INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;
Result (same for both):
num new_num
2 2
NULL
NULL
NULL
So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t
with symbols(characters) as
(
VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;
The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.
However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;
and the same for the REGEXP_REPLACE solution.
The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
num, new_num
FROM t;
Both extremely quick and identical!

Oracle regex to find the special character in name field

I'm trying to filter out the names which have special characters.
Requirement:
1) Filter the names which have characters other than a-zA-Z , space and forward slash(/).
Regex being tried out:
1) regexp_like (customername,'[^a-zA-Z[:space:]\/]'))
2) regexp_like (customername,'[^a-zA-Z \/]'))
The above two regex helps in finding the names with special characters like ? and dot(.)
For example:
LEAL/JO?O
FRANCO/DIVALDO Sr.
But I couldn't figure out why some names(listed below) with the allowed characters(a-zA-Z , space and forward slash(/)) also get retrieved.
For example:
ESTEVES/MARIA INES
PEREZ/JOSE
DUTRA SILVA/LIGIA
Please help to figure out the mistake in the regex being used.
Many thanks in advance!
Your regex #1 worked for me on 11g with the name data copied/pasted from this page. I wonder if you have non-printable control characters in the data? Try adding [:cntrl:] to the regex to catch control characters. P.S. the backslash is not needed before the slash when inside of a character class (square brackets).
SQL> with tbl(name) as (
select 'LEAL/JO?O' from dual union
select 'FRANCO/DIVALDO Sr.' from dual union
select 'ESTEVES/MARIA INES' from dual union
select 'PEREZ/JOSE' from dual union
select 'DUTRA SILVA/LIGIA' from dual
)
select *
from tbl
where regexp_like(name, '[^a-zA-Z[:space:][:cntrl:]/]');
NAME
------------------
FRANCO/DIVALDO Sr.
LEAL/JO?O
SQL>
If you can copy/paste this, run it and get the same results, then something is up with the data in your table. Have a look at the data in HEX which will bring to light a previously hidden character perhaps. Here's a simple example which shows the name "JOSE" in HEX. Using one of the numerous ASCII charts out there like http://www.asciitable.com/ you can see there are no hidden characters:
SQL> select 'JOSE' as chr, rawtohex('JOSE') as hex from dual;
CHR HEX
---- --------
JOSE 4A4F5345
SQL>
So, have a look at a name or two and see if you have any hidden characters. If not, I suspect a conflicting characterset issue maybe.
#gary_w has most of the bases well covered....
Here's my sql version of unix: cat -vet MyFile
select replace(regexp_replace(my_column,'[^[:print:]]', '!ACK!'),' ','.') as CAT_VET
from my_table
... all the non-printing characters become !ACK! and spaces become . You still need to determine what the characters actually ARE, but it's useful to find the looney-toon characters in your data.
Also, select dump(my_column) ... is another way to view the raw column values.