PL/SQL: Find all cyrillic (or non-latin1) signs via regex

PL/SQL: Find all cyrillic (or non-latin1) signs via regex - regex

I'm currently trying to figure out a way to output the IDs of all Rows within a table that contain any cyrillic (or non-latin-1) letters, no matter what column they're in
I've inherited a script that uses cursors to iterate through the tables and columns and searches for the cyrillic signs via a regex statement using unistr(), but i can't figure out why it does not seem to be working anymore on our oracle 12 db
The statement is as follows:
stmt := 'select ID from '||table_name || ' where regexp_LIKE('||table_name||'.'||column_name||','||stmt_template|| ')';
table_name and column name should be selft explanatory, stmt_template is a template that is defined earlier and contains my problem. 'stmt' is used as follows (and works):
OPEN stmt_cursor for stmt;
LOOP [some code]
The stmt_template is defined as follows and always throws me an error
stmt_template VARCHAR(32767) := '^[''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||unistr(''\1E93'')||unistr(''\1E80'')||''-''||unistr(''\1E85'')||unistr(''\1E6A'')||''-''||unistr(''\1E6B'')||unistr(''\1E60'')||''-''||unistr(''\1E63'')||unistr(''\1E56'')||''-''||unistr(''\1E57'')||unistr(''\1E44'')||''-''||unistr(''\1E45'')||unistr(''\1E40'')||''-''||unistr(''\1E41'')||unistr(''\1E30'')||''-''||unistr(''\1E31'')||unistr(''\1E24'')||''-''||unistr(''\1E27'')||unistr(''\1E1E'')||''-''||unistr(''\1E21'')||unistr(''\1E10'')||''-''||unistr(''\1E11'')||unistr(''\1E0A'')||''-''||unistr(''\1E0B'')||unistr(''\1E02'')||''-''||unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||unistr(''\0233'')||unistr(''\01FA'')||''-''||unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||unistr(''\01F5'')||unistr(''\01E2'')||''-''||unistr(''\01EF'')||unistr(''\01DE'')||''-''||unistr(''\01DF'')||unistr(''\01CD'')||''-''||unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||unistr(''\01b0'')||unistr(''\01A0'')||''-''||unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||unistr(''\0188'')||unistr(''\0134'')||''-''||unistr(''\017f'')||unistr(''\00AE'')||''-''||unistr(''\0131'')||unistr(''\00A1'')||''-''||unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||unistr(''\007E'')||'']*$'')';
This is supposed to be searching for a long list of cyrillic letters and other special characters, though it throws me the following:
ORA-00936: missing expression
I've already tried to search for everything not within the ascii table using
stmt_template VARCHAR(32767) :='''[^-~]''';
though this doesn't seem to give me the test-tuples I prepared (using some cyrillic characters as well as a € sign and stuff) but some rows that don't contain any 'illegal' characters
stmt_template VARCHAR(32767) := '''[^.' || CHR (1) || '-' || CHR (255) || ']''';
doesn't work either as it gives me the same as the above
can anyone help me identify my mistake/typo or whatever error there is in the first regex statement?
If you need any more information, please tell me, thx in advance

Your statement evaluates to:
select ID from table_name where regexp_LIKE(table_name.column_name,,'^['||unistr('\20AC')||unistr('\1EF8')||'-'||unistr('\1EF9')||unistr('\1EF2')||'-'||unistr('\1EF3')||unistr('\1EE4')||'-'||unistr('\1EE5')||unistr('\1ED6')||'-'||unistr('\1ED7')||unistr('\1ECA')||'-'||unistr('\1ECF')||unistr('\1EC4')||'-'||unistr('\1EC5')||unistr('\1EBD')||unistr('\1EAA')||'-'||unistr('\1EAC')||unistr('\1EA0')||'-'||unistr('\1EA1')||unistr('\1E9E')||unistr('\1E9B')||unistr('\1E8C')||'-'||unistr('\1E93')||unistr('\1E80')||'-'||unistr('\1E85')||unistr('\1E6A')||'-'||unistr('\1E6B')||unistr('\1E60')||'-'||unistr('\1E63')||unistr('\1E56')||'-'||unistr('\1E57')||unistr('\1E44')||'-'||unistr('\1E45')||unistr('\1E40')||'-'||unistr('\1E41')||unistr('\1E30')||'-'||unistr('\1E31')||unistr('\1E24')||'-'||unistr('\1E27')||unistr('\1E1E')||'-'||unistr('\1E21')||unistr('\1E10')||'-'||unistr('\1E11')||unistr('\1E0A')||'-'||unistr('\1E0B')||unistr('\1E02')||'-'||unistr('\1E03')||unistr('\0292')||unistr('\0259')||unistr('\022A')||'-'||unistr('\0233')||unistr('\01FA')||'-'||unistr('\021F')||unistr('\01F7')||unistr('\01F4')||'-'||unistr('\01F5')||unistr('\01E2')||'-'||unistr('\01EF')||unistr('\01DE')||'-'||unistr('\01DF')||unistr('\01CD')||'-'||unistr('\01D4')||unistr('\01BF')||unistr('\01B7')||unistr('\01AF')||'-'||unistr('\01b0')||unistr('\01A0')||'-'||unistr('\01A1')||unistr('\018F')||unistr('\0187')||'-'||unistr('\0188')||unistr('\0134')||'-'||unistr('\017f')||unistr('\00AE')||'-'||unistr('\0131')||unistr('\00A1')||'-'||unistr('\00AC')||unistr('\0009')||unistr('\000A')||unistr('\000D')||unistr('\0020')||'-'||unistr('\007E')||']*$'))
Which, with the guts of the regular expression removed looks like:
REGEXP_LIKE(table_name.column_name,,'your regex...'))
You need to remove the duplicate comma from the start of the regular expression string and the duplicate closing round bracket from the end.

Change your definition of stmt_template to
stmt_template VARCHAR(32767) := '^[''''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||
unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||
unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||
unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||
unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||
unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||
unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||
unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||
unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||
unistr(''\1E93'')||unistr(''\1E80'')||''-''||
unistr(''\1E85'')||unistr(''\1E6A'')||''-''||
unistr(''\1E6B'')||unistr(''\1E60'')||''-''||
unistr(''\1E63'')||unistr(''\1E56'')||''-''||
unistr(''\1E57'')||unistr(''\1E44'')||''-''||
unistr(''\1E45'')||unistr(''\1E40'')||''-''||
unistr(''\1E41'')||unistr(''\1E30'')||''-''||
unistr(''\1E31'')||unistr(''\1E24'')||''-''||
unistr(''\1E27'')||unistr(''\1E1E'')||''-''||
unistr(''\1E21'')||unistr(''\1E10'')||''-''||
unistr(''\1E11'')||unistr(''\1E0A'')||''-''||
unistr(''\1E0B'')||unistr(''\1E02'')||''-''||
unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||
unistr(''\0233'')||unistr(''\01FA'')||''-''||
unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||
unistr(''\01F5'')||unistr(''\01E2'')||''-''||
unistr(''\01EF'')||unistr(''\01DE'')||''-''||
unistr(''\01DF'')||unistr(''\01CD'')||''-''||
unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||
unistr(''\01b0'')||unistr(''\01A0'')||''-''||
unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||
unistr(''\0188'')||unistr(''\0134'')||''-''||
unistr(''\017f'')||unistr(''\00AE'')||''-''||
unistr(''\0131'')||unistr(''\00A1'')||''-''||
unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||
unistr(''\007E'')||'''']*$'')';
It appears that the original definition left an unbalanced single-quote at the beginning and end of the string. I'm still not certain that will work as there appears to be an unmatched right-parenthesis at the very end of the string but it might be better.
Best of luck.

This should give you data that isn't within the ascii-7 range chr(32) - chr(127):
select col1
from my_table
where regexp_like(col1, '[^'||chr(32)||'-'||chr(127)||']')
Note that I'm excluding control characters (less than dec 32) and extended ascii (> 127) in my range.

Related

Having difficulty in pattern matching Postal Codes for an oracle regexp_like command

The Problem:
All I'm trying to do is come up with a pattern matching string for my regular expression that lets me select Canadian postal codes in this format: 'A1A-2B2' (for example).
The types of data I am trying to insert:
Insert Into Table
(Table_Number, Person_Name, EMail_Address, Street_Address, City, Province, Postal_Code, Hire_Date)
Values
(87, 'Tommy', 'mobster#gmail.com', '123 Street', 'location', 'ZY', 'T4X-1S2', To_Date('30-Aug-2020 08:50:56');
This is a slightly modified/generic version to protect some of the data. All of the other columns enter just fine/no complaints. But the postal code it does not seem to like when I try to run a load data script.
The Column & Constraint in question:
Postal_Code varchar2(7) Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^\[[:upper:]]{1}[[:digit:]]{1}[[:upper:]][[:punct:]]{1}[[:digit:]]{1}[[:upper:]](1}[[:digit:]]{1}$')),
My logic here: following the regular expression documentation:
I have:
an open quote
a exponent sign to indicate start of string
Backslash (I think to interpet a string literal)
-1 upper case letter, 1 digit, 1 uppercase , 1 :punct: to account for the hypen, 1 digit, 1 upper case letter, 1 digit
$ to indicate end of string
Close quote
In my mind, something like this should work, it accounts for every single letter/character and the ranges they have to be in. But something is off regarding my formatting of this pattern matching string.
The error I get is:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated
(slightly modified once more to protect my identity)
Which tells me that the data insert statement is tripping off my check constraint and thats about it. So its as issue with the condition of the constraint itself - ie string I'm using to match it. My instructor has told me that insert data is valid, and doesn't need any fix-up so I'm at a loss.
Limits/Rules: The Hyphen has to be there/matched to my understanding of the problem. They are all uppercase in the dataset, so I don't have to worry about lowercase for this example.
I have tried countless variations of this regexp statement to see if anything at all would work, including:
changing all those uppers to :alpha: , then using 'i' to not check for case sensitivity for the time being
removing the {1} in case that was redudant
using - (backslash hyphen) , to turn into a string literal maybe
using only Hyphen by itself
even removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
keeping the uppers , turning :digit:'s to [0-9] to see if that would maybe work
The only logical thing I can think of now is: the check constraint is actually working fine and tripping off when it matches my syntax. But I didn't write it clearly enough to say "IGNORE these cases and only get tripped/activated if it doesn't meet these conditions"
But I'm at my wits end and asking here as a last resort. I wouldn't if I could see my mistake eventually - but everything I can think of, I probably tried. I'm sure its some tiny formatting rule I just can't see (I can feel it).Thank you kindly to anyone who would know how to format a pattern matching string like this properly.

It looks like you may have been overcomplicating the regex a bit. The regex below matches your description based on the first set of bullets you lined out:
REGEXP_LIKE (postal_code, '^[A-Z]\d[A-Z]-\d[A-Z]\d$')

I see two problems with that regexp.
Firstly, you have a spurious \ at the start. It serves you no purpose, get rid of it.
Secondly, the second-from last {1} appears in your code with mismatched brackets as (1}. I get the error ORA-12725: unmatched parentheses in regular expression because of this.
To be honest, you don't need the {1}s at all: they just tell the regular expression that you want one of the previous item, which is exactly what you'd get without them.
So you can fix the regexp in your constraint by getting rid of the \ and removing the {1}s, including the one with mismatched parentheses.
Here's a demo of the fixed constraint in action:
SQL> CREATE TABLE postal_code_test (
2 Postal_Code varchar2(7) Constraint Table_Postal_Code Null
3 Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'))));
Table created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('T4X-1S2');
1 row created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('invalid');
INSERT INTO postal_code_test (postal_code) VALUES ('invalid')
*
ERROR at line 1:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated

You do not need the backslash and you have (1} instead of {1}.
You can simplify the expression to:
Postal_Code varchar2(7)
Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check (
REGEXP_LIKE(Postal_Code, '^[A-Z]\d[A-Z][[:punct:]]\d[A-Z]\d$')
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[A-Z][0-9][A-Z][[:punct:]][0-9][A-Z][0-9]$'
)
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'
)
)
or (although the {1} syntax is redundant here):
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:punct:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:digit:]]{1}$'
)
)
fiddle
removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
That will not work as the LIKE operator does not match regular expression patterns.

Is it possible to only get non alphabetic, English alphabet, and non numeric from a column DB2

Sometimes I get Ÿ (hex C5B8: 2 bytes, 1 character) in my database and I have a script that processes multiple data which can't read that data since it doesn't know what to do with it so it stops the whole process and I have to go into my logs and see where the error is so that I can restart the whole process.
I want to execute a query that only gives me characters that are not in the english alphabet so that I can see if they should be changed.
I tried to only look for UTF8 characters but Ÿ is a UTF8 char so I need to go for another aproach.
words containing other than:
A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z
and numbers
0-1-2-3-4-5-6-7-8-9
excluding alpanumeric (in case someone writes a address like this)
h3ll0
I was thinkg something like this:
SELECT * FROM myTable WHERE myCol != (/^[A-Za-z]+$/)
something like that where I only get columns that have characters which do not belong to the english alphabet or numbers 0-9

I'm not sure if I understood you correctly. Basically you want to find all columns that have words with characters that are not in the English alphabet? If so this might work:
SELECT * FROM `myTable` WHERE `myCol` NOT REGEXP '[A-Za-z0-9]'
EDIT: This answer was written for the old tag to this question which was "mySQL", you've change it to db2. I've tried modifying it for db2 11 but It's at best an educated guess:
SELECT * FROM `myTable` WHERE `myCol` NOT REGEXP_LIKE '[A-Za-z0-9]'

Check out the
TRANSLATE
function - see documentation
Translate all regular characters and number to an empty string - like:
select translate(mycol, '', 'ABCDEFGabcdefghi1234567890')
from mytable
This is no the complete solution but you should get the idea. This works in DB2 LUW and is available für i series.

Oracle regex to find the special character in name field

I'm trying to filter out the names which have special characters.
Requirement:
1) Filter the names which have characters other than a-zA-Z , space and forward slash(/).
Regex being tried out:
1) regexp_like (customername,'[^a-zA-Z[:space:]\/]'))
2) regexp_like (customername,'[^a-zA-Z \/]'))
The above two regex helps in finding the names with special characters like ? and dot(.)
For example:
LEAL/JO?O
FRANCO/DIVALDO Sr.
But I couldn't figure out why some names(listed below) with the allowed characters(a-zA-Z , space and forward slash(/)) also get retrieved.
For example:
ESTEVES/MARIA INES
PEREZ/JOSE
DUTRA SILVA/LIGIA
Please help to figure out the mistake in the regex being used.
Many thanks in advance!

Your regex #1 worked for me on 11g with the name data copied/pasted from this page. I wonder if you have non-printable control characters in the data? Try adding [:cntrl:] to the regex to catch control characters. P.S. the backslash is not needed before the slash when inside of a character class (square brackets).
SQL> with tbl(name) as (
select 'LEAL/JO?O' from dual union
select 'FRANCO/DIVALDO Sr.' from dual union
select 'ESTEVES/MARIA INES' from dual union
select 'PEREZ/JOSE' from dual union
select 'DUTRA SILVA/LIGIA' from dual
)
select *
from tbl
where regexp_like(name, '[^a-zA-Z[:space:][:cntrl:]/]');
NAME
------------------
FRANCO/DIVALDO Sr.
LEAL/JO?O
SQL>
If you can copy/paste this, run it and get the same results, then something is up with the data in your table. Have a look at the data in HEX which will bring to light a previously hidden character perhaps. Here's a simple example which shows the name "JOSE" in HEX. Using one of the numerous ASCII charts out there like http://www.asciitable.com/ you can see there are no hidden characters:
SQL> select 'JOSE' as chr, rawtohex('JOSE') as hex from dual;
CHR HEX
---- --------
JOSE 4A4F5345
SQL>
So, have a look at a name or two and see if you have any hidden characters. If not, I suspect a conflicting characterset issue maybe.

#gary_w has most of the bases well covered....
Here's my sql version of unix: cat -vet MyFile
select replace(regexp_replace(my_column,'[^[:print:]]', '!ACK!'),' ','.') as CAT_VET
from my_table
... all the non-printing characters become !ACK! and spaces become . You still need to determine what the characters actually ARE, but it's useful to find the looney-toon characters in your data.
Also, select dump(my_column) ... is another way to view the raw column values.

DB2: find field value where first character is a lower case letter

I am trying to pick out a value in a field where the first character is a lower case letter. This is difficult since DB2 does not permit regular expressions. My current attempt is:
select * from mytable
where field1 like lcase('_%')
where I was hoping the underscore followed by percent wildcard would find any character in the first position, and then wrap the lcase() around that to ensure it is lower case. the result is that any and every value gets selected, so the lcase() is not performing what I want it to do, and in hindsight is used to cast to lowercase.
With that in mind, how to I ensure that the result of
('_%')
is lowercase only?
Thanks very much

i would use something like:
... where substr(field1,1,1) <> upper(substr(field1,1,1))
solution with 'a'...'z' will not work with characters different from latin characterset (e.g. cyrilic etc)

Why not
where field1 >= 'a' and field1 < '{'
This will even make use of an appropriate index, if any.
Be warned, however, that this won't work when your DB instance does lexicongraphic ordering. I am not sure if the latter is a DB attribute or a session attribute, however.
Another, more general way (especially when considering non ASCII letters) would be to check if the length of the field is > 0 and the lowercased substring consisting of the first character equals the substring consisting of the first character while the uppercased first character does not equal the first character. (Look up the functions in the DB2 reference, I have mine not ready at the moment.)

DB2 DOES allow Regular expressions with xQuery. For example:
with cteGender(VALUE) as
(
values
('M'),('F'),('U'),('S'),(' M'),('f')
),
cteResult(VALUE,RESULT_BOOLEAN) as
(
select '"' || VALUE || ‘"',
xmlquery('fn:matches($VALUE,''^[MFU]{1}$'')') from cteGender
)
select VALUE, RESULT_BOOLEAN,
xmlcast(RESULT_BOOLEAN as integer) RESULT_INTEGER from cteResult;
I took this example from: http://www.idug.org/p/bl/et/blogid=278&blogaid=187 That article explain very well how to use xQuery.
DB2 does not have SQL functions for Regular Expressions, but with xQuery you can do that. But if you really want SQL functions for RegEx, please visit this site: https://www.ibm.com/developerworks/jp/data/library/db2/j_d-regularexpression/ (In Japanese, but the code can be understood)
For more information about RegEx in DB2 please visit: http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.xml.doc/doc/xqrregexp.html

Finding and removing Non-ASCII characters from an Oracle Varchar2

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?

I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.

I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')

The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03  ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);

I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);

Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual

You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PL/SQL: Find all cyrillic (or non-latin1) signs via regex - regex

This should give you data that isn't within the ascii-7 range chr(32) - chr(127): select col1 from my_table where regexp_like(col1, '[^'||chr(32)||'-'||chr(127)||']') Note that I'm excluding control characters (less than dec 32) and extended ascii (> 127) in my range.

Related

Having difficulty in pattern matching Postal Codes for an oracle regexp_like command

Is it possible to only get non alphabetic, English alphabet, and non numeric from a column DB2

Oracle regex to find the special character in name field

DB2: find field value where first character is a lower case letter

Finding and removing Non-ASCII characters from an Oracle Varchar2

Categories

Resources