High Surrogate Detection in Oracle Regular Expression - regex

I have a web service that pulls text from an NCLOB column and returns the data via XML. The NCLOB column is populated by extracting text from documents, so there are occasions where invalid XML characters are placed in the XML, causing the consuming system to fail.
As per the W3C, the range of valid characters is:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
We have tried a few different RegExp patterns, and we're close, but we're not completely there yet. Here is the closest we've come. All of the invalid characters are replaced except for the high surrogates (DB9B - DBFF).
REGEXP_REPLACE(
TEXT,
'[^[:print:]' || chr(13) || chr(10) || ']|[' || UNISTR('\FFFE-\FFFF') || ']',
'*')
We have also tried this, but none of the surrogates (D800 - DFFE) are replaced.
REGEXP_REPLACE(REPLACE(TEXT, unistr('\0000'), ' '),
'[' || unistr('\0001-\0008') || ']'
|| '|[' || unistr('\000B-\000C') || ']'
|| '|[' || unistr('\000E-\001F') || ']'
|| '|[' || unistr('\D800-\DFFF') || ']'
|| '|[' || unistr('\FFFE-\FFFF') || ']',' ')
How can we match the high surrogates? Any thoughts or guidance would be most appreciated.

You could write your own function since regex_replace does not seem to work for the high surrogates. Here's an example (tested on 9.2 and 11.2):
CREATE OR REPLACE FUNCTION replace_invalid(p_clob NCLOB) RETURN NCLOB IS
l_result NCLOB;
l_char NVARCHAR2(1 char);
BEGIN
FOR i IN 1 .. length(p_clob) LOOP
l_char := substr(p_clob, i, 1);
IF utl_raw.cast_to_binary_integer(utl_raw.cast_to_raw(l_char))
BETWEEN to_number('DB9B', 'xxxx') AND to_number('DBFF', 'xxxx') THEN
l_result := l_result || N'*';
ELSE
l_result := l_result || l_char;
END IF;
END LOOP;
RETURN l_result;
END;
It should run with large NCLOB, here's an example with a clob > 32k characters:
SQL> DECLARE
2 l_in NCLOB;
3 l_out NCLOB;
4 BEGIN
5 FOR i IN 1 .. to_number('DBFF', 'xxxx') LOOP
6 l_in := l_in || nchr(i);
7 END LOOP;
8 dbms_output.put_line('l_in length:' || length(l_in));
9 l_out := replace_invalid(l_in);
10 dbms_output.put_line('l_out length:' || length(l_out));
11 dbms_output.put_line('chars replaced:'
12 || (length(l_out) - length(REPLACE(l_out, '*', ''))));
13 END;
14 /
l_in length:56319
l_out length:56319
chars replaced:102

You can use series of calls TRANSLATE.
For example,
SELECT UNISTR('abc\00e5\00f1\00f6') Source FROM DUAL;
SELECT TRANSLATE(UNISTR('abc\00e5\00f1\00f6'), UNISTR('a\00e5\00f1'), 'a') Final FROM DUAL;
Source
------
abcåñö
Final
------
abcö
SQLFiddle

Related

Oracle - How to find carriage return, new line and tab using REGEXP_LIKE?

I am trying to run a query in Oracle 11g where I am looking in a VARCHAR column for any rows that contain any of a carriage return, new line or tab. So far my code is as shown
select c1 from table_name where regexp_like(c1, '[\r\n\t]')
Not sure why but I am getting unexpected results. I saw some mention that Oracle doesnt support '\r' or any of the other characters I used? Some folks mentioned to use chr(10) for example and then I tried the following code
select c1 from table_name where regexp_like(c1, '[chr(10)|chr(13)]')
And again I am getting unexpected results. Pretty sure I am misunderstanding something here and I was hoping for some guidance.
You can use:
select c1
from table_name
where c1 LIKE '%' || chr(10) || '%'
or c1 LIKE '%' || chr(13) || '%'
or c1 LIKE '%' || chr(9) || '%';
or
select c1
from table_name
where regexp_like(c1, '[' || chr(10) || chr(13) || chr(9) || ']')
fiddle
where regexp_like(c1, '[\r\n\t]') does not work as you are matching any character that is \ or r or \ or n or \ or t (and not matching the perl-like character sets \r, \n or \t).
where regexp_like(c1, '[chr(10)|chr(13)]') does not wotk as you are matching any character that is c or h or r or ( or 1 or 0 or ) or | or c or h or r or ( or 1 or 3 or ) as you have a string literal and are not evaluating the contents of the literal. If you want to evaluate them as calls to the CHR function then it must be outside the string literal as the second example above.

How to remove all foreign characters SAS

I have been dealing with this problem all day today, so far I tried below but it is still causing error. The error shows
Last Name field may only contain alphabetic characters, hyphens, or apostrophes. Please remove all foreign characters and resubmit.
data APPLIED_GRAD1;
set APPLIED_GRAD;
last_name=compress(last_name,"ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890`~!##$%^&*()-_=+\|[]{};:',.<>?/ " , "kis");
pos_notalpha = notalpha ( TRIMN ( last_name )) ;
keep last_name;
run;
data APPLIED_GRAD2;
set APPLIED_GRAD1;
where pos_notalpha=0;
run;
Is anything I can do anything else to remove all foreign Characters?
Thanks
Here's how you can clean specific ASCII chars from a string. Define an FCMP function:
proc fcmp outlib=work.funcs.funcs;
function clean(iField $) $200;
bad_char_list = byte( 0) || byte( 1) || byte( 2) || byte( 3) || byte( 4) || byte( 5) || byte( 6) || byte( 7) || byte( 8) || byte( 9) ||
byte(10) || byte(11) || byte(12) || byte(13) || byte(14) || byte(15) || byte(16) || byte(17) || byte(18) || byte(19) ||
byte(20) || byte(21) || byte(22) || byte(23) || byte(24) || byte(25) || byte(26) || byte(27) || byte(28) || byte(29) ||
byte(30) || byte(31) ||
byte(127)
;
iCleaned = translate(iField," ",bad_char_list);
return (iCleaned );
endsub;
run;
Example Usage - Cleaning line breaks prior to exporting to CSV:
data x;
length employer $200;
employer = cats("blah",byte(10),"diblah");
employer = clean(employer);
run;
proc export data=x
outfile="d:\test.csv"
dbms=csv
replace;
run;
Note - This function is pretty slow if you have a very large dataset and/or are running against many fields. If you are targeting very specific bad characters (for example those that may affect CSV integrity) then you may want to reduce the character list to just bytes 9/10/13.
Here are two ways
Regular expression
Use PRXCHANGE with a regular expression pattern containing a bracketed NOT (^) character class specifying which non-matching characters are to be replaced with nothing (//). The literal characters dash (-) and apostrophe (') need to be escaped with backslash (\)
lastname = prxchange("s/[^A-Za-z\-\']//", -1, lastname);
Compress with list of characters to keep
The compress function using the third parameter form to specify the compress option K to mean keep instead of remove.
lastname = compress(lastname, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-'", 'K');

Find out if a string contains only ASCII characters

I need to know whether a string contains only ASCII characters. So far I use this REGEX:
DECLARE
str VARCHAR2(100) := 'xyz';
BEGIN
IF REGEXP_LIKE(str, '^[ -~]+$') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
END;
/
Pure ASCII
' ' and ~ are the first, resp. last character in ASCII.
Problem is, this REGEXP_LIKE fails on certain NLS-Settings:
ALTER SESSION SET NLS_SORT = 'GERMAN';
DECLARE
str VARCHAR2(100) := 'xyz';
BEGIN
IF REGEXP_LIKE(str, '^[ -~]+$') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
END;
/
ORA-12728: invalid range in regular expression
ORA-06512: at line 4
Do anybody knows a solution which works independently from current user NLS-Settings? Is this behavior on purpose or should it be considered as a bug?
You can use TRANSLATE to do this. Basically, translate away all the ASCII printable characters (there aren't that many of them) and see what you have left.
Here is a query that does it:
WITH input ( p_string_to_test) AS (
SELECT 'This this string' FROM DUAL UNION ALL
SELECT 'Test this ' || CHR(7) || ' string too!' FROM DUAL UNION ALL
SELECT 'xxx' FROM DUAL)
SELECT p_string_to_test,
case when translate(p_string_to_test,
chr(0) || q'[ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]',
chr(0)) is null then 'Yes' else 'No' END is_ascii
FROM input;
+-------------------------+----------+
| P_STRING_TO_TEST | IS_ASCII |
+-------------------------+----------+
| This this string | Yes |
| Test this string too! | No |
| xxx | Yes |
+-------------------------+----------+
ASCII function with upper limit of 127 may be used :
declare
str nvarchar2(100) := '\xyz~*-=)(/&%+$#£>|"éß';
a nvarchar2(1);
b number := 0;
begin
for i in 1..length(str)
loop
a := substrc(str,i,1);
b := greatest(ascii(a),b);
end loop;
if b < 128 then
dbms_output.put_line('String is composed of Pure ASCII characters');
else
dbms_output.put_line('String has non-ASCII characters');
end if;
end;
I think I will go for one of these two
IF CONVERT(str, 'US7ASCII') = str THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
IF ASCIISTR(REPLACE(str, '\', '/')) = REPLACE(str, '\', '/') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;

Informatica Expression IS_DATE

This works:
TO_DATE(TO_CHAR('12'|| '-' || '12' || '-01'),'YYYY/MM/DD')
This does not work:
IS_DATE(TO_DATE(TO_CHAR('12'|| '-' || '12' || '-01'),'YYYY/MM/DD'))
IS_DATE(TO_DATE(TO_CHAR('12'|| '-' || '12' || '-01'),'YYYY/MM/DD'),'YYYY/MM/DD')
What exactly am I doing wrong?
I have tried datatypes STRING and DATE/TIME
please try this
IS_DATE(TO_CHAR('12'|| '-' || '12' || '-01'),'YYYY/MM/DD')
Syntax:
IS_DATE(input as char,format as char)
IS_DATE returns 1 if the input is a valid date and 0 if the date is not valid.
IS_DATE('02/01/2013', 'DD/MM/YYYY')-> returns 1
IS_DATE('02312013','MMDDYYYY')-> returns 0(as February 31st is not a valid date)

Search every column, of every database of every schema for a regular expression

I'm working for a company undergoing acquisition at the moment. They use Oracle 11g and have a requirement for identifying all references to the current company name in their databases and having these listed by the schema/owner, table, column and number of occurrences in that column.
I've used the following with some success, as taken from another answer.
SET SERVEROUTPUT ON SIZE 100000
DECLARE
match_count INTEGER;
BEGIN
FOR T IN
(
SELECT owner, table_name, column_name
FROM all_tab_columns
WHERE
OWNER <> 'SYS' AND DATA_TYPE LIKE '%CHAR%'
) LOOP
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM ' || t.owner || '.' || t.table_name ||
' WHERE '||t.column_name||' = :1'
INTO MATCH_COUNT
USING 'NAME';
IF MATCH_COUNT > 0 THEN
dbms_output.put_line( t.owner ||' '|| t.table_name ||' '||t.column_name||' '||match_count );
END IF;
END LOOP;
END;
/
However it only finds literal strings of NAME and I also want to find Name, Name Shops, Name Accounts, Name someOtherStringICantGuess etc. So I think i should use a regular expression. I'm fine with the regular expression part, but it's how to incorporate it into the above functionality I'm unsure of. In fact i'm uncertain whether I will be adapting the above code, or doing something completely different.
One last thing: performance and duration of the run of the script are irrelevant and subordinate to the certainty of every column being checked. There is a dedicated environment that mimics production where this script will be deployed so it won't adversely affect the company's customers.
Thanks in advance.
EDIT: Just removed some company specific code...
The simplest method is to surround your search with upper.
SET SERVEROUTPUT ON SIZE 100000
DECLARE
-- set l_wildcard_search to true if you are using wildcards ('%'),
-- false if you want a straight match on the name
-- Wild card searches (like) are not able to use indexes whereas '='
-- potentially can.
l_wildcard_search CONSTANT BOOLEAN := FALSE;
match_count INTEGER;
--
l_searchvalue VARCHAR2 (100) := UPPER ('NAME');
l_cmd VARCHAR2 (200);
BEGIN
FOR t IN (SELECT owner, table_name, column_name
FROM all_tab_columns
WHERE owner NOT IN ('SYS', 'SYSTEM')
AND data_type LIKE '%CHAR%')
LOOP
BEGIN
l_cmd := 'SELECT COUNT(*) FROM '
|| t.owner
|| '.'
|| t.table_name
|| ' WHERE upper('
|| t.column_name
|| ')'
|| CASE WHEN l_wildcard_search THEN ' like ' ELSE ' = ' END
|| ':1';
DBMS_OUTPUT.put_line (l_cmd);
EXECUTE IMMEDIATE l_cmd INTO match_count USING l_searchvalue;
IF match_count > 0
THEN
DBMS_OUTPUT.put_line (t.owner || ' ' || t.table_name || ' ' || t.column_name || ' ' || match_count);
END IF;
EXCEPTION
WHEN OTHERS
THEN
DBMS_OUTPUT.put_line ('Error executing: ' || l_cmd);
END;
END LOOP;
END;
/
Here is your answer using regular expressions
SET SERVEROUTPUT ON SIZE 100000
DECLARE
match_count INTEGER;
l_searchvalue VARCHAR2 (100) := UPPER ('NAME');
l_cmd VARCHAR2 (200);
BEGIN
FOR t IN (SELECT owner, table_name, column_name
FROM all_tab_columns
WHERE owner NOT IN ('SYS', 'SYSTEM')
AND data_type LIKE '%CHAR%' and rownum < 10)
LOOP
BEGIN
l_cmd := 'SELECT COUNT(*) FROM '
|| t.owner
|| '.'
|| t.table_name
|| ' WHERE regexp_like('
|| t.column_name
|| ', :1)';
DBMS_OUTPUT.put_line (l_cmd);
EXECUTE IMMEDIATE l_cmd INTO match_count USING l_searchvalue;
IF match_count > 0
THEN
DBMS_OUTPUT.put_line (t.owner || ' ' || t.table_name || ' ' || t.column_name || ' ' || match_count);
END IF;
EXCEPTION
WHEN OTHERS
THEN
DBMS_OUTPUT.put_line ('Error executing: ' || l_cmd);
END;
END LOOP;
END;
/