Oracle Regex to remove small words

Oracle Regex to remove small words - regex

I am trying to scrub some address catalog which was originally user input for later processing. Having removed numbers and special characters, now I have strings like:
ST SOMETHING SOMEWHERE H BL D S
So, ST stands for street, H house and so on.. Not trying to be super precise here, so I figured just removing anything under 3 chars would be enough for now, but I can't seem to find the correct regexp in Oracle. Have tried:
SELECT
regexp_replace('ST SOMETHING SOMEWHERE H BL D S','[\ ^]\w{1,3}[\ $]', ' ')
FROM dual;
which yields:
ST SOMETHING SOMEWHERE BL S
and I wish I had only
SOMETHING SOMEWHERE
could someone lend me a hand?
(I'm on oracle 11g)

try
SELECT
regexp_replace(
'<raw string>'
,'(^|[[:space:]])([^[:space:]]{1,3}[[:space:]]+)*[^[:space:]]{1,3}([[:space:]]|$)'
, ' '
)
FROM dual;
hope that will suit your needs, regards, carsten

Related

Oracle SQL String Manipulation

My field contains short codes that I want to access, such as C-COR3.
The issue is some records have additional information (F and H with numbers). An example is C-COR3 F1.54H19, I only care about C-COR3. Anything after "F" I want to ignore.
Code below works, but only if I hard-code the full F1.54H19. I want to use wildcards to abstract this for other occurrences that have F and H info in the field. (Ex C-R3 F0.18H18 -> C-R3 or C-COR3 F0.23H8.5 -> C-COR3), note varying short code string lengths.
/* Translates C-COR3 F1.54H19 to C-COR3. */
select distinct SUBSTR(lud_code_short,1,INSTR(lud_code_short, 'F1.54H19')-2)
from rep_dba.mytable
I've read that SUBSTR does not allow wildcards, but have had no luck trying my hand at REGEXP_INSTR and REGEX_SUBSTR instead. Any help appreciated.

Assuming that the "code" is always the first continuous sequence of non-space characters (and that there are no leading spaces - if there are, that's easy to handle), you could do something like this. Note the str || ' ' in the call to instr() - that takes care of the case when the input string has no spaces in it to begin with. Also notice the last input - since there are no spaces anywhere, the output is the same as the input. (Showing that if the "code" is not always separated from the "additional information" by at least one space, the solution would not work.)
with
test_data (str) as (
select 'C-COR3 F14H2.5' from dual union all
select 'C-AB3' from dual union all
select null from dual union all
select 'C-AB2F14H2.5' from dual
)
select str, substr(str, 1, instr(str || ' ', ' ') - 1) as code
from test_data
;
STR CODE
-------------- --------------
C-COR3 F14H2.5 C-COR3
C-AB3 C-AB3
C-AB2F14H2.5 C-AB2F14H2.5

Try using regexp_replace within your query like below
SELECT
regexp_replace('C-COR3 F14H2.5', '(C-[[:alnum:]]+) [FH].*', '\1')
FROM dual;

Postgresql regexp_split_to_array NEGATION

So i can split a string into an array like this:
select regexp_split_to_array('Village at river nearby city' , ' at | nearby ')
-----------------------
{Village,river,city}
Easy.
But what if I want to negate this whole statement so the result is
{at,nearby}
I couldnt really find anything, because I coudnt find how to negate a list of words in regex...Any hints?

Replace parts in a string using REGEXP_REPLACE

Hi I have a string in oracle like this:
temp_a, temp_b, temp_c
What I want to get is :
e.temp_a, e.temp_b, e.temp_c
So I want to put an "e." before every part in this string
I searched in internet and I found examples to split the string or to replace simpler strings but nothing to guide me through my problem

select regexp_replace('temp_a, temp_b, temp_c',
'([a-zA-Z0-9_]+)(,?)',
'e.\1\2') from dual;
This should work.

I just noticed you're specifically asking for regex, but for what it's worth I would probably do it like this:
rtrim( replace( 'e.'||your_string, ', ', ', e.'), 'e.')

psql uppercase backreferenced string in regexp_replace

I have a string, that was previously processed with initcap(), and I wanto to uppercase part of it.
To be specific - I want to uppercase basic roman digits that might occur.
To be even more specific I'd like to replace
Jana Iii Sobieskiego
to
Jana III Sobieskiego
I suppose I could use some kind of a upper-substring-subquery combo to achieve it, but I am trying to make it work in a single regexp_replace, like this:
SELECT
ulica
--, regexp_matches(ulica , '((^|\s)([XxIiVv]+)(\s|$))', 'g')
, regexp_replace(ulica, '((^|\s)([XxIiVv]+)(\s|$))', '\2'||upper('q\3q')||'\4' , 'g')
FROM (
SELECT unnest(ARRAY['Jana Iii Sobieskiego', 'Xx Lecia', 'Xxx Lecia Panowania Zygmunta Iii Wazy'])::text AS ulica
) AS src
What happens, is that upper works on the 'static' part of replacement string (q...q), but not on the backreference.
I get
Jana QIiiQ Sobieskiego
Anyone has an idea how to do this?
PostgreSQL 9.1

SHORT ANSWER
Unfortunately, what you have tried is not possible with regexp_replace.
LONG ANSWER
INTRO
This line
regexp_replace(ulica, '((^|\s)([XxIiVv]+)(\s|$))', '\2'||upper('q\3q')||'\4' , 'g')
is equivalent to
regexp_replace(ulica, '((^|\s)([XxIiVv]+)(\s|$))', '\2Q\3Q\4' , 'g')
As you can see, regexp_replace won't upper case any backreferences.
WORKAROUND
You can create your own function that take an ulica as parameter and returns ulica with basic roman digits uppercased.
Step 1
In a first step this function would mark (I choose $$ as a marker but you can use any.) the part of ulica to be uppercased like this:
regexp_replace(ulica, '((^|\s)([XxIiVv]+)(\s|$))', '\2$$\3$$\4' , 'g')
Step 2
In a second step, go through the resulting string and uppercase each char located between two markers.

Finding and removing Non-ASCII characters from an Oracle Varchar2

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?

I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.

I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')

The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03  ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);

I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);

Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual

You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Oracle Regex to remove small words - regex

try SELECT regexp_replace( '<raw string>' ,'(^|[[:space:]])([^[:space:]]{1,3}[[:space:]]+)*[^[:space:]]{1,3}([[:space:]]|$)' , ' ' ) FROM dual; hope that will suit your needs, regards, carsten

Related

Oracle SQL String Manipulation

Postgresql regexp_split_to_array NEGATION

Replace parts in a string using REGEXP_REPLACE

psql uppercase backreferenced string in regexp_replace

Finding and removing Non-ASCII characters from an Oracle Varchar2

Categories

Resources