Related
Given input string as
'PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,kWh,800,Gallons,500'
and unit_param = 'Gallons'
I need to extract value of unit_param (Gallons) which is 1000 using postgresql regex functions.
As of now, I have a function that first extracts value for THRESHOLDLIST which is "kWh,2000,Gallons,1000,litre,3000", then splits and loops over the array to get the value.
Can I get this efficiently using regex.
SELECT substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' FROM '%THRESHOLDLIST=#".........#",%' FOR '#')
Use substring() with the target input grouped:
substring(myCol, 'THRESHOLDLIST=[^=]*Gallons,([0-9]+)')
The expression [^=]* means “characters that are not =”, so it won’t match Gallons within another parameter.
select
Substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' from 'Gallons,\d*');
returns Gallons,1000
I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.
I need to replace multiple words such as (dog|cat|bird) with nothing in a string where there may be multiple consecutive occurrences of a word. The actual code is to remove salutations and suffixes from a name. Unfortunately the garbage data I get sometimes contains "SNERD JR JR."
I was able to create a regular expression pattern that accomplishes my goal but only for the first occurrence. I implemented a stupid hack to get rid of the second occurrence, but I believe there has to be a better way. I just can't figure it out.
Here is my "hacked" code;
FUNCTION REMOVE_SALUTATIONS(IN_STRING VARCHAR2) RETURN VARCHAR2 DETERMINISTIC
AS
REGEX_SALUTATIONS VARCHAR2(4000) := '(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)';
BEGIN
RETURN TRIM(REGEXP_REPLACE(REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' '),REGEX_SALUTATIONS,''));
END REMOVE_SALUTATIONS;
I was actually proud that I was able to get this far, as regular expression are not very regular to me. All help is appreciated.
EDIT:
The default for regexp_replace based on my understanding is to do a global replace. But on the outside chance my DB is configured different I did try;
select REGEXP_REPLACE('SNERD JR JR','(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)',' ',1,0) from dual;
and the results are;
SNERD JR
Use occurrence parameter of REGEXP_REPLACE function. The docs says:
occurrence is a nonnegative integer indicating the occurrence of the replace operation:
If you specify 0, then Oracle replaces all occurrences of the match.
If you specify a positive integer n, then Oracle replaces the nth occurrenc
https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions137.htm#SQLRF06302
It should look like:
...
REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' ', 1,0 )
...
How can I replace a list of values like
married
single
non
married
couple
to a list like this using a regular expression
Status 2
Status 1
non
Status 2
couple
? I know can match each group by something like this
/(married|single)/gm
and that I can address the matched group by $1, $2, ... . But how can I address and/or if-else the group-value in the replace-part to acutally translate the values?
Edit
Let's say I have the values to replace in a MariaDB-colum marital in myTable. Then I can do something like
SELECT
marital,
REGEXP_REPLACE(REGEXP_REPLACE(marital,
"married", "Status 2")
, "single", "Status 1")
FROM myTable
To get the desired result. But Is there a way to do this with just one REGEXP_REPLACE?
Thanks for your help!
You cannot do it with a single REGEXP_REPLACE because MariaDB doesn't support the required features in the third parameter.
You may do it using PHP with arrays: http://php.net/manual/en/function.preg-replace.php
or with callback: http://php.net/manual/en/function.preg-replace-callback.php
You may do it using Perl: How to replace a set of search/replace pairs?
We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?
I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.
I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;
In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03 ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.
I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);
I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.
Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).
Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.
Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est