How to split a string in db2? - regex

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?

I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.

For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

Related

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

Replace pair of % in oracle

please, I have in Oracle table this texts (as 2 records)
"Sample text with replace parameter %1%"
"You reached 90% of your limit"
I need replace %1% with specific text from input parameter in Oracle Function. In fact, I can have more than just one replace parameters. I have also record with "Replace this %12% with real value"
This functionality I have programmed:
IF poc > 0 THEN
FOR i in 1 .. poc LOOP
p := get_param(mString => mbody);
mbody := replace(mbody,
'%' || p || '%', parameters(to_number(p, '99')));
END LOOP;
END IF;
But in this case I have problem with text number 2. This functionality trying replace "90%" also and I then I get this error:
ORA-06502: PL/SQL: numeric or value error: NULL index table key value
It's a possible to avoid try replace "90%"? Many thanks for advice.
Best regards
PS: Oracle version: 10g (OCI Version: 10.2)
Regular expressions can work here. Try the following and build them into your script.
SELECT REGEXP_REPLACE( 'Sample text with replace parameter %1%',
'\%[0-9]+\%',
'db_size' )
FROM DUAL
and
SELECT REGEXP_REPLACE( 'Sample text with replace parameter 1%',
'\%[0-9]+\%',
'db_size' )
FROM DUAL
The pattern is pretty simple; look for patterns where a '%' is followed by 1 or more numbers followed by a '%'.
The only issue here will be if you have more than one replacement to make in each string and each replacement is different. In that case you will need to loop round the string each time replacing the next parameter. To do this add the position and occurrence parameters to REGEXP_REPLACE after the replacement string, e.g.
REGEXP_REPLACE( 'Sample text with replace parameter %88888888888%','\%[0-9]+\%','db_size',0,1 )
You are getting the error because at parameters(to_number(p, '99')). Can you please check the value of p?
Also, if the p=90 then then REPLACE will not try to replace "90%". It will replace "%90%". How have you been sure that it's trying to replace "90%"?

complex regex to identify parameters with brackets only

I'm trying to build a regex that will identify parameters inside brackets and ignore pl/sql comments (single line --, and multiple lines /* */)
For instance:
create or replace table_name ---sfjdslkfjslkfjslkfjdsfsdf
**(var1 in out number, var2 number)**
/* sdfls
sfdsd jfs
sfs f
sd f
sfsf */
AS
BEGIN
(var1 in out number, var2 number) should be matched only. It should also account for cases where:
There are not comments (single or multiple lines)
There are only single line comments either before or after the parameters
There are only multiple line comments either before or after the parameters
There are no parameters
Assumptions:
Parameters are always enclosed in brackets ()
Procedures can sometimes have no parameters but have comments (either single or multiple line comments) before the AS BEGIN clause
Procedures start with create or replace table_name
We're only interested to read the until the AS BEGIN clause
In other words, I need to find the index of the first opening bracket '(' that is outside any comments (single or multiple lines) and that comes before the AS BEGIN clause.
UPDATE:
I have managed to match the comments using the following regex:
(?:\/\*(?:[\s\S]*?)\*\/)|(?:\-\-(?:.*)$)
For instance here it will match all the comments:
create or replace table_name
-- sdlfksl kjs slkjslds js
/* lsdjfdkj
s fskjfs
sf sf
sdf;;''
sfs fs
*/
(hello number, var2 number)
--sdflksf
/*sl --sdflks s kdjfls())({fsfs */
AS
BEGIN
I can do this now in Java to identify the first opening bracket outside of any matching group. However it would be easier if I could just ignore the matching group and match the one parameters in the brackets only instead.
EDIT
This is not asking for a solution with pl/sql or sqlplus or whatever. I have a few pl/sql procedures stored in files that I need to modify and add new parameters to. I'm using Java to do that and inside Java im using a combination of loops and regexs.
Not sure why you want to reinvent the wheel. In Oracle, you could use the *_ARGUMENTS view to get the list of all the arguments.
For example,
SQL> CREATE OR REPLACE
2 PROCEDURE p(
3 in_date DATE,
4 out_text VARCHAR2)
5 AS
6 BEGIN
7 NULL;
8 END;
9 /
Procedure created.
SQL>
SQL> SELECT object_name,
2 package_name,
3 argument_name,
4 position,
5 data_type
6 FROM USER_ARGUMENTS
7 WHERE OBJECT_NAME = 'P';
OBJECT_NAM PACKAGE_NA ARGUMENT_NAME POSITION DATA_TYPE
---------- ---------- --------------- ---------- ---------
P OUT_TEXT 2 VARCHAR2
P IN_DATE 1 DATE
SQL>
I'm unfamiliar with Oracle's flavor of regexes, but I made a pattern that works in most regex-engines. I know MySQL has a wildly different syntax so this is quite possibly also the case in Oracle, but the general idea is this:
create or replace table_name(?:(?!AS\s+BEGIN)(?:--[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\/|(?!\/\*|--)[^(]))*\(([^)]+)
Debuggex Demo
It's a "match this except in contexts A, B or C" approach also used here and explained in depth in this SO answer. Combined with the requirement that whatever we're matching isn't followed by AS\s+BEGIN.
You can see in the Debuggex Demo that the correct part between brackets is matched in capture group 1. As well, the 2nd create doesn't have parameters, and indeed nothing is matched (even though there's brackets in comments and after the AS BEGIN.
The task at hand is porting this to Oracle's regex flavor.

Why would regex to separate filename from extension not work in ColdFusion?

I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function:
REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );
I would like this to return an array like: ["Doe, John 8.15.2012","docx"]
but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]
I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?
Does ColdFusion use a different syntax? Or am I doing something wrong?
Thanks.
Why you're not getting expected results...
The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.
It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.
How to solve the problem...
If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...
ListLast( filename , '.' )
....to get only the file extension and to get the name without extension you can do...
rematch( '.+(?=\.[^.]+$)' , filename )
This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).
To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).
(optional) Accessing captured groups...
If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.
For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).
If you wanted to use cfRegex, you can do so with your original pattern like so:
RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )
Or with named arguments:
RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )
And you get returned an array of matches, within each element being an array of the captured groups for that match.
If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.
If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.
#Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.
<cfscript>
param name="URL.filename";
sRegex = "^.+?(?=(?:\.[^.]+?)?$)";
aMatch = reMatch(sRegex, URL.filename);
writeDump(aMatch);
</cfscript>
This works on the following filename patterns:
foo.bar
foo
.htaccess
John 8.15.2012.docx
Explanation of the regex:
^ From the beginning of the string
.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.
(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.
(?: Group this stuff together, but don't remember it for a back reference.
. A dot. This is the separator between file name and file extension.
[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.
? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.
$ To the end of the string
I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.
A few more ways of achieving the same result. They all execute in roughly the same amount of time.
<cfscript>
str = 'Doe, John 8.15.2012.docx';
// sans regex
arr1 = [
reverse( listRest( reverse( str ), '.' ) ),
listLast( str, '.' )
];
// using Java String lastIndexOf()
arr2 = [
str.substring( 0, str.lastIndexOf( '.' ) ),
str.substring( str.lastIndexOf( '.' ) + 1 )
];
// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>

REGEXP_EXTRACT () every word except ‘,’ in a field

I’d like to select country except ‘,’ from a data field which looks like this
Japan,Singapore,Italy,France
and my Code looks like this REGEXP_EXTRACT(country,'([^,]*)'), unfortunately, it works but only the country at the first was selected. How can I code it to select it all?
I slightly changed the RegEx to ([^,]+) to make the country name at least one digit. Using * creates empty matches so that every other match contains the country name. (Example)
Take a look at the fixed example here.
Important is the /g tag in the end to make the RegEx match globally.
If you are looking to extract all the characters except , then it could be achieved using either of the the REGEXP_REPLACE Calculated Fields below:
1) Replace , with (space)
REGEXP_REPLACE(country, ",", " ")
2) Remove ,
REGEXP_REPLACE(country, ",", "")
Google Data Studio Report and a GIF to elaborate: