Using Regex to extract numeric values in Tableau - regex

I am trying to pull out the numeric values (10004, 12245, 13456) from the following IDs:
10004a,
12v245, and
13456n
I can get the correct ID numbers with the exception of 12v245 ID, using the following regex code:
REGEXP_EXTRACT([ID], '([0-9]+)')
The 12v245 ID is only returning the the first two numbers. What am I missing in my code?

Your issue is that the function REGEXP_EXTRACT in Tableau requires exactly one capturing group.
The function [0-9]+ returns a capturing group per block of numbers and as the ID 12v245 has a letter in between the string of numbers it returns two capturing groups i.e. the 12 and then the 245.
The workaround for this is to use a nested replace as follows:
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE([ID], '[\D]+',"")
, '[\D]+' , "")
, '[\D]+' , "")
Depending on the nature of your data you may want to add more replaces.
This issue is documented on the Tableau community so feel free to vote up for a better fix: https://community.tableau.com/ideas/4975#

Related

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

How can I replace multiple words "globally" using regexp_replace in Oracle?

I need to replace multiple words such as (dog|cat|bird) with nothing in a string where there may be multiple consecutive occurrences of a word. The actual code is to remove salutations and suffixes from a name. Unfortunately the garbage data I get sometimes contains "SNERD JR JR."
I was able to create a regular expression pattern that accomplishes my goal but only for the first occurrence. I implemented a stupid hack to get rid of the second occurrence, but I believe there has to be a better way. I just can't figure it out.
Here is my "hacked" code;
FUNCTION REMOVE_SALUTATIONS(IN_STRING VARCHAR2) RETURN VARCHAR2 DETERMINISTIC
AS
REGEX_SALUTATIONS VARCHAR2(4000) := '(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)';
BEGIN
RETURN TRIM(REGEXP_REPLACE(REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' '),REGEX_SALUTATIONS,''));
END REMOVE_SALUTATIONS;
I was actually proud that I was able to get this far, as regular expression are not very regular to me. All help is appreciated.
EDIT:
The default for regexp_replace based on my understanding is to do a global replace. But on the outside chance my DB is configured different I did try;
select REGEXP_REPLACE('SNERD JR JR','(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)',' ',1,0) from dual;
and the results are;
SNERD JR
Use occurrence parameter of REGEXP_REPLACE function. The docs says:
occurrence is a nonnegative integer indicating the occurrence of the replace operation:
If you specify 0, then Oracle replaces all occurrences of the match.
If you specify a positive integer n, then Oracle replaces the nth occurrenc
https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions137.htm#SQLRF06302
It should look like:
...
REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' ', 1,0 )
...

Hive regex split a string in to two different fields

my record is like:
0x0000110PPPP111KZY0 H123456789 XYZ 000000000000000000607532030000607532000060753203002014101707199999
I am searching for a regex where i can split first 3 char 0x0 in to one field in a hive table and the rest 000110PPPP111KZY0 in to second field and so on fixed length file and no delimiter.
I have no experience with hadoop or hive, however the following regex will work with what I believe you're looking for.
/(\dx\d)(.*)/ This will capture/split 0x0 into the first capture group, and everything afterwards into the second capture group. If you only want the numbers/letters following the 0x0 number (so none of the H123456789 or trailing words and letters), use /(\dx\d)([^ ]*)/
If I misunderstood what you're looking for, can you just clarify the exact section of that code you provided that you'd like to select and/or capture? Thanks!
Select
regexp_extract(data, '^(\\dx\\d).*', 1),
regexp_extract(data, '^\\dx\\d(.*)', 1)
from (Select '0x0000110PPPP111KZY0 ' as data) a;
This code returns a Hive row with two fields:
0x0 000110PPPP111KZY0