Regexp expression from Oracle SQL to Big Query - regex

I previously had help here for an Regexp expression in oracle sql which worked great.However, our place is converting to Big Query and the regexp does not seem to be working anymore.
In my tables, i have the following data
WC 12/10 change FC from 24 to 32
W/C 12/10 change fc from 401 to 340
W/C12/10 18-26
This oracle sql would have split the table up to give me the before number (24) and (32) and (12/10).
cast(REGEXP_SUBSTR(Line_Comment, '((\d+ |\d+)(change )?(- |-|to |to|too|too )(\d+))', 1, 1, 'i',2) as Int) as Before,
cast(REGEXP_SUBSTR(Line_Comment, '((\d+ |\d+)(change )?(- |-|to |to|too|too )(\d+))', 1, 1, 'i', 5) as Int) as After,
REGEXP_SUBSTR(Line_Comment, '((\d+)(\/|-|.| )(\d+)(\/|-|.| )(\d+))|(\d+)(\/|-|.| )(\d+)', 1, 1, 'i') as WC_Date,
Totally understand that comments are not consistent and may not work but if it works more than 80% of the time which it has then we are fine with this.
Since moving to big query, I'm getting this error message. In oracle, the tables were in varchar but in big query when they migrated it, its now in strings. Could this be the reason why its broken?Is there anyone who can help with this?This is way over my head.
No matching signature for function REGEXP_SUBSTR for argument types:
STRING, STRING, INT64, INT64, STRING, INT64. Supported signatures:
REGEXP_SUBSTR(STRING, STRING, [INT64], [INT64]); REGEXP_SUBSTR(BYTES,
BYTES, [INT64], [INT64]) at [69:12]

Since google bigquery REGEXP_SUBSTR doesn't support the subexpr parameter of Oracle's REGEXP_SUBSTR, you need to modify your regexes to take advantage of the fact that:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group.
So for each value you are trying to extract, you need to make that the only capturing group in the regex:
cast(REGEXP_SUBSTR(Line_Comment, '(?:(\d+ |\d+)(?:change )?(?:- |-|to |to|too|too )(?:\d+))', 1, 1) as Int) as Before,
cast(REGEXP_SUBSTR(Line_Comment, '(?:(?:\d+ |\d+)(?:change )?(?:- |-|to |to|too|too )(\d+))', 1, 1) as Int) as After,
REGEXP_SUBSTR(Line_Comment, '((?:\d+)(?:\/|-|.| )(?:\d+)(?:\/|-|.| )(?:\d+))|((?:\d+)(?:\/|-|.| )(?:\d+))', 1, 1) as WC_Date,
Note you can substantially simplify your regexes as below:
(\d+) ?(?:change )?(?:-|too?) ?(?:\d+)
(?:\d+) ?(?:change )?(?:-|too?) ?(\d+)
(?:\d+)(?:[\/.-](?:\d+)){1,2}
Regex demos on regex101: numbers, date

Based on the sample data you provided in the comment section, you can try below query:
with t1 as (
select 'WC 12/10 change FC from 24 to 32' as Comment
union all select 'W/C 12/10 change fc from 401 to 340' as Comment,
union all select 'W/C12/10 18-26' as Comment
)
select Comment,
regexp_extract(t1.Comment, r'(\d+\/\d+)') as WC,
regexp_extract(t1.Comment, r'.+\s(\d{1,3})[\s|\-]') as Before,
regexp_extract(t1.Comment, r'.+[\sto\s|\-](\d{1,3})$') as After
from t1
Output:

Consider below super simple approach
select Comment,
format('%s/%s', arr[offset(0)], arr[safe_offset(1)]) as wc,
arr[safe_offset(2)] as before,
arr[safe_offset(3)] as after
from your_table, unnest([struct(regexp_extract_all(Comment, r'\d+') as arr)])
if applied to sample data in your question - output is

Related

Substitute for Function STUFF (SQL Server) in AWS redshift

I have to replace first 3 digits of a column to a fix first 3 digits (123)
Working SQL Server code. (Not working on AWS RedShift)
Code:
Select
Stuff (ColName,1,3,'123')as NewColName
From DataBase.dbo.TableName
eg 1 -Input --- 8010001802000000000209092396---output -1230001802000000000209092396
eg 2 -Input --- 555209092396- --output -123209092396
it should replace the first 3 digits to 123 irrespective of its length.
Please advice anything that is supported in AWS Redshift.
yet trying using substring and repalce.
I see that AWS RedShift was based on an old version of Postgres, and I looked up the SUBSTRING function for you (https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html), which is pretty forgiving of its argument values.
In this sample in Transact-SQL, and as documented for RedShift, the third argument of SUBSTRING can be much longer than the actual strings without causing an error. In Transact-SQL, even the second argument is "forgiving" if it starts after the end of the actual string:
;
WITH RawData AS
(SELECT * FROM (VALUES ('8010001802000000000209092396'),
('555209092396'),
('AB')
) AS X(InputString)
)
SELECT InputString, '123' + SUBSTRING(InputString, 4, 1000) AS OutputString
FROM RawData
InputString OutputString
8010001802000000000209092396 1230001802000000000209092396
555209092396 123209092396
AB 123
As it appears that the concatenation operator in Redshift is ||, I think your expression will be very close to:
'123' || SUBSTRING(InputString, 4, 1000)
Got this and it worked
--Using Substring and concat
Select
cast('123'+substring(ColName,4,LEN(ColName)) as numeric (28)) as NewColName
From DataBase.dbo.TableName

replace expression format xx-xx-xxxx_12345678

IDENTIFIER
31-03-2022_13636075
01-04-2022_13650262
04-04-2022_13663174
05-04-2022_13672025
20220099001
11614491_R
10781198
00000000000
11283627_P
11614491_R
-1
how can i remove (only) the "XX-XX-XXXXX_" Part in certain values of a column in SSIS but WITHOUT affecting values that doesn't have this format? For example "21-05-2022_12345678" = "12345678" but the other values i don't want them affected. This are just examples of many rows from this column so i want only the ones that have this format to be affected.
SELECT REVERSE(substring(REVERSE('09-03-2022_13481330'),0,CHARINDEX('_',REVERSE('09-03-2022_13481330'),0)))
result
13481330
but this also affects others values.Also this is in ssms not ssis because i am not sure how to transform this expression in ssis code.
Update : Corrected code in SSIS goes as following:
(FINDSTRING(IDENTIFIER,"__-__-____[_]",1) == 1) ? SUBSTRING(IIDENTIFIER,12,LEN(IDENTIFIER) - 11) : IDENTIFIER
Do you have access to the SQL source? You can do this on the sql by using a LIKE and crafting a match pattern using the single char wildcard _ please see below example
DECLARE #Value VARCHAR(50) = '09-03-2022_13481330'
SELECT CASE WHEN #Value LIKE '__-__-____[_]%' THEN
SUBSTRING(#Value,12,LEN(#Value)-11) ELSE #Value END
Please see the Microsoft Documentation on LIKE and using single char wildcards
If you don't have access to the source SQL it gets a bit more tricky as you might need to use regex in a script task or maybe there is a expression you can apply

How to find all the source lines containing desired table names from user_source by using 'regexp'

For example we have a large database contains lots of oracle packages, and now we want to see where a specific table resists in the source code. The source code is stored in user_source table and our desired table is called 'company'.
Normally, I would like to use:
select * from user_source
where upper(text) like '%COMPANY%'
This will return all words containing 'company', like
121 company cmy
14 company_id, idx_name %% end of coding
453 ;companyname
1253 from db.company.company_id where
989 using company, idx, db_name,
So how to make this result more intelligent using regular expression to parse all the source lines matching a meaningful table name (means a table to the compiler)?
So normally we allow the matched word contains chars like . ; , '' "" but not _
Can anyone make this work?
To find company as a "whole word" with a regular expression:
SELECT * FROM user_source
WHERE REGEXP_LIKE(text, '(^|\s)company(\s|$)', 'i');
The third argument of i makes the REGEXP_LIKE search case-insensitive.
As far as ignoring the characters . ; , '' "", you can use REGEXP_REPLACE to suck them out of the string before doing the comparison:
SELECT * FROM user_source
WHERE REGEXP_LIKE(REGEXP_REPLACE(text, '[.;,''"]'), '(^|\s)company(\s|$)', 'i');
Addendum: The following query will also help locate table references. It won't give the source line, but it's a start:
SELECT *
FROM user_dependencies
WHERE referenced_name = 'COMPANY'
AND referenced_type = 'TABLE';
If you want to identify the objects that refer to your table, you can get that information from the data dictionary:
select *
from all_dependencies
where referenced_owner = 'DB'
and referenced_name = 'COMPANY'
and referenced_type = 'TABLE';
You can't get the individual line numbers from that, but you can then either look at user_source or use a regexp on the specific source code, which woudl at least reduce false positives.
SELECT * FROM user_source
WHERE REGEXP_LIKE(text,'([^_a-z0-9])company([^_a-z0-9])','i')
Thanks #Ed Gibbs, with a little trick this modified answer could be more intelligent.

How to Select Date Range using RegEx

I have date strings that looks like so:
20120817110329
Which, as you can see, is formatted: YYYYMMDDHHMMSS
How would I select (using RegEx) dates that are between 7/15 and 8/20? Or what about 8/1 to 8/15?
I have this working if I want to select a range that doesn't involve more than one place, but it is very limited:
^2012081[0-7] //selects 8/10 to 8/17
Update
Never forget the obvious (as pointed out by Wiseguy below), one can simply look for a range between 201207150000 and 201208209999.
Since you're just querying a database field that contains these values, you could simply check for a value between 201207150000 and 201208209999.
If you still want the regex, it ain't pretty, but this does it:
^20120(7(1[5-9]|2\d|3[01])|8([0-1]\d|20))\d{4}$
reFiddle example
You basically have to account for each possible range by hand.
^20120
(
7
(
1[5-9]
|2\d
|3[01]
)
|
8
(
[0-1]\d
|20
)
)
\d{4}$
I think this should work:
^2012(07(1[5-9]|[2-3][0-9])|08([0-1][0-9]|20))
Although the other answers are pretty the same...
You can check this for more info.

SQL and regular expression to check if string is a substring of larger string?

I have a database filled with some codes like
EE789323
990
78000
These numbers are ALWAYS endings of a larger code. Now I have a function that needs to check if the larger code contains the subcode.
So if I have codes 90 and 990 and my full code is EX888990, it should match both of them.
However I need to do it in the following way:
SELECT * FROM tableWithRecordsWithSubcode
WHERE subcode MATCHES [reg exp with full code];
Is a regular expression like this this even possible?
EDIT:
To clarify the issue I'm having, I'm not using SQL here. I just used that to give an example of the type of query I'm using.
In fact I'm using iOS with CoreData, and I need a predicate to fetch me only the records that match.
In the way that is mentioned below.
Given the observations from a comment:
Do you have two tables, one called tableWithRecordsWithSubcode and another that might be tableWithFullCodeColumn? So the matching condition is in part a join - you need to know which subcodes match any of the full codes in the second table? But you're only interested in the information in the tableWithRecordsWithSubcode table, not in which rows it matches in the other table?
and the laconic "you're correct" response, then we have to rewrite the query somewhat.
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON F.Fullcode ...ends-with... S.Subcode
or maybe using an EXISTS sub-query:
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(SELECT * FROM tableWithFullCodeColumn AS F
WHERE F.Fullcode ...ends-with... S.Subcode)
This uses a correlated sub-query but avoids the DISTINCT operation; it may mean the optimizer can work more efficiently.
That just leaves the magical 'X ...ends-with... T' operator to be defined. One possible way to do that is with LENGTH and SUBSTR. However, SUBSTR does not behave the same way in all DBMS, so you may have to tinker with this (possibly adding a third argument, LENGTH(s.subcode)):
LENGTH(f.fullcode) >= LENGTH(s.subcode) AND
SUBSTR(f.fullcode, LENGTH(f.fullcode) - LENGTH(s.subcode)) = s.subcode
This leads to two possible formulations:
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode;
and
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(
SELECT * FROM tableWithFullCodeColumn AS F
WHERE LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode);
This is not going to be a fast operation; joins on computed results such as required by this query seldom are.
I'm not sure why you think that you need a regular expression... Just use the charindex function:
select something
from table
where charindex(code, subcode) <> 0
Edit:
To find strings at the end, you can create a pattern with the % wildcard from the subcode:
select something
from table
where '%' + subcode like code