SQL Server function to eliminate replicated characters - regex

I wonder if there is an easy and efficient way in SQL Server 2005 to eliminate replicated characters in a string. Like converting
'ABBBCDEEFFFFG' to 'ABCDEFG'
It really sucks that SQL Server has such a poor string library and no ready-to-use regexp feature...

You can use the CLR functionality built into SQL Server 2005/2008 to get this done by .NET code.
MSDN magazine wrote about it in their February 2007 issue.
If this is not an acceptable solution, here is a UDF that will do the same, mind you this is about two orders of magnitude slower than the CLR solution.

YMMV. This appears to work for your string above. But not ABBBCDEEBBBBG
DECLARE #Numbers TABLE (Num smallint NOT NULL PRIMARY KEY)
INSERT #Numbers (Num)
SELECT TOP 8000
ROW_NUMBER() OVER (ORDER BY c1.NAME)
FROM
sys.columns c1
DECLARE #STuff TABLE (Seq varchar(100) NOT NULL PRIMARY KEY)
INSERT #STuff (Seq) VALUES ('ABBBCDEEFFFFG') --works
SELECT
Single
FROM
(
SELECT DISTINCT
CAST(Single AS varchar(100))
FROM
#Numbers N
CROSS APPLY
(SELECT Seq, SUBSTRING(Seq, Num, 1) AS Single FROM #Stuff) S
WHERE
Num <= LEN(Seq)
FOR XML PATH ('')
) foo(Single)

I know about the CLR solution, but as I said, I am neither responsible nor authorized to implement it in the DB of question.
For this particular problem, I decided to write a very simple and kinda silly loop. I am afraid it won't be fast enough for millions of records, but anyways... I wish I could do this stuff in the application layer but I am bound to T-SQL here..
DECLARE #i int ; -- counter
DECLARE #input varchar(200) ;
SET #input = 'AAABCDEEFFBBBXYZSSSWWWNT'
IF LEN(#input) > 1
BEGIN
DECLARE #unduplicated varchar(200) ;
SET #unduplicated = SUBSTRING(#input,1,1) ;
SET #i = 2 ;
WHILE #i <= LEN(#input)
BEGIN
-- If current char is different from the last char, concatenate, else not
IF SUBSTRING(#unduplicated, LEN(#unduplicated), 1) <> SUBSTRING(#input, #i, 1)
SET #unduplicated = #unduplicated + SUBSTRING(#input, #i, 1) ;
SET #i = #i + 1;
END
END
SELECT #unduplicated AS unduplicated;
Result:
unduplicated
ABCDEFBXYZSWNT

Related

Count number of WHERE filters in SQL query using regex

Update: I've updated the test string to cover a case that I've missed.
I'm trying to do count the number of WHERE filters in a query using regex.
So the general idea is to count the number of WHERE and AND occuring in the query, while excluding the AND that happens after a JOIN and before a WHERE. And also excluding the AND that happens in a CASE WHEN clause.
For example, this query:
WITH cte AS (\nSELECT a,b\nFROM something\nWHERE a>10\n AND b<5)\n, cte2 AS (\n SELECT c,\nd FROM another\nWHERE c>10\nAND d<5)\n SELECT CASE WHEN c1.a=1\nAND c2.c=1 THEN 'yes' ELSE 'no' \nEND,c1.a,c1.b,c2.c,c2.d\nFROM cte c1\nINNER JOIN cte2 c2 ON c1.a = c2.c\nAND c1.b = c2.d\nWHERE c1.a<4 AND DATE(c1)>'2022-01-01'\nAND c2.c>6
-- FORMATTED FOR EASE OF READ. PLEASE USE LINE ABOVE AS REGEX TEST STRING
WITH cte AS (
SELECT a,b
FROM something
WHERE a>10
AND b<5
)
, cte2 AS (
SELECT c,d
FROM another
WHERE c>10
AND d<5
)
SELECT
CASE
WHEN c1.a=1 AND c2.c=1 THEN 'yes'
WHEN c1.a=1 AND c2.c=1 THEN 'maybe'
ELSE 'no'
END,
c1.a,
c1.b,
c2.c,
c2.d
FROM cte c1
INNER JOIN cte2 c2
ON c1.a = c2.c
AND c1.b = c2.d
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
should return 7, which are:
WHERE a>10
AND b<5
WHERE c>10
AND d<5
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
The portion AND c1.b = c2.d is not counted because it happens after JOIN, before WHERE.
The portion AND c2.c=1 is not counted because it is in a CASE WHEN clause.
I eventually plan to use this on a Postgresql query to count the number of filters that happens in all queries in a certain period.
I've tried searching around for answer and trying it myself but to no avail. Hence looking for help here. Thank you in advanced!
I try to stay away from lookarounds as they could be messy and too painful to use, especially with the fixed-width limitation of lookbehind assertion.
My proposed solution is to capture all scenarios in different groups, and then select only the group of interest. The undesired scenarios will still be matched, but will not be selected.
Group 1 - Starts with JOIN (undesired)
Group 2 - Starts with WHERE (desired)
Group 3 - Starts with CASE (undesired)
(JOIN.*?(?=$|WHERE|JOIN|CASE|END))|(WHERE.*?(?=$|WHERE|JOIN|CASE|END))|(CASE.*?(?=$|WHERE|JOIN|CASE|END))
Note: Feel free to replace WHERE|JOIN|CASE|END to any keyword you want to be the 'stopper' words.
All scenarios including the undesired ones will be matched, but you need to select only Group 2 (highlighted in orange).
You can try something like this:
WITH DataSource (parts) AS
(
SELECT REGEXP_MATCHES(
'WITH cte AS (SELECT a,b FROM something WHERE a>10 AND b<5)\n, cte2 AS (SELECT c,d FROM another WHERE c>10 AND d<5)\n SELECT c1.a,c1.b,c2.c,c2.d FROM cte c1 INNER JOIN cte2 c2 ON c1.a = c2.c AND c1.b = c2.d WHERE c1.a<4 AND c2.c>6',
E'(?= WHERE)[^)|;]+'
,'gmi'
)
)
SELECT SUM
(
(length(parts[1]) - length(REPLACE(parts[1], 'AND', ''))) / 3 -- counting ANDs
+ 1 -- for the where
)
FROM DataSource
The idea is to match the text after WHERE clause:
and then simply count the ANDs and add one because of the matched WHERE.

How to use listagg function in select query? [duplicate]

Would it be possible to construct SQL to concatenate column values from
multiple rows?
The following is an example:
Table A
PID
A
B
C
Table B
PID SEQ Desc
A 1 Have
A 2 a nice
A 3 day.
B 1 Nice Work.
C 1 Yes
C 2 we can
C 3 do
C 4 this work!
Output of the SQL should be -
PID Desc
A Have a nice day.
B Nice Work.
C Yes we can do this work!
So basically the Desc column for out put table is a concatenation of the SEQ values from Table B?
Any help with the SQL?
There are a few ways depending on what version you have - see the oracle documentation on string aggregation techniques. A very common one is to use LISTAGG:
SELECT pid, LISTAGG(Desc, ' ') WITHIN GROUP (ORDER BY seq) AS description
FROM B GROUP BY pid;
Then join to A to pick out the pids you want.
Note: Out of the box, LISTAGG only works correctly with VARCHAR2 columns.
There's also an XMLAGG function, which works on versions prior to 11.2. Because WM_CONCAT is undocumented and unsupported by Oracle, it's recommended not to use it in production system.
With XMLAGG you can do the following:
SELECT XMLAGG(XMLELEMENT(E,ename||',')).EXTRACT('//text()') "Result"
FROM employee_names
What this does is
put the values of the ename column (concatenated with a comma) from the employee_names table in an xml element (with tag E)
extract the text of this
aggregate the xml (concatenate it)
call the resulting column "Result"
With SQL model clause:
SQL> select pid
2 , ltrim(sentence) sentence
3 from ( select pid
4 , seq
5 , sentence
6 from b
7 model
8 partition by (pid)
9 dimension by (seq)
10 measures (descr,cast(null as varchar2(100)) as sentence)
11 ( sentence[any] order by seq desc
12 = descr[cv()] || ' ' || sentence[cv()+1]
13 )
14 )
15 where seq = 1
16 /
P SENTENCE
- ---------------------------------------------------------------------------
A Have a nice day
B Nice Work.
C Yes we can do this work!
3 rows selected.
I wrote about this here. And if you follow the link to the OTN-thread you will find some more, including a performance comparison.
The LISTAGG analytic function was introduced in Oracle 11g Release 2, making it very easy to aggregate strings.
If you are using 11g Release 2 you should use this function for string aggregation.
Please refer below url for more information about string concatenation.
http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php
String Concatenation
As most of the answers suggest, LISTAGG is the obvious option. However, one annoying aspect with LISTAGG is that if the total length of concatenated string exceeds 4000 characters( limit for VARCHAR2 in SQL ), the below error is thrown, which is difficult to manage in Oracle versions upto 12.1
ORA-01489: result of string concatenation is too long
A new feature added in 12cR2 is the ON OVERFLOW clause of LISTAGG.
The query including this clause would look like:
SELECT pid, LISTAGG(Desc, ' ' on overflow truncate) WITHIN GROUP (ORDER BY seq) AS desc
FROM B GROUP BY pid;
The above will restrict the output to 4000 characters but will not throw the ORA-01489 error.
These are some of the additional options of ON OVERFLOW clause:
ON OVERFLOW TRUNCATE 'Contd..' : This will display 'Contd..' at
the end of string (Default is ... )
ON OVERFLOW TRUNCATE '' : This will display the 4000 characters
without any terminating string.
ON OVERFLOW TRUNCATE WITH COUNT : This will display the total
number of characters at the end after the terminating characters.
Eg:- '...(5512)'
ON OVERFLOW ERROR : If you expect the LISTAGG to fail with the
ORA-01489 error ( Which is default anyway ).
For those who must solve this problem using Oracle 9i (or earlier), you will probably need to use SYS_CONNECT_BY_PATH, since LISTAGG is not available.
To answer the OP, the following query will display the PID from Table A and concatenate all the DESC columns from Table B:
SELECT pid, SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY pid ORDER BY pid, seq) rnum, pid, description
FROM (
SELECT a.pid, seq, description
FROM table_a a, table_b b
WHERE a.pid = b.pid(+)
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1 AND PRIOR pid = pid
GROUP BY pid
ORDER BY pid;
There may also be instances where keys and values are all contained in one table. The following query can be used where there is no Table A, and only Table B exists:
SELECT pid, SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY pid ORDER BY pid, seq) rnum, pid, description
FROM (
SELECT pid, seq, description
FROM table_b
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1 AND PRIOR pid = pid
GROUP BY pid
ORDER BY pid;
All values can be reordered as desired. Individual concatenated descriptions can be reordered in the PARTITION BY clause, and the list of PIDs can be reordered in the final ORDER BY clause.
Alternately: there may be times when you want to concatenate all the values from an entire table into one row.
The key idea here is using an artificial value for the group of descriptions to be concatenated.
In the following query, the constant string '1' is used, but any value will work:
SELECT SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY unique_id ORDER BY pid, seq) rnum, description
FROM (
SELECT '1' unique_id, b.pid, b.seq, b.description
FROM table_b b
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1;
Individual concatenated descriptions can be reordered in the PARTITION BY clause.
Several other answers on this page have also mentioned this extremely helpful reference:
https://oracle-base.com/articles/misc/string-aggregation-techniques
LISTAGG delivers the best performance if sorting is a must(00:00:05.85)
SELECT pid, LISTAGG(Desc, ' ') WITHIN GROUP (ORDER BY seq) AS description
FROM B GROUP BY pid;
COLLECT delivers the best performance if sorting is not needed(00:00:02.90):
SELECT pid, TO_STRING(CAST(COLLECT(Desc) AS varchar2_ntt)) AS Vals FROM B GROUP BY pid;
COLLECT with ordering is bit slower(00:00:07.08):
SELECT pid, TO_STRING(CAST(COLLECT(Desc ORDER BY Desc) AS varchar2_ntt)) AS Vals FROM B GROUP BY pid;
All other techniques were slower.
Before you run a select query, run this:
SET SERVEROUT ON SIZE 6000
SELECT XMLAGG(XMLELEMENT(E,SUPLR_SUPLR_ID||',')).EXTRACT('//text()') "SUPPLIER"
FROM SUPPLIERS;
Try this code:
SELECT XMLAGG(XMLELEMENT(E,fieldname||',')).EXTRACT('//text()') "FieldNames"
FROM FIELD_MASTER
WHERE FIELD_ID > 10 AND FIELD_AREA != 'NEBRASKA';
In the select where you want your concatenation, call a SQL function.
For example:
select PID, dbo.MyConcat(PID)
from TableA;
Then for the SQL function:
Function MyConcat(#PID varchar(10))
returns varchar(1000)
as
begin
declare #x varchar(1000);
select #x = isnull(#x +',', #x, #x +',') + Desc
from TableB
where PID = #PID;
return #x;
end
The Function Header syntax might be wrong, but the principle does work.

Date format substitution in PL/SQL. Example: from 5y 6m 20d to 050620

I am writing a query where I need to perform a date format transformation to meet the specified requirements.
In the database which I have to search, the date format looks like the one in the example: 5y 6m 10d with spaces in between and with optional digits (10y 30d; 1m 23d; 6m are also valid) and they are always ordered (first years, then month and then days).
The format transformation should be the following:
10y 6m 10d => 100610
1y 10m 1d => 011001
6m 2d => 000602
So that the output is always a 6-digit number.
I tried writing regular expressions within REGEX_SUBSTR to isolate the tokens and then concatenate them together in the type of SELECT REGEXP_SUBSTR(text_source, '(\d+)*y') FROM database and I also tried using the REGEX_REPLACE function. Nevertheless, I am not able to perform the transformation to two digits per token without spaces, nor replace one pattern by another, I can only replace the pattern by another string.
Although I am able to output the token separation without spaces by writing the function above. I am not able to get the whole transformation. Is there any possibility of writing a RegEx and combining it with any of the PL/SQL functions in order to transform the dates stated on the list above ? I am also open to hear any other solutions not involving RegEx, I just thought it was sensible to make a proper use of them here.
Here is a simple solution in SQL.
you get the values for year, month and day e.g. with regexp_substr.
with nvl you set the value to 0 if there it is null.
lpad it with 0
with tab as(
select '10y 6m 10d' as str from dual union all
select '1y 10m 1d ' as str from dual union all
select '6m 2d ' as str from dual
)
select lpad(nvl(y,0), 2,'0') ||lpad(nvl(m,0), 2,'0')|| lpad(nvl(d,0), 2,'0')
from (
select rtrim(regexp_substr(str, '[0-9]{1,2}y', 1),'y') as y
,rtrim(regexp_substr(str, '[0-9]{1,2}m', 1),'m') as m
,rtrim(regexp_substr(str, '[0-9]{1,2}d', 1),'d') as d
from tab
)
;
LPAD(N
------
100610
011001
000602
I hope it works
declare
myDate_ varchar2(50) := REPLACE('1y 10m 81d',' ','');
year_ varchar2(50);
month_ varchar2(50);
day_ varchar2(50);
begin
if instrb(myDate_,'y',1,1)>0 then
year_ := lpad(regexp_substr(substr(myDate_,0,instrb(myDate_,'y',1,1)), '[^y]+',1 , 1),2,0);
end if;
if instrb(myDate_,'m',1,1)>0 then
month_ := lpad(regexp_substr(substr(myDate_,instrb(myDate_,'y',1,1)+1,instrb(myDate_,'m',1,1)), '[^m]+',1 , 1),2,0);
end if;
if instrb(myDate_,'d',1,1)>0 then
day_ := lpad(regexp_substr(substr(myDate_,instrb(myDate_,'m',1,1)+1,instrb(myDate_,'d',1,1)), '[^d]+',1 , 1),2,0);
end if;
dbms_output.put_line(year_||month_||day_);
end;

Breaking down multiple Left join in multiple steps in Proc Sql

I got a code that uses a lot of left join with many tables. When I run this code, it takes more than an hour to run and at the end it gives error with Sort Execution Failure. So, I am thinking of breaking down that left join in multiple steps but I am not sure how to do it and need your help.
The code is as:
Proc sql;
create table newlib.Final_test as
SELECT
POpener.Name as Client,
Popener.PartyId as Account_Number,
Case
When BalLoc.ConvertedRefNo NE '' then BalLoc.ConvertedRefNo
else BalLoc.Ourreferencenum
End as LC_Number,
BalLoc.OurReferenceNum ,
BalLoc.CnvLiabilityCode as Liability_Code,
POfficer.PartyID as Officer_Num,
POfficer.Name as Officer_Name,
POpener.ExpenseCode,
BalLoc.IssueDate as Issue_Date format=mmddyy10.,
BalLoc.ExpirationDate AS Expiry format=mmddyy10.,
BalLoc.LiabilityAmountBase as Total_LC_Balance,
Case
When BalLoc.Syndicated = 0 Then BalLoc.LiabilityAmountBase
else 0
End as SunTrust_Non_Syndicated_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey NE 0 Then
BalLoc.LiabilityAmountBase
else 0
End as SunTrust_Syndicated_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey NE 0 Then
(BalLoc.LiabilityAmountBase - (BalLoc.LiabilityAmountBase *
(PParty.ParticipationPercent/100)))
Else BalLoc.LiabilityAmountBase
End as SunTrust_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey <> 0 Then
(BalLoc.LiabilityAmountBase * PParty.ParticipationPercent/100)
Else 0
End as Exposure_Held_By_Other_Banks,
PBene.Name as Beneficiary_Trustee,
cat(put(input(POpener.ObligorNumber,best10.),z10.),put(input
(BalLoc.CommitmentNumber,best10.),Z10.)) as Key,
case
when BalLoc.BeneCusip2 NE ' ' then catx
('|',Balloc.BeneCusip,Balloc.BeneCusip2)
else BalLoc.BeneCusip
End as Cusip,
Case
when balLoc.OKtoExpire = 1 then '0'
when balLOc.OKtoExpire=0 and BalLoc.AutoExtTermDays NE 0 then put
(Balloc.AutoExtTermDays,z3.)
when balLoc.OKtoExpire=0 and BalLoc.AutoExtTermsMonth NE 0 then put
(balloc.AutoExtTermsMonth,z3.)
else '000'
End as Evergreen
Case
when blf.AnnualRate NE 0 then put(blf.AnnualRate,z7.)
when blf.Amount NE 0 then cats('F',put(blf.amount,z7.))
else 'WAIVE'
End as Pricing,
FROM BalLocPrimary BalLoc
Left JOIN Party POpener on POpener.Pkey = BalLoc.OpenerPkey
Left join PartGroup PGroup on BallOC.PartOutGroupPkey = PGroup.pKey
Left join PartParties PParty ON PGroup.pKey = PParty.PartGroupPkey and
PParty.ParticipationPercent > 0 and
PParty.combined in
(select PPartParties.All_combined
from PPartParties /*group by PartGroupPkey, PartyPkey*/)
Left Join MemExpenseCodes ExpCodes on POpener.ExpenseCode = ExpCodes.Code
Left JOIN Party PBene on PBene.Pkey = BalLoc.BenePkey
Left join Party POfficer on POfficer.Pkey = BalLoc.AccountOfficerPkey
left join maxfee on maxfee.LocPrimaryPkey = BalLoc.LocPrimaryPkey
left join BalLocFee BLF on BLF.Pkey = maxfee.pkey
Where BalLoc.LetterType not in ('STBA','EXPA', 'FEE',' ') and
BalLoc.LiabilityAmountBase > 0 and BalLoc.irdb = 1
;
quit;
Thank you,
Shankar
A few things I would suggest:
1, for each dataset that you reference, keep only the variables you need to join on, or which get used in the SELECT statement. E.g., from your Party dset, it looks like you only need the Pkey field and Name. Therefore when you make your join to that dset, you should use:
Left JOIN Party(keep=Pkey Name) PBene on PBene.Pkey = BalLoc.BenePkey
2, Push your WHERE statement into the FROM statement like so:
FROM BalLocPrimary(where=(LetterType not in ('STBA','EXPA', 'FEE',' ') and
LiabilityAmountBase > 0 and irdb = 1)) BalLoc
And make sure the conditions are in the order of most common to least (barring any index that might be on those 3 fields)
3, You are driving off the BalLocPrimary dataset, left joining to everything else. Is that what you really intend? Is it OK that your result set comes back without a Client or Account_Number? Left Joins can be computationally expensive, and the more you can minimize them, the better.
4, Joe asked about indexes on the join fields. You probably should have some. I have found myself referencing this SUGI paper regularly enough to bookmark it. Similarly, you could review the EXPLAIN PLAN from the query to see where it might be bottlenecking. Another SUGI paper would be a good start.
5, You're right that this could (should?) be broken up into multiple steps. That's a good intuition. However the optimal breaks are going to be highly depending on the underlying data, index, and the join paths. So it's hard to prescribe that from the other side of the screen. I think that second paper I linked could give you some good tips on optimization for your specific case.

oracle regular expression and MERGE

As updating my previous question,
I've a some newline separated strings.
I need to insert those each words into a table.
The new logic and its condition is that, it should be inserted if not exists, or update the corresponding count by 1. (as like using MERGE).
But my current query is just using insert, so I've used CONNECT BY LEVEL method without checking the value is existing or not.
it syntax is somewhat like:
if the word already EXISTS THEN
UPDATE my_table set w_count = w_count +1 where word = '...';
else
INSERT INTO my_table (word, w_count)
SELECT REGEXP_SUBSTR(i_words, '[^[:cntrl:]]+', 1 ,level),
1
FROM dual
CONNECT BY REGEXP_SUBSTR(i_words, '[^[:cntrl:]]+', 1 ,level) IS NOT NULL;
end if;
Try this
MERGE INTO my_table m
USING(WITH the_data AS (
SELECT 'a
bb
&
c' AS dat
FROM dual
)
SELECT regexp_substr(dat, '[^[:cntrl:]]+', 1 ,LEVEL) wrd
FROM the_data
CONNECT BY regexp_substr(dat, '[^[:cntrl:]]+', 1 ,LEVEL) IS NOT NULL) word_list
ON (word_list.wrd = m.word)
WHEN matched THEN UPDATE SET m.w_count = m.w_count + 1
WHEN NOT matched THEN insert(m.word,m.w_count) VALUES (word_list.wrd,1);
More details on MERGE here.
Sample fiddle