Breaking down multiple Left join in multiple steps in Proc Sql - sas

I got a code that uses a lot of left join with many tables. When I run this code, it takes more than an hour to run and at the end it gives error with Sort Execution Failure. So, I am thinking of breaking down that left join in multiple steps but I am not sure how to do it and need your help.
The code is as:
Proc sql;
create table newlib.Final_test as
SELECT
POpener.Name as Client,
Popener.PartyId as Account_Number,
Case
When BalLoc.ConvertedRefNo NE '' then BalLoc.ConvertedRefNo
else BalLoc.Ourreferencenum
End as LC_Number,
BalLoc.OurReferenceNum ,
BalLoc.CnvLiabilityCode as Liability_Code,
POfficer.PartyID as Officer_Num,
POfficer.Name as Officer_Name,
POpener.ExpenseCode,
BalLoc.IssueDate as Issue_Date format=mmddyy10.,
BalLoc.ExpirationDate AS Expiry format=mmddyy10.,
BalLoc.LiabilityAmountBase as Total_LC_Balance,
Case
When BalLoc.Syndicated = 0 Then BalLoc.LiabilityAmountBase
else 0
End as SunTrust_Non_Syndicated_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey NE 0 Then
BalLoc.LiabilityAmountBase
else 0
End as SunTrust_Syndicated_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey NE 0 Then
(BalLoc.LiabilityAmountBase - (BalLoc.LiabilityAmountBase *
(PParty.ParticipationPercent/100)))
Else BalLoc.LiabilityAmountBase
End as SunTrust_Exposure,
Case
When BalLoc.Syndicated = 1 and BalLoc.PartOutGroupPkey <> 0 Then
(BalLoc.LiabilityAmountBase * PParty.ParticipationPercent/100)
Else 0
End as Exposure_Held_By_Other_Banks,
PBene.Name as Beneficiary_Trustee,
cat(put(input(POpener.ObligorNumber,best10.),z10.),put(input
(BalLoc.CommitmentNumber,best10.),Z10.)) as Key,
case
when BalLoc.BeneCusip2 NE ' ' then catx
('|',Balloc.BeneCusip,Balloc.BeneCusip2)
else BalLoc.BeneCusip
End as Cusip,
Case
when balLoc.OKtoExpire = 1 then '0'
when balLOc.OKtoExpire=0 and BalLoc.AutoExtTermDays NE 0 then put
(Balloc.AutoExtTermDays,z3.)
when balLoc.OKtoExpire=0 and BalLoc.AutoExtTermsMonth NE 0 then put
(balloc.AutoExtTermsMonth,z3.)
else '000'
End as Evergreen
Case
when blf.AnnualRate NE 0 then put(blf.AnnualRate,z7.)
when blf.Amount NE 0 then cats('F',put(blf.amount,z7.))
else 'WAIVE'
End as Pricing,
FROM BalLocPrimary BalLoc
Left JOIN Party POpener on POpener.Pkey = BalLoc.OpenerPkey
Left join PartGroup PGroup on BallOC.PartOutGroupPkey = PGroup.pKey
Left join PartParties PParty ON PGroup.pKey = PParty.PartGroupPkey and
PParty.ParticipationPercent > 0 and
PParty.combined in
(select PPartParties.All_combined
from PPartParties /*group by PartGroupPkey, PartyPkey*/)
Left Join MemExpenseCodes ExpCodes on POpener.ExpenseCode = ExpCodes.Code
Left JOIN Party PBene on PBene.Pkey = BalLoc.BenePkey
Left join Party POfficer on POfficer.Pkey = BalLoc.AccountOfficerPkey
left join maxfee on maxfee.LocPrimaryPkey = BalLoc.LocPrimaryPkey
left join BalLocFee BLF on BLF.Pkey = maxfee.pkey
Where BalLoc.LetterType not in ('STBA','EXPA', 'FEE',' ') and
BalLoc.LiabilityAmountBase > 0 and BalLoc.irdb = 1
;
quit;
Thank you,
Shankar

A few things I would suggest:
1, for each dataset that you reference, keep only the variables you need to join on, or which get used in the SELECT statement. E.g., from your Party dset, it looks like you only need the Pkey field and Name. Therefore when you make your join to that dset, you should use:
Left JOIN Party(keep=Pkey Name) PBene on PBene.Pkey = BalLoc.BenePkey
2, Push your WHERE statement into the FROM statement like so:
FROM BalLocPrimary(where=(LetterType not in ('STBA','EXPA', 'FEE',' ') and
LiabilityAmountBase > 0 and irdb = 1)) BalLoc
And make sure the conditions are in the order of most common to least (barring any index that might be on those 3 fields)
3, You are driving off the BalLocPrimary dataset, left joining to everything else. Is that what you really intend? Is it OK that your result set comes back without a Client or Account_Number? Left Joins can be computationally expensive, and the more you can minimize them, the better.
4, Joe asked about indexes on the join fields. You probably should have some. I have found myself referencing this SUGI paper regularly enough to bookmark it. Similarly, you could review the EXPLAIN PLAN from the query to see where it might be bottlenecking. Another SUGI paper would be a good start.
5, You're right that this could (should?) be broken up into multiple steps. That's a good intuition. However the optimal breaks are going to be highly depending on the underlying data, index, and the join paths. So it's hard to prescribe that from the other side of the screen. I think that second paper I linked could give you some good tips on optimization for your specific case.

Related

Count number of WHERE filters in SQL query using regex

Update: I've updated the test string to cover a case that I've missed.
I'm trying to do count the number of WHERE filters in a query using regex.
So the general idea is to count the number of WHERE and AND occuring in the query, while excluding the AND that happens after a JOIN and before a WHERE. And also excluding the AND that happens in a CASE WHEN clause.
For example, this query:
WITH cte AS (\nSELECT a,b\nFROM something\nWHERE a>10\n AND b<5)\n, cte2 AS (\n SELECT c,\nd FROM another\nWHERE c>10\nAND d<5)\n SELECT CASE WHEN c1.a=1\nAND c2.c=1 THEN 'yes' ELSE 'no' \nEND,c1.a,c1.b,c2.c,c2.d\nFROM cte c1\nINNER JOIN cte2 c2 ON c1.a = c2.c\nAND c1.b = c2.d\nWHERE c1.a<4 AND DATE(c1)>'2022-01-01'\nAND c2.c>6
-- FORMATTED FOR EASE OF READ. PLEASE USE LINE ABOVE AS REGEX TEST STRING
WITH cte AS (
SELECT a,b
FROM something
WHERE a>10
AND b<5
)
, cte2 AS (
SELECT c,d
FROM another
WHERE c>10
AND d<5
)
SELECT
CASE
WHEN c1.a=1 AND c2.c=1 THEN 'yes'
WHEN c1.a=1 AND c2.c=1 THEN 'maybe'
ELSE 'no'
END,
c1.a,
c1.b,
c2.c,
c2.d
FROM cte c1
INNER JOIN cte2 c2
ON c1.a = c2.c
AND c1.b = c2.d
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
should return 7, which are:
WHERE a>10
AND b<5
WHERE c>10
AND d<5
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
The portion AND c1.b = c2.d is not counted because it happens after JOIN, before WHERE.
The portion AND c2.c=1 is not counted because it is in a CASE WHEN clause.
I eventually plan to use this on a Postgresql query to count the number of filters that happens in all queries in a certain period.
I've tried searching around for answer and trying it myself but to no avail. Hence looking for help here. Thank you in advanced!
I try to stay away from lookarounds as they could be messy and too painful to use, especially with the fixed-width limitation of lookbehind assertion.
My proposed solution is to capture all scenarios in different groups, and then select only the group of interest. The undesired scenarios will still be matched, but will not be selected.
Group 1 - Starts with JOIN (undesired)
Group 2 - Starts with WHERE (desired)
Group 3 - Starts with CASE (undesired)
(JOIN.*?(?=$|WHERE|JOIN|CASE|END))|(WHERE.*?(?=$|WHERE|JOIN|CASE|END))|(CASE.*?(?=$|WHERE|JOIN|CASE|END))
Note: Feel free to replace WHERE|JOIN|CASE|END to any keyword you want to be the 'stopper' words.
All scenarios including the undesired ones will be matched, but you need to select only Group 2 (highlighted in orange).
You can try something like this:
WITH DataSource (parts) AS
(
SELECT REGEXP_MATCHES(
'WITH cte AS (SELECT a,b FROM something WHERE a>10 AND b<5)\n, cte2 AS (SELECT c,d FROM another WHERE c>10 AND d<5)\n SELECT c1.a,c1.b,c2.c,c2.d FROM cte c1 INNER JOIN cte2 c2 ON c1.a = c2.c AND c1.b = c2.d WHERE c1.a<4 AND c2.c>6',
E'(?= WHERE)[^)|;]+'
,'gmi'
)
)
SELECT SUM
(
(length(parts[1]) - length(REPLACE(parts[1], 'AND', ''))) / 3 -- counting ANDs
+ 1 -- for the where
)
FROM DataSource
The idea is to match the text after WHERE clause:
and then simply count the ANDs and add one because of the matched WHERE.

plsql if control different calculation

I am using PLSQL to realize some of the function below.
I have the table which have piece level data with each piece weight. Basically I want to realize the following function:
if piece weight is over 1 LB. groupby ceil(weight) (next LB)
if piece weight is less 1 LB groupby cell(weight*16) ( Next OZ)
I am just curious how can I realize that in plsql. I feel I need to have the if statement. But I am not sure how to do that.
(Weight is already an variable in that table, do I need to declare here?)
begin
if weight <1 then
select ceil(weight*16),sum(weight)
from ops_owner.track_mail_item
where manifestdate = '24-aug-2016'
group by ceil(weight*16)
else select ceil(weight),sum(weight)
from ops_owner.track_mail_item
where manifestdate = '24-aug-2016'
end if,
end;
Thank you very much!
I would adjust the weight value in an inline view, I think.
select
ceil(adjusted_weight),
sum(adjusted_weight)
from
(
select
case
when weight < 1 then weight * 16
else weight
end adjusted_weight
from
ops_owner.track_mail_item
where
manifestdate = '24-aug-2016'
)
group by
ceil(adjusted_weight);

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0

Doctrine2: Doctrine query returns different results from raw query?

I have come across this multiple times in the past, and I just ended up writing a raw SQL query to overcome it, but I really want to find out why this is happening. Look at the statement below
$q = $this->entityManager->createQuery('SELECT um,s,st
FROM Dashboard\Entity\Metric um
JOIN um.stat st
JOIN um.site s
JOIN s.clients c
WHERE c.id = ?1
AND s.competitor = 0
AND s.ignored = 0
AND st.id IN (?2)
GROUP BY s.id, st.id
ORDER BY st.response_field, s.id')
->setParameter(1, $params['c_id'])
->setParameter(2, $statId);
$sql = $q->getSql();
$rs = $q->getResult();
If I take the contents of $sql and paste them into a mySQL tool and run the raw query, it returns 18 results which is correct.
However, $rs only contains 3 results. $statId is a comma-separated string of 6 numbers: (1,2,3,4,5,6). So I am grouping by st.id, and s.id. There will be 3 s.id elements for every st.id element, working out to the 18 results I expected. What's happening is Doctrine is only returning the first st.id which is the group of 3 s.id 's
Any idea what could be causing this?
I got my answer from IRC, but posting here in case it helps someone else. Smart WHERE IN statements aren't planned for support until 2.1 version.
http://groups.google.com/group/doctrine-dev/browse_thread/thread/fbf70837293676fb
But I can accomplish the same goal with query builder.
http://www.doctrine-project.org/docs/orm/2.0/en/reference/query-builder.html

SQL Server function to eliminate replicated characters

I wonder if there is an easy and efficient way in SQL Server 2005 to eliminate replicated characters in a string. Like converting
'ABBBCDEEFFFFG' to 'ABCDEFG'
It really sucks that SQL Server has such a poor string library and no ready-to-use regexp feature...
You can use the CLR functionality built into SQL Server 2005/2008 to get this done by .NET code.
MSDN magazine wrote about it in their February 2007 issue.
If this is not an acceptable solution, here is a UDF that will do the same, mind you this is about two orders of magnitude slower than the CLR solution.
YMMV. This appears to work for your string above. But not ABBBCDEEBBBBG
DECLARE #Numbers TABLE (Num smallint NOT NULL PRIMARY KEY)
INSERT #Numbers (Num)
SELECT TOP 8000
ROW_NUMBER() OVER (ORDER BY c1.NAME)
FROM
sys.columns c1
DECLARE #STuff TABLE (Seq varchar(100) NOT NULL PRIMARY KEY)
INSERT #STuff (Seq) VALUES ('ABBBCDEEFFFFG') --works
SELECT
Single
FROM
(
SELECT DISTINCT
CAST(Single AS varchar(100))
FROM
#Numbers N
CROSS APPLY
(SELECT Seq, SUBSTRING(Seq, Num, 1) AS Single FROM #Stuff) S
WHERE
Num <= LEN(Seq)
FOR XML PATH ('')
) foo(Single)
I know about the CLR solution, but as I said, I am neither responsible nor authorized to implement it in the DB of question.
For this particular problem, I decided to write a very simple and kinda silly loop. I am afraid it won't be fast enough for millions of records, but anyways... I wish I could do this stuff in the application layer but I am bound to T-SQL here..
DECLARE #i int ; -- counter
DECLARE #input varchar(200) ;
SET #input = 'AAABCDEEFFBBBXYZSSSWWWNT'
IF LEN(#input) > 1
BEGIN
DECLARE #unduplicated varchar(200) ;
SET #unduplicated = SUBSTRING(#input,1,1) ;
SET #i = 2 ;
WHILE #i <= LEN(#input)
BEGIN
-- If current char is different from the last char, concatenate, else not
IF SUBSTRING(#unduplicated, LEN(#unduplicated), 1) <> SUBSTRING(#input, #i, 1)
SET #unduplicated = #unduplicated + SUBSTRING(#input, #i, 1) ;
SET #i = #i + 1;
END
END
SELECT #unduplicated AS unduplicated;
Result:
unduplicated
ABCDEFBXYZSWNT