SQL Dev: Updating column from another table's column with where statement - sql-update

I have the following 2 table examples (large databases with many more columns)
table1
Dirty1 code
Ne yok 553
Bufflo 5767
Ne yok -345
Tchicgo -35
Albunny 543
Dtroit -443
Bufflo -4534
Matatan -45
Ne yok -345
table 2
Dirty2 Standardized
Manhatahn Manhattan
Ne yok New York
Matatan Manhattan
Brocklyn Brooklyn
Albunny Albany
Bufflo Buffalo
Baffalow Buffalo
I want to update table 1 with the standardized city format in table 2 where table1.dirty1 = table2.dirty2 and code is < 0
so the output should look like the following
output table1
Dirty1 code
Ne yok 553
Bufflo 5767
New York -345
Tchicgo -35
Albunny 543
Dtroit -443
Buffalo -4534
Manhattan -45
New York -345
I also want to make sure any that don't have a standardized form in the table 2 get skipped (example: Dtroit and tchicgo)

UPDATE: for Oracle-
UPDATE table1 SET table1.Dirty1= (SELECT table2.Standardized FROM table2
WHERE table1.Dirty1=table2.Dirty2)
WHERE table1.code<0 AND EXISTS (SELECT table2.Standardized FROM table2
WHERE table1.Dirty1=table2.Dirty2);
Note I'm not using Oracle and haven't tested it but it should work.
This should do the trick (MS-SQL)-
UPDATE table1 INNER JOIN table2 ON table1.Dirty1=table2.Dirty2 SET
table1.Dirty1=table2.Standardized WHERE table1.code<0;

Related

SAS proc sql inner join without duplicates

I am struggling to join two table without creating duplicate rows using proc sql ( not sure if any other method is more efficient).
Inner join is on: datepart(table1.date)=datepart(table2.date) AND tag=tag AND ID=ID
I think the problem is date and different names in table 1. By just looking that the table its clear that table1's row 1 should be joined with table 2's row 1 because the transaction started at 00:04 in table one and finished at 00:06 in table 2. I issue I am having is I cant join on dates with the timestamp so I am removing timestamps and because of that its creating duplicates.
Table1:
id tag date amount name_x
1 23 01JUL2018:00:04 12 smith ltd
1 23 01JUL2018:00:09 12 anna smith
table 2:
id tag ref amount date
1 23 19 12 01JUL2018:00:06:00
1 23 20 12 01JUL2018:00:10:00
Desired output:
id tag date amount name_x ref
1 23 01JUL2018 12 smith ltd 19
1 23 01JUL2018 12 anna smith 20
Appreciate your help.
Thanks!
You need to set a boundary for that datetime join. You are correct in why you are getting duplicates. I would guess the lower bound is the previous datetime, if it exists and the upper bound is this record's datetime.
As an aside, this is poor database design on someone's part...
Let's first sort table2 by id, tag, and date
proc sort data=table2 out=temp;
by id tag date;
run;
Now write a data step to add the previous date for unique id/tag combinations.
data temp;
set temp;
format low_date datetime20.
by id tag;
retain p_date;
if first.tag then
p_date = 0;
low_date = p_date;
p_date = date;
run;
Now update your join to use the date range.
proc sql noprint;
create table want as
select a.id, a.tag, a.date, a.amount, a.name_x, b.ref
from table1 as a
inner join
temp as b
on a.id = b.id
and a.tag = b.tag
and b.low_date < a.date <= b.date;
quit;
If my understanding is correct, you want to merge by ID, tag and the closest two date, it means that 01JUL2018:00:04 in table1 is the closest with 01JUL2018:00:06:00 in talbe2, and 01JUL2018:00:09 is with 01JUL2018:00:10:00, you could try this:
data table1;
input id tag date:datetime21. amount name_x $15.;
format date datetime21.;
cards;
1 23 01JUL2018:00:04 12 smith ltd
1 23 01JUL2018:00:09 12 anna smith
;
data table2;
input id tag ref amount date: datetime21.;
format date datetime21.;
cards;
1 23 19 12 01JUL2018:00:06:00
1 23 20 12 01JUL2018:00:10:00
;
proc sql;
select a.*,b.ref from table1 a inner join table2 b
on a.id=b.id and a.tag=b.tag
group by a.id,a.tag,a.date
having abs(a.date-b.date)=min(abs(a.date-b.date));
quit;

Splitting a Column into two based on condtions in Proc Sql ,SAS

I want to Split the airlines column into two groups and then
Add each group 's amount for all clients... : -
Group 1 = Air India & jet airways
| Group 2 = Others.
Loc Client_Name Airlines Amout
BBI A_1ABC2 Air India 41302
BBI A 1ABC2 Air India 41302
MAA Th 1ABC2 Spice Jet Airlines 288713
HYD Ma 1ABC2 Jet Airways 365667
BOM Vi 1ABC2 Air India 552506
Something like this: -
Rank Client_name Group1 Group2 Total
1 Ca 1ABC2 5266269 7040320 1230658
2 Ve 1ABC2 2815593 2675886 5491479
3 Ma 1ABC2 1286686 437843 1724529
4 Th 1ABC2 723268 701712 1424980
5 Ec 1ABC2 113517 627734 741251
6 A 1ABC2 152804 439381 592185
I grouped it first ..but i am confused regarding how to split: -
Data assign6.Airlines_grouping1;
Set assign6.Airlines_grouping;
if Scan(Airlines,1) IN ('Air','Jet') then Group = "Group1";
else
if Scan(Airlines,1) Not in('Air','Jet') then Group = "Group2";
Run;
You are categorizing a row based on the first word of the airline.
Proc TRANSPOSE with an ID statement is one common way to reshape data so that a categorical value becomes a column. A second way is to bypass the categorization and use a data step to produce the new shape of data directly.
Here is an example of the second way -- create new columns group1 and group2 and set value based on airline criteria.
data airlines_group_amounts;
set airlines;
if scan (airlines,1) in ('Air', 'Jet') then
group1 = amount;
else
group2 = amount;
run;
summarize over client
proc sql;
create table want as
select
client_name
, sum(group1) as group1
, sum(group2) as group2
, sum(amount) as total
from airlines_group_amounts
group by client_name
;
You can avoid the two steps and do all of the processing in a single query, or you can do the summarization with Proc MEANS
Here is a single query way.
proc sql;
create table want as
select
client_name
, sum(case when scan (airlines,1) in ('Air', 'Jet') then amount else 0 end) as group1
, sum(case when scan (airlines,1) in ('Air', 'Jet') then 0 else amount end) as group2
, sum(amount) as total
from airlines
group by client_name
;

SAS Recursive Join

I have a large table of connections, and would like to expand that table to include recursive connections.
My data looks like this --
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
I'd like expand the data set to include stopovers, so it ends up looking like this. I'm not concerned about whether I have both a "PORTLAND/SEATTLE" and a "SEATTLE/PORTLAND" record -- I can handle those afterwards as necessary.
BOISE HELENA
BOISE PORTLAND
BOISE SEATTLE
NYC MIAMI
NYC ORLANDO
ORLANDO MIAMI
PORTLAND HELENA
PORTLAND SEATTLE
SEATTLE HELENA
I've tried using the following macro, but ran into performance problems when there were too many levels of recursion. I believe the best option would be hash tables, but am not sure how to code this precise scenario.
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
%macro RecurJoin(
baseTbl,
destTbl,
baseKey,
compKey
);
Proc SQL;
Create Table WORK.RECUR_JOIN_TBL as
SELECT distinct Base.&baseKey, Connect.&compkey
FROM &baseTbl AS Base
INNER JOIN &baseTbl AS Connect
ON (Base.&compkey = Connect.&baseKey)
LEFT JOIN &baseTbl AS Subbase
ON (Base.&baseKey = Subbase.&baseKey) AND
(Connect.&compkey = Subbase.&compkey)
WHERE Subbase.&baseKey IS NULL;
quit;
proc sql noprint;
select count(1) into :connectCnt from RECUR_JOIN_TBL;
quit;
Data &destTbl;
set &baseTbl
RECUR_JOIN_TBL;
run;
Proc DataSets nolist;
Delete RECUR_JOIN_TBL;
Quit;
%if &connectCnt > 0 %then %do;
%RecurJoin(
baseTbl=&destTbl,
destTbl=&destTbl,
baseKey=&baseKey,
compKey=&compKey
);
%end;
%mend;
%RecurJoin(
baseTbl=city_list,
destTbl=FNL_CITY_LIST,
baseKey=from_city,
compKey=to_city
);
Proc Sort data=WORK.FNL_CITY_LIST (where=(NOT(from_city=to_city)));
by from_city to_city;
run;
Memory allowing, you can use the hash-based approach I came up with in this answer to identify the groups of connected cities within your dataset. Then you just need to generate a row for every pair of cities within the same group, which can easily be done via a cartesian join in proc sql.

How to assign values for dummy variables in sas

I have a dataset which includes cities , state and claims and premium
City state Claims Model
Mumbai Karnataka 200000 Honda city
Bangalore Maharastra 190000 Ford
Kochi Kerala 150000 honda city
I have created dummy variables for model. I want to impute values of claim in the dummy variable. Example is given below. I want my dataset to look like this.
City state Claims Model HondaCity Ford
Mumbai Karnataka 200000 Honda city 200000 0
Bangalore Maharastra 190000 Ford 0 190000
Kochi Kerala 150000 honda city 150000 0
instead of 0/1 dummy, I want to impute claim values to model variable. My aim is to predict the risk based premium. How can I do that?
In case you still need help with this (or for future reference), the following code converts the first dataset into the second one:
proc sql;
create table new_table as
select
a.*
,case when upper(model) = "HONDA CITY" then claims else 0 end as HondaCity
,case when upper(model) = "FORD" then claims else 0 end as Ford
from old_table as a;
quit;

How to calculate quantile data for table of frequencies in SAS?

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.