I'm currently using SAS Enterprise Guide for one of my assignments and I'm trying to alter my current table to a desired table such as this one. I've already used the split/stack column task, but I'm not sure which variable(s) I should put in certain task role(s) – any suggestions?
Thank you!
You want to transpose your dataset.
Try this out
proc transpose data = current_data out = new_data (drop=_name_);
by GENDER NOTSORTED;
id STATUS;
var Avg_Claim_amt;
run;
Best,
Try something like this:
create temporary table data
(
Gender varchar(1),
Status varchar(20),
Avg_Claim_Amt numeric(8,2)
);
insert into data values('F','Married_Yes',1546.19);
insert into data values('F','Married_No' ,2269.10);
insert into data values('M','Married_Yes',1485.45);
insert into data values('M','Married_No' ,2308.96);
select * from data;
select gender,sum(married_no) as Married_No,sum(married_yes) as Married_Yes
from
(
select gender,
case when Status='Married_No' then Avg_Claim_amt else 0.00 end as Married_No,
case when Status='Married_Yes' then Avg_Claim_amt else 0.00 end as Married_Yes
from data) as x
group by gender;
gender | married_no | married_yes
--------+------------+-------------
M | 2308.96 | 1485.45
F | 2269.10 | 1546.19
You want to use the TRANSPOSE task in SAS EG with the following parameters defined:
https://documentation.sas.com/?activeCdc=egdoccdc&cdcId=egcdc&cdcVersion=8.3&docsetId=egamotasks&docsetTarget=n0f6xtevrykn11n1n8n41ocko1cj.htm&locale=en&docsetVersion=8.3
If you're doing it via code use #LuizZ response instead. The task should build very similar code. Another option is to 'back up' and create that table via a different process or task that will generate it in the format desired at the start.
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
How do i start this??
I have two data sets.
For the output you will deliver:
It should be an excel or XML format
Each query logic/programmed check should be on each tab
Columns should be
Subject #,
Visit Date (You will need the Visit Date Listing also attached)
Visit Name (Visit date from the file_34422 must match Visit name in the Blood Pressure File)
Date of Assessment (From the BP Log), VSBPDT_RAW, VSTPT, BP results.
A column for SYBP1. SYBP2, SYBP3, DIABP1, DIABP2, DIABP3
Findings/query text.
Below are Specification for BP:
For same SUBJECT and same FOLDERNAME, where VSTPT is Blood Pressure 1.
if VSBPYN is No, then all must be null or =0 (VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1, VSBPND2, VSBPTM2, SYSBP2, DIABP2, VSBPND3, VSBPTM3, SYSBP3, DIABP3)
This is what i have started with and
proc sql;
select
f.subject,
f.SVSTDT_RAW, f.FolderName,
b.FolderName,
VSBPDT_RAW, VSTPT,
SYSBP1, SYSBP2, SYSBP3,
DIABP1, DIABP2, DIABP3
FROM first_data as f, bp_data as b
group by subject, foldername
where f.subject = b.subject
having VSTPT is Blood Pressure set 1,
VSBPYN is No;
quit;
I just need to be pointed towards the right direction. I know this can't be right.
I do not know the exact structure of your data, so the solution below may need to be modified by you to select the right columns.
From the descritpion, this looks like it might be a good situation for SQL and a data step. You have a lot of columns to merge with the bp table. It will be easy to do merge all of these columns with first_data in SQL.
When you have lots of by-row conditionals, a data step will be easier to work with and read than many CASE statements in SQL. We'll do a two-stage approach in which we use SQL and a data step.
Step 1: Merge the data
proc sql noprint;
create table stage as
select t1.*
, t2.VSBPYN
from bp_data as t1
INNER JOIN
first_data as t2
ON t1.subject = t2.subject
AND foldername = t2.foldername
where t1.VSTPT = 1
;
quit;
Step 2: Conditionally set values to missing
Next, we'll do a data step for our conditional logic. call missing() is a useful function that will let you set the value of many variables to missing all in a single statement.
data want;
set stage;
if(upcase(VSBPYN) = 'NO') then call missing(VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1,
VSBPND2, VSBPTM2, SYSBP2, DIABP2,
VSBPND3, VSBPTM3, SYSBP3, DIABP3
);
run;
Step 3: Output to Excel
Finally, we sent the output to Excel.
proc export
data=want
file='/my/location/want.xlsx'
dbms=xlsx
replace;
run;
I need help reading the code below. I am not sure what specific parts in this code are doing. For example, what does ( firstobs = 2 keep = column3 rename = (column3 = column4) ) do?
Also, what does ( obs = 1 drop = _all_ ); do?
I have also not used column5 = ifn( first.column1, (.), lag(column3) ); before. What does this do?
I am reading someone else's code. I wish I could provide more detail. If I find a solution, I will post it. Thanks for your help.
data out.dataset1;
set out.dataset2;
by column1;
WHERE column2 = 'N';
set out.dataset1 ( firstobs = 2 keep = column3 rename = (column3 = column4) )
out.dataset1 ( obs = 1 drop = _all_ );
FORMAT column5 DATETIME20.;
FORMAT column4 DATETIME20.;
column5 = ifn( first.column1, (.), lag(column3) );
column4 = ifn( last.column1, (.), column4 );
IF first.column1 then DIF=intck('dtday',column4,column3);
ELSE DIF= intck('dtday',column5,column3);
format column6 $6.;
IF first.column1
OR intck('dtday',column5,column3) GT 20 THEN column6= 'HARM';
ELSE column6= 'REPEAT';
run;
Seems like you need to learn about SAS datastep languange!
This series of things happening in the parenthesis are datastep options
You can use those options when ever you are referencing a table, even in a proc sql
The options you have:
firstobs : This starts the datafeed on the record enter in your case 2 it means SAS will start on the table on the 2nd record.
keep : This will only use the fields in list rather than using all the field in the table
rename = rename will rename field, so it works like an alias in SQL
OBS = will limit the amount of record you pull out of a table like top or limit in SQL
DROP = would remove the fields selected from the table in your case all is used wich means he's dropping all the fields.
as for the functions:
LAG is keeping the value from the previous record for the field you put in parenthesis so DPD_CLOSE_OF_BUSINESS_DT
INF = Works like a case or if. Basically you create a condition in the 1st argument and then the 2nd argument is applied when your condition in the 1st argument is true, the 3rd argument get done in the event that your condition on the 1st argument is false.
So to answer that question if it's the first record for the variable SOR_LEASE_NBR then the field Prev_COB_DT will be . otherwise it will be a previous value of DPD_CLOSE_OF_BUSINESS_DT.
The best advise I can give you is to start googling SAS and the function name you are wondering what it does, then it's a matter of encapsulation!
Hope this helps!
Basically your data step is using the LAG() function to look back one observation and the extra SET statement to look ahead one observation.
The IFN() function calls are then being used to make sure that missing values are assigned when at the boundary of a group.
You then use these calculated PREV and NEXT dates to calculate the DIF variable.
Note for this to work you need to be referencing the same input dataset in the two different SET statements (the dataset used in the last with the obs=1 and drop=_all_ dataset options
doesn't really need to be the same since it is not reading any of the actual data, it just has to have at least one observation).
( firstobs = 2 keep = DPD_CLOSE_OF_BUSINESS_DT rename =
(DPD_CLOSE_OF_BUSINESS_DT = Next_COB_DT) ) do?
Here the code firstobs=2 says SAS to read the data from 2nd observation in the dataset.
and also by using rename option trying to change the name of the variable.
(obs = 1 drop = _all_);
obs=1 is reading only the 1st obs in the dataset. If you specify obs=2 then up to 2nd obs will be read.
drop = _all_ , is dropping all of your variables.
Firstobs:
Can read part of the data. If you specify Firstobs= 10, it starts reading the data from 10th observation.
Obs :
If specify obs=15, up to 15th obs the data will be readed.
If you run the below table, it gives you 3 observations ( from 2nd to 4th ) in the output result.
Example;
DATA INSURANCE;
INFILE CARDS FIRSTOBS=2 OBS=4;
INPUT NAME$ GENDER$ AGE INSURANCE $;
CARDS;
SOWMYA FEMALE 20 MEDICAL
SUNDAR MALE 25 MEDICAL
DIANA FEMALE 67 MEDICARE
NINA FEMALE 56 MEDICAL
RUN;
I have a question about the following 2 codes in SAS PROC SQL.
Code 1: (Standard Book version)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
Code 2: (The much faster way in practice)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN ('10001', '10002', '10003', ... '15000')
When I try to do it more elegantly by using subquery in #1, the run time blows up to 50 minutes +. But the same input returns within 3 minutes using Code 2. Why is that? Note, it's just as slow using INNER JOIN too (after reading this). The input is 5000+ CLAIMID, which I manually paste into the IN('...') block everyday.
PS: The CLAIMID are made up, in real life they are random.
The CLAIMID are indexed in DW.CLAIMS. I am using SAS PROC SQL to access an Oracle database. What is going on, and is there a better way? Thanks!
I don't know that I can tell you why SAS is so slow at the first select; something's not optimized in that scenario clearly.
If I had to guess, I'd guess that SAS is deciding in the first case that it can't use pass-through SQL and so it's downloading the whole big table and then running this SAS-side, while in the second case it's passing the query up to the SQL database and only transporting the resulting rows back.
But there are several ways to work around this, anyway. Here's one: use a macro variable to do precisely the pasting you're doing!
proc sql;
select quote(strip(claimid)) into :claimlist separated by ','
from work.input
;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claimlist.)
;
quit;
Tada, you don't have to touch this anymore, and it's identical to the copy/paste that you did.
A few extra notes given some comments:
If CLAIMID is ever less than 15, you may have space padding, so I added strip to remove those. It doesn't matter for string comparisons - except insomuch as you might run out of macro language, and I worry that some DBMS may actually care about the padding. You can leave out strip if the 15 is a constant length.
Macro variables run up to 64K in space. If you have 15 character variable plus " " two plus comma one, you have 18 characters; you have room for a bit over 3500 values. That's under 5000, unfortunately.
In this case, you can either split up the field into two macro variables (easy enough hopefully, use obs and firstobs) or you can do some other solution.
Transfer the work.input dataset into the DW libname, then do the join in SQL there.
Put the contents of the claimID into a file instead of into a macro variable, and then %include that file.
Use call execute to execute the whole proc SQL.
Here's one example of CALL EXECUTE.
data _null_;
set work.input end=eof;
if _n_=1 then do;
call execute('CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = "0001"
AND a.CLAIMID IN ('); *the part of the SQL query before the list of IDs;
end;
call execute(quote(claimID) || ' ');
if EOF then do;
call execute('); QUIT;'); *the part of the SQL query after the list of IDs;
end;
run;
This would be nearly identical to the %INCLUDE solution really, except there you put that stuff to a text file instead of CALL EXECUTEing it, and then you %INCLUDE that text file.
I think you're working both with local data and data on your server. When SAS is working with data from different sources (databases) it brings it all into SAS for processing which can be really, really slow.
Instead, you can make a macro variable and use that within your query. If it's 5000, it should fit into one macro variable, assuming the length is less than 13 chars each. A macro variable size limit is 64K characters, so it depends on the length of the variable. If not you could create a macro instead.
proc sql noprint;
select quote(claimID, "'") into : claim_list separated by ", " from input;
quit;
proc sql;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claim_list.);
quit;
Please be sure to use
option sastrace=',,,ds' sastraceloc=saslog nostsuffix;
to receive information on how your code is translated by SAS/Aceess engine to DB statements.
In order to give SAS a hint to dynamicly build IN (1,2,3, ..) clause from your IN (SELECT .. query
add MULTI_DATASRC_OPT=IN_CLAUSE to your libname DW ... statement and
add dbmaster dataset option to the "master" table
like one of the following queries:
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
or
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
inner join WORK.INPUT AS b
on a.CLAIMID = b.CLAIMID
WHERE
a.SITEID = '0001'
Using the In() without sub-querying is definitely faster, but other performance consideration to keep in mind is the network and compute server load/traffic at the time of running; assuming you are running on a client / server configuration.
If you plan to use the SQL select into macro variable solution; keep in mind the count of distinct values and the length of the string you are saving in the macro as there is a size limit.
You can also save the In() values in a table and just do a join.
PROC SQL;
/*CLAIM ID Table*/
CREATE TABLE WORK.OUTPUT1 AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001';
/*ID Lookup Table*/
CREATE TABLE WORK.OUTPUT2 AS
SELECT
DISTINCT b.CLAIMID FROM WORK.INPUT AS b
;
/*Inner Join Table / AKA lookup join*/
CREATE TABLE WORK.Final AS
SELECT
a.SOURCE, a.CLAIMID, a.DXCODE
FROM WORK.OUTPUT1 AS a INNER JOIN WORK.OUTPUT2 AS b
ON a.CLAIMID = b.CLAIMID
;
QUIT;
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.