I am trying to collapse my data using proc sql. However, i noticed that when I tried to collapse my data I lost a bunch of variables that I wanted to keep. I am trying to collapse my data based on the variable MRN (which is numeric). The other variables I want to keep are CITY and SITE (these are character values) and these are constant for each unique MRN, so collapsing them should be fine.
Here is the code I am using
proc sql;
create table collapsed_data as
select distinct mrn,
sum(msk_tx_yes) as msk_tx_yes,
sum(msk_cancel_tx_yes) as msk_cancel_tx_yes,
sum(msk_ca_yes) as msk_ca_yes,
sum(msk_cancel_ca_yes) as msk_cancel_ca_yes,
sum(msk_dc_yes) as msk_dc_yes,
sum(conc_psych_tx_yes) as conc_psych_tx_yes,
sum(conc_psych_ca_yes) as conc_psych_ca_yes,
sum (conc_psych_dc_yes) as conc_psych_dc_yes,
sum (conc_yes) as conc_yes,
sum (psych_yes) as psych_yes,
sum (foot_prog) as foot_prog,
sum (hand_prog) as hand_prog,
sum (surg_prog) as surg_prog,
sum (sx_yes) as sx_yes
from temp_collapsed_data
group by mrn;
quit;
I'm not sure how to use the SELECT and DISTINCT functions together.
I thought maybe I could add the variables CITY and STATE after SELECT, while keeping DISTINCT but it doens't sem to work.
I want to be able to keep CITY and STATE in the new table along with the new summed variables I am making. How can I achieve this without turning CITY and STATE into dummy coded variables? I would like to keep them as character values if possible.
Anyone know how I can achieve this?
Yur code is already correct. Just add the variables to the select statement.
proc sql;
create table collapsed_data as
select distinct mrn, city, site,
sum(msk_tx_yes) as msk_tx_yes,
sum(msk_cancel_tx_yes) as msk_cancel_tx_yes,
sum(msk_ca_yes) as msk_ca_yes,
sum(msk_cancel_ca_yes) as msk_cancel_ca_yes,
sum(msk_dc_yes) as msk_dc_yes,
sum(conc_psych_tx_yes) as conc_psych_tx_yes,
sum(conc_psych_ca_yes) as conc_psych_ca_yes,
sum (conc_psych_dc_yes) as conc_psych_dc_yes,
sum (conc_yes) as conc_yes,
sum (psych_yes) as psych_yes,
sum (foot_prog) as foot_prog,
sum (hand_prog) as hand_prog,
sum (surg_prog) as surg_prog,
sum (sx_yes) as sx_yes
from temp_collapsed_data
group by mrn;
quit;
The distinct statement will result in not having two rows with the same information.
Related
I have the following query in M:
= Table.Combine({
Table.Distinct(Table.SelectColumns(Tab1,{"item"})),
Table.Distinct(Table.SelectColumns(Tab2,{"Column1"}))
})
Is it possible to get it working without prior changing column names?
I want to get something similar to SQL syntax:
select item from Tab1 union all
select Column1 from Tab2
If you need just one column from each table then you may use this code:
= Table.FromList(List.Distinct(Tab1[item])
& List.Distinct(Tab2[Column1]))
If you use M (like in your example or the append query option) the columns names must be the same otherwise it wont work.
But it works in DAX with the command
=UNION(Table1; Table2)
https://learn.microsoft.com/en-us/dax/union-function-dax
It's not possible in Power Query M. Table.Combine make an union with columns that match. If you want to keep all in the same step you can add the change names step instead of tap2 like you did with Table.SelectColumns.
This comparison of matching names is to union in a correct way.
Hope you can manage in the same step if that's what you want.
I have list of keywords; CORPORATE, REAL ESTATE, COMPETITION, TRADE, DISPUTE
I want to be able to count the number of occurrences these keywords appear between columns practice_area1—Practice_area10. Therefore, I want to scan across 10 columns and create new columns. Each new column will represent each of the keywords (above) and a count as a value e.g. Corporate 4
Once this statement has ran and we have our new columns I want to create a new variable “Practice Group” which is populated with the highest count of the new variables we have just created. The dataset below is an example of how the data should look:
Please could somebody offer me some advice of the best approach to do this?
Many Thanks
Chris
All you need to do is use makean array for all the columns which you want to check. Then do loop it through each column for the word you want to check by using count function and add the count in the loop.
Code below is checking for three columns and three values. You can apply this code to as many columns as you want.
data have(drop= i);
col1 = 'CORPORATE, REAL ESTATE, REAL ESTATE';
col2= 'CORPORATE, CORPORATE, TRADE, TRADE, TRADE, REAL ESTATE';
col3= 'TRADE, TRADE, DISPUTE,REAL ESTATE';
array col[*] col1 - col3;
realestate=0;/*starting with zero*/
trade=0;
corporate= 0;
do i = 1 to 3;
realestate =realestate+count( col(i), 'REAL ESTATE');/* adding through the loop*/
TRADE =trade+count( col(i), 'TRADE');
CORPORATE= corporate+count( col(i), 'CORPORATE');
end;
run;
I have two table with exactly the same column headers and one row each. I have the code to concatenate them which works fine.
data concatenation;
set CURR_CURR CURR_30;
run;
However, there is no index in the output to say which row corresponds to which table.
I've tried using 'create index' and 'index create' already but they don't work syntactically. Simply I'd just want to add a column of strings and move it to the front of all the other columns in the data set.
INDSNAME option on the SET statement + variable to store the information.
If you set the length statement ahead of your SET statement it will create it as the first column.
Just a note that this isn't the same as an 'index'. An index in SAS has a different meaning which isn't what you're trying to create here.
data concatenation;
length dset source $50.;
set CURR_CURR CURR_30 indsname=source;
dset=source;
run;
Reeza's answer is very similar to something I figured out that worked as well. Here's my version as an alternative.
data concatenation;
length id $ 10;
set CURR_CURR (in=a) CURR_30 (in=b);
if a then id = 'curr_curr';
else if b then id = 'curr_30';
run;
I am rather a beginner in SAS. I have the following problem. Given is a big data set (my_time) which I imported into SAS looking as follows
I want to implement the following algorithm
for every account look for a status and if it is equal to na then look for the same contract after one year (one year after it gets the status na) and put the information "my_date", "status" and "money" in three new columns "new_my_date", "new_status" and "new_money" like in
I need something like countifs in excel. I found loops in SAS like DO but not for the purpose to look through all rows.
I do not even know for which key word I have to look.
I would be grateful for any hint.
A simple method would be by sorting, then exploiting the special variable prefix first. and retain statement to get the desired result.
Step 1: Sort by account, date, and status
proc sort data=have;
by account my_date status;
run;
This will guarantee that your data is in the order that you need. Since we are looking only for year+1 after the status = 'na', anything that happens in-between that doesn't matter.
Step 2: Use a data step to remember the first year when na happens for that account
data want;
set have;
by account my_date status;
retain first_na_year first_na_account;
if(first.account) then call missing(first_na_year,first_na_account);
if(status IN('na', 'tna') ) then do;
first_na_year = year;
first_na_month = month;
first_na_account = account;
end;
if( year = first_na_year+1
AND first_na_month = month
AND account = first_na_account)
AND status NOT IN('na', 'tna') )
then do;
new_status = status ;
new_my_date = my_date;
new_money = money;
end;
if(cmiss(new_status, new_my_date, new_money) ) = 0;
drop first:;
run;
For each row, we compare three things:
Is the status not 'na'?
Is the year 1 year bigger than the last time it was 'na'?
Is this the same account we're comparing?
If all are true, then we want to create the three new variables.
What's happening:
SAS is inherently a looping language, so we do not need to use a do loop here. When SAS goes to a new row, it will clear all variables in the Program Data Vector (PDV) in preparation for filling them in with the new values in the row.
Since SAS the SAS data step only goes forwards and doesn't like to go backwards, we want it to remember the first time that na occurs for that account. retain tells SAS not to discard the value of a variable when it reads a new row.
When we are done doing our comparison and we've moved onto the next account, we reset these variables to missing. by group processing allows SAS to know exactly where the first and last occurrence of the account is in the dataset.
At the end, we output only if all 3 of the new variables are not missing. cmiss counts how many variables are not missing. Note that output is always implied before the run statement, so we simply need to use an "if without then" in this case.
The final statement, drop first:;, is a simple shortcut to remove any variables that start with the phrase first. This prevents them from being shown in the final dataset.
I’m wondering if someone can help with a coding problem I have.
Background – I have a project that imports some files and uses the data in those files to perform projections. The contents of the files determines some aspects of the size of the output that follows. Simply, values in data loaded in drives the size and shape of the tables that follow, and this can vary.
The following code is an example of the problem.
The data loaded will have a variable year start (note wf2009, 2009 is the first year) and variable range (this example goes from 2009 to 2030, but this will vary too).
proc summary data= labeled_proj_data_hc;
class jurisdiction specialty measure;
types jurisdiction*specialty*measure;
VAR wf2009--wf2030;
output out= sum_labeled_proj_data_hc
sum(wf2009) = y2009
sum(wf2010) = y2010
sum(wf2011) = y2011
sum(wf2012) = y2012;
run;
Where I’m not sure how to proceed is:
sum(wf2009) = y2009
sum(wf2010) = y2010
sum(wf2011) = y2011
sum(wf2012) = y2012;
In the sequence of lines calling for the sum of their respective columns, how can I make this dynamic so that the start year is populated from a variable and it increments yearly until the last year which is also variable.
Has anyone solved a similar problem.
Cheers,
Is renaming the variables necessary? If not then you can use the : wildcard operator to access all variables that begin with 'wf', then just put SUM= in the output statement, which will preserve the original names.
So your proc summary would look like this.
proc summary data= labeled_proj_data_hc;
class jurisdiction specialty measure;
types jurisdiction*specialty*measure;
VAR wf: ;
output out= sum_labeled_proj_data_hc
sum=;
run;