I have a dataset that looks like this, and am using SAS Enterprise Guide 6.3:
data have;
input id state $;
cards;
134 NC,NY,SC
145 AL,NC,NY,SC
;
run;
I have another dataset that has several metrics for each id in every state, but I only need to pull the data for the states listed in the second column of the have dataset.
data complete;
input id state $ metric;
cards;
134 AL 5
134 NC 4.3
134 NY 4
134 SC 5.5
145 AL 1.3
145 NC 1.3
145 NY 1.5
145 SC 1.1
177 AL 10
177 NC 74
177 NY 23
177 SC 33
;
run;
I thought about using trnwrd to replace the comma with ', ' and concatenating a beginning and ending quote to make the list a character list, so that I could use a WHERE IN statement. However, I think it would be more helpful if I could somehow transpose the comma separated list to something like this:
data have_mod;
input id state $;
cards;
134 NC
134 NY
134 SC
145 AL
145 NC
145 NY
145 SC
;
run;
Then I could simply join this table to the complete data table to get the subset I need (below).
data want;
input id state $ metric;
cards;
134 NC 4.3
134 NY 4
134 SC 5.5
145 AL 1.3
145 NC 1.3
145 NY 1.5
145 SC 1.1
;
run;
Any thoughts? Thanks.
I'd do exactly what you propose and transpose it - except i'd read it in that way.
data have;
infile datalines truncover dlm=', ';
length state $2;
input id #; *read in the id for that line;
do until (state=''); *keep reading in until state is missing = EOL;
input state $ #;
if not missing(state) then output;
end;
cards;
134 NC,NY,SC
145 AL,NC,NY,SC
;
run;
Alternately, you can SCAN for the first statecode.
data want_to_merge;
set have;
state_first = scan(state,1,',');
run;
SCAN is the function equivalent of reading in a delimited file.
Related
Piggy backing on a similar question I asked
(Summing a Column By Group In a Dataset With Macros)...
I have the following dataset:
Month Cost_Center Account Actual Annual_Budget
May 53410 Postage 23 134
May 53420 Postage 7 238
May 53430 Postage 98 743
May 53440 Postage 0 417
May 53710 Postage 102 562
May 53410 Phone 63 137
May 53420 Phone 103 909
May 53430 Phone 90 763
June 53410 Postage 13 134
June 53420 Postage 0 238
June 53430 Postage 48 743
June 53440 Postage 0 417
June 53710 Postage 92 562
June 53410 Phone 73 137
June 53420 Phone 103 909
June 53430 Phone 90 763
I would like to "splice" it so each month has its own respective column for Actual while summing the numeric values by Account.
So for example, I want the output to look like the following:
Account May_Actual_Sum June_Actual_Sum Annual_Budget
Postage 14562 37960 255251
Phone 4564 2660 32241
The code below provided by a fellow user works great when not needing to further dis-aggregated by month; however, I'm not sure if it's possible to do so (I tired adding a 'by month clause' - didn't work).
proc means data=Test N SUM NWAY STACKODS;
class Account_Description;
var Actual annual_budget;
by month;
ods output summary = summary_stats1;
output out = summary_stats2 N = SUM= / AUTONAME;
data want;
set summary_stats2;
run;
Use PROC MEANS to get summaries - same as last time. Please read up the documentation on PROC MEANS to understand how the CLASS statements works and how you can control the different levels of output.
Use PROC TRANSPOSE to flip the data wide. Since the budget amount is consistent across rows you'll be fine.
I'm guessing your next set of question will then be how to sort the columns correctly because your months won't sort and how to reference them dynamically to calculate the month to date changes. Which are some of the reasons why this data structure is not recommended.
data have;
input Month $ Cost_Center $ Account $ Actual Annual_Budget;
cards;
May 53410 Postage 23 134
May 53420 Postage 7 238
May 53430 Postage 98 743
May 53440 Postage 0 417
May 53710 Postage 102 562
May 53410 Phone 63 137
May 53420 Phone 103 909
May 53430 Phone 90 763
June 53410 Postage 13 134
June 53420 Postage 0 238
June 53430 Postage 48 743
June 53440 Postage 0 417
June 53710 Postage 92 562
June 53410 Phone 73 137
June 53420 Phone 103 909
June 53430 Phone 90 763
;
;
;;
run;
*summarize;
proc means data=have noprint nway;
class account month;
var actual annual_budget;
output out=temp sum=actual_total budget_total;
run;
*transpose;
proc transpose data=temp out=want prefix=Month_;
by account budget_total;
var actual_total;
id month;
run;
Output:
I cannot think of a way to generate this report using just one PROC. You will need to do some post processing of PROC MEANS or PROC SUMMARY results to get to this:
proc means data=have SUM ;
class Account month;
var Actual annual_budget;
output out = summary_stats SUM=;
run;
/* Look at summary_stats to understand it's structure here */
/* Otherwise you will not understand the following code */
proc sort data = summary_stats;
where _type_ in (2,3);
by account;
run;
data want;
set summary_stats;
by account ;
retain May_Actual_Sum June_Actual_Sum Annual_Budget_sum;
if first.account then Annual_Budget_sum = Annual_Budget;
else do;
select(month);
when ('May') May_Actual_Sum = actual;
when ('June') June_Actual_Sum = actual;
/* List other months also here. Can use some macros here to make the code compact and expandable for future enhancements */
end;
end;
if last.account then output;
keep account May_Actual_Sum June_Actual_Sum Annual_Budget_sum;
run;
I´m trying to combine and sum certain observations of a dataset with different values for their common variables, in this case, I am trying to combine the deaths of three age intervals (85-90), (91-95), (95+) in one only (85+) age interval. Our teacher told us it is better if we do not create a new variable and use proc means, tabulate etc.
I have read every google page and all I can find is a proc means combining and summing by variable, but I don´t need the whole group summed, just some observations of the group.
Having the dataset like:
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
86 to 90 3
91 to 95 1
95+ 3
I would like to have it like
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
85+ 7
Thanks!
Create a custom format to map the existing literal categorizations into a new ones.
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
This only works for concatenating categories, creating a coarser aggregation.
Example:
* A format to get you into the pickle you are in;
proc format;
value age_cat_have
71-75 = '71 to 75'
76-80 = '76 to 80'
81-84 = '81 to 85'
86-90 = '86 to 90'
91-95 = '91 to 95'
95-high = '95+'
;
data have;
input age ##;
agecat = put (age, age_cat_have.);
datalines;
71 72 73
76 77 78 79
82 83
87 86 86
94
99 101 113
;
proc freq data=have;
title "Original categories are character literals";
table agecat;
run;
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
proc freq data=have;
title "New age categories via custom format $age_cat_want";
table agecat;
format agecat $age_cat_want.;
run;
Note: An existing literal categorization cannot be explicitly split. You would have to make presumptions about the age value distribution within each category and impute a specific age that could be applied to a different age mapping format.
I have 2 datasets like below.
dataset ab;
input m;
cards;
102
103
104
run;
dataset ac;
input m;
cards;
102
102
103
103
104
104
104
run;
when i wrote the below statement,
data a;
merge ab ac;
by m;
run;
I got the output as 102 102 103 103 104 104 104
but when i wrote update statement,
data b;
update ab ac;
by m;
run;
i got output as 102 103 104.
Can you please explain me what has happened in the update statement.
Thanks in Advance,
Nikhila
Update applies the transactions 1 by 1. The master table is required to have Unique BY values which is true. The transaction table has multiples, but doesn't have any new values so they are not added.
If the transaction had a BY value not in the table it would add it.
With an UPDATE and BY the following may help:
BY value is in transaction dataset AND master - >records in master are updates with values from transaction. If there are multiple records in the transaction table for a BY group they're each applied in order. There will only be one record in the Master table with the value from the last match in the transaction table.
BY value in transaction, not in master -> Record is added to master table
BY value is not in transaction, is in Master -> record in master remains unchanged.
This would be easier to see if you add a second variable to your test datasets that are unique.
data ab;
input m ##;
cards;
101 102 103 104
;;;;
run;
data ac;
input m ##;
cards;
102 102 103 103 104 104 104
;;;;
run;
data b;
update ab ac(in=in1);
by m;
if first.m then tCount=0;
tCount + in1;
run;
proc print;
run;
I have the following data set:
Date jobboardid Sales
Jan05 3 256
Jan05 6 70
Jan05 54 90
Feb05 32 456
Feb05 11 89
Feb05 16 876
March05
April05
.
.
.
Jan06 6 678
Jan06 54 87
Jan06 13 56
Feb06 McDonald 67
Feb06 11 281
Feb06 16 876
March06
April06
.
.
.
Jan07 6 567
Jan07 54 76
Jan07 34 87
Feb07 10 678
Feb07 11 765
Feb07 16 67
March07
April06
I am trying to calculate a 12 month growth rate for Sales column when jobboardid column has the same value 12 months apart. I have the following code:
data Want;
set Have;
by Date jobboardid;
format From Till monyy7.;
from = lag12(Date);
oldsales = lag12(sales);
if lag12 (jobboardid) EQ jobboardid
and INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
run;
However I keep getting the following error message:
Note: Missing values were created as a result of performing operation on missing values.
But when I checked my dataset, there aren't any missing values. What's the problem?
Note: My date column is in monyy7. format. jobboardid is numeric value and so does the Sales.
The NOTE is being thrown by the INTCK() function. When you say from=lag12(date) the first 12 records will have a missing value for from. And then INTCK('month', from, Date) will throw the NOTE. Even though INTCK is not used in an assignment statement, it still throws the NOTE because one of its arguments has a missing value. Below is an example. The log reports that missing values were created 12 times, because I used lag12.
77 data have;
78 do Date=1 to 20;
79 output;
80 end;
81 run;
NOTE: The data set WORK.HAVE has 20 observations and 1 variables.
82 data want;
83 set have;
84 from=lag12(Date);
85 if intck('month',from,today())=. then put 'Missing: ' (_n_ Date)(=);
86 else put 'Not Missing: ' (_n_ Date)(=);
87 run;
Missing: _N_=1 Date=1
Missing: _N_=2 Date=2
Missing: _N_=3 Date=3
Missing: _N_=4 Date=4
Missing: _N_=5 Date=5
Missing: _N_=6 Date=6
Missing: _N_=7 Date=7
Missing: _N_=8 Date=8
Missing: _N_=9 Date=9
Missing: _N_=10 Date=10
Missing: _N_=11 Date=11
Missing: _N_=12 Date=12
Not Missing: _N_=13 Date=13
Not Missing: _N_=14 Date=14
Not Missing: _N_=15 Date=15
Not Missing: _N_=16 Date=16
Not Missing: _N_=17 Date=17
Not Missing: _N_=18 Date=18
Not Missing: _N_=19 Date=19
Not Missing: _N_=20 Date=20
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
12 at 85:6
NOTE: There were 20 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 20 observations and 2 variables.
One way to avoid the problem would be to add another do block something like (untested):
if lag12 (jobboardid) EQ jobboardid and _n_> 12 then do;
if INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
end;
I am new to SAS, so this might be a silly type of question.
Assume there are several datasets with similar structure but different column names. I want to get new datasets with the same number of rows but only a subset of columns.
In the following example, Data_A and Data_B are original datasets and SubA and SubBare what I want. What is the efficient way of deriving SubA and SubB?
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
DATA SubA;
set A_auto;
keep A_make A_price;
RUN;
DATA SubB;
set B_auto;
keep B_make B_price;
RUN;
Here's my new answer. This introduces quite a few concepts, but all are necessary to complete this task.
First of all I would store the required part variable names (the suffixes that are common to all datasets) in a new dataset. This keeps them all in one place and makes it easier to change if required.
The next step is to create a regular expression (regex) search string that combines all the names, separated by a pipe (|), which is the regex symbol for or. I've also added a $ symbol to end of the names, this ensures only variables ending with the part names will be selected.
select into :[macroname] is the method to create macro variables within proc sql
Then I set up a macro to extract the specific variable names for the current dataset and use those names to create a view (like my original answer)
The dictionary library referenced in the proc sql is a metadata library that contains information on all active libraries, tables, columns etc, so is a good source of identifying what the actual variable names are called (based on the regex search string created earlier).
You won't need the proc print in your code, I just put it in to show everything is working as expected.
Let me know if this works for you
/* create intial datasets */
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH B_make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
/* create dataset containing partial name of variables to keep */
data keepvars;
input part_name $ :20.;
datalines;
_make
_price
;
run;
/* create regular expression search string from partial names */
proc sql noprint;
select
cats(part_name,'$') /* '$' matches end of string */
into
:name_str separated by '|' /* '|' is an 'or' search operator in regular expressions */
from
keepvars;
quit;
%put &name_str.; /* print search string to log */
/* macro to create views from datasets */
%macro create_views (dsname, vwname); /* inputs are dataset name being read in and view name being created */
/* extract specific variable names to be kept, based on search string */
proc sql noprint;
select
name
into
:vars separated by ' '
from
dictionary.columns
where
libname = 'WORK'
and memname = upper("&dsname.")
and prxmatch("/&name_str./",strip(name))>0; /* prxmatch is regular expression search function */
quit;
%put &vars.; /* print variables to keep to log */
/* create views */
data &vwname. / view=&vwname.;
set &dsname. (keep=&vars.);
run;
/* test view by printing */
proc print data=&vwname.;;
run;
%mend create_views;
/* run macro for each dataset */
%create_views(A_auto, SubA);
%create_views(B_auto, SubB);