I have a dataset that has weekly values stored by location. I want to determine how many times the value has changed. Initially I thought I could just count distinct values, but the issue is that sometimes the values are repeated. Consider the example below:
data have;
input location $ week value;
cards;
NC 1 100
NC 2 200
NC 3 200
NC 4 200
NC 5 100
NC 6 200
SC 1 500
SC 2 500
SC 3 500
SC 4 500
SC 5 500
SC 6 500
;
run;
Notice that the value at location NC changes three times, at weeks 2,5,6. The value at location SC changes 0 times.
I would like an output of the change frequency...something like:
NC 3
SC 0
Any help would be greatly appreciated. Thank you.
Use the NOTSORTED keyword on a BY statement and you can then count the number of FIRST. occurrences.
proc sort data=have;
by location week;
run;
data want
set have;
by location value notsorted ;
if first.location then nchange=0;
else nchange + first.value;
if last.location;
keep location nchange ;
run;
Make sure the data is sorted. Your example is, but if not then
proc sort data=have;
by location week;
run;
After that, use the BY statement inside the data step. This will create indicators that tell you when you are at the start and end of the BY group.
RETAIN, will keep values between lines.
data want;
set have;
by location;
retain last count;
if first.location then do;
count = 0;
last = value;
end;
if last ^= value then
count = count + 1;
last = value;
if last.location then
output;
run;
Related
SAS - I'd like to count the number of times a record appears within a variable (Ref) and apply a count in a new variable (Count) for these.
Eg.
Ref
Count
1000
1
1001
1
2000
1
3000
1
1000
2
1000
3
What is the best way to do this?
That is what PROC FREQ is for. It will count the number of OBSERVATIONS for each value of a variable (or combination of variables).
proc freq data=have;
tables REF ;
run;
If you want the result in a dataset then use the OUT= option of the TABLES statement.
proc freq data=have;
tables REF / out=want;
run;
Managed to achive the results with the below code.
Please note - Data needs to be sorted with a PROC SORT before running this.
DATA want;
set have;
BY Variable;
IF FIRST.Variable then counter = 1
ELSE counter + 1;
RUN;
If you use the SAS hash object, you can do it with the original order intact
data have;
input Ref $;
datalines;
1000
1001
2000
3000
1000
1000
;
data want;
if _N_ = 1 then do;
dcl hash h();
h.definekey('Ref');
h.definedata('Count');
h.definedone();
end;
set have;
if h.find() ne 0 then Count = 1;
else Count + 1;
h.replace();
run;
I have a column for dollar-amount that I need to break apart into $1000 segments - so $0-$999, $1,000-$1,999, etc.
I could use Case/When, but there are an awful lot of groups I would have to make.
Is there a more efficient way to do this?
Thanks!
You could just use arithmetic. For example you could convert them to upper limit of the $1,000 range.
up_to = 1000*ceil(dollar/1000);
Let's make up some example data:
data test;
do dollar=0 to 5000 by 500 ;
up_to = 1000*ceil(dollar/1000);
output;
end;
run;
Results:
Obs dollar up_to
1 0 0
2 500 1000
3 1000 1000
4 1500 2000
5 2000 2000
6 2500 3000
7 3000 3000
8 3500 4000
9 4000 4000
10 4500 5000
11 5000 5000
Absolutely. This is a great use case for user-defined formats.
proc format;
value segment
0-<1000 = '0-1000'
1000-<2000 = '1000s'
2000-<3000 = '2000s'
;
quit;
If the number is too high to write out, do it with code!
data segments;
retain
fmtname 'segment'
type 'n' /* numeric format */
eexcl 'Y' /* exclude the "end" match, so 0-1000 excluding 1000 itself */
;
do start = 0 to 1e6 by 1000;
end = start + 1000;
label = catx('- <',start,end); * what you want this to show up as;
output;
end;
run;
proc format cntlin=segments;
quit;
Then you can use segment = put(dollaramt,segment.); to assign the value of segment, or just apply the format format dollaramt segment.; if you're just using it in PROC SUMMARY or somesuch.
And you can combine the two approaches above to generate a User Defined Format that will bin the amounts for you.
Create bins to set up a user defined format. One drawback of this method is that it requires you to know the range of data ahead of time.
Use a user defined function via PROC FCMP.
Use a manual calculation
I illustrate version of the solution for 1 & 3 below. #2 requires PROC FCMP but I think using it a plain data step can be simpler.
data thousands_format;
fmtname = 'thousands_fmt';
type = 'N';
do Start = 0 to 10000 by 1000;
END = Start + 1000 - 1;
label = catx(" - ", put(start, dollar12.0), put(end, dollar12.0));
output;
end;
run;
proc format cntlin=thousands_format;
run;
data demo;
do i=100 to 10000 by 50;
custom_format = put(i, thousands_fmt.);
manual_format = catx(" - ", put(floor(i/1000)*1000, dollar12.0), put((ceil(i/1000))*1000-1, dollar12.0));
output;
end;
run;
I have a dataset similiar to this.I am not sure how many coulmns I would get or rows as it is part of the code. But I will have the first value to be equal to 0 bucket.
DATA MY_data;
INPUT bucket D_201503 D_201504 ;
DATALINES;
0 1000 20500
1 200 6700
2 101 456
3 45 567
;
eg -In this dataset I want the values below 10% of the first row value should be missing. like for eg first value is 1000 for bucket 0 so 45 should be missing. The same for 20500 as well.Anything below 10% should be missing. The dataset is generally not huge but need to determine columns and rows.
So I should get this as
0 1000 20500
1 200 6700
2 101 .
3 . .
I am not sure how I should loop through the dataset and make this condition
DATA MY_data;
INPUT bucket D_201503 D_201504 ;
DATALINES;
0 1000 20500
1 200 6700
2 101 456
3 45 567
;
data want;
set MY_data;
array row(*) _all_;
array _first_row(999); /*any number >= the number of columns of MY_data*/
/*we read the first line and store the values in _first_row array*/
retain _first_row:;
if _n_ = 1 then do i=1 to dim(row);
_first_row(i) = row(i);
end;
/*replacing values <10% of the first row*/
else do i=1 to dim(row);
if upcase(vname(row(i))) ne "BUCKET" and row(i) < 0.1*_first_row(i) then row(i) = .;
end;
drop i _first_row:;
run;
/*Find out how many variables there are (assume we just want all vars prefixed D_)*/
data _null_;
set my_data(obs = 1);
array vars_of_interest(*) D_:;
call symput(dim(vars_of_interest),"nvars")
run;
/*Save bucket 0 values to a temp array, compare each row and set missing values*/
data want;
set my_data;
array bucket_0(&nvars) _temporary_;
array vars_of_interest(*) D_:;
do i = 1 to &nvars;
if bucket = 0 then bucket_0[i] = vars_of_interest[i];
else if vars_of_interest[i] < bucket_0[i] / 100 then call missing(vname(vars_of_interest[i]))
end;
run;
You need a way to remember the values from the first row (or perhaps from the row where BUCKET=0?) so that you can then compare the value from the first row to the current value. A temporary ARRAY is an easy way to do that.
So assuming that BUCKET is always the first numeric variable in your data then you can just do something like this.
data want ;
set my_data;
array x _numeric_;
array y (1000) _temporary_;
do i=2 to dim(x);
if bucket=0 then y(i)=x(i);
else if x(i) < y(i) then x(i)=.;
end;
drop i;
run;
If BUCKET is not the first variable then you could add retain bucket; before the set statement to force it to be the first. Or change the first array statement to list the specific variables you want to process, just remember to change the lower bound on the DO loop.
If you have more than a thousand variables then increase the dimension of the temporary array.
I have a data set that looks like this:
company Assets Liabilities strategy1 strategy2 strategy3.....strategy22
1 500 500 0 50 50 50
2 200 300 33 30 33 0
My goal is to find the maximum value across the row for all strategies (strategy1 - strategy22), and basically bucket the company by the strategy they use. The problem comes when some companies have the same maximum value under multiple strategies. In this case I would want to place the firm into multiple buckets. The final dataset would be something like this:
company Assets Liab. strategy1 strategy2 strategy3.....strategy22 Strategy
1 500 500 0 50 50 50 Strategy2
1 500 500 0 50 50 50 Strategy3
1 500 500 0 50 50 50 Strategy22
Etc.
The end goal is to be able to run a proc means on the company's assets, liabilities, etc. by strategy. So far I have been able to achieve a dataset close to what I would like, but in the "Strategy" column I can't get it so SAS doesn't always output the first strategy with the maximum value.
Data want;
set have;
MAX = max(of strategy1-strategy22);
array nums {22} strategy1-strategy22;
do _n_=1 to 21;
count=1;
do _i_ = _n_+1 to 22;
if nums{_n_} = nums{_i_} AND nums{_i_} ne 0 then count + 1;
end;
if count > maxcount then do;
mode = nums{_n_};
maxcount = count;
end;
end;
Run;
Data want2;
set want (where=( maxcount > 1 AND Mode = Max));
by company;
strat=1;
do until (strat gt maxcount);
output;
strat = strat +1;
end;
Run;
Basically, I computed the mode and the count of identical maximum values and if maxcount > 1 and mode = max then I output identical observations. However, I am stuck regarding how to get SAS to output different strategies if there are multiple maximum values that are the same.
That seems more complicated than you need.
data want;
set have;
array strategies[22] strategy1-strategy22;
do strategy = 1 to dim(strategies);
if strategies[strategy] = max(of strategies[*]) then output;
end;
run;
Why not just output the the row if the strategy column matches the MAX?
My array language is off, but here is some pseudo code to do what I'm thinking...
If the column you're on has the value EQ MAX, then output that row with the strategy column set to the strategy that you're looking at:
Data want;
set have;
MAX = max(of strategy1-strategy22);
array nums {22} strategy1-strategy22;
do i = _n_+1 to 22;
if nums{i} eq MAX then do;
strategy = "strategy" + i
output;
end;
Run;
how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;