I have two columns. The first column contains values (count) and the second column the percentage (from the total population). So I have:
count percentage
12 (48%)
14 (29%)
89 (50%)
I would like one column with both parts:
count
12(48%)
14(29%)
89(50%)
I have tried:
data mydata1;
set mydata;
countper=catx(' ',count, percent);
run;
and
data mydata1;
set mydata;
counter=count || percent;
run;
Both of these sort of combine, the 'catx' being more successful, but I lose the brackets and percentage and the number of decimal places in the percentage originally used.
How can I combine these columns as I wish?
countper=catx(' ',vvalue(count), vvalue(percent)); will do it for you without requiring a re-formatting.
Related
I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.
What I have:
Number Cost Amount
52 98 1
108 50 3
922 12 1
What I want:
Number Cost
52 98
108 50
109 50
110 50
922 12 1
My dataset has a variable Amount. If Amount is 2 for a certain row, I want to create a new row right beneath it with the same Cost and the Number equal to that of the row above + 1. If the Amount is 3, I want to create two new rows right beneath it, both with the same Cost and with the Numbers being Number from row above +1 and Number from row above +2, and so on.
My final step would be to delete the Amount column, which I can do with
data want (drop=Amount);
set have;
I am having problems implementing this, my thoughts have been to use proc sql insert into but I am having trouble combining this with an if condition that runs through the amount variable.
Code to reproduce table:
proc sql;
create table want
(Number num, Cost num, Amount num);
insert into want
values(52,98,1)
values(108,50,3)
values(922,12,1);
This can help you:
proc sort data=want out=want_s nodupkey;
by Number;
run;
data result;
keep Number Cost;
set want_s;
do i=1 to Amount;
output;
Number=Number+1;
end;
run;
You might need to take care that Number does not overlap with the next input row like below:
Number ; Amount
108 ; 10
110 ; 1
Use a DO loop to output the AMOUNT number of rows. You can code the index variable of the loop to increment the NUMBER
Example (untested)
data want(keep=number cost);
set have;
do number = number to number + amount-1;
output;
end;
However, you may not need to perform this expansion of data in some cases. Many SAS Procedures provide a WEIGHT or FREQ statement that allows a variable to perform that statistical or processing roles.
I have three columns in a dataset: spend, age_bucket, and multiplier. The data looks something like...
spend age_bucket multiplier
10 18-24 2x
120 18-24 2x
1 35-54 3x
I'd like a dataset with the columns as the age buckets, the rows as the multipliers, and the entries as the sum (or other aggregate function) of the spend column. Is there a proc to do this? Can I accomplish it easily using proc SQL?
There are a few ways to do this.
data have;
input spend age_bucket $ multiplier $;
datalines;
10 18-24 2x
120 18-24 2x
1 35-54 3x
10 35-54 2x
;
proc summary data=have;
var spend;
class age_bucket multiplier;
output out=temp sum=;
run;
First you can use PROC SUMMARY to calculate the aggregation, sum in this case, for the variable in question. The CLASS statement gives you things to sum by. This will calculate the N-Way sums and the output data set will contain them all. Run the code and look at data set temp.
Next you can use PROC TRANSPOSE to pivot the table. We need to use a BY statement so a PROC SORT is necessary. I also filter to the aggregations you care about.
proc sort data=temp(where=(_type_=3));
by multiplier;
run;
proc transpose data=temp out=want(drop=_name_);
by multiplier;
var spend;
id age_bucket;
idlabel age_bucket;
run;
In traditional mode 35-54 is not a valid SAS variable name. SAS will convert your columns to proper names. The label on the variable will retain the original value. Just be aware if you need to reference the variable later, the name has changed to be valid.
I calculate a ratio for 40 stocks. I need to sort those into three groups high, medium and low based on the value of the ratio. The ratios are fractions of one and there aren't many repetitions. What I need is to create three groups of about 13 stocks each, in group 1 to have the high ratios, in group 2 medium ratios and group 3 low ratios. I have the below code but it just assigns rank 1 to all my stocks.
How can I correct this?
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
proc rank data=sourceh.combinedFreq2 out=sourceh.ranked groups=3;
by symbol notsorted;
var ratio;
ranks rank;
run;
If you want to automatically partition into three relatively even groups, you can use PROC RANK (See example using sashelp.stocks):
data have;
set sashelp.stocks;
ratio=high/low;
run;
proc rank data=have out=want groups=3;
by stock notsorted;
var ratio;
ranks rank;
run;
That partitions them into three groups. As long as you have 40 different values (ie, not a lot of repeats of one value), it will make 3 evenly split groups (with ~13 in each).
In your case, do not use by anything - by will create separate sets of ranks (here I'm ranking dates by stock, but you want to rank stocks.)
I think people are making this more complicated than it needs to be. Lets do this on easy mode.
First, we'll create the dataset and create out ratios.
Second, We'll sort the data by ratio.
Lastly, we'll assign a group based on observation number.
WARNING! UNTESTED CODE!
/*Make the dataset. I stole this from your code above*/
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
/*sort the data so that its ordered by ratio*/
PROC SORT DATA=sourceh.combinedfreq2 OUT=sourceh.combinedfreq2 ;
BY DESCENDING ratio ;
RUN ;
/*Assign a value based on observation number*/
Data sourceh.combinedfreq2;
Set sourceh.combinedfreq2;
length Group $6.;
if _N_ <=13 Then Group = "High";
if _N_ > 13 and _N_ <= 26 Then Group = "Medium";
if _N_ > 26 Then Group = "Low";
run;
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;