I have the numeric values of salaries of different employee's. I want to break the ranges up into categories. However I do not want a new column rather, I want to just format the existing salary column into this range method:
At least $20,000 but less than $100,000 -
At least $100,000 and up to $500,000 - >$100,000
Missing - Missing salary
Any other value - Invalid salary
I've done something similar with gender. I just want to use the proc print and format command to show salary and gender.
DATA Work.nonsales2;
SET Work.nonsales;
RUN;
PROC FORMAT;
VALUE $Gender
'M'='Male'
'F'='Female'
'O'='Other'
other='Invalid Code';
PROC FORMAT;
VALUE salrange
'At least $20,000 but less than $100,000 '=<$100,000
other='Invalid Code';
PROC PRINT;
title 'Salary and Gender';
title2 'for Non-Sales Employees';
format gender $gender.;
RUN;
Proc Format is the correct method and you need a numeric format:
proc format;
value salfmt
20000 - <100000 = "At least $20,000 but less than $100,000"
100000 - 500000 = "100,000 +"
. = 'Missing'
other = 'Other';
Then in your print apply the format, similar to what you did for gender.
format salary salfmt.;
This should help get you started.
I created a little function that mimics the R cut functions :
options cmplib=work.functions;
proc fcmp outlib=work.functions.test;
function cut2string(var, cutoffs[*], values[*] $) $;
if var <cutoffs[1] then return (values[1]);
if var >=cutoffs[dim(cutoffs)] then return (values[dim(values)]);
do i=1 to dim(cutoffs);
if var >=cutoffs[i] & var <cutoffs[i+1] then return (values[i+1]);
end;
return ("Error, this shouldn't ever happen");
endsub;
run;
Then you can use it like this :
data Work.nonsales2;
set Work.nonsales;
array cutoffs[3] _temporary_ (20000 100000 500000);
array valuesString[4] $10 _temporary_ ("<20k " "20k-100k" "100k-500k" ">500k");
salary_string = cut2string(salary ,cutoffs,valuesString);
run;
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
Hi I am trying to use a data step and an array to convert from a long format to a wide format. Originally my table was in the wide format and I figured out how to make it in the long format but now I need to use an array to make it wide again. When I run my code of the last data step I get a table with empty Expense1, Expense2, Expense 3 etc. columns. My table needs to look like this but with nine Hotels and six Expense columns.
Resort
Expense1
Expense2
Expense3
Expense 4
HOTEL1
$165.89
$45.50
$78.00
$56.25
HOTEL2
$215.32
$64.00
$54.00
$62.50
The long table looks like this but there are nine hotels.
Resort
Expense ID
Expense
HOTEL1
1
$165.89
HOTEL1
2
$45.50
HOTEL1
3
$78.00
Here is my code but the last datastep is me attempting to convert it from long to wide.
proc import datafile="/home/u54324957/The
Files/Hotels.xlsx" out=Sheet1
dbms=xlsx replace;
data Hotels;
set Sheet1;
array TheExpense(*) Expense1-Expense6;
array Peak(6) PeakExpense1-PeakExpense6;
do i=1 to 6;
Peak(i)=TheExpense(i) * 1.25;
drop i;
drop Expense1-Expense6;
format PeakExpense1-PeakExpense6
dollar7.2;
end;
run;
title "Peak Season Resort Pricing";
proc print data=Hotels noobs;
run;
data Hotels1;
set Sheet1;
array Hotels(*) Expense1-Expense6;
do ExpenseID=1 to 6;
Expense = Hotels(ExpenseID);
drop Expense1-Expense6;
output;
end;
run;
title "Restructure Data from Wide to Long
Format";
proc print data=Hotels1 noobs;
format Expense dollar7.2;
run;
proc sort data=Hotels1;
by ExpenseID;
run;
data Hotels2;
set Hotels1;
array Hotels(*) Expense1-Expense6;
retain Expense1-Expense3;
by ExpenseID;
if first.ExpenseID then i=0;
i+1;
if last.ExpenseID then output;
run;
proc print data=Hotels2;
run;
Any ideas for how I can fill in these empty columns with values?
Array based transposition by group can be accomplished as follows:
data wide(keep=resort expense1-expense6);
if 0 then set tall (keep=resort); * prep PDV with resort variable;
array expenses expenses1-expenses6; * prep PDV with wide variables;
* reset array to zeroes, resorts without a specific expenseID will have a 0;
do index = 1 to dim(expenses);
expenses[index] = 0;
end;
* if you want missing values instead of zeroes;
* call missing (of expenses(*));
* dow loop, iterate down the by group;
do until (last.resort);
set tall;
by resort;
expenses[expenseID] = expense;
end;
*implicit output, one row per resort;
run;
Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)
I have a dataset looks like the following:
Name Number
a 1
b 2
c 9
d 6
e 5.5
Total ???
I want to calculate the sum of variable Number and record the sum in the last row (corresponding with Name = 'total'). I know I can do this using proc means then merge the output backto this file. But this seems not very efficient. Can anyone tell me whether there is any better way please.
you can do the following in a dataset:
data test2;
drop sum;
set test end = last;
retain sum;
if _n_ = 1 then sum = 0;
sum = sum + number;
output;
if last then do;
NAME = 'TOTAL';
number = sum;
output;
end;
run;
it takes just one pass through the dataset
It is easy to get by report procedure.
data have;
input Name $ Number ;
cards;
a 1
b 2
c 9
d 6
e 5.5
;
proc report data=have out=want(drop=_:);
rbreak after/ summarize ;
compute after;
name='Total';
endcomp;
run;
The following code uses the DOW-Loop (DO-Whitlock) to achieve the result by reading through the observations once, outputting each one, then lastly outputting the total:
data want(drop=tot);
do until(lastrec);
set have end=lastrec;
tot+number;
output;
end;
name='Total';
number=tot;
output;
run;
For all of the data step solutions offered, it is important to keep in mind the 'Length' factor. Make sure it will accommodate both 'Total' and original values.
proc sql;
select max(5,length) into :len trimmed from dictionary.columns WHERE LIBNAME='WORK' AND MEMNAME='TEST' AND UPCASE(NAME)='NAME';
QUIT;
data test2;
length name $ &len;
set test end=last;
...
run;
County...AgeGrp...Population
A.............1..........200
A.............2..........100
A.............3..........100
A............All.........400
B.............1..........200
So, I have a list of counties and I'd like to find the under 18 population as a percent of the population for each county, so as an example from the table above I'd like to add only the population of agegrp 1 and 2 and divide by the 'all' population. In this case it would be 300/400. I'm wondering if this can be done for every county.
Let's call your SAS data set "HAVE" and say it has two character variables (County and AgeGrp) and one numeric variable (Population). And let's say you always have one observation in your data set for a each County with AgeGrp='All' on which the value of Population is the total for the county.
To be safe, let's sort the data set by County and process it in another data step to, creating a new data set named "WANT" with new variables for the county population (TOT_POP), the sum of the two Age Group values you want (TOT_GRP) and calculate the proportion (AgeGrpPct):
proc sort data=HAVE;
by County;
run;
data WANT;
retain TOT_POP TOT_GRP 0;
set HAVE;
by County;
if first.County then do;
TOT_POP = 0;
TOT_GRP = 0;
end;
if AgeGrp in ('1','2') then TOT_GRP + Population;
else if AgeGrp = 'All' then TOT_POP = Population;
if last.County;
AgeGrpPct = TOT_GRP / TOT_POP;
keep County TOT_POP TOT_GRP AgeGrpPct;
output;
run;
Notice that the observation containing AgeGrp='All' is not really needed; you could just as well have created another variable to collect a running total for all age groups.
If you want a procedural approach, create a format for the under 18's, then use PROC FREQ to calculate the percentage. It is necessary to exclude the 'All' values from the dataset with this method (it's generally bad practice to include summary rows in the source data).
PROC TABULATE could also be used for this.
data have;
input County $ AgeGrp $ Population;
datalines;
A 1 200
A 2 100
A 3 100
A All 400
B 1 200
B 2 300
B 3 500
B All 1000
;
run;
proc format;
value $age_fmt '1','2' = '<18'
other = '18+';
run;
proc sort data=have;
by county;
run;
proc freq data=have (where=(agegrp ne 'All')) noprint;
by county;
table agegrp / out=want (drop=COUNT where=(agegrp in ('1','2')));
format agegrp $age_fmt.;
weight population;
run;