I am using the ASA24 HEI SAS macro for the SAS University Edition.
I believe SAS is truncating my IDs (ASA24 Usernames). In my data (.csv) I have Usernames like
After the macro code, the GFHSp040x (x is a natural number) IDs get truncated to GFHSp040.
Is this related to how the macro imports?
Code below:
/*The SAS program (HEI-2015 Individual Scores using multiple days of data from ASA24-2016 and ASA24-2018)
ByPerson.SAS */
/*This SAS program can be used to calculate Healthy Eating Index (HEI)-2015 scores from 24-hour recall or food records data collected using ASA24-2016 and ASA24-2018. This program calculates HEI-2015 component and total scores for each individual (using multiple recalls, if available, for a single respondent). Additional code that calculates HEI-2015 component and total scores for each day of 24HR recall or food record data is available on the ASA24 HEI Resources page. */
/*This program has been tested using SAS UNIVERSITY EDITION. */
/*Note: Some users have found that the SAS program will drop observations
from the analysis if the ID field is not the same length
for all observations. To prevent this error, the observations with the
longest ID length should be listed first when the data is imported into SAS. */
**********************************************************************;
%let home = /folders/myfolders/ASA24;
/* Rename FS_BL_CH_02.02.20_Totals to specify your input file. Do not forget .csv at the end of the name. */
filename Totals "&home/Totals/FS_BL_PA_02.02.20_Totals.csv"; /* In this example, the ASA24-2016 or ASA24-2018 Daily Total Nutrient and Pyramid Equivalents data “Totals”, are saved in a folder called “Totals”, within the “home” folder. The data are in csv format. */
/* Rename FS_BL_CH_02.02.20_HEI to specify your output file. */
filename res "&home/FS_BL_PA_02.02.20_HEI";
/*NOTE: Once you have renamed the above, all you need to do is run the SAS program.*/
%include "&home/hei2015.score.macro.sas";
TITLE 'ASA24-2016 and ASA24-2018 HEI-2015 scores - by person using all days';
/*Step 1.
Input daily total data and create five additional required variables. These variables are:
FWHOLEFRT, MONOPOLY, VTOTALLEG, VDRKGRLEG, PFALLPROTLEG, and PFSEAPLANTLEG
*/
Proc import datafile=Totals
Out=Totals
Dbms=csv
Replace;
Getnames=yes;
Run;
DATA Totals;
SET Totals;
FWHOLEFRT=F_CITMLB+F_OTHER;
MONOPOLY=MFAT+PFAT;
VTOTALLEG=V_TOTAL+V_LEGUMES;
VDRKGRLEG=V_DRKGR+V_LEGUMES;
PFALLPROTLEG=PF_MPS_TOTAL+PF_EGGS+PF_NUTSDS+PF_SOY+PF_LEGUMES;
PFSEAPLANTLEG=PF_SEAFD_HI+PF_SEAFD_LOW+PF_NUTSDS+PF_SOY+PF_LEGUMES;
run;
/*Step 2.
Calculates total food group and nutrient intake over all possible days reported per individual.
*/
proc sort data=Totals;
by UserName UserID;
run;
*get sum per person of variables of interest;
proc means data=Totals noprint;
by UserName UserID;
var KCAL VTOTALLEG VDRKGRLEG F_TOTAL FWHOLEFRT G_WHOLE D_TOTAL
PFALLPROTLEG PFSEAPLANTLEG MONOPOLY SFAT SODI G_REFINED ADD_SUGARS;
output out=idtot sum=;
run;
/*Step 3.
Runs the HEI2015 scoring macro which calculates intake density amounts and HEI scores.
*/
%HEI2015 (indat=idtot,
kcal= KCAL,
vtotalleg= VTOTALLEG,
vdrkgrleg= VDRKGRLEG,
f_total= F_TOTAL,
fwholefrt=FWHOLEFRT,
g_whole= G_WHOLE,
d_total= D_TOTAL,
pfallprotleg= PFALLPROTLEG,
pfseaplantleg= PFSEAPLANTLEG,
monopoly=MONOPOLY,
satfat=SFAT,
sodium=SODI,
g_refined=G_REFINED,
add_sugars=ADD_SUGARS,
outdat=hei2015);
/*Step 4.
Displays and saves the results.
*/
Data hei2015r (keep=UserName UserID kcal HEI2015C1_TOTALVEG HEI2015C2_GREEN_AND_BEAN HEI2015C3_TOTALFRUIT
HEI2015C4_WHOLEFRUIT HEI2015C5_WHOLEGRAIN HEI2015C6_TOTALDAIRY HEI2015C7_TOTPROT HEI2015C8_SEAPLANT_PROT
HEI2015C9_FATTYACID HEI2015C10_SODIUM HEI2015C11_REFINEDGRAIN HEI2015C12_SFAT HEI2015C13_ADDSUG HEI2015_TOTAL_SCORE);
Set hei2015;
Run;
proc means n nmiss min max mean data=hei2015r;
run;
proc export data= hei2015r
file=res
dbms=xlsx
replace;
run;
_______________________________ THE MACRO
/*************************************************************************/
/*************************************************************************/
/* */
/* THE HEI-2015 SCORING MACRO */
/* (hei2015.score.macro.sas) */
/*************************************************************************/
/* VERSION 1.0 06/25/2017 */
/* */
/* */
/* This HEI-2015 macro is to be used to calculate densities and */
/* and HEI-2015 component and total scores. */
/* */
/* The macro requires an input dataset with variables for each of */
/* the HEI-2015 components, noted below. */
/* */
/* The resulting dataset, which is named by the user, contains the */
/* same variables as the supplied dataset, and creates 27 new */
/* variables. These include: */
/* */
/* The densities (per 1000 kcal) or percent (of total calories) */
/* for each of the 13 HEI-2015 components. */
/* */
/* Scores for the 13 components of the HEI-2015. */
/* */
/* The total HEI-2015 score, which is the sum of the */
/* scores for the 13 components. */
/* */
/* The syntax for calling the macro is: */
/* */
/* %HEI 2015 */
/* (indat=,kcal=,vtotalleg=,vdrkgrleg=,f_total=,fwholefrt=,g_whole= */
/* d_total=,pfallprotleg=,pfseaplantleg=,monopoly=,satfat=,sodium=, */
/* g_refined=,add_sugars=,outdat=) */
/* */
/* where */
/* */
/* "indat" * Specifies the dataset to be used. */
/* */
/* "kcal" * Specifies calorie amount. */
/* */
/* "vtotalleg" * Specifies the intake of total veg plus */
/* legumes in cup eq. */
/* */
/* "vdrkgrleg" * Specifies the intake of dark green veg */
/* plus legumes in cup eq. */
/* */
/* "f_total" * Specifies the intake of total fruit in cup eq */
/* */
/* "fwholefrt" * Specifies the intake of whole fruit in cup eq. */
/* */
/* "g_whole" * Specifies the intake of whole grain in oz. eq. */
/* */
/* "d_total" * Specifies the intake of total dairy in cup eq. */
/* */
/* "pfallprotleg" * Specifies the intake of total protein */
/* (includes legumes) in oz. eq. */
/* */
/* "pfseaplantleg" * Specifies the intake of seafood, fish and plant */
/* protein (includes legumes) in oz. eq. */
/* */
/* "monopoly" * Specifies the grams of mono fat plus poly fat. */
/* */
/* "satfat" * Specifies the grams of saturated fat. */
/* */
/* "sodium" * Specifies the mg of sodium. */
/* */
/* "g_refined" * Specifies the intake of refined */
/* grain in oz. eq. */
/* */
/* "add_sugars" * Specifies the intake of added sugars in tsp. eq. */
/* */
/* "outdat" * Specifies the name of the resulting dataset. */
/* */
/* */
/* Caution: variable names "FARMIN", "FARMAX", "SODMIN", */
/* "SODMAX", "RGMIN", "RGMAX", "SFATMIN", "SFATMAX", "ADDSUGMIN", */
/* "ADDSUGMAX" are reserved for this macro. */
/* */
/* */
/*************************************************************************/
;
%macro HEI2015 (indat=,kcal=,vtotalleg=,vdrkgrleg=,f_total=,fwholefrt=,g_whole=,d_total=,
pfallprotleg=,pfseaplantleg=,monopoly=,satfat=,sodium=,g_refined=,add_sugars=,outdat=);
data &outdat (drop=FARMIN FARMAX SODMAX SODMIN RGMIN RGMAX SFATMIN SFATMAX ADDSUGMIN ADDSUGMAX);
set &indat;
IF &kcal > 0 then VEGDEN=&vtotalleg/(&kcal/1000);
HEI2015C1_TOTALVEG=5*(VEGDEN/1.1);
IF HEI2015C1_TOTALVEG > 5 THEN HEI2015C1_TOTALVEG=5;
IF VEGDEN=0 THEN HEI2015C1_TOTALVEG=0;
IF &kcal > 0 then GRBNDEN=&vdrkgrleg/(&kcal/1000);
HEI2015C2_GREEN_AND_BEAN=5*(GRBNDEN/0.2);
IF HEI2015C2_GREEN_AND_BEAN > 5 THEN HEI2015C2_GREEN_AND_BEAN=5;
IF GRBNDEN=0 THEN HEI2015C2_GREEN_AND_BEAN=0;
IF &kcal > 0 then FRTDEN=&f_total/(&kcal/1000);
HEI2015C3_TOTALFRUIT=5*(FRTDEN/0.8);
IF HEI2015C3_TOTALFRUIT > 5 THEN HEI2015C3_TOTALFRUIT=5;
IF FRTDEN=0 THEN HEI2015C3_TOTALFRUIT=0;
IF &kcal > 0 then WHFRDEN=&fwholefrt/(&kcal/1000);
HEI2015C4_WHOLEFRUIT=5*(WHFRDEN/0.4);
IF HEI2015C4_WHOLEFRUIT > 5 THEN HEI2015C4_WHOLEFRUIT=5;
IF WHFRDEN=0 THEN HEI2015C4_WHOLEFRUIT=0;
IF &kcal > 0 then WGRNDEN=&g_whole/(&kcal/1000);
HEI2015C5_WHOLEGRAIN=10*(WGRNDEN/1.5);
IF HEI2015C5_WHOLEGRAIN > 10 THEN HEI2015C5_WHOLEGRAIN=10;
IF WGRNDEN=0 THEN HEI2015C5_WHOLEGRAIN=0;
IF &kcal > 0 then DAIRYDEN=&d_total/(&kcal/1000);
HEI2015C6_TOTALDAIRY=10*(DAIRYDEN/1.3);
IF HEI2015C6_TOTALDAIRY > 10 THEN HEI2015C6_TOTALDAIRY=10;
IF DAIRYDEN=0 THEN HEI2015C6_TOTALDAIRY=0;
IF &kcal > 0 then PROTDEN=&pfallprotleg/(&kcal/1000);
HEI2015C7_TOTPROT=5*(PROTDEN/2.5);
IF HEI2015C7_TOTPROT > 5 THEN HEI2015C7_TOTPROT=5;
IF PROTDEN=0 THEN HEI2015C7_TOTPROT=0;
IF &kcal > 0 then SEAPLDEN=&pfseaplantleg/(&kcal/1000);
HEI2015C8_SEAPLANT_PROT=5*(SEAPLDEN/0.8);
IF HEI2015C8_SEAPLANT_PROT > 5 THEN HEI2015C8_SEAPLANT_PROT=5;
IF SEAPLDEN=0 THEN HEI2015C8_SEAPLANT_PROT=0;
IF &satfat > 0 THEN FARATIO=&monopoly/&satfat;
FARMIN=1.2;
FARMAX=2.5;
if &satfat=0 and &monopoly=0 then HEI2015C9_FATTYACID=0;
else if &satfat=0 and &monopoly > 0 then HEI2015C9_FATTYACID=10;
else if FARATIO >= FARMAX THEN HEI2015C9_FATTYACID=10;
else if FARATIO <= FARMIN THEN HEI2015C9_FATTYACID=0;
else HEI2015C9_FATTYACID=10* ( (FARATIO-FARMIN) / (FARMAX-FARMIN) );
IF &kcal > 0 then SODDEN=&sodium/&kcal;
SODMIN=1.1;
SODMAX=2.0;
IF SODDEN <= SODMIN THEN HEI2015C10_SODIUM=10;
ELSE IF SODDEN >= SODMAX THEN HEI2015C10_SODIUM=0;
ELSE HEI2015C10_SODIUM=10 - (10 * (SODDEN-SODMIN) / (SODMAX-SODMIN) );
IF &kcal > 0 then RGDEN=&g_refined/(&kcal/1000);
RGMIN=1.8;
RGMAX=4.3;
IF RGDEN <= RGMIN THEN HEI2015C11_REFINEDGRAIN=10;
ELSE IF RGDEN >= RGMAX THEN HEI2015C11_REFINEDGRAIN=0;
ELSE HEI2015C11_REFINEDGRAIN=10 - ( 10* (RGDEN-RGMIN) / (RGMAX-RGMIN) );
IF &kcal > 0 then SFAT_PERC=100*(&satfat*9/&kcal);
SFATMIN=8;
SFATMAX=16;
IF SFAT_PERC >= SFATMAX THEN HEI2015C12_SFAT=0;
ELSE IF SFAT_PERC <= SFATMIN THEN HEI2015C12_SFAT=10;
ELSE HEI2015C12_SFAT= 10 - ( 10* (SFAT_PERC-SFATMIN) / (SFATMAX-SFATMIN) );
IF &kcal > 0 then ADDSUG_PERC=100*(&add_sugars*16/&kcal);
ADDSUGMIN=6.5;
ADDSUGMAX=26;
IF ADDSUG_PERC >= ADDSUGMAX THEN HEI2015C13_ADDSUG=0;
ELSE IF ADDSUG_PERC <= ADDSUGMIN THEN HEI2015C13_ADDSUG=10;
ELSE HEI2015C13_ADDSUG= 10 - ( 10* (ADDSUG_PERC-ADDSUGMIN) / (ADDSUGMAX-ADDSUGMIN) );
IF &kcal=0 THEN DO;
HEI2015C1_TOTALVEG=0; HEI2015C2_GREEN_AND_BEAN=0; HEI2015C3_TOTALFRUIT=0; HEI2015C4_WHOLEFRUIT=0; HEI2015C5_WHOLEGRAIN=0; HEI2015C6_TOTALDAIRY=0;
HEI2015C7_TOTPROT=0; HEI2015C8_SEAPLANT_PROT=0; HEI2015C9_FATTYACID=0; HEI2015C10_SODIUM=0; HEI2015C11_REFINEDGRAIN=0; HEI2015C12_SFAT=0; HEI2015C13_ADDSUG=0;
END;
/**Calculate HEI-2015 total score**/
/*total HEI-2015 score is the sum of 13 HEI component scores*/
HEI2015_TOTAL_SCORE = HEI2015C1_TOTALVEG + HEI2015C2_GREEN_AND_BEAN + HEI2015C3_TOTALFRUIT + HEI2015C4_WHOLEFRUIT + HEI2015C5_WHOLEGRAIN + HEI2015C6_TOTALDAIRY +
HEI2015C7_TOTPROT + HEI2015C8_SEAPLANT_PROT + HEI2015C9_FATTYACID + HEI2015C10_SODIUM + HEI2015C11_REFINEDGRAIN + HEI2015C12_SFAT + HEI2015C13_ADDSUG;
LABEL HEI2015_TOTAL_SCORE='TOTAL HEI-2015 SCORE'
HEI2015C1_TOTALVEG='HEI-2015 COMPONENT 1 TOTAL VEGETABLES'
HEI2015C2_GREEN_AND_BEAN='HEI-2015 COMPONENT 2 GREENS AND BEANS'
HEI2015C3_TOTALFRUIT='HEI-2015 COMPONENT 3 TOTAL FRUIT'
HEI2015C4_WHOLEFRUIT='HEI-2015 COMPONENT 4 WHOLE FRUIT'
HEI2015C5_WHOLEGRAIN='HEI-2015 COMPONENT 5 WHOLE GRAINS'
HEI2015C6_TOTALDAIRY='HEI-2015 COMPONENT 6 DAIRY'
HEI2015C7_TOTPROT='HEI-2015 COMPONENT 7 TOTAL PROTEIN FOODS'
HEI2015C8_SEAPLANT_PROT='HEI-2015 COMPONENT 8 SEAFOOD AND PLANT PROTEIN'
HEI2015C9_FATTYACID='HEI-2015 COMPONENT 9 FATTY ACID RATIO'
HEI2015C10_SODIUM='HEI-2015 COMPONENT 10 SODIUM'
HEI2015C11_REFINEDGRAIN='HEI-2015 COMPONENT 11 REFINED GRAINS'
HEI2015C12_SFAT='HEI-2015 COMPONENT 12 SAT FAT'
HEI2015C13_ADDSUG='HEI-2015 COMPONENT 13 ADDED SUGAR'
VEGDEN='DENSITY OF TOTAL VEGETABLES PER 1000 KCAL'
GRBNDEN='DENSITY OF DARK GREEN VEG AND BEANS PER 1000 KCAL'
FRTDEN='DENSITY OF TOTAL FRUIT PER 1000 KCAL'
WHFRDEN='DENSITY OF WHOLE FRUIT PER 1000 KCAL'
WGRNDEN='DENSITY OF WHOLE GRAIN PER 1000 KCAL'
DAIRYDEN='DENSITY OF DAIRY PER 1000 KCAL'
PROTDEN='DENSITY OF TOTAL PROTEIN PER 1000 KCAL'
SEAPLDEN='DENSITY OF SEAFOOD AND PLANT PROTEIN PER 1000 KCAL'
FARATIO='FATTY ACID RATIO'
SODDEN='DENSITY OF SODIUM PER 1000 KCAL'
RGDEN='DENSITY OF REFINED GRAINS PER 1000 KCAL'
SFAT_PERC='PERCENT OF CALORIES FROM SAT FAT'
ADDSUG_PERC='PERCENT OF CALORIES FROM ADDED SUGAR'
;
run;
%mend HEI2015;
/* END OF THE HEI2015 MACRO */
/*******************************************************************/
So the macro is NOT calling PROC IMPORT. Your code is doing that. You can add the GUESSINGROWS=MAX statement to the PROC IMPORT step and it will do a better job of guessing how long to define the character variables.
Or just write your own data step to read the CSV file. You can see the data step that PROC IMPORT generated in your log. You can use that as a model if you want.
Related
Let's say I have stores all around the world and I want to know what was my top losses sales across the world per store. What is the code for that?!
here is my try:
proc sort data= store out=sorted_store;
by store descending amount;
run;
and
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then "Sum_5Largest_Losses"n=sum(amount);
end;
run;
but this just prints out the 5:th amount and not 1.. TO .. 5! and I really don't know how to select the top 5 of EACH store . I think a kind of group by would be a perfect fit. But first things, first. How do I selct i= 1...5 ? And not just = 5?
There is also way of doing it with proc sql:
data have;
input store$ amount;
datalines;
A 100
A 200
A 300
A 400
A 500
A 600
A 700
B 1000
B 1100
C 1200
C 1300
C 1400
D 600
D 700
E 1000
E 1100
F 1200
;
run;
proc sql outobs=4; /* limit to first 10 results */
select store, sum(amount) as TOTAL_AMT
from have
group by 1
order by 2 desc; /* order them to the TOP selection*/
quit;
The data step sum(,) function adds up its arguments. If you only give it one argument then there is nothing to actually sum so it just returns the input value.
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then Sum_5Largest_Losses=sum(Sum_5Largest_Losses,amount);
end;
run;
I would highly recommend learning the basic methods before getting into DOW loops.
Add a counter so you can find the first 5 of each store
As the data step loops the sum accumulates
Output sum for counter=5
proc sort data= store out=sorted_store;
by store descending amount;
run;
data calc1;
set sorted_store;
by store;
*if first store then set counter to 1 and total sum to 0;
if first.store then do ;
counter=1;
total_sum=0;
end;
*otherwise increment the counter;
else counter+1;
*accumulate the sum if counter <= 5;
if counter <=5 then total_sum = sum(total_sum, amount);
*output only when on last/5th record for each store;
if counter=5 then output;
run;
When the code below is executed, how many columns will be in the Willis dataset?
data Willis;
put "Willis"; /* Line 32 */
do b = 1 to 12;
A = rand("normal", 12.5, 1.57); /* Line 34 */
do j = 1 to 5;
lev = rand("normal", 4, 2.155) + A; /* Line 36 */
output;
end;
end;
put "Willis" b=; /* Line 40 */
run;
A data step that does not use a statement keep or drop or data set option keep= or drop= will output every non-automatic variable in the program data vector (pdv). When looking at the code observe statements that involve a variable, those variables will be part of the pdv.
The four non-automatic variables are:
b - first used in do loop statement
A - first used to receive a random value
j - first used in do loop statement
lev - first used to receive a random value
I would like to assign IDs with blank Sizes a size based on the frequency distribution of their Group.
Dataset A contains a snapshot of my data:
ID Group Size
1 A Large
2 B Small
3 C Small
5 D Medium
6 C Large
7 B Medium
8 B -
Dataset B shows the frequency distribution of the Sizes among the Groups:
Group Small Medium Large
A 0.31 0.25 0.44
B 0.43 0.22 0.35
C 0.10 0.13 0.78
D 0.29 0.27 0.44
For ID 8, we know that it has a 43% probability of being "small", a 22% probability of being "medium" and a 35% probability of being "large". That's because these are the Size distributions for Group B.
How do I assign ID 8 (and other blank IDs) a Size based on the Group distributions in Dataset B? I'm using SAS 9.4. Macros, SQL, anything is welcome!
The table distribution is ideal for this. The last datastep here shows that; before that I set things up to create the data at random and determine the frequency table, so you can skip that if you already do that.
See Rick Wicklin's blog about simulating multinomial data for an example of this in other use cases (and more information about the function).
*Setting this up to help generate random data;
proc format;
value sizef
low - 1.3 = 'Small'
1.3 <-<2.3 = 'Medium'
2.3 - high = 'Large'
;
quit;
*Generating random data;
data have;
call streaminit(7);
do id = 1 to 1e5;
group = byte(65+rand('Uniform')*4); *A = 65, B = 66, etc.;
size = put((rank(group)-66)*0.5 + rand('Uniform')*3,sizef.); *Intentionally making size somewhat linked to group to allow for differences in the frequency;
if rand('Uniform') < 0.05 then call missing(size); *A separate call to set missingness;
output;
end;
run;
proc sort data=have;
by group;
run;
title "Initial frequency of size by group";
proc freq data=have;
by group;
tables size/list out=freq_size;
run;
title;
*Transpose to one row per group, needed for table distribution;
proc transpose data=freq_size out=table_size prefix=pct_;
var percent;
id size;
by group;
run;
data want;
merge have table_size;
by group;
array pcts pct_:; *convenience array;
if first.group then do _i = 1 to dim(pcts); *must divide by 100 but only once!;
pcts[_i] = pcts[_i]/100;
end;
if missing(size) then do;
size_new = rand('table',of pcts[*]); *table uses the pcts[] array to tell SAS the table of probabilities;
size = scan(vname(pcts[size_new]),2,'_');
end;
run;
title "Final frequency of size by group";
proc freq data=want;
by group;
tables size/list;
run;
title;
You can also do this with a random value and some if-else logic:
proc sql;
create table temp_assigned as select
a.*, rand("Uniform") as random_roll, /*generate a random number from 0 to 1*/
case when missing(size) then
case when calculated random_roll < small then small
when calculated random_roll < sum(small, medium) then medium
when calculated random_roll < sum(small, medium, large) then large
end end as value_selected, /*pick the value of the size associated with that value in each group*/
coalesce(case when calculated value_selected = small then "Small"
when calculated value_selected = medium then "Medium"
when calculated value_selected = large then "Large" end, size) as group_assigned /*pick the value associated with that size*/
from temp as a
left join freqs as b
on a.group = b.group;
quit;
Obviously you can do this without creating the value_selected variable, but I thought showing it for demonstrative purposes would be helpful.
Is there a way to make multiple bar charts with uniform axis with proc gchart?
In proc gplot, I can use the uniform option like this:
proc gplot data=test uniform;
by state;
plot var*date;
run;
This will give me a set of plots for the by variable that all use the same axis range.
This option doesn't exist for proc gchart--is there any other way to do this? I can't just define a fixed range since my data will vary.
Thanks for the input everyone.
Since it looks like there isn't a good solution within the proc itself, I went with a macro approach to manually setting the axis.
This paper provided the foundation for what I did:
http://analytics.ncsu.edu/sesug/2012/BB-09.pdf
Since I couldn't find the text of the program anywhere except for in that non-searchable PDF, I've typed it in here. My version adds one additional parameter that optionally pads the low value in order to leave space for data labels below the low point (useful if you are making a column chart with labels above the positive values and below the negative values)
%macro set_axis_minmaxincrement(ds=,
axisvar=,
axis_length = 51,
sa_min = 999999,
sa_max = -999999,
returned_min = axis_min,
returned_max = axis_max,
returned_increment = axis_increment,
force_zero = 0,
pad_bottom = 0
) ;
%global &returned_min &returned_max &returned_increment;
/* Find the high and low values. Note: a data step was used versus a proc */
/* to allow the application of the option parameters, if specified. */
proc sort data=&ds out=sortlb(keep=&axisvar);
by &axisvar;
where &axisvar ne .;
run;
data axisdata(keep=low high);
retain low 0;
set sortlb end=eof;
by &axisvar;
if _n_=1 then low = &axisvar;
if eof then do;
high = &axisvar;
if &sa_min ^= 999999 and &sa_min < low then low = &sa_min;
if &sa_max ^= -999999 and &sa_max > high then high = &sa_max;
%if &force_zero = 1 %then %do;
if low > 0 then low = 0;
else if high < 0 then high = 0;
%end;
%if &pad_bottom = 1 %then %do;
if low < 0 then low = low-((high-low)*.06);
%end;
output;
end;
run;
data axisdata;
set axisdata;
/* insure that high is greater than low */
if high <= low then do;
if abs(low) <= 1 then high = low + 1;
else high = low+10;
end;
/* Calculate the conversion unit to transform the standard range to */
/* include the actual range. This value is used to convert the standard */
/* to the actual increment for the actual range. */
axisrange = high - low;
/* ranges of less than 1 */
if axisrange <= 6 then do;
check = 6;
conversion_unit = .01;
do until (axisrange > check);
check = check/10;
if axisrange <= check then conversion_unit = conversion_unit / 10;
end;
end;
/* Ranges of 1 or greater */
else do;
check = 60;
conversion_unit = 1.0;
do while (axisrange > check);
check = check*10;
conversion_unit = conversion_unit * 10;
end;
end;
/* standardize the range to lie between 6 to 60 */
unit_range = axisrange/conversion_unit;
/* Set the increment based on the unitized range */
/* 'Long' axis, 8 - 12 increments */
%if &axis_length >50 %then %do;
if unit_range < 12 then axisinc = 1 * conversion_unit;
else if unit_range < 24 then axisinc = 2 * conversion_unit;
else if unit_range < 30 then axisinc = 2.5 * conversion_unit;
else axisinc = 5 * conversion_unit;
%end;
/* Otherwise, 'short' axis, 4-6 increments */
%else %do;
if unit_range < 12 then axisinc = 2 * conversion_unit;
else if unit_range < 18 then axisinc = 3 * conversion_unit;
else if unit_range < 24 then axisinc = 4 * conversion_unit;
else if unit_range < 30 then axisinc = 5 * conversion_unit;
else axisinc = 10 * conversion_unit;
%end;
/*Round the min's value to match the increment; if the number is */
/* rounded up so that it becomes larger than the lowest data value, */
/* decrease the min by one increment. */
axislow = round(low,axisinc);
if axislow > low then axislow = axislow - axisinc;
/* Round the max; if the number is rounded down, */
/* increase the max by one increment. */
axishigh = round(high, axisinc);
if axishigh < high then axishigh = axishigh + axisinc;
/* put the values into the global macro variables */
call symput("&returned_min",compress(put(axislow, best.)));
call symput("&returned_max",compress(put(axishigh, best.)));
call symput("&returned_increment",compress(put(axisinc, best.)));
run;
%mend set_axis_minmaxincrement;
This is somewhat complex (well to me at least).
Here is what I have to do:
Say that I have the following dataset:
date price volume
02-Sep 40 100
03-Sep 45 200
04-Sep 46 150
05-Sep 43 300
Say that I have a breakpoint where I wish to create an interval in my dataset. For instance, let my breakpoint = 200 volume transaction.
What I want is to create an ID column and record an ID variable =1,2,3,... for every breakpoint = 200. When you sum all the volume per ID, the value must be constant across all ID variables.
So using my example above, my final dataset should look like the following:
date price volume id
02-Sep 40 100 1
03-Sep 45 100 1
03-Sep 45 100 2
04-Sep 46 100 2
04-Sep 46 50 3
05-Sep 43 150 3
05-Sep 43 150 4
(last row can miss some value but that is fine. I will kick out the last id)
As you can see, I had to "decompose" some rows (like the second row for instance, I break the 200 into two 100 volume) in order to have constant value of the sum, 200, of volume across all ID.
Looks like you're doing volume bucketing for a flow toxicity VPIN calculation. I think this works:
%let bucketsize = 200;
data buckets(drop=bucket volume rename=(vol=volume));
set tmp;
retain bucket &bucketsize id 1;
do until(volume=0);
vol=min(volume,bucket);
output;
volume=volume-vol;
bucket=bucket-vol;
if bucket=0 then do;
bucket=&bucketsize;
id=id+1;
end;
end;
run;
I tested this with your dataset and it looks right, but I would check carefully several cases to confirm that it works right.
If you have a variable which indicates 'Buy' or 'Sell', then you can try this. Let's say this variable is called type and takes the values 'B' or 'S'. One advantage of using this method would be that it is easier to process 'by-groups' if any.
%let bucketsize = 200;
data tmp2;
set tmp;
retain volsumb idb volusums ids;
/* Initialize. */
volusumb = 0; idb = 1; volsums = 0; ids = 1;
/* Store the current total for each type. */
if type = 'B' then volsumb = volsumb + volume;
else if type = 'S' then volsums = volsums + volume;
/* If the total has reached 200, then reset and increment id. */
/* You have not given the algorithm if the volume exceeds 200, for example the first two values are 150 and 75. */
if volsumb = &bucketsize then do; idb = idb + 1; volsumb = 0; end;
if volsums = &bucketsize then do; ids = ids + 1; volsums = 0; end;
drop volsumb volsums;
run;