Calculating the length of a string within a variable SAS - sas

I have imported some data into SAS through SQL. I want to change the length of the variables to improve the efficiency and storage of the dataset, however I don't know what the maximum length of the variable should be.
For example I have the variable "Forename". It's length is currently set to $300. I know this is too large but don't want to take a guess at what it should be in case I chop off any data. If I have the following names, how can I work out that I need to reset the length to $10.?:
Obs Forename Total Character Length
1 Tim 3
2 Gary 4
3 Samantha 8
4 Christopher 10

This isn't really a dynamic answer but it's pretty simple. Just get the maximum length of the Forename variable.
proc sql;
select max(length(forename)) from have;
quit;
This will just return the max length then you can plug it into the following data step.
data want;
length forename $11;
set have;
run;

Related

SAS Array change to header ext

I have a large data set with 73 columns of characteristics. If a member has a 1 in the box then they have the characteristic. I would like to change all of the 1s to the characteristic name. When uploading the data set I changed all columns to text and can use the following code to replace the 1s with "Yes" but trying to figure out how to change them to the column header text. ie "Single", "Married" etc..
DATA DataSetb;
SET DataSetA ;
array change _CHARACTER_ ;
do over change;
if change=1. then change=????????
End;
run;
This will only work if the variables are character in the first place and if they are you also need to make sure the length will hold the name.
Assuming these are accounted for you can use the VNAME() function to retrieve the name.
DO OVER is deprecated (20 years ago) so I don't use it. This will work, if the assumptions above are met.
You may also want to declare another array with the names if the length isn't large enough.
do i=1 to dim(change);
if change(i)='1' then change=vname(change(i));
End;
You can use the VNAME() function. Note you need to have defined the variables long enough to hold the names. SAS variable names can be up to 32 characters long.
Here is example code you can try that will show what happens if the variables are too short. Notice how the value of SEX is truncated to just 'S' since that variable is defined with length=$1.
data test;
set sashelp.class ;
array _c _character_;
do over _c ; _c=vname(_c); end;
run;
proc freq ;
tables _character_ / list ;
run;

Is it possible to have a numeric variable in SAS with length less than 3 bytes?

I have a numeric variable in a SAS dataset which is length 8. Despite of its length being 8 bytes it contains only one number. See the example bellow.
my_variable
1
2
5
9
0
3
The problem is that I need this variable to be only 1 byte in length and SAS doesn't accept it. I am running the following code:
data my_data_2;
set my_data;
length my_variable 1;
run;
And SAS reports this error message:
ERROR 352-185: The length of numeric variables is 3-8.
1 - So, why I cannot have a numeric variable with a length less than 3 (or greater than 8) bytes?
2 - Is there a way to manage this? I really need this variable to be length 1.
Edit - adding more context:
I need this specific variable to be length one because I need to submit this dataset to a regulatory authority in my country. They demand this variable to be numeric and length one, otherwise their validation program will not be able to read it. Also, it is needed to be submitted as .DBF file (which is simply done by using SAS proc export statement).
I tried to use Microsoft Access 2013 to change length to 1 and it works. The problem is that Access 2013 does not read or save .DBF as it is an old file format. So, I wanted to change the length in SAS and simply export it .DBF.
According to the documentation:
The minimum length for a SAS variable on Windows and UNIX operating
systems is 3 bytes, and the maximum length is 8 bytes. On IBM
mainframes, the minimum length for a SAS variable is 2 bytes, and the
maximum length is 8 bytes.
The SAS numeric variable length is counted in bytes, not digits.
If you need a flag use character variable instead.
The length of a numeric variable in SAS is the number of bytes that it can be stored in. As SAS only uses floating point numerics, they cannot be smaller than 3 bytes in Windows or Unix (2 in z/OS); there is no integer or binary/bit data type in base SAS.
You're welcome to use a format which controls the field width displayed on the screen.
I think that PROC EXPORT will write numeric variable with length of 1. You just need to a attach a format to it so that the proc knows that is what you want.
Try this test program.
%let fname=%sysfunc(pathname(work))/test.dbf ;
data test;
length male 8 sex $1 female 8;
set sashelp.class(obs=3 keep=sex );
male=(sex='M');
female=(sex='F');
format male female F1.;
run;
proc export data=test outfile="&fname" replace
dbms=dbf
;
run;
Then dump the contents of your DBF file as binary to the log
data _null_;
infile "&fname" recfm=f lrecl=32;
input;
list;
run;
and compare it to the description of the file format https://en.wikipedia.org/wiki/.dbf#File_architecture_overview

Macro that outputs table with testing results of SAS table

Problem
I'm not a very experienced SAS user, but unfortunately the lab where I can access data is restricted to SAS. Also, I don't currently have access to the data since it is only available in the lab, so I've created simulated data for testing.
I need to create a macro that gets the values and dimensions from a PROC MEANS table and performs some tests that check whether or not the top two values from the data make up 90% of the results.
As an example, assume I have panel data that lists firms revenue, costs, and profits. I've created a table that lists n, sum, mean, median, and std. Now I need to check whether or not the top two firms make up 90% of the results and if so, flag if it's profit, revenue, or costs that makes up 90%.
I'm not sure how to get started
Here are the steps :
Read the data
Read the PROC MEAN table created, get dimensions, and variables.
Get top two firms in each variable and perform check
Create new table that lists variable, value from read table, largest and second largest, and flag.
Then print table
Simulated data :
https://www.dropbox.com/s/ypmri8s6i8irn8a/dataset.csv?dl=0
PROC MEANS Table
proc import datafile="/folders/myfolders/dataset.csv"
out=dt
dbms=csv
replace;
getnames=yes;
run;
TITLE "Macro Project Sample";
PROC MEANS n sum mean median std;
VAR V1 V2 V3;
RUN;
Desired Results :
Value Largest Sec. Largest Flag
V1 463138.09 9888.09 9847.13
V2 148.92 1.99 1.99
V3 11503375 9999900 1000000 Y
At the moment I can't open your simulated dataset but I can give you some advices, hope they will help.
You can add the n extreme values of given variables using the 'output out=' statement with the option IDGROUP.
Here an example using charity dataset ( run this to create it http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p1oii7oi6k9gfxn19hxiiszb70ms.htm)
proc means data=Charity;
var MoneyRaised HoursVolunteered;
output out=try sum=
IDGROUP ( MAX (Moneyraised HoursVolunteered) OUT[2] (moneyraised hoursvolunteered)=max1 max2);
run;
data var1 (keep=name1 _freq_ moneyraised max1_1 max1_2 rename=(moneyraised=value max1_1=largest max1_2=seclargest name1=name))
var2 (keep=name2 _freq_ HoursVolunteered max2_1 max2_2 rename=(HoursVolunteered=value max2_1=largest max2_2=seclargest name2=name));
length name1 name2 $4;
set try ;
name1='VAR1';
name2='VAR2';
run;
data finalmerge;
length flag $1;
set var1 var2;
if largest+seclargest > value*0.9 then flag='Y';
run;
in the proc means I choose to variables moneyraised and hoursvolunteered, you will choose your var1 var2 var3 and make your changes in all the program.
The IDgroup will output the max value for both variables, as you see in the parentheses, but with out[2], obviously largest and second largest.
You must rename them, I choose to rename max1 and max 2, then sas will add an _1 and _2 to the first and the second max values automatically.
All the output will be on the same line, so I do a datastep referencing 2 datasets in output (data var1 var2) keeping the variables needed and renaming them for the next merge, I also choose a naming system as you see.
Finally I'll merge the 2 datasets created and add the flag.
Here are some initial steps and pointers in a non macro approach which restructures the data in such a manner that no array processing is required. This approach should be good for teaching you a bit about manipulating data in SAS but will not be as fast a single pass approach (like the macros you originally posted) as it transposes and sorts the data.
First create some nice looking dummy data.
/* Create some dummy data with three variables to assess */
data have;
do firm = 1 to 3;
revenue = rand("uniform");
costs = rand("uniform");
profits = rand("uniform");
output;
end;
run;
Transpose the data so all the values are in one column (with the variable names in another).
/* Move from wide to deep table */
proc transpose
data = have
out = trans
name = Variable;
by firm;
var revenue costs profits;
run;
Sort the data so each variable is in a contiguous group of rows and the highest values are at the end of each Variable group.
/* Sort by Variable and then value
so the biggest values are at the end of each Variable group */
proc sort data = trans;
by Variable COL1;
run;
Because of the structure of this data, you could go down through each observation in turn, creating a running total, which when you get to the final observation in a Variable group would be the Variable total. In this observation you also have the largest value (the second largest was in the previous observation).
At this point you can create a data step that:
Is aware when it is in the first and last values of each variable group
by statement to make the data step aware of your groups
first.Variable temporary variable so you can initialise your total variable to 0
last.Variable temporary variable so you can output only the last line of each group
Sums up the values in each group
retain statement so SAS doesn't empty your total with each new observation
sum() function or + operator to create your total
Creates and populates new variables for the largest and second largest values in each group
lag() function or retain statement to keep the previous value (the second largest)
Creates your flag
Outputs your new variables at the end of each group
output statement to request an observation be stored
keep statement to select which variables you want
The macros you posted originally looked like they were meant to perform the analysis you are describing but with some extras (only positive values contributed to the Total, an arbitrary number of values could be included rather than just the top 2, the total was multiplied by another variable k1198, negative values where caught in the second largest, extra flags and values were calculated).

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

SAS: How do I point to a specific observation of a value?

I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.