Unknown Errors with Proc Transpose - sas

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.

In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).

Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

Related

SAS: Adding aggregated data to same dataset

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?
There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.
OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...
Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

SAS - count of records which appear in both datasets (large data)

Consider I have two datasets:
data dataset_1;
input CASENO X;
datalines;
1 100
2 200
3 300
;
data dataset_2;
input CASENO Y;
datalines;
2 200000
3 300000
;
I'm looking to find how many CASENOs appear in both lists: in the example above, I would get 2.
My data is very large. It is taking a long time to get this sort of result by using a merge.
data result;
merge dataset_1 (in = a) and dataset_2 (in = b);
by CASENO;
if a and b;
RUN;
I'm looking for a more efficient way -
edit: for clarity, is there a way to return the number of matches in two datasets without SAS having to write out the resulting file?
If the datasets are already sorted, the data step merge is incredibly efficient. It passes over each row in each table exactly once. Of course, if you just want the count, you don't need to output all of the rows to a dataset, you can just:
data _null_;
merge dataset_1 (in = a keep=caseno) dataset_2 (in = b keep=caseno) end=eof;
by CASENO;
if a and b then count+1;
if eof then call symputx('count',count);
RUN;
This will be much faster to run since you're not writing anything out. I also add KEEP statements (as Tom points out in comments) to the incoming datasets to only read in the by variable, this produces a speed-up of about 10%.
If the datasets are indexed, you have some additional options that will be faster as they will be doing index scans (such as using SQL). But sorted, non-indexed tables, it's hard to improve on the data step merge.
Try:
proc sql;
select count(caseno) as Number from dataset_1 where caseno in (select caseno from dataset_2);
quit;

SAS Transpose Comma Separated Field

This is a follow-up to an earlier question of mine.
Transposing Comma-delimited field
The answer I got worked for the specific case, but now I have a much larger dataset, so reading it in a datalines statement is not an option. I have a dataset similar to the one created by this process:
data MAIN;
input ID STATUS STATE $;
cards;
123 7 AL,NC,SC,NY
456 6 AL,NC
789 7 ALL
;
run;
There are two problems here:
1: I need a separate row for each state in the STATE column
2: Notice the third observation says 'ALL'. I need to replace that with a list of the specific states, which I can get from a separate dataset (below).
data STATES;
input STATE $;
cards;
AL
NC
SC
NY
TX
;
run;
So, here is the process I am attempting that doesn't seem to be working.
First, I create a list of the STATES needed for the imputation, and a count of said states.
proc sql;
select distinct STATE into :all_states separated by ','
from STATES;
select count(distinct STATE) into :count_states
from STATES;
quit;
Second, I try to impute that list where the 'ALL' value appears for STATE. This is where the first error appears. How can I ensure that the variable STATE is long enough for the new value? Also, how do I handle the commas?
data x_MAIN;
set MAIN;
if STATE='ALL' then STATE="&all_states.";
run;
Finally, I use a SCAN function to read in one state at a time. I'm also getting an error here, but I think fixing the above part may solve it.
data x_MAIN_mod;
set x_MAIN;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
run;
Thanks in advance for the help!
Looks like you are almost there. Try this on the last Data Step.
data x_MAIN_mod;
set x_MAIN;
format out_state $2.;
nstate = countw(state,",");
do i=1 to nstate;
out_state = scan(state,i,",");
output;
end;
run;
Do you have to actually have two steps like that? You can use a 'big number' in a temporary variable and not have much effect on things, if you don't have the intermediate dataset.
data x_MAIN;
length state_temp $150;
set MAIN;
if STATE='ALL' then STATE_temp="&all_states.";
else STATE_temp=STATE;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
drop STATE_temp;
run;
If you actually do need the STATE, then honestly I'd go with the big number (=50*3, so not all that big) and then add OPTIONS COMPRESS=CHAR; which will (give or take) turn your CHAR fields into VARCHAR (at the cost of a tiny bit of CPU time, but usually far less than the disk read/write time saved).

editing categorical data for uniformity

i have 1 million + rows of data and on of the columns is channel_name. The people collecting the data didn't seem to care that they entered one channel in about 10 different variations, lots of which contain the # symbol. Google search isn't giving me any decent documentation, can anyone direct me to something useful?
To some extent the answer has to be, "it depends". Your actual data will determine the best solution to this; and there may not be one true solution - you may have to try a few things, and there may well be more manual work than you'd like.
One option is to build a format based on what you see. That format can either convert various values to one consistent value, or convert to a numeric category (which is then overlaid with a format that shows the consistent value).
For example, you might have 'channel' as retail store:
data have;
infile datalines truncover;
input #1 channel $8.;
datalines;
Best Buy
BestBuy
BB
;;;;
run;
So you can do one of two things:
proc format;
value $channel
"Best Buy","BB","BestBuy" = "Best Buy";
quit;
data want;
set have;
channel_coded = put(channel,$channel.);
run;
Or you can do:
proc format;
invalue channeli
"Best Buy", "BB","BestBuy" = 1
;
value channelf
1 = "Best Buy"
;
quit;
data want;
set have;
channel_coded = input(channel,CHANNELI.);
format channel_coded channelf.;
run;
Which you do is largely up to you - the latter gives you more flexibility in the long run, for example when Sears and K-Mart merged, it would be somewhat to take 2 and 16 and format then as Sears, than to change the stored values for the character format - and even easier to roll back if/when KMart splits off again.
This does require some manual work, though; you have to code things by hand here, or develop some method for figuring out what the coding is. You can use the other option in proc format to easily identify new values and add them to the format (which can be derived from a dataset, instead of hand written code), but at the end of the day the actual values you have determine what solution is best for the actual work of determining what is "Best Buy", and a by-hand solution (each time a new value comes in, it is looked at by a person and coded) may ultimately be the best.

Print sas datatset having 100,000 columns in to excel file

I need to print a sas dataset having 100,000 rows * 100,000 columns into excel file.
Proc export or ODS html statements are breaking and hence, are unable to print the same.
Data in file statements are able to print the same. But, due to their logical record length limit, the printing is not proper and essentially my one row is being broken down into 3 rows.
Is there a way out or is this a limitation of SAS in terms of data handling?
Not so much a limitation of SAS, but a limitation of Excel, which can handle up to 16384 columns and up to ~1 million rows, depending on the version. Excel isn't meant to handle datasets of this magnitude; use a proper database.
You certainly cannot get this into excel in any system.
You should be able to get this into another format, like a text file. For example:
data mydata;
array vars[100000];
do _n_=1 to 10;
do _t = 1 to dim(vars);
vars[_t]=_t;
end;
output;
end;
drop _t;
run;
data _null_;
file "c:\temp\myfile.csv" dlm=',' lrecl=2000000;
set mydata;
put _all_;
run;
*put all doesn't really work properly for this, but as I don't know your variable names or setup I cannot really give you a better solution; more than likely you can use a shortcut to define the put statement.;
Maximum LRECL value depends on your operating system, but I'd think most of them could handle a million or two. Certainly Win7 can. You could also use PROC EXPORT to a csv, but you'd have to grab the (300k lines of) code from the log and modify the LRECL to be larger as it defaults to 32767, and I don't think you can modify it in the proc.
SAS/IML would also allow another option. I'm not sure you could really do 100k*100k on any reasonable system (if it's numeric 8 byte matrix elements, you're at 80 billion bytes required to store...)
proc iml;
x=j(1e5,1e5,12345);
filename out ’c:\temp\myfile.csv’;
file out lrecl=800000;
do i=1 to nrow(x);
do j=1 to ncol(x);
put (x[i,j]) 5.0 +5 ',' #;
end;
put;
end;
closefile out;
quit;
Edit: It seems that the lrecl statement in IML doesn't quite behave properly, or else I'm doing something wrong here - but that may be a fault of my system. I get buffer overflows even when the lrecl is clearly long enough.