editing categorical data for uniformity - sas

i have 1 million + rows of data and on of the columns is channel_name. The people collecting the data didn't seem to care that they entered one channel in about 10 different variations, lots of which contain the # symbol. Google search isn't giving me any decent documentation, can anyone direct me to something useful?

To some extent the answer has to be, "it depends". Your actual data will determine the best solution to this; and there may not be one true solution - you may have to try a few things, and there may well be more manual work than you'd like.
One option is to build a format based on what you see. That format can either convert various values to one consistent value, or convert to a numeric category (which is then overlaid with a format that shows the consistent value).
For example, you might have 'channel' as retail store:
data have;
infile datalines truncover;
input #1 channel $8.;
datalines;
Best Buy
BestBuy
BB
;;;;
run;
So you can do one of two things:
proc format;
value $channel
"Best Buy","BB","BestBuy" = "Best Buy";
quit;
data want;
set have;
channel_coded = put(channel,$channel.);
run;
Or you can do:
proc format;
invalue channeli
"Best Buy", "BB","BestBuy" = 1
;
value channelf
1 = "Best Buy"
;
quit;
data want;
set have;
channel_coded = input(channel,CHANNELI.);
format channel_coded channelf.;
run;
Which you do is largely up to you - the latter gives you more flexibility in the long run, for example when Sears and K-Mart merged, it would be somewhat to take 2 and 16 and format then as Sears, than to change the stored values for the character format - and even easier to roll back if/when KMart splits off again.
This does require some manual work, though; you have to code things by hand here, or develop some method for figuring out what the coding is. You can use the other option in proc format to easily identify new values and add them to the format (which can be derived from a dataset, instead of hand written code), but at the end of the day the actual values you have determine what solution is best for the actual work of determining what is "Best Buy", and a by-hand solution (each time a new value comes in, it is looked at by a person and coded) may ultimately be the best.

Related

SAS: Adding aggregated data to same dataset

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?
There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.
OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...
Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

Missing values in a FREQ (SAS)

I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

Creating a template dataset from multiple datasets with different columns

Currently, I have several sets of business unit data that I'd like to put into a standard template format. Some business unit data contains columns that others don't. I would like to check if certain columns exist and then to create them if they don't. I understand that techniques to achieve similar functionality have been discussed earlier, here and here. However, I was wondering if a better method exists.
My current code is:
data Source_Data4;
set Interm.Source_Data3;
if 0 then do;
a="";
b="";
end;
run;
Using the RETAIN statement should be the fastest and easiest way to do this. If the field you are checking for is numeric then put a . instead of "".
data Source_Data4;
set Interm.Source_Data3;
retain a b "";
run;
If you have multiple datasets with different columns that you want to use a template for, an excellent way to do this is something like this:
data want;
if 0 then set template;
set have2;
run;
This is far easier to code than a bunch of retain/length statements. It accomplishes the identical results as the retain solution (it defines the PDV), with one exception; it will define lengths and formats of variables based on template (while retain does not affect length or format). This may be desirable or may not be, depending on your use case. It is very helpful when combining multiple datasets, as it provides a single point at which length/format differences can be tested for; once this step occurs, you can be confident that your various datasets are all identical in variable length/format.
Creating this dataset can be done a number of ways. One simple way is:
data template;
if 0 then set have;
if 0 then set have2;
stop;
run;
That will create a blank dataset with have1 order followed by any new variables from have2. If that's not desired, you may want to add a RETAIN statement prior to the if 0's that draws from a data dictionary.

Interleaving output from two different procedures with a by value

I have a large SAS dataset and I would like to make a series of tables and charts using by value processing. I am outputing these to a PDF.
Is there any way to get SAS to alternate between the table and the chart as it goes through the data? Right now, I have to print all of the tables first and then print the charts. If it were just 4 tables/charts, then I would be ok writing
Here is a simple example:
data sample;
input byval $ item $ amount;
datalines;
A X 15
A Y 16
A Z 12
B X 25
B Y 10
B Z 18
;
run;
symbol1 i=j;
proc print data=sample;
by byval;
var item amount;
run;
proc gplot uniform data=sample;
by byval;
plot amount*item;
run;
This prints 2 tables, followed by 2 charts.
I would like the Chart for "A" to come after the table for "A" so that the reader can flip through the pdf and always see the associated charts and tables together.
I could write separate procs for each one, but then the gplot won't have a uniform axis (and it gets messy if I have 100 different groups instead of 2).
I thought about pumping them into greplay but then you can't use titles with "#BYVAL1".
Is there any easy way to do this?
I've never used it, but it may be worth checking out ODS DOCUMENT. This allows you to store the output of all your procedures and then reference specific items from them using PROC DOCUMENT.
Below is a link to the SAS website with useful information about this, in particular the paper by Cynthia Zender for the SAS Global Forum 2009.
http://support.sas.com/rnd/base/ods/odsdocument/index.html
Cynthia also regularly contributes to the SAS Support Communities website (https://communities.sas.com/community/support-communities), so it may be worth asking on there if you are still stuck.
Good luck
I don't know of any way to do what you ask directly. GREPLAY is probably the closest you'll come; the primary problem is that SAS processes the PROCs linearly, first processing the entire PROC PRINT, then the entire PROC GPLOT. GREPLAY would allow you to redisplay the output, but if that doesn't work for your needs due to the #BYVAL issue, I'm not sure there's a better solution. Perhaps you can modify the title afterwards (not sure if GREPLAY allows this)?
You could try using ODS LAYOUT, but I don't think that would be any better. The one way it could be better is if you can work out having two columns on a 'page', one column being the PROC PRINT outputs, one the PROC GPLOT, and then print the columns one page than the other. I don't think this is possible, but it might be worth exploring.
You might also try setting up a macro to do each BYVAL separately, defining the axis in a uniform manner manually (ie, defining it based on your own calculation of the correct axis parameters, as an argument to the macro). That is probably the easiest solution that might still allow #BYVAL to work properly.
You might also try browsing about Richard DeVenezia's site (http://www.devenezia.com/downloads/sas/samples/ ) which has a lot of examples of SAS/GRAPH solutions. He also posts on SAS-L (sasl#listserv.uga.edu) sometimes, not sure if I've seen him on StackOverflow. He's probably the person most likely to be able to answer the question that I know of.