SAS data step with BY variable on unsorted data - sas

I'm executing SAS data step with by variable. I understand the output when the data is sorted by key (X in my case). However, when the data is unsorted, I get the following output:
I'm using SAS ODA's AFRICA dataset from MAPS library which has 52824 rows. Here's the link to the CSV file.
data AFRICA_NEW12;
set Maps.AFRICA;
by X;
firstX = FIRST.X;
lastX = LAST.X;
run;
I don't understand how rows are selected when data is not sorted. Why does the output have 14 rows?

You have an error in your log because you didn't sort it. Make sure to read your log.
This likely generates the same issue for you:
data cars;
set sashelp.cars;
by model;
run;
proc print data=cars;
var make model origin;
run;
Output is:
Obs Make Model Origin
1 Acura MDX Asia
2 Acura RSX Type S 2dr Asia
And the log shows:
ERROR: BY variables are not properly sorted on data set SASHELP.CARS.
Make=Acura Model=TSX 4dr Type=Sedan Origin=Asia DriveTrain=Front MSRP=$26,990 Invoice=$24,647 EngineSize=2.4 Cylinders=4
Horsepower=200 MPG_City=22 MPG_Highway=29 Weight=3230 Wheelbase=105 Length=183 FIRST.Model=1 LAST.Model=1 _ERROR_=1 _N_=3
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 4 observations read from the data set SASHELP.CARS.
WARNING: The data set WORK.CARS may be incomplete. When this step was stopped there were 2 observations and 15 variables.
WARNING: Data set WORK.CARS was not replaced because this step was stopped.
Note this portion specifically:
WARNING: The data set WORK.CARS may be incomplete. When this step was stopped there were 2 observations and 15 variables.
If you know the data is sorted in the order you want, which may not be the same as what SAS expects you can add the notsorted option on the BY statement but this is a different type of functionality so check your code thoroughly.
data cars;
set sashelp.cars;
by model notsorted;
run;

Related

How to move specific rows from an original table to a new table?

In SAS, I have a table that have 1000 rows. I am trying to separate that table into two tables. Row1-500 to Table A and row501-100 to table B. What is the code that can do this function. Thank you all for helping!
I am searching the code online and cannot get anything on google, help is appreciated.
The DATA statement lists the output tables of the step. An OUTPUT statement explicitly sends a row to each of the output tables. An explicit OUTPUT <target> ... <target-k> statement sends records to the specified tables only. The automatic implicit loop index variable _n_ can act as a row counter when a single data set is being read with SET.
Try
data want1 want2;
set have;
if _n_ <= 500 then output want1; else output want2;
run;
However, you may be better served by creating a categorical variable that can be used later in WHERE or BY statements.
Maybe the set options will help.Try firstobs= and obs= to choose the rows you want.
Here is how to use them:
data want1;
set have(obs=500);
run;
data want2;
set have(firstobs=501 obs=1000);
run;

SAS: How to create datasets in loop based on macro variable

I have a macro variable like this:
%let months = 202002 202001 201912 201911 201910;
As one can see, we have 5 months, separated by space ' '.
I would like to create 5 datasets like a_202002, a_202001, a_201912, a_2019_11, a_201910. How can I run this in loop and create 5 datasets, instead of writing the datastep 5 times?
Pseudo code:
for m in &months.
data a_m;
....
....
run;
How can I do that in SAS? I tried %do_over but that did not help me.
Use the knowledge you gained from #Tom answer in an earlier question to create the macro
%macro datasets_for_months ...
...
%mend;
Specify the output data sets in the DATA statement:
DATA %datasets_for_months(...);
...
RUN;
Direct rows to specific output data sets by naming the data set, such as
OUTPUT a_202002;
Note:
A step with no OUTPUT statements will implicitly output to each data set
A step with an OUTPUT statement will cause records to be written to either ALL data sets, or only the ones named in the statement:
OUTPUT writes records to all output data sets
OUTPUT data-set-name-1 writes records to only data sets specified
The DATA Step documentation covers what you need to know in greater detail
DATA Statement
Begins a DATA step and provides names for any output such as SAS data sets, views, or programs.
...
Syntax
Form 1:
DATA statement for creating output data sets
DATA <data-set-name-1 <(data-set-options-1)>>
... <data-set-name-n <(data-set-options-n)>>
... ;
THE ROAD AHEAD
You will likely discover that month will be better served in a conceptual role as a categorical variable in a single large data set, instead of breaking the data into multiple month-named data sets.
A categorical variable will let you leverage the power of SAS' partitioning and segregating statements such as WHERE, BY and CLASS when pursuing processing, reporting and visualization of your data at different combinations of class level values.
How about this approach? Create the data set names in another macro variable and use a single data step.
%let months = 202002 202001 201912 201911 201910;
data _null_;
ds = prxchange('s/(\d+)/a_$1/', -1, "&months.");
call symputx('ds', ds);
run;
options symbolgen;
data &ds.;
run;
You can use a %DO loop and the %SCAN() function. Use the COUNTW() function to find the upper bound.
%do i=1 %to %sysfunc(countw(&months,%str( )));
%let month=%scan(&months,&i,%str( ));
....
%end;

proc gchart stacked bar with attached table annotate months

I'm trying to recreate a graph that looks like this:
It's a stacked bar graph with many types of visits, with the values shown in an attached data table and 2 types of goal lines.
My data looks like this (I wasn't sure how to create sample code):
I transformed the data so it's long:
I'm basing my method from this.
In the example, if I run the first annotate portion (anno_values) using the example data from the thread, everything runs fine. However, using a similar setup but accounting for more groups (Visit1, Visit2, etc.) I keep getting this error message:
NOTE: ERROR DETECTED IN ANNOTATE= DATASET WORK.ANNO_VALUES.
MINIMUM VARIABLES NOT MET - AMBIGUITY PREVENTS SELECTION
NOTE: ERROR LIMIT REACHED IN ANNOTATE PROCESS. PROCESSING IS TERMINATED.
NOTE: PROCESSING TERMINATED BY INDIVIDUAL ERROR COUNT.
NOTE: 1 TOTAL ERRORS.
data anno_values; set long2;
format xc monyy.; informat month monyy.;
xsys='2'; ysys='3'; hsys='3'; when='a';
function='label'; position='5';
xc=month;
if type='Total' then do;
y=15;
text=trim(left(value));
output;
end;
if type='Visit1' then do;
y=7;
text=trim(left(value));
output;
end;
if type='Visit2' then do;
y=0;
text=trim(left(value));
output;
end;
if type='Visit3' then do;
y=-7;
text=trim(left(value));
output;
end;
run;
proc gchart data=long2 anno=anno_values;
vbar month / type=sum sumvar=value discrete
subgroup=type nolegend
raxis=axis1 maxis=axis2
coutline=gray77;
run; quit;
I'm not sure if it's the months that causing the issue, but couldn't get further than the first step.
There are macros installed with SAS/Graph that will help you construct a proper annotation data set. The macro is name dclanno, meaning declare annonation variables.
Add these lines to your code:
%annomac /* compiles the SAS/Graph annotation macros */
data myAnno;
/* The dclanno macro, part of the annomac package does code generation
* for defining the annotation variables in the PDV
*/
%dclanno;
dclanno is part of the annomac package found in your installation at SASHOME\SASFoundation\9.4\core\sasmacro.
Here is a link to another example of A stacked vbar chart annotated to display counts of another subgroup

"BY variables are not properly sorted" error although it was sorted already

I am using SAS for a large dataset (>20gb). When I run a DATA step, I received the "BY variables are not properly sorted ......" although I sorted the dataset by the same variables. When I ran the PROC SORT again, SAS even said "Input dataset is already sorted, No sorting done"
My code is:
proc sort data=output.TAQ;
by market ric date miliseconds descending type order;
run;
options nomprint;
data markers (keep=market ric date miliseconds type order);
set output.TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;
And the error message was:
ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were
56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:14.21
cpu time 26.71 seconds
The error is occurring deep into your data step, at _N_=43297873. That suggests to me that the PROC SORT is working up to a point, but then fails. It is hard to know what the reason is without knowing your SAS environment or how OUTPUT.TAQ is stored.
Some people have reported resource problems or file system limitations when sorting large data sets.
From SAS FAQ: Sorting Very Large Datasets with SAS (not an official source):
When sorting in a WORK folder, you must have free storage equal to 4x the size of the data set (or 5x if under Unix)
You may be running out of RAM
You may be able to use options MSGLEVEL=i and FULLSTIMER to get a fuller picture
Also using options sastraceloc=saslog; can produce helpful messages.
Maybe instead of sorting it, you could break it up into a few steps, something like:
/* Get your market ~ ric ~ date pairs */
proc sql;
create table market_ric_date as
select distinct market, ric, date
from output.TAQ
/* Possibly an order by clause here on market, ric, date */
; quit;
data millisecond_stuff;
set market_ric_date;
*Possibly add type/order in this step as well?;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;
/* Possibly a third step here to add type / order if you need to get from original data source */
If your source dataset is in a database, it may be sorted in a different collation.
Try the following before your sort:
options sortpgm=sas;
I had the same error, and the solution was to make a copy of the original table in the work directory, do the sort, and then the "by" was working.
In your case something like below:
data tmp_TAQ;
set output.TAQ;
run;
proc sort data=tmp_TAQ;
by market ric date miliseconds descending type order;
run;
data markers (keep=market ric date miliseconds type order);
set tmp_TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;

Can I get some default/empty text to display if a PROC REPORT doesn't generate due to no valid data?

I have a SAS program that loops through certain sets of data and generates a bunch of reports to an ODS HTML destination.
Sometimes, due to small sets of data I run these reports for, a certain PROC REPORT will not generate because, for this set of data I'm on, there is no data to report. I get this message for those instances:
WARNING: A GROUP, ORDER, or ACROSS variable is missing on every observation.
What I want in the HTML is to display some sort of message for these like "did not generate" or something.
I tried to use return/error codes or the warning text above to detect this, but the error code is 0 (no problem, really?) and the warning text doesn't reset if the next PROC REPORT generates OK.
If it is of any importance, I'm using a data step with CALL EXECUTE to get all this PROC REPORT code generated for these sets of data.
Is there any way to generate this "did not generate" message or at least to catch these warnings per PROC REPORT?
You can substitute in a value for the missing observations in your report.
First redefine missing values to some character. I think you can only use a single character, I could be wrong, though.
options missing='M';run;
Then make sure to use the "missing" option in your PROC REPORT.
proc report data=somedata nowd headline missing;
....
run;
EDITS BASED ON COMMENTS
To get comments to show up, I see a few possibilities.
One, scan the the data set and check for missing values. If any are present throw a message out.
Data _Null_;
Set dataset;
file print notitles;
if obs = . then do;
put #01 'DID NOT COMPUTE';
stop;
end;
run;
Two, add a column with a compute:
define xx /computed "(Message)";
compute xx /char length=16 ;
if obs =. then xx = 'did not compute value in row';
Three, a conditional line using compute:
compute after obs;
if obs = . then do;
line #1 "DID NOT COMPUTE";
end;
endcomp;
endcomp;
See: http://www2.sas.com/proceedings/sugi26/p095-26.pdf
Look for the MTANYOBS macro and the section on printing a 'no observations' page.