I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...
Related
I am learning SAS at the moment and I wanted to know how do you join two tables without using any SQL where I need to get only the common values in two tables.
Both tables have a common unique id. Also the tables don't have common variables.
Please don't give any documentation links as I already have and I know merge. I am trying it with an IN operator.
Table 1 : Screenshot
Table 2 : Screenshot
Description: The first table has 157 records and the other has 161 records.
I tried searching a solution but didn't get any. Please refer a solution.
Thanks !
In DATA Step you will want to use the MERGE statement and the IN= option which sets up flags indicating 'contribution' to the current state of the program data vector (PDV)
data want;
merge
have1 (in=_from1)
have2 (in=_from2)
;
by uniqueid; * variable of same name, type and length should be in have1 and have2;
if _from1 and _from2; * subsetting if;
run;
DATA Step is an implicit loop. The MERGE automatically advances reads through the contributing data, synchronizing about the BY variables.
When a DATA Step has no explicit OUTPUT statement, there will be an implicit OUPUT of the values in the PDV when control reaches the bottom of the step. Thus, the if without a then is called subsetting because control only goes past the if (and reaches the bottom for implicit output) when both flags are true (or when data is coming from both tables at a common key value)
I am reading some code that involves a sortedby option which confuses me a little bit.
So in the first part of the code, the following code was used to readin the data:
data _quotes / view=_quotes;
set taq.&yyyymmdd:;
by symbol date time NOTSORTED ex; length EXN 3.;
run;
Note that there is a NOTSORTED option here. when I remove it, SAS would return an error ERROR: BY variables are not properly sorted on data set.
Based on my understanding of how SAS NOTSORTED works, the taq dataset is not properly sorted, but in proper groups.
However, in the following code almost immediately after the previous code (no sort code is involved in between), there in no more NOTSORTED option, but there is no error:
data &outset (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
set _quotes;
by symbol date time;
run;
Therefore, I was wondering if it is because of the sortedby statement that made the difference? I read the SAS documentation, it seems that sortedby will NOT sort the dataset but only specify how the data is currently sorted.
But why did the by statement without the NOTSORTED option work in the second code but not the first code?
Note that the variable ex is not on the BY statement on the second step. If the source data is sorted by symbol date time but is not sorted by ex, what you are observing makes sense.
It's unusual to put the notsorted option in the middle of a list of variables. No matter where it is placed, it applies to the entire list. In this case, it's possible the author was intending to suggest to the reader which variable was not sorted. I find this style confusing.
To check, add ex to the BY statement on the second step and see if it throws an error.
So I have created a macro, which works perfectly fine. Within the macro, I set where the observation will begin reading, and then how many observations it will read.
But, in my proc print call, I am not able to simply do:
(firstobs=&start obs=&obs)
Because although firstobs will correctly start where I want it, obs does not cooperate, as it must be a higher number than firstobs. For example,
%testmacro(start=5, obs=3)
Does not work, because it is reading in the first 3 observations, but trying to start at observation 5. What I want the macro to do is, start at observation 5, and then read the next 3. So what I did is this:
(firstobs=&start obs=%eval((&obs-1)+&start))
This works perfectly fine when I use it. But I am just wondering if there is a simpler way to do this, rather than having to use the whole %eval... call. Is there one simple call, something like numberofobservations=...?
I don't think there is. You can only simplify your macro a little, within the %eval(). .
%let start=5;
%let obs=3;
data want;
set sashelp.class (firstobs=&start obs=%eval(&obs-1+&start));
run;
Data set options are listed here:
http://support.sas.com/documentation/cdl/en/ledsoptsref/68025/HTML/default/viewer.htm#p0h5nwbig8mobbn1u0dwtdo0c0a0.htm
You could count the obs inside the data step using a counter and only outputting the records desired, but that won't work on something like proc print and isn't efficient for larger data steps.
You could try the point= option, but I'm not familiar with that method, and again I don't think it will work with proc print.
As #Reeza said - there is not a dataset option that will do what you are looking for. You need to calculate the ending observation unfortunately, and %eval() is about as good a way to do it as any.
On a side-note, I would recommend making your macro parameter more flexible. Rather than this:
%testmacro(start=5, obs=3)
Change it to take a single parameter which will be the list of data-set options to apply:
%macro testmacro(iDsOptions);
data want;
set sashelp.class (&iDsOptions);
run;
%mend;
%testmacro(firstobs=3 obs=7);
This provides more flexibility if you need to add in additional options later, which means fewer future code changes, and it's simpler to call the macro. You also defer figuring out the observation counts in this case to the calling program which is a good thing.
I've got the following question.
I'm trying to run a partial least square forecast on a data model I have. Issue is that I need to block certain line in order to have the forecast for a specific time.
What I want would be the following. For June, every line before May 2014 will be blocked (see the screenshot below).
For May , every line before April 2014 will be blocked (see the screenshot below).
I was thinking of using a delete through a proc sql to do so but this solution seems to be very brutal and I wish to keep my table intact.
Question : Is there a way to block the line for a specific date with needing a deletion?
Many thanks for any insight you can give me as I've never done that before and don't know if there is a way to do that (I did not find anything on the net).
Edit : The aim of the blocking will be to use the missing values and to run the forecast on this missing month namely here in June 2014 and in May 2014 for the second example
I'm not sure what proc you are planning to use, but you should be able to do something like the below.
It builds a control data set, based on a distinct set of dates, including a filter value and building a text data set name. This data set is then called from a data null step.
Call execute is a ridiculously powerful function for this sort of looping behaviour that requires you to build strings that it will than pass as if they were code. Note that column names in the control set are "outside" the string and concatenated with it using ||. The alternative is probably using quite a lot of macro.
proc sql;
create table control_dates as
select distinct
nuov_date,
put(nuov_date,mon3.)||'_results' as out_name
from [csv_import];
quit;
data _null_;
set control_dates;
call execute(
'data '||out_name||';
set control_dates
(where=(nuov_date<'||nouv_date||'));
run;');
call execute('proc [analysis proc] data='||out_name||';run;');
run;
This might be a weird question. I have a data set contains data like agree, neutral, disagree...for many questions. There is not so many observations so for some question, one or more options has frequency of 0, say neutral. When I run proc freq, since neutral shows up in that variable, the table does not contain a row for neutral. I end up with tables with different number of rows. I would like to know if there is a option to show these 0 frequency rows. I will also need to run proc gchart for the same data set, and I will run into the same problem for having different number of bars. Please help me on this. Thank you!
This depends on how exactly you are running your PROC FREQ. It has the sparse option, which tells it to create a value for every logical cell on the table when creating an output dataset; normally, while you would have a cell with a missing value (or zero) in a crosstab, if that is output to a dataset (which is vertical, ie each combination of x and y axis value are placed in one row) those rows are left off. Sparse makes sure that doesn't happen; and in a larger (n-dimensional) crosstab, it creates rows for every possible combination of every variable, even ones that don't occur in the data.
However, if you're just doing
proc freq data=mydata;
tables myvar;
run;
That won't help you, as SAS doesn't really have anything to go on to figure out what should be there.
For that, you have to use a class variable procedure. Proc Tabulate is one of such procedures, and is similar to Proc Freq in its syntax (sort of). You need to either use CLASSDATA on the proc statement, or PRINTMISS on the table statement. In the former case, you do not need to use a format, I don't believe. In the latter case (PRINTMISS), you need to create a format for your variable (if you don't already have one) that contains all levels of the data that you want to display (even if it's just an identity format, e.g. formatting character strings to identical character strings), and specify PRELOADFMT on the proc statement. See this man page for more details.