Hello I have this data
And I need to perform these edit checks
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is YES, then “If no, did the subject consume at least 50%
of the high fat/high calorie breakfast?” fasting question must not be
present"
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is NO, then “If no, did the subject consume at least 50%
of the high fat/high calorie breakfast?” fasting question must be
present"
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is NO, then answer to “If no, did the subject consume at
least 50% of the high fat/high calorie breakfast?” must be present"
I have written the following code
*data fq;
set dm.fq;
ptno=strip(compress(clientid,'-'))+0;
run;
proc sort;
by ptno period day hour;
run;
proc transpose data=fq out=tempfq (DROP=_NAME_ _LABEL_);
by ptno period day hour;
var fq_yn;
id fq_qst;
run;
data final;
set tempfq;*
However, I get the following error in the log
ERROR: The ID value "DID_THE_SUBJECT_FAST_AT_LEAST_4" occurs twice in the same BY group.
And the headers come out capitalised and truncated
How do I deal with the error in the log and how can I expand the column width to produce the whole question when transposing?
You will need do two things
determine what you want for repeated questions, such as:
for the case of occurs twice in the same BY group.
do you want 2 columns, or
one column from the first occurrence, or
one column from the second occurrence, or
one column with YES if both YES & NO occur, or
one column with NO if both YES & NO occur, or
one column with MULTI if both YES & NO occur, or
one column with YES, NO if they occur in that order per period, day, hour, or
one column with NO, YES if they occur in that order per period, day, hour
one column with NO, YES regardless of order
one column with YES, NO regardless of order
one column with (freq) suffix such as YES(2) or NO(2)
same ideas for cases of BY GROUP replicates of 3 or more
each idea will require some preprocessing of the data before TRANSPOSE
Use IDLABEL to have the whole original question in the column header when output.
Pivoting survey data may be useful for modeling and forecasting purposes, however, if done for reporting purposes you might be better off using Proc TABULATE or Proc REPORT.
Possible 'Fix'
When an ID value occurs more than once in a BY group there are more than one VAR values going into a single pivot destination. Hence the ERROR:
For the case of a question repeated and having the same answer within group ptno period day hour you can SORT by one key more, adding FQ_QST and specify option NODUPKEY. After such sorting no duplicates of FQ_QST occur, only a single FQ_YN value is being pivoted into a column via ID.
proc sort NODUPKEY;
by ptno period day hour FQ_QST;
run;
If you have data with repeated questions within group and the questions have different answers, the answer remaining per NODUPKEY is dependent on how SORT is being run. From help:
When the SORT procedure’s input is a Base SAS engine data set and the sorting is done by SAS, then the order of observations within an output BY group is predictable. The order of the observations within the group is the same as the order in which they were written to the data set when it was created. Because the Base SAS engine maintains observations in the order that they were written to the data set, they are read by PROC SORT in the same order. While processing, PROC SORT maintains the order of the observations because it uses a stable sorting algorithm. The stable sorting algorithm is used because the EQUALS option is set by default. Therefore, the observation that is selected by PROC SORT to be written to the output data set for a given BY group is the first observation in the data set having the BY variable values that define the group.
Related
I'm interested in seeing how sedentary behaviors change throughout time (Time 1, 2, 3) and see in a second step how it relates to mental health.
Thus, I would like to obtain an estimate (slope/intercept) for each subject to allow me to do the 2nd step. I can't find online how to do it (not sure what to search for).
Here's my code so far, which gives me 2 estimates (boys and girls); I would rather have an estimate for every participant.
ods output LSMeans=Means1;
proc mixed data=sb.LFcomplete method=ml covtest;
class SexeF time;
model CompDay = Time SexeF Time*SexeF;
repeated time;
lsmeans time*sexeF;
run;
Thank you in advance!
Please check this website for a similar example:
https://www.stat.ncsu.edu/people/davidian/courses/st732/examples/ex10_1.sas
The professor was using HLM for longitudinal data analysis. He used gender like your SexeF, age like your Time, and child as ID. The tricky part is when he was organizing the random effect file, he sorted ID and created Gender (Group or your SexeF) for subsequent merging with the fixed effects file. If your current ID variable is not aligned with your SexeF, you may sort your SexeF and create a new ID variable in SPSS before you import your data in SPSS.
I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.
I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;
I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...
(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).