I have a dataset of 60 observations and I'm trying to create a new variable to label the first 30 observations as "A" and the second 30 observations as "B". I assume I would need to use an IF THEN statement for this, but how would I go about creating the new variable and writing the statement to achieve this goal?
Figured out a roundabout way to solve this.
DATA market_new;
SET mydata_old;
id = _N_;
RUN;
Then using conditional logic within this.
IF id <31 THEN Letter = 'A';
IF id >=31 THEN Letter = 'B';
The _N_ internal variable is what you want to use. I would first permanently capture this in a variable, like row_num=_N_, because if the data set gets reordered for some reason, _N_ will change. Then you can do your conditional assignments according to the value of row_num.
Related
I want to create custom variable that will store table's name from which observation is comming from.
Something like this:
data FTTH_SOHO_2;
ATTRIB scoring_month;
set fmscore.SCORE_FTTH_CHURN_SOHO_202009 - fmscore.SCORE_FTTH_CHURN_SOHO_202012
fmscore.SCORE_FTTH_CHURN_SOHO_202101 - fmscore.SCORE_FTTH_CHURN_SOHO_202106 = tablename;
scoring_month = tablename;
where tp_desig_num = 'XXXXXXXXXXX';
run;
Ofcourse I get syntax error but is it possible to store the name of currety used Data Set to some kind variable and use it to mark with it observation from which it came from?
I need to see months from which I recive observations.
You're looking for the INDSNAME option.
data want;
set have1 have2 have3 indsname=dsn;
ds_name = dsn;
run;
You do have to create a variable with an assignment statement separate from the indsname option, as the option creates only a temporary variable.
I have dataset with two variables 'shift' and 'scheduled'. The 'shift' variable contains a number of different time value records, for example "ED A 7a-4p"; the scheduled variable contains the number or days that shift is scheduled, so for example there would be a "3" in the cell to represent 3 days.
I created the following code to understand how many shifts are staffed at a given hour.
data ED_A_7a_4p;
set schedule schedule10;
if shift = 'ED A 7a-4p' and Scheduled = '3' then SevenToEightAM = ???;
if shift = 'ED A 7a-4p' and Scheduled = '7' then EightToNineAM = ???;
run;
I would like the created variables, for example 'SevenToEightAM', to equal the number that is in "scheduled" variable column. So if 'scheduled' is 3 I would want 'SevenToEightAM' to equal 3.
The issue is that 'scheduled' is totally random and I can't autocode it so I was hoping there is a conditional option in SAS that would allow me to make 'SevenToEightAM' to whatever "scheduled" is within my dataset.
You probably want a TABULATE report instead of creating new variables. Try:
data have;
set original;
scheduled_num = input(scheduled, best12.);
run;
Proc TABULATE data=have;
class shift;
var scheduled_num;
table shift, scheduled*sum;
I need help reading the code below. I am not sure what specific parts in this code are doing. For example, what does ( firstobs = 2 keep = column3 rename = (column3 = column4) ) do?
Also, what does ( obs = 1 drop = _all_ ); do?
I have also not used column5 = ifn( first.column1, (.), lag(column3) ); before. What does this do?
I am reading someone else's code. I wish I could provide more detail. If I find a solution, I will post it. Thanks for your help.
data out.dataset1;
set out.dataset2;
by column1;
WHERE column2 = 'N';
set out.dataset1 ( firstobs = 2 keep = column3 rename = (column3 = column4) )
out.dataset1 ( obs = 1 drop = _all_ );
FORMAT column5 DATETIME20.;
FORMAT column4 DATETIME20.;
column5 = ifn( first.column1, (.), lag(column3) );
column4 = ifn( last.column1, (.), column4 );
IF first.column1 then DIF=intck('dtday',column4,column3);
ELSE DIF= intck('dtday',column5,column3);
format column6 $6.;
IF first.column1
OR intck('dtday',column5,column3) GT 20 THEN column6= 'HARM';
ELSE column6= 'REPEAT';
run;
Seems like you need to learn about SAS datastep languange!
This series of things happening in the parenthesis are datastep options
You can use those options when ever you are referencing a table, even in a proc sql
The options you have:
firstobs : This starts the datafeed on the record enter in your case 2 it means SAS will start on the table on the 2nd record.
keep : This will only use the fields in list rather than using all the field in the table
rename = rename will rename field, so it works like an alias in SQL
OBS = will limit the amount of record you pull out of a table like top or limit in SQL
DROP = would remove the fields selected from the table in your case all is used wich means he's dropping all the fields.
as for the functions:
LAG is keeping the value from the previous record for the field you put in parenthesis so DPD_CLOSE_OF_BUSINESS_DT
INF = Works like a case or if. Basically you create a condition in the 1st argument and then the 2nd argument is applied when your condition in the 1st argument is true, the 3rd argument get done in the event that your condition on the 1st argument is false.
So to answer that question if it's the first record for the variable SOR_LEASE_NBR then the field Prev_COB_DT will be . otherwise it will be a previous value of DPD_CLOSE_OF_BUSINESS_DT.
The best advise I can give you is to start googling SAS and the function name you are wondering what it does, then it's a matter of encapsulation!
Hope this helps!
Basically your data step is using the LAG() function to look back one observation and the extra SET statement to look ahead one observation.
The IFN() function calls are then being used to make sure that missing values are assigned when at the boundary of a group.
You then use these calculated PREV and NEXT dates to calculate the DIF variable.
Note for this to work you need to be referencing the same input dataset in the two different SET statements (the dataset used in the last with the obs=1 and drop=_all_ dataset options
doesn't really need to be the same since it is not reading any of the actual data, it just has to have at least one observation).
( firstobs = 2 keep = DPD_CLOSE_OF_BUSINESS_DT rename =
(DPD_CLOSE_OF_BUSINESS_DT = Next_COB_DT) ) do?
Here the code firstobs=2 says SAS to read the data from 2nd observation in the dataset.
and also by using rename option trying to change the name of the variable.
(obs = 1 drop = _all_);
obs=1 is reading only the 1st obs in the dataset. If you specify obs=2 then up to 2nd obs will be read.
drop = _all_ , is dropping all of your variables.
Firstobs:
Can read part of the data. If you specify Firstobs= 10, it starts reading the data from 10th observation.
Obs :
If specify obs=15, up to 15th obs the data will be readed.
If you run the below table, it gives you 3 observations ( from 2nd to 4th ) in the output result.
Example;
DATA INSURANCE;
INFILE CARDS FIRSTOBS=2 OBS=4;
INPUT NAME$ GENDER$ AGE INSURANCE $;
CARDS;
SOWMYA FEMALE 20 MEDICAL
SUNDAR MALE 25 MEDICAL
DIANA FEMALE 67 MEDICARE
NINA FEMALE 56 MEDICAL
RUN;
Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.
I have my data sorted by ID and date. I have converted the date to a single numeric which has ordering (the year followed by the week in the year). i want to make a new variable that is a function of the minimum value in the finest partition. and example follows
ID Start listen
1 201134 201138
1 201204 201150
2 200905 200910
2 201005 201020
I want something like
ID Start listen weekSincestart
1 201134 201138 4
1 201204 201150 54
2 200905 200910 5
2 201005 201020 15
all im doing is taking (listen-min(start)) but i am assuming min() is taking the minimum start for a given ID. So, i am asking if there is a "by statement" for the min function
In my opinion there is no need to convert your start and listen values from dates using the method you have.
I converted your data back into dates using INTNX using the first day of the year in START and LISTEN variables and increment by week in the same variables. The dates might not be exactly what you have on your dataset, however it should result in something similar.
The following should do what you want, if I understand you correctly.
DATA WANT2;
SET HAVE;
BY ID START;
RETAIN _START;
FORMAT _START DATE9.;
IF FIRST.ID THEN _START = START;
WEEKSINCESTART = INTCK("WEEK",_START,LISTEN);
RUN;
In this instance your sample is sorted, however if you wish to conduct by statement processing to identify the first instance of a value in ID you will need to sort your dataset first. The retain statement will hold a value and by using the by statement we can specify when the value in the retained variable is altered. In this instance we want to alter the _START variablee when the first instance of an ID is encountered. I use the underscore prefix because it makes it easier to drop these variables en masse if necessary. This value will not be replaced until the next instance of an ID, which means it will be the value of subsequent observations for ID 1 and so on. The INTCK function measures the number of intervals, in this example the number of WEEKS, between period one and period two, in this instance between the first instance of START for each ID captured in _START and LISTEN for each observation.
The end result is:
ID START LISTEN _START WEEKSINCESTART
1 21AUG2011 18SEP2011 21AUG2011 4
1 29JAN2012 11DEC2011 21AUG2011 16
2 01FEB2009 08MAR2009 01FEB2009 5
2 31JAN2010 16MAY2010 01FEB2009 67
I hope this is useful.
Regards,
Scott
You could easily do it with proc sql:
proc sql;
create table RESULT as
select *, listen-min(start) as weekSincestart
from INPUT
group by id;
quit;
It will take the min of each id group to calculate min(start).
And since you select variables which are not in the group by, nor have an aggregation function, it will not aggregate multiple rows into one on the group by.
Your question is a bit confusing. If you want just listen minus start (what your 'result' is), then just do that. The min function does not cross rows; in SAS it's difficult to cross rows (or at least it's something you would have to do intentionally). Of course you do need to figure out how to deal with the year barrier; if I were you I'd leave the dates as actual dates and use INTCK to determine the difference in weeks.
If you actually do want the minimum for the entire ID, the data step solution (not quite as neat as the SQL solution but works roughly the same):
data want;
set have;
by id start;
retain _initial_start;
if first.id then _initial_start=start;
weeksincestart=listen-_initial_start; *or whatever you intended - this does not seem right;
drop _initial_start;
run;