I have a sample data that has three columns (Row, ID, and Level) and 10 rows. ID1 has 6 rows and ID2 has 4 rows.
enter image description here
I want to create a new column Level1. This new column is created based on Level by a rule: if Level moves from Moderate to Severe, then Level1 from Moderate to Severe. But if Level moves from Severe to Moderate then Level1 keeps all the rest as Severe (from the first Severe). Here for ID1, from Row 1 to Row 4, Level moves from Moderate to Severe, so Level1 is the same as Level from Moderate to Severe. But from Row 5, Level moves from Severe to Moderate, so the rest (Row 5 to Row 6) of ID1 both are Severe. For ID2, since all move from Moderate to Severe, Level is the same as Level1.
enter image description here
Use RETAIN to keep value of LEVEL1 the same until you explicitly change it.
data want;
set have;
by id;
if first.id then level1=level;
if level='Severe' then level1=level;
retain level1;
run;
Related
What I have:
Number Cost Amount
52 98 1
108 50 3
922 12 1
What I want:
Number Cost
52 98
108 50
109 50
110 50
922 12 1
My dataset has a variable Amount. If Amount is 2 for a certain row, I want to create a new row right beneath it with the same Cost and the Number equal to that of the row above + 1. If the Amount is 3, I want to create two new rows right beneath it, both with the same Cost and with the Numbers being Number from row above +1 and Number from row above +2, and so on.
My final step would be to delete the Amount column, which I can do with
data want (drop=Amount);
set have;
I am having problems implementing this, my thoughts have been to use proc sql insert into but I am having trouble combining this with an if condition that runs through the amount variable.
Code to reproduce table:
proc sql;
create table want
(Number num, Cost num, Amount num);
insert into want
values(52,98,1)
values(108,50,3)
values(922,12,1);
This can help you:
proc sort data=want out=want_s nodupkey;
by Number;
run;
data result;
keep Number Cost;
set want_s;
do i=1 to Amount;
output;
Number=Number+1;
end;
run;
You might need to take care that Number does not overlap with the next input row like below:
Number ; Amount
108 ; 10
110 ; 1
Use a DO loop to output the AMOUNT number of rows. You can code the index variable of the loop to increment the NUMBER
Example (untested)
data want(keep=number cost);
set have;
do number = number to number + amount-1;
output;
end;
However, you may not need to perform this expansion of data in some cases. Many SAS Procedures provide a WEIGHT or FREQ statement that allows a variable to perform that statistical or processing roles.
I am working with crime data. Now, I have the following table crimes. Each row contains a specific crime (e.g. assault): the date it was committed (date) and a person-ID of the offender (person).
date person
------------------------------
02JAN2017 1
03FEB2017 1
04JAN2018 1 --> not to be counted (more than a year after 02JAN2017)
27NOV2017 2
28NOV2018 2 --> should not be counted (more than a year after 27NOV2017)
01MAY2017 3
24FEB2018 3
10OCT2017 4
I am interested in whether each person has committed (relapse=1) or not committed (relapse=0) another crime within 1 year after the first crime committed by the same person. Another condition is that the first crime has to be committed within a specific year (here 2017).
The result should therefore look like this:
date person relapse
------------------------------
02JAN2017 1 1
03FEB2017 1 1
04JAN2018 1 1
27NOV2017 2 0
28NOV2018 2 0
01MAY2017 3 1
24FEB2018 3 1
10OCT2017 4 0
Can anyone please give me a hint on how to do this in SAS?
Obviously, the real data are much larger, so I cannot do it manually.
One approach is to use DATA step by group processing.
The BY <var> statement sets up binary variables first.<var> and last.<var> that flag the first row in a group and the last row in a group.
You appear to be assigning the computed relapse flag over the entire group, and that kind of computation can be done with what SAS coders call a DOW loop -- a loop with the SET statement inside loop, with a follow up loop that assigns the computation to each row in the group.
The INTCK function can compute the number of years between two dates.
For example:
data want(keep=person date relapse);
* DOW loop computes assertion that relapse occurred;
relapse = 0;
do _n_ = 1 by 1 until (last.person);
set crimes; * <-------------- CRIMES;
by person date;
* check if persons first crime was in 2017;
if _n_ = 1 and year(date) = 2017 then _first = date;
* check if persons second crime was within 1 year of first;
if _n_ = 2 and _first then relapse = intck('year', _first, date, 'C') < 1;
end;
* at this point the relapse flag has been computed, and its value
* will be repeated for each row output;
* serial loop over same number of rows in the group, but
* read in through a second SET statement;
do _n_ = 1 to _n_;
set crimes; * <-------------- CRIMES;
output;
end;
run;
The process would be more complex, with more bookkeeping variables, if the actual process is to classify different time frames of a person as either relapsed or reformed based on rules more nuanced than "1st in 2017 and next within 1 year".
I started using sas relatively recent - I'm not by any means attempting to create perfect code here.
I'd sort the data by id/person and date first (date should be numeric), and then use retain statements check against the date of the first crime. It's not perfect, but if your data is good (no missing dates), it'll work, and it is easy to follow imho.
This only works if the first record and act of crime is supposed to happen in 2017. If you have crimes happening in 2016, and want to check whether 'a crime' is committed in 2017 and then check the relapse, then this code is not going to work - but I think that is covered in the comments beneath your question.
data test;
input tmp_year $ 1-9 person;
datalines;
02JAN2017 1
03FEB2017 1
04JAN2018 1
27NOV2017 2
28NOV2018 2
01MAY2017 3
24FEB2018 3
10OCT2017 4
;
run;
data test2;
set test;
crime_date = input(tmp_year, date9.);
act_year = year(crime_date);
run;
proc sort data=test2;
by person crime_date ;
run;
data want;
set test2;
by person crime_date;
retain date_of_crime;
if first.person and act_year = 2017 then date_of_crime = crime_date;
else if first.person then call missing(date_of_crime);
if intck('YEAR', date_of_crime, crime_date) =< 1 and not first.person
then relapse = 1;
else relapse = 0;
run;
The above code flags the act of crimes committed one year after an act of crime in 2017. You can then retrieve the unique persons with a proc sql statement, and join them with whatever dataset you have.
I need advice on how to split up a dataset efficiently (around 7 million rows and 280 columns).
My dataset contains the columns 'department' and 'classid' which are not unique.
I would like to split my dataset depending on the department variable and the maximum number of observations (100k). Another limitation is shown by the following example:
Ex 1:
math_1 - 10k rows
math_2 - 80k rows
math_3 - 20k rows
Result 1:
math_1 + math_2 -> 90.000 rows - OK
math_3 -> 20.000 rows - OK
Ex. 2:
math_1 - 90k rows
math_2 - 80k rows
math_3 - 10k rows
Result 2.1:
math_1 + math_2 -> 100k rows (90k from math1, 10k from math2) -> not OK
math_2 + math_3 -> 80k rows (70k from math_2, 10k from math_3 -> not OK
math_2 is split across 2 tables although it would fit into one, so it should be split like this:
Result 2.2:
math_1 -> 90k rows -> OK
math_2 + math_3 -> 90k rows -> OK
Even if math_2 would not fit into one table, I don't want it to be mixed with rows from another original table.
I tried to solve it with hash tables but am simply running out of memory because of the huge amount of columns.
Not sure what hashes have to do here.
I would first summarize the data by Department and ClassID. Put the counts into a table. Then you can go down the table and create a new variable, called group. If the total > X amount then group + 1, else group is the same. This creates a variable that tells you your file structure.
Then use that data set with the groups to build your table split. I would recommend a CALL EXECUTE or DOSUBL to split the data into the subsets.
7 million with the max of 90K would be 8 + data sets....but it'll be a nightmare to work with to understand where you need to go get your data because it's not designed logically. So you'll always need to reference this table anyways.
data have;
input department $ classID $ num_records;
cards;
A math1 500
A math2 500
A math3 200
A math4 100
;
run;
data groups;
set have;
retain running_total;
running_total=sum(running_total, num_records);
if running_total >= 500 then do; group+1; running_total=num_records;
end;
run;
Use this with the links above to create the subsets if really, really desired.
Create a test dataset to play with:
data test; set original(keep=department classid); run;
Use PROC TABULATE to get an overview of departments and classids.
Use PROC SORT; BY department classid; to sort your data.
Write SAS-Code to write SAS-Code to split your data:
data _null__;
put 'data classid1; set original; if classid="math_1"; run;';
So the code for splitting looks like this:
data classid1;
set original;
if classid="math_1";
run;
data classid2;
set original;
if classid="math_2";
run;
I am trying to come up with code that will select a random column from a group of columns on interest. The group of columns will change depending on the values in the columns for each observation. Each observation is a subject.
Let me explain to be more clear:
I have 8 columns, names V1-V8. Each column has 3 potential responses ('Small','Medium','High'). Due to certain circumstances in our project, I need to "combine" all this information into 1 column.
Key factor 1: We only want the columns per subject where he/she selected 'High' (lots of combinations here). This is what I refer to when I say the columns of interest changes per subject.
Key factor 2: Once I have identified the columns where 'High' was selected for the subject, select one of the columns at random.
At the end, I need a new variable (New_V) with values V1-V8 (NOT 'Small','Medium','High') indicating which column was selected for each subject.
Any advice would be great. I have tried ARRAYs and Macro variables but I can seem to tackle this the right way.
This method uses macro variables and a loop. There are three main steps: First, find all variables that are "high." Second, select a random value from 1 to the number of variables that are "high." Third, pick that variable and call it selected_var.
data temp;
input subject $ v1 $ v2 $ v3 $ v4 $ v5 $ v6 $ v7 $ v8 $;
datalines;
1 high medium small high medium small high medium
2 medium small high medium small high medium high
3 small high high medium small high medium high
4 medium medium high medium small small medium medium
5 medium medium high small small high medium small
6 small small high medium small high high high
7 small small small small small small small small
8 high high high high high high high high
;
run;
%let vars = v1 v2 v3 v4 v5 v6 v7 v8;
%macro find_vars;
data temp2;
set temp;
/*find possible variables*/
format possible_vars $20.;
%do i = 1 %to %sysfunc(countw(&vars.));
%let this_var = %scan(&vars., &i.);
if &this_var. = "high" then possible_vars = cats(possible_vars, "&this_var.");
%end;
/*create a random integer between 1 and number of variables to select from*/
rand = 1 + floor((length(possible_vars) / 2) * rand("Uniform"));
/*pick that one!*/
selected_var = substr(possible_vars, (rand * 2 - 1), 2);
run;
%mend find_vars;
%find_vars;
You're on the right track with arrays. The vname function will be helpful here. The want datastep shows how to do this (the rest just sets up example data):
proc format;
value smh
1='Small'
2='Medium'
3='High'
other=' '
;
quit;
data have;
call streaminit(5);
array v[8] $;
do _i = 1 to 1000;
do _j = 1 to 8;
__rand = ceil(1+rand('Binomial',.7,2));
v[_j] = put(__rand,smh6.);
end;
if whichc('High',of v[*]) = 0 then v8 = 'High'; *guarantee have one high;
output;
end;
drop _:;
run;
data want;
call streaminit(7); *arbitrary seed here, pick any positive number;
set have;
array v[8] ;
do until (v[_rand] = 'High'); *repeat this loop until one is picked that is High;
_rand = ceil(8*rand('Uniform'));
end;
chosen_v = vname(v[_rand]); *assign the chosen name to chosen_v variable;
drop _:;
run;
proc freq data=want;
tables chosen_v;
run;
I calculate a ratio for 40 stocks. I need to sort those into three groups high, medium and low based on the value of the ratio. The ratios are fractions of one and there aren't many repetitions. What I need is to create three groups of about 13 stocks each, in group 1 to have the high ratios, in group 2 medium ratios and group 3 low ratios. I have the below code but it just assigns rank 1 to all my stocks.
How can I correct this?
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
proc rank data=sourceh.combinedFreq2 out=sourceh.ranked groups=3;
by symbol notsorted;
var ratio;
ranks rank;
run;
If you want to automatically partition into three relatively even groups, you can use PROC RANK (See example using sashelp.stocks):
data have;
set sashelp.stocks;
ratio=high/low;
run;
proc rank data=have out=want groups=3;
by stock notsorted;
var ratio;
ranks rank;
run;
That partitions them into three groups. As long as you have 40 different values (ie, not a lot of repeats of one value), it will make 3 evenly split groups (with ~13 in each).
In your case, do not use by anything - by will create separate sets of ranks (here I'm ranking dates by stock, but you want to rank stocks.)
I think people are making this more complicated than it needs to be. Lets do this on easy mode.
First, we'll create the dataset and create out ratios.
Second, We'll sort the data by ratio.
Lastly, we'll assign a group based on observation number.
WARNING! UNTESTED CODE!
/*Make the dataset. I stole this from your code above*/
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
/*sort the data so that its ordered by ratio*/
PROC SORT DATA=sourceh.combinedfreq2 OUT=sourceh.combinedfreq2 ;
BY DESCENDING ratio ;
RUN ;
/*Assign a value based on observation number*/
Data sourceh.combinedfreq2;
Set sourceh.combinedfreq2;
length Group $6.;
if _N_ <=13 Then Group = "High";
if _N_ > 13 and _N_ <= 26 Then Group = "Medium";
if _N_ > 26 Then Group = "Low";
run;