Between equivalent in Stata - stata

I want to flag all rows that have salaries between 100,000 and 500,000. I have Data1.dta as:
ID Salary
1 100000
2 65000
3 400000
4 5000000
Is there a way to create a flag variable in Stata using SQL's between operator or its equivalent?
The resulting dataset would look like this:
ID Salary six_figures
1 100000 1
2 65000 0
3 400000 1
4 5000000 0
I can only think of following
gen six_figures = 0
replace six_figures = 1 if salary == 100000
replace six_figures = 1 if salary == 100001
... and so on. Which would be inefficient and silly.

gen six_figures = 0
replace six_figures = 1 if inrange(salary,100000,500000)
or
gen six_figures = 0
replace six_figures = 1 if salary >= 100000 & salary <= 500000

Related

SAS Count number of changes between consecutive time periods

Proc SQL Version=9.4. No windows functions to use.
There are client id, time period(month), amount and corresponding class.
client_id data_period amount class
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
Threshold amount = 20 000.
amount > 20k gives class = 2, amount <= 20k makes class = 1
client_id = 1, amount and class are the same for 200801 and 200802.
client_id = 2, amount gets higher from 17k to 55.5k, class change is correct, from 1 to 2.
client_id =3, amount changed within the same class 1 (<20K), but class changed incorrectly.
Desired result is
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew Good Bad
2 200801 200802 17000 55555 1 2 1 0
3 200801 200802 9000 11000 1 1 0 1
I tried to applied self join to get all the differences btw data periods, but there are too many rows in output. Data below is not from example above, real numbers.
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew
A001687463 200808 200802 -5613 1690386 I03 I04
A001687463 200807 200802 -5613 1690386 I03 I04
A001687463 200806 200802 -5613 1690386 I03 I04
A001687463 200805 200802 -5613 1690386 I03 I04
PROC SQL;
CREATE TABLE WORK.'Q'n AS
SELECT distinct
t1.client_id, t1.data_period as oldDate, t2.data_period as newDate, t1.amount as expAmtOld, t2.amount as expAmtNew, t1.class as classOld, t2.class as classNew
FROM WORK.'E'n t1, WORK.'E'n t2
where
t1.client_id = t2.client_id and
t1.amount <> t2.amount
order by t1.client_id;
Do not attempt to do sequential processing using SQL. It is not built for that.
It should be easy to do in a data step. For example let's convert your printout into an actual SAS dataset so we have something to code with.
data have ;
input client_id data_period amount class ;
cards;
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
;
And let's sort it by client and period.
proc sort data=have ;
by client_id data_period ;
run;
Now just set the data and use the LAG() function to get the previous values.
Not sure what you definition of GOOD and BAD were so I just created new class variables based on your rule of 20K.
data want ;
set have ;
by client_id;
old_period = lag(data_period);
old_class = lag(class);
newclass = 1 + (amount > 20000) ;
old_newclass = lag(newclass);
if first.client_id then call missing(of old_:);
bad = (class ne newclass) or (old_newclass ne old_class) ;
run;
So here are the results.
client_ data_ old_ old_ old_
id period amount class period class newclass newclass bad
1 200801 30000 2 . . 2 . 0
1 200802 30000 2 200801 2 2 2 0
2 200801 17000 1 . . 1 . 0
2 200802 55555 2 200801 1 2 1 0
3 200801 9000 1 . . 1 . 0
3 200802 11000 2 200801 1 1 1 1

SAS : getting list of numbers based on reducing months

I have this data
data have;
input cust_id pmt months;
datalines;
AA 100 0
AA 50 1
AA 200 2
AA 350 3
AA 150 4
AA 700 5
BB 500 0
BB 300 1
BB 1000 2
BB 800 3
run;
and I'd like to generate an output that looks like this
data want;
input cust_id pmt months i;
datalines;
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 50 1 0
AA 200 1 1
AA 350 1 2
AA 150 1 3
AA 700 1 4
AA 200 2 0
AA 350 2 1
AA 150 2 2
AA 700 2 3
AA 350 3 0
AA 150 3 1
AA 700 3 2
AA 150 4 0
AA 700 4 1
AA 700 5 0
BB 500 0 0
BB 300 0 1
BB 1000 0 2
BB 800 0 3
BB 300 1 0
BB 1000 1 1
BB 800 1 2
BB 1000 2 0
BB 800 2 1
BB 800 3 0
run;
There are few thousand rows with different cust_ID and different months length. I tried joining tables but it couldn't get me the sequence of 100 50 200 350 150 700 (for cust_ID AA). I could only replicated 100 if my months are 0, 50 if months are 1 & so on. I created a maxval which is the maximum month value. My code is something like this
data temp1;
set have;
do i = 0 to maxval;
if (months <=maxval) then output;
end;
i thought of creating a uniquekey to join my have data and temp1 data but it could only give me
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 100 1 0
AA 50 1 1
AA 200 1 2
AA 350 1 3
AA 150 1 4
AA 100 2 0
AA 50 2 1
AA 200 2 2
AA 350 2 3
AA 100 3 0
AA 50 3 1
AA 200 3 2
AA 100 4 0
AA 50 4 1
AA 100 5 0
Any thoughts or different approach on how to generate my want table? Thank you!
This problem is a little tricky because you have things going in three directions
The number of group repetitions descends from group count. Within each repetition:
The payments item start index ascends and terminates at group count
The months (as I) item start index is 1 and termination descends from group count
SQL
One SQL approach is a three-way reflexive join with-in group. The months values act as a within group index and must be monotonic by 1 from 0 for this to work.
proc sql;
create table want as
select X.cust_id, Z.pmt, X.months, Y.months as i
from have as X
join have as Y on X.cust_id = Y.cust_id
join have as Z on Y.cust_id = Z.cust_id
where
X.months + Y.months = Z.months
order by
X.cust_id, X.months, Z.months
;
quit;
DATA Step
A DOW loop is used to count the group size. 2-deep looping crosses the combinations and three point= values are computed (finagled) to retrieve the relevant values.
data want2;
if 0 then set have; * prep pdv to match have;
retain point_end ;
point_start = sum(point_end,0);
do group_count = 1 by 1 until (last.cust_id);
set have(keep=cust_id);
by cust_id;
end;
do index1 = 1 to group_count;
point1 = point_start + index1;
set have (keep=months) point = point1;
do index2 = 0 to group_count - index1 ;
point2 = point_start + index1 + index2;
set have (keep=pmt) point=point2;
point3 = point_start + index2 + 1;
set have (keep=months rename=months=i) point=point3;
output;
end;
end;
point_end = point1;
keep cust_id pmt months i;
run;
Try the following:
data want(drop = start_obs limit j);
retain start_obs 1;
/* read by cust_id group */
do until(last.cust_id);
set have end = last_obs;
by cust_id;
end;
limit = months;
do j = 0 to limit;
i = 0;
do obs_num = start_obs + j to start_obs + limit;
/* read specific observations using direct access */
set have point = obs_num;
months = j;
output;
i = i + 1;
end;
end;
/* prepare for next direct access read */
start_obs = limit + 2;
if last_obs then
stop;
run;

Python: max occurence of consecutive days

I have an Input file:
ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300
I'm trying to find maximum count of consecutive days where FEES is 0 for a particular ROLL_NO. If FEES is not equal to zero for consecutive days, max count will be zero for that particular ROLL_NO.
Expected Output:
ID,ROLL_NO,MAX_CNT -- First occurrence of ID for a particular ROLL_NO should come as ID in output
1,12345,3
3,987654,0
This is what I've come up with so far,
import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df
But I don't believe this is the right way to approach this.
Could someone help out a python newbie out here?
You can use:
#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 4
10 5
11 5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
print (df)
ROLL_NO MAX_CNT
0 12345 3
1 987654 0
Explanation:
First compare FEES column with 0, eq is same as == and multiple mask by column ROLL_NO:
mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0 0
1 0
2 0
3 12345
4 12345
5 0
6 12345
7 12345
8 12345
9 0
10 0
dtype: int64
Get consecutive groups by compare shifted Series r and cumsum:
a = r.ne(r.shift()).cumsum()
print (a)
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 5
10 5
dtype: int32
Filter only 0 in FEES and groupby with size, also filter a for same indexes:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO
12345 2 2
4 3
dtype: int64
Get max values per first level of MultiIndex:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345 3
dtype: int64
Last add missing ROLL_NO without 0 by reindex:
print (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345 3
987654 0
dtype: int64
and for columns from index use reset_index.
EDIT:
For first ID use drop_duplicates with insert and map:
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s))
print (df1)
ID ROLL_NO MAX_CNT
0 1 12345 3
1 3 987654 0

SAS Dataset : Counting observation that match an IF condition

Here is a very basic question, but I'm unable to find an easy way to do it.
I have a dataset that references different highschools and students :
Highschool Students Sexe
A 1 m
A 2 m
A 3 m
A 4 f
A 5 f
B 1 m
B 2 m
And I'd like to create two new variables that count the number of male and female in each schools :
Highschool Students Sexe Nb_m Nb_f
A 1 m 1 0
A 2 m 2 0
A 3 m 3 0
A 4 f 3 1
A 5 f 3 2
B 1 m 1 0
B 2 m 2 0
And I can finally extract the last line with the total that would look like this :
Highschool Students Sexe Nb_m Nb_f
A 5 f 3 2
B 2 m 2 0
Any ideas ?
You can do this in a single PROC SQL step...
Also, I don't think you really need the value of Sexe from the last row.
proc sql ;
create table want as
select Highschool,
sum(case when Sexe = 'f' then 1 else 0 end) as Nb_f,
sum(case when Sexe = 'm' then 1 else 0 end) as Nb_m,
Nb_f + Nb_m as Students
group by Highschool
order by Highschool ;
quit ;
First you have to sort your dataset by Highschool:
proc sort data = your_dataset;
by Highschool;
run;
then you use
- retain to not reset Nb_m and Nb_f at every record;
- last function and output statement to print only the last observation for every school.
data new_dataset;
set your_dataset;
by Highschool;
retain Nb_m Nb_f;
if Sexe = 'm' then
Nb_m + 1;
else
Nb_f + 1;
if last.Highschool then do;
Students = Nb_m + Nb_f;
output;
Nb_m = 0;
Nb_f = 0;
end;
run;

subset of dataset using first and last in sas

Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;