Stata: add values onto existing values - stata

year
0
1
6
....
(omit)
....
77
90
....
(omit)
....
The "year" is a numeric variable. I need to add "200" before the 1-digit values, and "19" before the 2-digit values.
year
2000
2001
2006
....
1977
1990
....
How can I do this in Stata?

Be careful: the variable might be byte and that will bite.
This should work:
gen year2 = cond(year < 10, 2000 + year, 1900 + year)
tab year2
If year2 looks good,
drop year
rename year2 year

Related

SAS problem: sum up rows and divide till it reach a specific value

I have the following problem, I would like to sum up a column and divide the sum every line through the sum of the whole column till a specific value is reached. so in Pseudocode it would look like that:
data;
set auto;
sum_of_whole_column = sum(price);
subtotal[i] = 0;
i =1;
do until (subtotal[i] = 70000)
subtotal[i] = (subtotal[i] + subtotal[i+1])/sum_of_whole_column
i = i+1
end;
run;
I get the error that I haven't defined an array... so can I use something else instead of subtotal[i]?and how can I put a column in an array? I tried but it doesn't work (data = auto and price the column I want to put into an array)
data invent_array;
set auto;
array price_array {1} price;
run;
EDIT: maybe the dataset I used is helpful :)
DATA auto ;
LENGTH make $ 20 ;
INPUT make $ 1-17 price mpg rep78 ;
CARDS;
AMC Concord 4099 22 3
AMC Pacer 4749 17 3
Audi 5000 9690 17 5
Audi Fox 6295 23 3
BMW 320i 9735 25 4
Buick Century 4816 20 3
Buick Electra 7827 15 4
Buick LeSabre 5788 18 3
Cad. Eldorado 14500 14 2
Olds Starfire 4195 24 1
Olds Toronado 10371 16 3
Plym. Volare 4060 18 2
Pont. Catalina 5798 18 4
Pont. Firebird 4934 18 1
Pont. Grand Prix 5222 19 3
Pont. Le Mans 4723 19 3
;
RUN;
Perhaps I am missing your point but your subtotal will never be equal to 70 000 if you divide by the sum of its column. The maximum value will be 1. Your incremental sum however can be equal or superior to 70 000.
data stage1;
retain _sum 0;
set auto;
_sum = sum(_sum, price);
if _sum < 70000 then output;
run;
proc sql;
create table want as
select t1.*, t1._sum/sum(price) as subtotal
from stage1 as t1;
quit;
subtotal
0.0607268256
0.1310834235
0.2746411058
0.3679017467
0.5121261056
0.5834753107
0.6994325842
0.7851820027
1

How to count no of days id is having positive balance and negative balance in SAS

My data is as following:
id balance date
1 10 02Mar2018
1 12 05Mar2018
1 -15 07Mar2018
1 14 14Mar2018
1 -25 25Mar2018
Now i want the number of days id 1 was in positive bal and number of days the id was in negative bal in a march month.
For example no of days in positive will be calculated as following 01mar to 06 mar as first negative entry came on 07Mar so that 6 days.
Then again it went on positive bal on 14 to ,24 that 11 days
so in total it was 6+11=17 days in positive.
And similarly for negative bal.
I tried using following code:
DATA B;
SET A ;
BY ID;
IF FIRST.ID THEN Y=DATE;
RETAIN Y;
ELSE Y=INTCK('day',DATE,Y);
RUN;
But couldn't get the exact results.
Any help will be appriciated.
Assuming your data is sorted by id and date.
First do a 'look-ahead' merge (to get the next date) :
data lookahead ;
merge have
have (firstobs=2 rename=(date=nextdate id=nextid)) ;
if id ^= nextid then call missing(nextdate) ;
drop nextid ;
run ;
/* data now looks like this */
id balance date nextdate
1 10 02Mar2018 05Mar2018
1 12 05Mar2018 07Mar2018
1 -15 07Mar2018 14Mar2018
1 14 14Mar2018 25Mar2018
1 -25 25Mar2018
Then, expand out the missing dates, dealing with instances where the first date per id isn't the 1st of a month, and the last record per id isn't the last day of the month :
data expand ;
set lookahead (rename=(date=thisdate)) ;
by id ;
if first.id and day(thisdate) ^= 1 then do ;
/* loop from 1st of month to day before date, output new record for each date */
do date = intnx('month',thisdate,0,'b') to thisdate - 1 ;
output ;
end ;
end ;
/* output the input record */
date = thisdate ; output ;
/* output dates up to the next date */
if nextdate > thisdate + 1 then do ;
do date = thisdate + 1 to nextdate - 1 ;
output ;
end ;
end ;
else
/* last record for id, loop to end of month */
if missing(nextdate) and thisdate ^= intnx('month',thisdate,0,'end') then do ;
do date = thisdate + 1 to intnx('month',thisdate,0,'end') ;
output ;
end ;
end ;
drop thisdate nextdate ;
format date date9. ;
run ;
/* data now looks like this */
id balance date
1 10 01Mar2018
1 10 02Mar2018
1 10 03Mar2018
1 10 04Mar2018
1 12 05Mar2018
1 12 06Mar2018
1 -15 07Mar2018
1 -15 08Mar2018
... etc ...
1 -15 13Mar2018
1 14 14Mar2018
1 14 15Mar2018
... etc ...
1 14 24Mar2018
1 -25 25Mar2018
... etc ...
1 -25 31Mar2018
It should now be relatively easily to flag the values accordingly, and count them up per id/month.

How can I select the first and last week of each month in SAS?

I have monthly data with several observations per day. I have day, month and year variables. How can I retain data from only the first and the last 5 days of each month? I have only weekdays in my data so the first and last five days of the month changes from month to month, ie for Jan 2008 the first five days can be 2nd, 3rd, 4th, 7th and 8th of the month.
Below is an example of the data file. I wasn't sure how to share this so I just copied some lines below. This is from Jan 2, 2008.
Would a variation of first.variable and last.variable work? How can I retain observations from the first 5 days and last 5 days of each month?
Thanks.
1 AA 500 B 36.9800 NH 2 1 2008 9:10:21
2 AA 500 S 36.4500 NN 2 1 2008 9:30:41
3 AA 100 B 36.4700 NH 2 1 2008 9:30:43
4 AA 100 B 36.4700 NH 2 1 2008 9:30:48
5 AA 50 S 36.4500 NN 2 1 2008 9:30:49
If you want to examine the data and determine the minimum 5 and maximum 5 values then you can use PROC SUMMARY. You could then merge the result back with the data to select the records.
So if your data has variables YEAR, MONTH and DAY you can make a new data set that has the top and bottom five days per month using simple steps.
proc sort data=HAVE (keep=year month day) nodupkey
out=ALLDAYS;
by year month day;
run;
proc summary data=ALLDAYS nway;
class year month;
output out=MIDDLE
idgroup(min(day) out[5](day)=min_day)
idgroup(max(day) out[5](day)=max_day)
/ autoname ;
run;
proc transpose data=MIDDLE out=DAYS (rename=(col1=day));
by year month;
var min_day: max_day: ;
run;
proc sql ;
create table WANT as
select a.*
from HAVE a
inner join DAYS b
on a.year=b.year and a.month=b.month and a.day = b.day
;
quit;
/****
get some dates to play with
****/
data dates(keep=i thisdate);
offset = input('01Jan2015',DATE9.);
do i=1 to 100;
thisdate = offset + round(599*ranuni(1)+1); *** within 600 days from offset;
output;
end;
format thisdate date9.;
run;
/****
BTW: intnx('month',thisdate,1)-1 = first day of next month. Deduct 1 to get the last day
of the current month.
intnx('month',thisdate,0,"BEGINNING") = first day of the current month
****/
proc sql;
create table first5_last5 AS
SELECT
*
FROM
dates /* replace with name of your data set */
WHERE
/* replace all occurences of 'thisdate' with name of your date variable */
( intnx('month',thisdate,1)-5 <= thisdate <= intnx('month',thisdate,1)-1 )
OR
( intnx('month',thisdate,0,"BEGINNING") <= thisdate <= intnx('month',thisdate,0,"BEGINNING")+4 )
ORDER BY
thisdate;
quit;
Create some data with the desired structure;
Data inData (drop=_:); * froget all variables starting with an underscore*;
format date yymmdd10. time time8.;
_instant = datetime();
do _i = 1 to 1E5;
date = datepart(_instant);
time = timepart(_instant);
yy = year(date);
mm = month(date);
dd = day(date);
*just some more random data*;
letter = byte(rank('a') +floor(rand('uniform', 0, 26)));
*select week days*;
if weekday(date) in (2,3,4,5,6) then output;
_instant = _instant + 1E5*rand('exponential');
end;
run;
Count the days per month;
proc sql;
create view dayCounts as
select yy, mm, count(distinct dd) as _countInMonth
from inData
group by yy, mm;
quit;
Select the days;
data first_5(drop=_:) last_5(drop=_:);
merge inData dayCounts;
by yy mm;
_newDay = dif(date) ne 0;
retain _nrInMonth;
if first.mm then _nrInMonth = 1;
else if _newDay then _nrInMonth + 1;
if _nrInMonth le 5 then output first_5;
if _nrInMonth gt _countInMonth - 5 then output last_5;
run;
Use the INTNX() function. You can use INTNX('month',...) to find the beginning and ending days of the month and then use INTNX('weekday',...) to find the first 5 week days and last five week days.
You can convert your month, day, year values into a date using the MDY() function. Let's assume that you do that and create a variable called TODAY. Then to test if it is within the first 5 weekdays of last 5 weekdays of the month you could do something like this:
first5 = intnx('weekday',intnx('month',today,0,'B'),0) <= today
<= intnx('weekday',intnx('month',today,0,'B'),4) ;
last5 = intnx('weekday',intnx('month',today,0,'E'),-4) <= today
<= intnx('weekday',intnx('month',today,0,'E'),0) ;
Note that those ranges will include the week-ends, but it shouldn't matter if your data doesn't have those dates.
But you might have issues if your data skips holidays.

How to collapse numbers with same identifier but different date, but preserve the date of first observation for each identifier

I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+

Stata egen combined with if

I have data like this
year month X Y weight
2013 1 1 0 1000
2001 12 0 1 2000
I want to create a variable Z based on the X and Y variables, conditional on year. I have two formulas for year before and after 2002. If I use egen with if,
if year > 2002 {
bysort year month :egen Z= total( x*weight)
}
else {
bysort year month : egen Z= total(y*weight*0.5)
}
this code is not going to work, because if year <2002 , Stata would report that z has already been created. Is there any way to achieve the goal?
I used a very crude and brute force way to solve this problem. I create two variables for z, namely z and z_2002. Then I replace z with z_2002 if the year is less than 2002.
If I understand correctly, this should work.
Compute the products in a first step (conditional on the year) and the sums in a second step.
As other answers already note, there's a difference between the if qualifier and the if programming command. There's a short FAQ on this: http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/.
(I use code provided by #NickCox in a comment to another answer.)
clear all
set more off
*----- example data -----
input year month x y weight
2013 1 1 0 1000
2013 1 1 0 800
2013 2 0 1 1200
2013 2 1 0 1400
2001 12 1 0 1500
2001 12 0 1 2000
2001 11 1 1 4000
end
sort year month
list, sepby(year month)
*----- computations -----
gen Z = cond(year > 2002, x * weight, y * weight * 0.5)
bysort year month: egen totZ = total(Z) // already sorted so -by- should be enough
list, sepby(year month)
clear
input year month x y weight
2013 1 1 0 1000
2001 12 0 1 2000
end
preserve
keep if year>2002
bysort year month :egen z= total(x*weight)
tempfile t1
save `t1'
restore
keep if year<=2002
bysort year month : egen z= total(y*weight*0.5)
append using `t1'
list