How can I add data into a SAS table of different groups? - sas

The primary key is car, model, and date, I have to fill in the empty fields with the previous data but that its primary key is car and model.
Example:
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5
7 Ford kuga 2004 2 5
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 red
11 Ford Mondeo 2004 6 red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 red
17 Mercedes 190E 2012 1 6
And the final output of the table is ...
Output:
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5 blue
7 Ford kuga 2004 2 5 blue
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 4 red
11 Ford Mondeo 2004 6 4 red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3 red
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 5 red
17 Mercedes 190E 2012 1 6 red
How is it done? Thank you

The UPDATE trick will work to produce the output you show.
data cars;
retain dummyby 1;
infile cards firstobs=2;
input row car $ model $ date sex door colour $;
cards;
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5 .
7 Ford kuga 2004 2 5 .
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 . red
11 Ford Mondeo 2004 6 . red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3 .
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 . red
17 Mercedes 190E 2012 1 6 .
;;;;
run;
data locf;
update cars(obs=0) cars;
by dummyby; *Use BY CAR; to LOCF for each car.;
output;
drop dummyby;
run;
proc print;
run;

Related

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,
bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

Filter specific observations

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

Filter Specific Data in Stata

I'm using Stata 13 and have to clean a data set in a panel format with different ids for a given period from 2000 to 2003. My data looks like:
id year ln_wage
1 2000 2.30
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.10
3 2002 1.60
4 2002 2.46
4 2003 2.47
5 2000 2.10
5 2001 2.10
5 2003 2.12
I would like to keep in my dataset for each year only individuals that appear in t-1 year. In this way, the first year of my sample (2000) will be dropped. I'm looking for output like:
2001:
id year ln_wage
1 2001 2.31
5 2001 2.10
2002:
id year ln_wage
1 2002 2.31
2 2002 1.89
2003:
id year ln_wage
2 2003 2.10
4 2003 2.47
Regards,
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int year float ln_wage
1 2000 2.3
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.1
3 2002 1.6
4 2002 2.46
4 2003 2.47
5 2000 2.1
5 2001 2.1
5 2003 2.12
end
xtset id year
drop if missing(L.ln_wage)
sort year id
list, noobs sepby(year)
+---------------------+
| id year ln_wage |
|---------------------|
| 1 2001 2.31 |
| 5 2001 2.1 |
|---------------------|
| 1 2002 2.31 |
| 2 2002 1.89 |
|---------------------|
| 2 2003 2.1 |
| 4 2003 2.47 |
+---------------------+
// Alternatively, assuming no duplicate years within id exist
bysort id (year): gen todrop = year[_n-1] != year - 1
drop if todrop

How to add a factor/group variable to line plot in Stata

I would like to have a line plot of a continuous variable over time using xtline and overlay a scatterplot or label for each data point indicating a group membership at this point.
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id year group variable)
101 2003 3 12
102 2003 2 10
102 2005 1 10
102 2007 2 10
102 2009 1 10
102 2011 2 10
103 2003 4 3
103 2005 2 1
104 2003 4 50
105 2003 4 8
105 2005 4 12
105 2007 4 12
105 2009 4 12
106 2003 1 6
106 2005 1 28
106 2007 2 15
106 2009 2 4
106 2011 3 4
106 2015 1 2
106 2017 1 2
end
xtset id year
xtline variable, overlay
Here I added/marked/labelled groups of id 103.
I have four groups, which I hope can be shown in the legend as well.
Solutions
preserve
separate variable, by(id) veryshortlabel
line variable101-variable106 year ///
|| scatter variable year, ///
mla(group) ms(none) mlabc(black) ytitle(variable)
restore
Alternatively
xtline variable, overlay addplot(scatter variable year, mlabel(group))
I recommend direct labelling here. It is likely to yield a slightly messy graph, but your own example is already messy and will only get worse if you add more details.
Here is a reproducible example.
webuse grunfeld, clear
set scheme s1color
separate invest, by(company) veryshortlabel
line invest1-invest10 year , ysc(log) ///
|| scatter invest year if year == 1954, ///
mla(company) ms(none) mlabc(black) legend(off) yla(1 10 100 1000, ang(h)) ytitle(investment)
EDIT:
In your example two identifiers are present only for single years. To show some technique for line plots with panel data, I focus on the others.
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id year group variable)
101 2003 3 12
102 2003 2 10
102 2005 1 10
102 2007 2 10
102 2009 1 10
102 2011 2 10
103 2003 4 3
103 2005 2 1
104 2003 4 50
105 2003 4 8
105 2005 4 12
105 2007 4 12
105 2009 4 12
106 2003 1 6
106 2005 1 28
106 2007 2 15
106 2009 2 4
106 2011 3 4
106 2015 1 2
106 2017 1 2
end
bysort id : gen include = _N > 1
ssc install fabplot
set scheme s1color
fabplot line variable year if include, xla(2003 " 2003" 2010 2017 "2017 ") by(id) frontopts(lw(thick)) xtitle("")

Extract weeks from datetime (Python Pandas)

I have a dataframe:
time year month
0 12/28/2013 0:17 2013 12
1 12/28/2013 0:20 2013 12
2 12/28/2013 0:26 2013 12
3 12/29/2013 0:20 2013 12
4 12/29/2013 0:26 2013 12
5 12/30/2013 0:31 2013 12
6 12/30/2013 0:31 2013 12
7 12/31/2013 0:17 2013 12
8 12/31/2013 0:20 2013 12
9 12/31/2013 0:26 2013 12
10 1/1/2014 4:30 2014 1
11 1/1/2014 4:34 2014 1
12 1/1/2014 4:37 2014 1
13 1/2/2014 4:30 2014 1
14 1/2/2014 5:30 2014 1
15 1/3/2014 4:30 2014 1
16 1/3/2014 4:34 2014 1
17 1/3/2014 4:37 2014 1
18 1/4/2014 4:30 2014 1
19 1/4/2014 4:34 2014 1
20 1/4/2014 4:37 2014 1
I use the following code to extract the week information:
df['week'] = df['time'].dt.week
This makes the dataframe as following:
time year month week
0 2013-12-28 00:17:00 2013 12 52
1 2013-12-28 00:20:00 2013 12 52
2 2013-12-28 00:26:00 2013 12 52
3 2013-12-29 00:20:00 2013 12 52
4 2013-12-29 00:26:00 2013 12 52
5 2013-12-30 00:31:00 2013 12 1
6 2013-12-30 00:31:00 2013 12 1
7 2013-12-31 00:17:00 2013 12 1
8 2013-12-31 00:20:00 2013 12 1
9 2013-12-31 00:26:00 2013 12 1
10 2014-01-01 04:30:00 2014 1 1
11 2014-01-01 04:34:00 2014 1 1
12 2014-01-01 04:37:00 2014 1 1
13 2014-01-02 04:30:00 2014 1 1
14 2014-01-02 05:30:00 2014 1 1
15 2014-01-03 04:30:00 2014 1 1
16 2014-01-03 04:34:00 2014 1 1
17 2014-01-03 04:37:00 2014 1 1
18 2014-01-04 04:30:00 2014 1 1
19 2014-01-04 04:34:00 2014 1 1
20 2014-01-04 04:37:00 2014 1 1
I would like to create another column showing year-week (e.g., 2013-52, 2014-1). The problem is when I combine two columns (year, week) in rows 5 through 9, the result is 2013-1 saying the first week of 2013. This is not correct. Is there any solution for this issue?
Use dt.strftime
reference http://strftime.org/
df.time.dt.strftime('%Y-%W')
0 2013-51
1 2013-51
2 2013-51
3 2013-51
4 2013-51
5 2013-52
6 2013-52
7 2013-52
8 2013-52
9 2013-52
10 2014-00
11 2014-00
12 2014-00
13 2014-00
14 2014-00
15 2014-00
16 2014-00
17 2014-00
18 2014-00
19 2014-00
20 2014-00
Name: time, dtype: object
As #TrigonaMinima pointed out, the first week of the year as defined by ISO 8601 (which dt.week follows):
It is the first week with a majority (4 or more) of its days in
January
In your case, week = 1 has 2 days in December and the rest in January, thus fitting the definition of the first week.