Conditionally create new observations - stata

I have data in the following format (there are a lot more variables):
year ID Dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
What I am trying to achieve is conditionally create new observations that fills in the data between two points in time conditional on a dummy. If the dummy is equal to 1, the data is supposed to be filled in. If the variable is equal to 0 then it shall not be filled in.
For example:
year ID Dummy
1495 65 1
1496 65 1
1497 65 1
1498 65 1
.
.
1501 65 1
1502 65 1
1503 65 1
1504 65 1
.
.
.
1520 65 0
1522 65 0

Here's one way to do this:
clear
input year id dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
end
generate tag = year[_n] != year[_n+1] & dummy == 1
generate delta = year[_n] - year[_n+1] if tag
replace delta = . if abs(delta) == 1
expand abs(delta) if tag & delta != .
sort year
bysort year: egen seq = seq() if delta != .
replace seq = seq - 1
replace seq = 0 if seq == .
replace year = year + seq if year != .
drop tag delta seq
The above code snippet will produce:
list
+-------------------+
| year id dummy |
|-------------------|
1. | 1495 65 1 |
2. | 1496 65 1 |
3. | 1497 65 1 |
4. | 1498 65 1 |
5. | 1499 65 1 |
|-------------------|
6. | 1500 65 1 |
7. | 1501 65 1 |
8. | 1502 65 1 |
9. | 1503 65 1 |
10. | 1504 65 1 |
|-------------------|
11. | 1505 65 1 |
12. | 1506 65 1 |
13. | 1507 65 1 |
14. | 1508 65 1 |
15. | 1509 65 1 |
|-------------------|
16. | 1510 65 1 |
17. | 1511 65 1 |
18. | 1512 65 1 |
19. | 1513 65 1 |
20. | 1514 65 1 |
|-------------------|
21. | 1515 65 1 |
22. | 1516 65 1 |
23. | 1517 65 1 |
24. | 1518 65 1 |
25. | 1519 65 1 |
|-------------------|
26. | 1520 65 0 |
27. | 1522 65 0 |
+-------------------+

Related

Merging two variables

I have the following data:
I want to turn the data in the upper panel into the data in the lower one.
For each origin group, I want to add one line with destination value -1, and var value from varnew.
I tried to find if there is a command which adds one row so that I can do something like:
bysort origin: addrow
However, it seems there isn't any such thing.
Using your toy example data:
clear
input destination origin var varnew
0 111 124 .
111 111 671 168
0 222 623 .
222 222 768 865
end
list, abbreviate(15)
+-------------------------------------+
| destination origin var varnew |
|-------------------------------------|
1. | 0 111 124 . |
2. | 111 111 671 168 |
3. | 0 222 623 . |
4. | 222 222 768 865 |
+-------------------------------------+
expand 2 if varnew != .
sort origin destination
list, abbreviate(15)
+-------------------------------------+
| destination origin var varnew |
|-------------------------------------|
1. | 0 111 124 . |
2. | 111 111 671 168 |
3. | 111 111 671 168 |
4. | 0 222 623 . |
5. | 222 222 768 865 |
|-------------------------------------|
6. | 222 222 768 865 |
+-------------------------------------+
The following works for me:
bysort origin: replace destination = -1 if destination[_n] == destination[_n+1] & !missing(varnew)
bysort origin: replace var = varnew if var[_n] == var[_n+1] & !missing(varnew)
list destination origin var, abbreviate(15)
+----------------------------+
| destination origin var |
|----------------------------|
1. | 0 111 124 |
2. | -1 111 168 |
3. | 111 111 671 |
4. | 0 222 623 |
5. | -1 222 865 |
|----------------------------|
6. | 222 222 768 |
+----------------------------+

select minimum value by ID, over range of visits

I'm trying to extract a variable for the lowest value over a range of visits, in this case:
I want the lowest value over first 3 days of admission (admission day 1 or 2 or 3) , by VisitID. any suggestions?
visitID value day of admission
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
and I would want:
visitID value visit minvalue
1 941 1 941
1 948 2 941
1 935 4 941
2 83 1 83
2 84 2 83
2 50 4 83
2 79 5 83
It would have been helpful if you had presented your data in an easily usable form. But here's an approach that should point you in a useful direction.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte visitid int value byte day
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
end
bysort visitid (day) : egen minvalue = min(cond(day<=3,value,.))
Which results in
. list, sepby(visitid)
+----------------------------------+
| visitid value day minvalue |
|----------------------------------|
1. | 1 941 1 941 |
2. | 1 948 2 941 |
3. | 1 935 4 941 |
|----------------------------------|
4. | 2 83 1 83 |
5. | 2 84 2 83 |
6. | 2 50 4 83 |
7. | 2 79 5 83 |
+----------------------------------+

Frequency table with group variable

I have a dataset with firm level data.
I have a variable employees (an integer) and a variable nace2 (an integer indicating what industry or service sector the company is related to)
I have created a third variable for grouping employees:
gen employees_cat = .
replace employees_cat = 1 if employees >=0 & employees<10
replace employees_cat = 2 if employees >=10 & employees<20
replace employees_cat = 3 if employees >=20 & employees<49
replace employees_cat = 4 if employees >=49 & employees<249
replace employees_cat = 5 if employees >=249
I would like to create a frequency table showing how many employees work in every nace2 sector per employees_cat.
As a reproducible example take
sysuse auto.dta
Let's try to get a frequency table showing the overall mileage (mpg) of all domestic / foreign cars that have a trunk space of 11, 12, 16, etc.
The starting point for frequency tabulations in Stata is tabulate which can show one- and two-way breakdowns. Used with by: multi-way breakdowns can be produced as a series of two-way tables. See also table.
With the variables you mention in the auto data there are 21 distinct values for mpg and 18 for trunk, so a two-way table would be 21 x 18 or 18 x 21 with many empty cells, as the number of observations at 74 is much less than the product 378. (Here to count distinct values the command distinct is installed: search distinct in Stata for literature references and latest code version to download.)
. sysuse auto, clear
(1978 Automobile Data)
. distinct mpg trunk
------------------------------
| total distinct
-------+----------------------
mpg | 74 21
trunk | 74 18
------------------------------
One way around this problem is to collapse the tabulation into a list with typical entry {row variable, column variable, frequency information}. This is offered by the program groups, which must be installed first, as here:
. ssc inst groups
. groups trunk mpg
+-------------------------------+
| trunk mpg Freq. Percent |
|-------------------------------|
| 5 28 1 1.35 |
| 6 23 1 1.35 |
| 7 18 1 1.35 |
| 7 24 2 2.70 |
| 8 21 1 1.35 |
|-------------------------------|
| 8 24 1 1.35 |
| 8 26 1 1.35 |
| 8 30 1 1.35 |
| 8 35 1 1.35 |
| 9 22 1 1.35 |
|-------------------------------|
| 9 28 1 1.35 |
| 9 29 1 1.35 |
| 9 31 1 1.35 |
| 10 21 1 1.35 |
| 10 24 1 1.35 |
|-------------------------------|
| 10 25 1 1.35 |
| 10 26 2 2.70 |
| 11 17 1 1.35 |
| 11 18 1 1.35 |
| 11 22 1 1.35 |
|-------------------------------|
| 11 23 1 1.35 |
| 11 28 1 1.35 |
| 11 30 1 1.35 |
| 11 34 1 1.35 |
| 11 35 1 1.35 |
|-------------------------------|
| 12 22 1 1.35 |
| 12 23 1 1.35 |
| 12 25 1 1.35 |
| 13 19 3 4.05 |
| 13 21 1 1.35 |
|-------------------------------|
| 14 14 1 1.35 |
| 14 17 1 1.35 |
| 14 18 1 1.35 |
| 14 19 1 1.35 |
| 15 14 1 1.35 |
|-------------------------------|
| 15 17 1 1.35 |
| 15 18 1 1.35 |
| 15 25 1 1.35 |
| 15 41 1 1.35 |
| 16 14 3 4.05 |
|-------------------------------|
| 16 18 1 1.35 |
| 16 19 3 4.05 |
| 16 20 2 2.70 |
| 16 21 1 1.35 |
| 16 22 1 1.35 |
|-------------------------------|
| 16 25 1 1.35 |
| 17 16 3 4.05 |
| 17 18 1 1.35 |
| 17 19 1 1.35 |
| 17 20 1 1.35 |
|-------------------------------|
| 17 22 1 1.35 |
| 17 25 1 1.35 |
| 18 12 1 1.35 |
| 20 14 1 1.35 |
| 20 15 1 1.35 |
|-------------------------------|
| 20 16 1 1.35 |
| 20 18 2 2.70 |
| 20 21 1 1.35 |
| 21 17 1 1.35 |
| 21 18 1 1.35 |
|-------------------------------|
| 22 12 1 1.35 |
| 23 15 1 1.35 |
+-------------------------------+
groups has many more options, which are documented in its help. But it extends easily to multi-way tables also collapsed to lists, as here with a third grouping variable:
. groups foreign trunk mpg, sepby(foreign trunk)
+------------------------------------------+
| foreign trunk mpg Freq. Percent |
|------------------------------------------|
| Domestic 7 18 1 1.35 |
| Domestic 7 24 2 2.70 |
|------------------------------------------|
| Domestic 8 26 1 1.35 |
| Domestic 8 30 1 1.35 |
|------------------------------------------|
| Domestic 9 22 1 1.35 |
| Domestic 9 28 1 1.35 |
| Domestic 9 29 1 1.35 |
|------------------------------------------|
| Domestic 10 21 1 1.35 |
| Domestic 10 24 1 1.35 |
| Domestic 10 26 1 1.35 |
|------------------------------------------|
| Domestic 11 17 1 1.35 |
| Domestic 11 22 1 1.35 |
| Domestic 11 28 1 1.35 |
| Domestic 11 34 1 1.35 |
|------------------------------------------|
| Domestic 12 22 1 1.35 |
|------------------------------------------|
| Domestic 13 19 3 4.05 |
| Domestic 13 21 1 1.35 |
|------------------------------------------|
| Domestic 14 19 1 1.35 |
|------------------------------------------|
| Domestic 15 14 1 1.35 |
| Domestic 15 18 1 1.35 |
|------------------------------------------|
| Domestic 16 14 3 4.05 |
| Domestic 16 18 1 1.35 |
| Domestic 16 19 3 4.05 |
| Domestic 16 20 2 2.70 |
| Domestic 16 22 1 1.35 |
|------------------------------------------|
| Domestic 17 16 3 4.05 |
| Domestic 17 18 1 1.35 |
| Domestic 17 19 1 1.35 |
| Domestic 17 20 1 1.35 |
| Domestic 17 22 1 1.35 |
| Domestic 17 25 1 1.35 |
|------------------------------------------|
| Domestic 18 12 1 1.35 |
|------------------------------------------|
| Domestic 20 14 1 1.35 |
| Domestic 20 15 1 1.35 |
| Domestic 20 16 1 1.35 |
| Domestic 20 18 2 2.70 |
| Domestic 20 21 1 1.35 |
|------------------------------------------|
| Domestic 21 17 1 1.35 |
| Domestic 21 18 1 1.35 |
|------------------------------------------|
| Domestic 22 12 1 1.35 |
|------------------------------------------|
| Domestic 23 15 1 1.35 |
|------------------------------------------|
| Foreign 5 28 1 1.35 |
|------------------------------------------|
| Foreign 6 23 1 1.35 |
|------------------------------------------|
| Foreign 8 21 1 1.35 |
| Foreign 8 24 1 1.35 |
| Foreign 8 35 1 1.35 |
|------------------------------------------|
| Foreign 9 31 1 1.35 |
|------------------------------------------|
| Foreign 10 25 1 1.35 |
| Foreign 10 26 1 1.35 |
|------------------------------------------|
| Foreign 11 18 1 1.35 |
| Foreign 11 23 1 1.35 |
| Foreign 11 30 1 1.35 |
| Foreign 11 35 1 1.35 |
|------------------------------------------|
| Foreign 12 23 1 1.35 |
| Foreign 12 25 1 1.35 |
|------------------------------------------|
| Foreign 14 14 1 1.35 |
| Foreign 14 17 1 1.35 |
| Foreign 14 18 1 1.35 |
|------------------------------------------|
| Foreign 15 17 1 1.35 |
| Foreign 15 25 1 1.35 |
| Foreign 15 41 1 1.35 |
|------------------------------------------|
| Foreign 16 21 1 1.35 |
| Foreign 16 25 1 1.35 |
+------------------------------------------+

Train our own classifier

Now I am training my own classifier.So for that I am using traincascade.But when I am giving this command 'opencv_traincascade -data facedet -vec vecfile.vec -bg negative.txt -npos 2650 -nneg 581 -nstages 20 -w 20 -h 20' it shows error like this.
PARAMETERS:
cascadeDirName: facedet
vecFileName: vecfile.vec
bgFileName: negative.txt
numPos: 2000
numNeg: 1000
numStages: 20
precalcValBufSize[Mb] : 256
precalcIdxBufSize[Mb] : 256
stageType: BOOST
featureType: HAAR
sampleWidth: 20
sampleHeight: 20
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2000 : 2000
NEG count : acceptanceRatio 1000 : 1
Precalculation time: 3
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 1| 1|
+----+---------+---------+
| 4| 1| 1|
+----+---------+---------+
| 5| 1| 1|
+----+---------+---------+
| 6| 0.9955| 0.391|
+----+---------+---------+
END>
Parameters can not be written, because file facedet/params.xml can not be opened.
What is this error.I don't understand.Any one help me to solve this.
Positive samples:
/home/arya/myown/Positive/images18413.jpeg 1 1 1 113 33
/home/arya/myown/Positive/images1392.jpeg 1 113 33 107 133
/home/arya/myown/Positive/face841.jpeg 1 185 93 35 73
/home/arya/myown/Positive/images866.jpeg 2 121 26 64 68 121 26 88 123
/home/arya/myown/Positive/images83.jpeg 1 102 13 107 136
/home/arya/myown/Positive/images355.jpeg 2 92 16 224 25 92 16 117 130
/home/arya/myown/Positive/images888.jpeg 1 108 29 116 71
/home/arya/myown/Positive/images2535.jpeg 1 108 29 111 129
/home/arya/myown/Positive/images18221.jpeg 1 110 34 109 124
/home/arya/myown/Positive/images1127.jpeg 1 110 34 92 104
/home/arya/myown/Positive/images18357.jpeg 1 103 27 142 133
/home/arya/myown/Positive/images889.jpeg 1 86 25 134 124
Negative samples:
./Negative/face150.jpeg
./Negative/face1051.jpeg
./Negative/Pictures174.jpeg
./Negative/Pictures160.jpeg
./Negative/Pictures34.jpeg
./Negative/face130.jpeg
./Negative/face1.jpeg
./Negative/Pictures319.jpeg
./Negative/face1120.jpeg
./Negative/Pictures317.jpeg
./Negative/face1077.jpeg
./Negative/Pictures93.jpeg
./Negative/Pictures145.jpeg
./Negative/face1094.jpeg
./Negative/Pictures7.jpeg
Please be sure that you have already created the folder "facedet" before training your classifier as it does not create it by itself.
It needs this folder to create "params.xml" file in inside it.

Editing a text file in python?

I have this in a text file :
Rubble HM3 80 HM2 90 HM4 92
Bunny HM2 92 HM5 70 HM1 98
Duck HM1 86 HM3 100 HM2 93 HM4 94
Chipmunk HM4 96 HM1 86
Simpson HM3 70 HM1 90 Test1 90
and i want to write a code that changes it to this :
Name | HM1 | HM2 | HM3 | HM4 | Avg. |
________________________________________________
Bunny | 98 | 92 | 0 | 0 | 47.50 |
Chipmunk | 86 | 0 | 0 | 96 | 45.50 |
Duck | 86 | 93 | 100 | 94 | 93.25 |
Rubble | 0 | 90 | 80 | 92 | 65.50 |
Simpson | 90 | 0 | 70 | 0 | 40.00 |
so far :
my_file=open("C:/python27/tools/student_grades.txt", "r+")
my_file_pointer=my_file.read()
for lines in my_file_pointer:
x=my_file_pointer.replace("HM2","|")
print x
Go Easy first time programmer . :)
and if i use the replace function how can i print it all at once and then sort it under every subject "HM1" ?