Merging two variables - stata

I have the following data:
I want to turn the data in the upper panel into the data in the lower one.
For each origin group, I want to add one line with destination value -1, and var value from varnew.
I tried to find if there is a command which adds one row so that I can do something like:
bysort origin: addrow
However, it seems there isn't any such thing.

Using your toy example data:
clear
input destination origin var varnew
0 111 124 .
111 111 671 168
0 222 623 .
222 222 768 865
end
list, abbreviate(15)
+-------------------------------------+
| destination origin var varnew |
|-------------------------------------|
1. | 0 111 124 . |
2. | 111 111 671 168 |
3. | 0 222 623 . |
4. | 222 222 768 865 |
+-------------------------------------+
expand 2 if varnew != .
sort origin destination
list, abbreviate(15)
+-------------------------------------+
| destination origin var varnew |
|-------------------------------------|
1. | 0 111 124 . |
2. | 111 111 671 168 |
3. | 111 111 671 168 |
4. | 0 222 623 . |
5. | 222 222 768 865 |
|-------------------------------------|
6. | 222 222 768 865 |
+-------------------------------------+
The following works for me:
bysort origin: replace destination = -1 if destination[_n] == destination[_n+1] & !missing(varnew)
bysort origin: replace var = varnew if var[_n] == var[_n+1] & !missing(varnew)
list destination origin var, abbreviate(15)
+----------------------------+
| destination origin var |
|----------------------------|
1. | 0 111 124 |
2. | -1 111 168 |
3. | 111 111 671 |
4. | 0 222 623 |
5. | -1 222 865 |
|----------------------------|
6. | 222 222 768 |
+----------------------------+

Related

Filling missing observations with equal parts of the existing observation (Stata)

I would like to fill the missing observation(s) with the values of the next cell and distribute it equally over the missing rows.
For example using data from below, I would fill value for 2004m1 and 2004m2 with 142 and also replace value for 2004m3 with 142, as we fill two missings (142 = 426/3). For 2005m7/m8 it would be 171 etc. I am able to fill the missings with revered sorting and carryforward, however I cannot figure out how to redistribute the values, especially that the number of rows that I try to fill can vary and it is not simple [_n+1].
My try to fill the values (but this does not redistribute):
carryforward value, gen(value_filled)
Example data set:
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 342
2005m7 .
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 380
2004m7 .
2004m6 174
2004m5 178
2004m4 104
2004m3 426
2004m2 .
2004m1 .
Expected result
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 171
2005m7 171
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 190
2004m7 190
2004m6 174
2004m5 178
2004m4 104
2004m3 142
2004m2 142
2004m1 142
Thanks for your data example, which is helpful, but as detailed in the Stata tag wiki and on Statalist an example using dataex is even better. Date and time variables are especially awkward otherwise.
You allude to carryforward, which is from SSC and which many have found useful. Having written the FAQ on this accessible here my prejudice is that most such problems yield quickly and directly to sorting, subscripting and replace. Your problem is trickier than most in including a value to be divided after an unpredictable gap of missing values.
This works for your example and doesn't rule out a simpler solution.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date int mvalue
551 56
550 150
549 190
548 157
547 342
546 .
545 181
544 151
543 107
542 131
541 247
540 100
539 77
538 181
537 132
536 153
535 380
534 .
533 174
532 178
531 104
530 426
529 .
528 .
end
format %tm date
gsort -date
gen copy = mvalue
replace copy = copy[_n-1] if missing(copy)
gen gap = missing(mvalue[_n+1]) | missing(mvalue)
replace gap = gap + gap[_n-1] if gap == 1 & _n > 1
sort date
replace gap = gap[_n-1] if inrange(gap[_n-1], 1, .) & gap >= 1
gen wanted = cond(gap, copy/gap, copy)
list , sepby(gap)
+----------------------------------------+
| date mvalue copy gap wanted |
|----------------------------------------|
1. | 2004m1 . 426 3 142 |
2. | 2004m2 . 426 3 142 |
3. | 2004m3 426 426 3 142 |
|----------------------------------------|
4. | 2004m4 104 104 0 104 |
5. | 2004m5 178 178 0 178 |
6. | 2004m6 174 174 0 174 |
|----------------------------------------|
7. | 2004m7 . 380 2 190 |
8. | 2004m8 380 380 2 190 |
|----------------------------------------|
9. | 2004m9 153 153 0 153 |
10. | 2004m10 132 132 0 132 |
11. | 2004m11 181 181 0 181 |
12. | 2004m12 77 77 0 77 |
13. | 2005m1 100 100 0 100 |
14. | 2005m2 247 247 0 247 |
15. | 2005m3 131 131 0 131 |
16. | 2005m4 107 107 0 107 |
17. | 2005m5 151 151 0 151 |
18. | 2005m6 181 181 0 181 |
|----------------------------------------|
19. | 2005m7 . 342 2 171 |
20. | 2005m8 342 342 2 171 |
|----------------------------------------|
21. | 2005m9 157 157 0 157 |
22. | 2005m10 190 190 0 190 |
23. | 2005m11 150 150 0 150 |
24. | 2005m12 56 56 0 56 |
+----------------------------------------+

Split variable to get the last string as a new variable

I have a large dataset of 5,000 observations and a subset of my data looks as follows:
AandB
1 222 454 213.51 59.15%
444 630 789.46 6.15%
2 374 798 807.69 32.00%
304 738 263.59 19.95%
177 641 617.86 18.07%
857 937 842.27 51.97%
973 127.33 0.03%
86 205 146.62 1.18%
I need two variables, A and B out of this one variable.
For example, 1 222 454 213.51 should be in column A as 1222454213.51and corresponding observation in variable B should be 59.15%
There is a double-space separating what values I want in A, and what I want in B in the raw data.
Hence, I need:
A B
1222454213.51 59.15%
444630789.46 6.15%
2374798807.69 32.00%
304738263.59 19.95%
177641617.86 18.07%
857937842.27 51.97%
973127.33 0.03%
86205146.62 1.18%
I was able to obtain variable A with the following:
generate A = reverse(substr(reverse(AandB),strpos(reverse(AandB), " "), . ))
replace A = subinstr(A, " ", "", .)
However, I have trouble extracting the percentage numbers.
Another way to advance is to peel off the last "word" (Stata sense) first:
clear
input str42 AandB
"1 222 454 213.51 59.15%"
"444 630 789.46 6.15%"
"2 374 798 807.69 32.00%"
"304 738 263.59 19.95%"
"177 641 617.86 18.07%"
"857 937 842.27 51.97%"
"973 127.33 0.03%"
"86 205 146.62 1.18%"
end
generate B = word(AandB, -1)
generate A = trim(subinstr(AandB, B, "", .))
list AandB A B, separator(0)
+------------------------------------------------------+
| AandB A B |
|------------------------------------------------------|
1. | 1 222 454 213.51 59.15% 1 222 454 213.51 59.15% |
2. | 444 630 789.46 6.15% 444 630 789.46 6.15% |
3. | 2 374 798 807.69 32.00% 2 374 798 807.69 32.00% |
4. | 304 738 263.59 19.95% 304 738 263.59 19.95% |
5. | 177 641 617.86 18.07% 177 641 617.86 18.07% |
6. | 857 937 842.27 51.97% 857 937 842.27 51.97% |
7. | 973 127.33 0.03% 973 127.33 0.03% |
8. | 86 205 146.62 1.18% 86 205 146.62 1.18% |
+------------------------------------------------------+
If you do want A to be regarded as specifying some very big numbers, then
generate double A2 = real(subinstr(A, " ", "", .))
is one way forward. Measuring to 12 significant figures implies that you are in astronomy (and perhaps the first 6 digits are good) or in economics (and perhaps the first digit is reliable).
The following works for me:
clear
input str50 AandB
"1 222 454 213.51 59.15%"
"444 630 789.46 6.15%"
"2 374 798 807.69 32.00%"
"304 738 263.59 19.95%"
"177 641 617.86 18.07%"
"857 937 842.27 51.97%"
"973 127.33 0.03%"
"86 205 146.62 1.18%"
end
generate A = subinstr(substr(AandB, 1, strpos(AandB,"%")-6)," ", "", .)
generate B = subinstr(substr(AandB, strpos(AandB,"%")-6, .)," ", "", .)
list, separator(0)
+---------------------------------------------------+
| AandB A B |
|---------------------------------------------------|
1. | 1 222 454 213.51 59.15% 1222454213.51 59.15% |
2. | 444 630 789.46 6.15% 444630789.46 6.15% |
3. | 2 374 798 807.69 32.00% 2374798807.69 32.00% |
4. | 304 738 263.59 19.95% 304738263.59 19.95% |
5. | 177 641 617.86 18.07% 177641617.86 18.07% |
6. | 857 937 842.27 51.97% 857937842.27 51.97% |
7. | 973 127.33 0.03% 973127.33 0.03% |
8. | 86 205 146.62 1.18% 86205146.62 1.18% |
+---------------------------------------------------+
EDIT:
On second thought this can be simplified to the following:
generate A = subinstr(substr(AandB, 1, strpos(AandB," "))," ", "", .)
generate B = subinstr(substr(AandB, strpos(AandB," "), .)," ", "", .)
One way would be:
split AandB, p(" ")
rename AandB1 A
rename AandB2 B
replace A = subinstr(A, " ", "", .)
list, separator(0)
+---------------------------------------------------+
| AandB A B |
|---------------------------------------------------|
1. | 1 222 454 213.51 59.15% 1222454213.51 59.15% |
2. | 444 630 789.46 6.15% 444630789.46 6.15% |
3. | 2 374 798 807.69 32.00% 2374798807.69 32.00% |
4. | 304 738 263.59 19.95% 304738263.59 19.95% |
5. | 177 641 617.86 18.07% 177641617.86 18.07% |
6. | 857 937 842.27 51.97% 857937842.27 51.97% |
7. | 973 127.33 0.03% 973127.33 0.03% |
8. | 86 205 146.62 1.18% 86205146.62 1.18% |
+---------------------------------------------------+

Conditionally create new observations

I have data in the following format (there are a lot more variables):
year ID Dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
What I am trying to achieve is conditionally create new observations that fills in the data between two points in time conditional on a dummy. If the dummy is equal to 1, the data is supposed to be filled in. If the variable is equal to 0 then it shall not be filled in.
For example:
year ID Dummy
1495 65 1
1496 65 1
1497 65 1
1498 65 1
.
.
1501 65 1
1502 65 1
1503 65 1
1504 65 1
.
.
.
1520 65 0
1522 65 0
Here's one way to do this:
clear
input year id dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
end
generate tag = year[_n] != year[_n+1] & dummy == 1
generate delta = year[_n] - year[_n+1] if tag
replace delta = . if abs(delta) == 1
expand abs(delta) if tag & delta != .
sort year
bysort year: egen seq = seq() if delta != .
replace seq = seq - 1
replace seq = 0 if seq == .
replace year = year + seq if year != .
drop tag delta seq
The above code snippet will produce:
list
+-------------------+
| year id dummy |
|-------------------|
1. | 1495 65 1 |
2. | 1496 65 1 |
3. | 1497 65 1 |
4. | 1498 65 1 |
5. | 1499 65 1 |
|-------------------|
6. | 1500 65 1 |
7. | 1501 65 1 |
8. | 1502 65 1 |
9. | 1503 65 1 |
10. | 1504 65 1 |
|-------------------|
11. | 1505 65 1 |
12. | 1506 65 1 |
13. | 1507 65 1 |
14. | 1508 65 1 |
15. | 1509 65 1 |
|-------------------|
16. | 1510 65 1 |
17. | 1511 65 1 |
18. | 1512 65 1 |
19. | 1513 65 1 |
20. | 1514 65 1 |
|-------------------|
21. | 1515 65 1 |
22. | 1516 65 1 |
23. | 1517 65 1 |
24. | 1518 65 1 |
25. | 1519 65 1 |
|-------------------|
26. | 1520 65 0 |
27. | 1522 65 0 |
+-------------------+

Train our own classifier

Now I am training my own classifier.So for that I am using traincascade.But when I am giving this command 'opencv_traincascade -data facedet -vec vecfile.vec -bg negative.txt -npos 2650 -nneg 581 -nstages 20 -w 20 -h 20' it shows error like this.
PARAMETERS:
cascadeDirName: facedet
vecFileName: vecfile.vec
bgFileName: negative.txt
numPos: 2000
numNeg: 1000
numStages: 20
precalcValBufSize[Mb] : 256
precalcIdxBufSize[Mb] : 256
stageType: BOOST
featureType: HAAR
sampleWidth: 20
sampleHeight: 20
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2000 : 2000
NEG count : acceptanceRatio 1000 : 1
Precalculation time: 3
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 1| 1|
+----+---------+---------+
| 4| 1| 1|
+----+---------+---------+
| 5| 1| 1|
+----+---------+---------+
| 6| 0.9955| 0.391|
+----+---------+---------+
END>
Parameters can not be written, because file facedet/params.xml can not be opened.
What is this error.I don't understand.Any one help me to solve this.
Positive samples:
/home/arya/myown/Positive/images18413.jpeg 1 1 1 113 33
/home/arya/myown/Positive/images1392.jpeg 1 113 33 107 133
/home/arya/myown/Positive/face841.jpeg 1 185 93 35 73
/home/arya/myown/Positive/images866.jpeg 2 121 26 64 68 121 26 88 123
/home/arya/myown/Positive/images83.jpeg 1 102 13 107 136
/home/arya/myown/Positive/images355.jpeg 2 92 16 224 25 92 16 117 130
/home/arya/myown/Positive/images888.jpeg 1 108 29 116 71
/home/arya/myown/Positive/images2535.jpeg 1 108 29 111 129
/home/arya/myown/Positive/images18221.jpeg 1 110 34 109 124
/home/arya/myown/Positive/images1127.jpeg 1 110 34 92 104
/home/arya/myown/Positive/images18357.jpeg 1 103 27 142 133
/home/arya/myown/Positive/images889.jpeg 1 86 25 134 124
Negative samples:
./Negative/face150.jpeg
./Negative/face1051.jpeg
./Negative/Pictures174.jpeg
./Negative/Pictures160.jpeg
./Negative/Pictures34.jpeg
./Negative/face130.jpeg
./Negative/face1.jpeg
./Negative/Pictures319.jpeg
./Negative/face1120.jpeg
./Negative/Pictures317.jpeg
./Negative/face1077.jpeg
./Negative/Pictures93.jpeg
./Negative/Pictures145.jpeg
./Negative/face1094.jpeg
./Negative/Pictures7.jpeg
Please be sure that you have already created the folder "facedet" before training your classifier as it does not create it by itself.
It needs this folder to create "params.xml" file in inside it.

Editing a text file in python?

I have this in a text file :
Rubble HM3 80 HM2 90 HM4 92
Bunny HM2 92 HM5 70 HM1 98
Duck HM1 86 HM3 100 HM2 93 HM4 94
Chipmunk HM4 96 HM1 86
Simpson HM3 70 HM1 90 Test1 90
and i want to write a code that changes it to this :
Name | HM1 | HM2 | HM3 | HM4 | Avg. |
________________________________________________
Bunny | 98 | 92 | 0 | 0 | 47.50 |
Chipmunk | 86 | 0 | 0 | 96 | 45.50 |
Duck | 86 | 93 | 100 | 94 | 93.25 |
Rubble | 0 | 90 | 80 | 92 | 65.50 |
Simpson | 90 | 0 | 70 | 0 | 40.00 |
so far :
my_file=open("C:/python27/tools/student_grades.txt", "r+")
my_file_pointer=my_file.read()
for lines in my_file_pointer:
x=my_file_pointer.replace("HM2","|")
print x
Go Easy first time programmer . :)
and if i use the replace function how can i print it all at once and then sort it under every subject "HM1" ?