I would like to fill the missing observation(s) with the values of the next cell and distribute it equally over the missing rows.
For example using data from below, I would fill value for 2004m1 and 2004m2 with 142 and also replace value for 2004m3 with 142, as we fill two missings (142 = 426/3). For 2005m7/m8 it would be 171 etc. I am able to fill the missings with revered sorting and carryforward, however I cannot figure out how to redistribute the values, especially that the number of rows that I try to fill can vary and it is not simple [_n+1].
My try to fill the values (but this does not redistribute):
carryforward value, gen(value_filled)
Example data set:
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 342
2005m7 .
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 380
2004m7 .
2004m6 174
2004m5 178
2004m4 104
2004m3 426
2004m2 .
2004m1 .
Expected result
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 171
2005m7 171
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 190
2004m7 190
2004m6 174
2004m5 178
2004m4 104
2004m3 142
2004m2 142
2004m1 142
Thanks for your data example, which is helpful, but as detailed in the Stata tag wiki and on Statalist an example using dataex is even better. Date and time variables are especially awkward otherwise.
You allude to carryforward, which is from SSC and which many have found useful. Having written the FAQ on this accessible here my prejudice is that most such problems yield quickly and directly to sorting, subscripting and replace. Your problem is trickier than most in including a value to be divided after an unpredictable gap of missing values.
This works for your example and doesn't rule out a simpler solution.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date int mvalue
551 56
550 150
549 190
548 157
547 342
546 .
545 181
544 151
543 107
542 131
541 247
540 100
539 77
538 181
537 132
536 153
535 380
534 .
533 174
532 178
531 104
530 426
529 .
528 .
end
format %tm date
gsort -date
gen copy = mvalue
replace copy = copy[_n-1] if missing(copy)
gen gap = missing(mvalue[_n+1]) | missing(mvalue)
replace gap = gap + gap[_n-1] if gap == 1 & _n > 1
sort date
replace gap = gap[_n-1] if inrange(gap[_n-1], 1, .) & gap >= 1
gen wanted = cond(gap, copy/gap, copy)
list , sepby(gap)
+----------------------------------------+
| date mvalue copy gap wanted |
|----------------------------------------|
1. | 2004m1 . 426 3 142 |
2. | 2004m2 . 426 3 142 |
3. | 2004m3 426 426 3 142 |
|----------------------------------------|
4. | 2004m4 104 104 0 104 |
5. | 2004m5 178 178 0 178 |
6. | 2004m6 174 174 0 174 |
|----------------------------------------|
7. | 2004m7 . 380 2 190 |
8. | 2004m8 380 380 2 190 |
|----------------------------------------|
9. | 2004m9 153 153 0 153 |
10. | 2004m10 132 132 0 132 |
11. | 2004m11 181 181 0 181 |
12. | 2004m12 77 77 0 77 |
13. | 2005m1 100 100 0 100 |
14. | 2005m2 247 247 0 247 |
15. | 2005m3 131 131 0 131 |
16. | 2005m4 107 107 0 107 |
17. | 2005m5 151 151 0 151 |
18. | 2005m6 181 181 0 181 |
|----------------------------------------|
19. | 2005m7 . 342 2 171 |
20. | 2005m8 342 342 2 171 |
|----------------------------------------|
21. | 2005m9 157 157 0 157 |
22. | 2005m10 190 190 0 190 |
23. | 2005m11 150 150 0 150 |
24. | 2005m12 56 56 0 56 |
+----------------------------------------+
Related
I have the following dataset:
clear
input float(department employee expertise_area share)
1 56 334 1
1 143 389 .04
1 143 334 .18
1 143 383 .02
1 143 398 .1
1 143 414 .02
1 143 396 .08
1 143 385 .08
1 143 403 .3
1 143 409 .02
1 143 373 .02
1 143 392 .06
1 143 397 .06
1 143 394 .02
1 214 373 1
4 145 399 .029
4 145 409 .7681
4 145 311 .0145
4 145 403 .1884
4 161 62 .4
4 161 373 .6
4 285 355 .5333
4 285 392 .0333
4 285 304 .0333
4 285 310 .2333
4 285 73 .0333
4 285 331 .0333
4 285 399 .0333
4 285 414 .0667
186 161 62 .4
186 161 373 .6
186 247 409 .0025
186 247 311 .0025
186 247 338 .25
186 247 298 .0051
186 247 334 .649
186 247 337 .0051
186 247 404 .0076
186 247 339 .0051
186 247 301 .0025
186 247 403 .0631
186 247 347 .0025
186 247 336 .0051
186 285 304 .0333
186 285 399 .0333
186 285 355 .5333
186 285 392 .0333
186 285 310 .2333
186 285 73 .0333
186 285 414 .0667
186 285 331 .0333
end
I would like to compute the differences between the distribution of the prior experience of employees in a team (or department).
This is the mean Euclidean distance, which calculates that separation of individuals in a team:
Here, p_ij and p_kj are the share of employee i’s or k’s expertise in area j over his career and n equals the team size.
For example, for department 1, employee 143, he has worked 18% of his career on area 334 (this example corresponds to observation 3). The team size for department 1 is 3, that is for department 1, n=3.
In summary, I want to calculate the Euclidean distance for each department (1, 4, 186) for 3 points (or employees) each [(56, 143, 214), (145, 161, 285), (161, 247, 285) respectively] with 13, 13 and 22 dimensions (or expertise_area) respectively. Note that I should be able to produce output even if a department has more than 3 employees (or points).
The output should look as follows:
+------------+--------------------+
| department | euclidean_distance |
+------------+--------------------+
| 1 | .4022 |
| 4 | .4131 |
| 186 | .3882 |
+------------+--------------------+
How can I compute this in Stata?
I have to read a data set of 50 numbers from a text file. It's all in a row with a space delimiter and in multiple uneven lines. for example:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15
16 17 18 19 20 21
Etc.
The first 25 numbers belong to group 1, and the 2nd 25 belong to group 2. So I need to make a group variable (binary either 1 or 2), a count number (1 to 25), and a value variable which is holding the value of the number.
I am stuck on how to split the data in half when reading it. I tried to use truncover but it did not work.
Try something like this, replacing the datalines keyword with the path to your file:
data groups;
infile datalines;
format number 8. counter 2. group 1.; * Not mandatory, used here to order variables;
retain group (1);
input number ##;
counter + 1;
if counter = 26 then do;
group = 2;
counter = 1;
end;
datalines;
192 105 435 448 160 499 184 246 388 190 316
139 146 147 192 231 449 101 216 342 399 352 122 418
280 400 187 352 321 180 425 500 320 179 105
232 105 323 132 106 255 449
186 135 472 174 119 255
308 350
run;
This question is related to Stata: select the minimum of each observation.
I have data as follows:
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
Some people have multiple readings on one day, eg see Sue on 31st March 1999. I want to select the lowest reading per day.
Here is my code which gets me some of the way. It is clunky and clumsy and I am looking for help to do what I want to do in a more straightforward way.
*make flag for repeat observations on same day
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
drop flag flag2
* group repeat observations together
egen group = group(id flag3 eventdate)
* find lowest `sys_bp_copy` value per group
bys group (eventdate flag3): egen low_sys=min(sys_bp_copy)
*remove the observations where the lowest value of `sys_bp`_copy doesn't exist
bys group: gen remove =1 if low_sys!=sys_bp_copy
drop if remove==1 & group !=.
****Problems with this and where I'd like help** **
The problem with the above approach is that for Sue, two of her repeat readings have the same val of sys_bp_copy. So my approach above leaves me with multiple readings for her.
In this instance I would like to refer to the dia_sys_copy and select the lowest value there to help me pick out one row per person when multiple readings are in place. Code for this is below - but there must be a simpler way to do this?
drop flag3 remove group
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
egen group = group(id flag3 eventdate)
bys group (eventdate flag3): egen low_dia=min(dia_bp_copy)
bys group: gen remove =1 if low_dia!=dia_bp_copy
drop if remove==1 & group !=.
The lowest systolic pressure for a patient on a particular day is easy to define: you just sort and look for the lowest value in each block of observations.
We can refine the definition by breaking ties on systolic by values of diastolic. That's another sort. In this example, that makes no difference.
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
bysort id eventdate (sys) : gen lowest = sys[1]
bysort id eventdate (sys dia) : gen lowest_2 = sys[1]
egen tag = tag(id eventdate)
count if lowest != lowest_2
list id event dia sys lowest* if tag, sepby(id)
+-----------------------------------------------------------+
| id eventd~e dia_bp~y sys_bp~y lowest lowest_2 |
|-----------------------------------------------------------|
1. | mary 14998 90 154 154 154 |
2. | mary 15165 91 179 179 179 |
3. | mary 15280 91 156 156 156 |
4. | mary 15386 81 154 154 154 |
5. | mary 15952 77 133 133 133 |
7. | mary 16390 91 159 159 159 |
|-----------------------------------------------------------|
8. | pat 15698 100 140 140 140 |
9. | pat 16183 80 120 120 120 |
10. | pat 19226 98 155 155 155 |
11. | pat 19375 80 130 130 130 |
|-----------------------------------------------------------|
12. | sue 14296 80 120 120 120 |
13. | sue 14334 88 127 127 127 |
16. | sue 14403 86 124 124 124 |
21. | sue 14431 80 120 120 120 |
24. | sue 15456 80 130 130 130 |
25. | sue 15501 80 120 120 120 |
26. | sue 15596 80 120 120 120 |
+-----------------------------------------------------------+
egen is very useful (disclosure of various interests there), but the main idea here is just that by: defines groups of observations and you can do that for two or more variables, and not just one -- and control the sort order too. As it were, about half of egen is built on such ideas, but it can be easiest and best to use them directly.
If I understand:
Create an identifier for same id and same date
egen temp_group = group(id eventdate)
Find the first occurrence based on lowest sys_bp_copy and then lowest dia_bp_copy
bys temp_group (sys_bp_copy dia_bp_copy): gen temp_first = _n
keep if temp_first == 1
drop temp*
or in 1 line as suggest in comment:
bys id eventdate (sys_bp_copy dia_bp_copy): keep if _n==1
I have file that looks like this
gene_id_100100 sp|Q53IZ1|ASDP_PSESP 35.81 148 90 2 13 158 6 150 6e-27 109 158 531
gene_id_100600 sp|Q49W80|Y1834_STAS1 31.31 99 63 2 1 95 279 376 7e-07 50.1 113 402
gene_id_100 sp|A7TSV7|PAN1_VANPO 36.36 44 24 1 41 80 879 922 1.9 32.3 154 1492
gene_id_10100 sp|P37348|YECE_ECOLI 32.77 177 104 6 3 172 2 170 2e-13 71.2 248 272
gene_id_101100 sp|B0U4U5|SURE_XYLFM 29.11 79 41 3 70 148 143 206 0.14 35.8 175 262
gene_id_101600 sp|Q5AWD4|BGLM_EMENI 35.90 39 25 0 21 59 506 544 4.9 30.4 129 772
gene_id_102100 sp|P20374|COX1_APILI 38.89 36 22 0 3 38 353 388 0.54 32.0 92 521
gene_id_102600 sp|Q46127|SYW_CLOLO 79.12 91 19 0 1 91 1 91 5e-44 150 92 341
gene_id_103100 sp|Q9UJX6|ANC2_HUMAN 53.57 28 13 0 11 38 608 635 2.1 28.9 42 822
gene_id_103600 sp|C1DA02|SYL_LARHH 35.59 59 30 2 88 138 382 440 4.6 30.8 140 866
gene_id_104100 sp|B8DHP2|PROB_LISMH 25.88 85 50 2 37 110 27 109 0.81 32.3 127 276
gene_id_105100 sp|A1ALU1|RL3_PELPD 31.88 69 42 2 14 77 42 110 2.2 31.6 166 209
gene_id_105600 sp|P59696|T200_SALTY 64.00 125 45 0 5 129 3 127 9e-58 182 129 152
gene_id_10600 sp|G3XDA3|CTPH_PSEAE 28.38 74 48 1 4 77 364 432 0.56 31.6 81 568
gene_id_106100 sp|P94369|YXLA_BACSU 35.00 100 56 3 25 120 270 364 4e-08 53.9 120 457
gene_id_106600 sp|P34706|SDC3_CAEEL 60.00 20 8 0 18 37 1027 1046 2.3 32.7 191 2150
Now, I need to extract the gene ID, which is the one between || in the second column. In other words, I need an output that looks like this:
Q53IZ1
Q49W80
A7TSV7
P37348
B0U4U5
Q5AWD4
P20374
Q46127
Q9UJX6
C1DA02
B8DHP2
A1ALU1
P59696
G3XDA3
P94369
P34706
I have been trying to do it using the following command:
awk '{for(i=1;i<=NF;++i){ if($i==/[A-Z][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]/){print $i} } }'
but it doesn't seem to work.
Pattern matching is not really necessary. I'd suggest
awk -F\| '{print $2}' filename
This splits the line into |-delimited fields and prints the second of them.
Alternatively,
cut -d\| -f 2 filename
achieves the same.
Now I am training my own classifier.So for that I am using traincascade.But when I am giving this command 'opencv_traincascade -data facedet -vec vecfile.vec -bg negative.txt -npos 2650 -nneg 581 -nstages 20 -w 20 -h 20' it shows error like this.
PARAMETERS:
cascadeDirName: facedet
vecFileName: vecfile.vec
bgFileName: negative.txt
numPos: 2000
numNeg: 1000
numStages: 20
precalcValBufSize[Mb] : 256
precalcIdxBufSize[Mb] : 256
stageType: BOOST
featureType: HAAR
sampleWidth: 20
sampleHeight: 20
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2000 : 2000
NEG count : acceptanceRatio 1000 : 1
Precalculation time: 3
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 1| 1|
+----+---------+---------+
| 4| 1| 1|
+----+---------+---------+
| 5| 1| 1|
+----+---------+---------+
| 6| 0.9955| 0.391|
+----+---------+---------+
END>
Parameters can not be written, because file facedet/params.xml can not be opened.
What is this error.I don't understand.Any one help me to solve this.
Positive samples:
/home/arya/myown/Positive/images18413.jpeg 1 1 1 113 33
/home/arya/myown/Positive/images1392.jpeg 1 113 33 107 133
/home/arya/myown/Positive/face841.jpeg 1 185 93 35 73
/home/arya/myown/Positive/images866.jpeg 2 121 26 64 68 121 26 88 123
/home/arya/myown/Positive/images83.jpeg 1 102 13 107 136
/home/arya/myown/Positive/images355.jpeg 2 92 16 224 25 92 16 117 130
/home/arya/myown/Positive/images888.jpeg 1 108 29 116 71
/home/arya/myown/Positive/images2535.jpeg 1 108 29 111 129
/home/arya/myown/Positive/images18221.jpeg 1 110 34 109 124
/home/arya/myown/Positive/images1127.jpeg 1 110 34 92 104
/home/arya/myown/Positive/images18357.jpeg 1 103 27 142 133
/home/arya/myown/Positive/images889.jpeg 1 86 25 134 124
Negative samples:
./Negative/face150.jpeg
./Negative/face1051.jpeg
./Negative/Pictures174.jpeg
./Negative/Pictures160.jpeg
./Negative/Pictures34.jpeg
./Negative/face130.jpeg
./Negative/face1.jpeg
./Negative/Pictures319.jpeg
./Negative/face1120.jpeg
./Negative/Pictures317.jpeg
./Negative/face1077.jpeg
./Negative/Pictures93.jpeg
./Negative/Pictures145.jpeg
./Negative/face1094.jpeg
./Negative/Pictures7.jpeg
Please be sure that you have already created the folder "facedet" before training your classifier as it does not create it by itself.
It needs this folder to create "params.xml" file in inside it.