Select lowest value per group - grouping

This question is related to Stata: select the minimum of each observation.
I have data as follows:
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
Some people have multiple readings on one day, eg see Sue on 31st March 1999. I want to select the lowest reading per day.
Here is my code which gets me some of the way. It is clunky and clumsy and I am looking for help to do what I want to do in a more straightforward way.
*make flag for repeat observations on same day
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
drop flag flag2
* group repeat observations together
egen group = group(id flag3 eventdate)
* find lowest `sys_bp_copy` value per group
bys group (eventdate flag3): egen low_sys=min(sys_bp_copy)
*remove the observations where the lowest value of `sys_bp`_copy doesn't exist
bys group: gen remove =1 if low_sys!=sys_bp_copy
drop if remove==1 & group !=.
****Problems with this and where I'd like help** **
The problem with the above approach is that for Sue, two of her repeat readings have the same val of sys_bp_copy. So my approach above leaves me with multiple readings for her.
In this instance I would like to refer to the dia_sys_copy and select the lowest value there to help me pick out one row per person when multiple readings are in place. Code for this is below - but there must be a simpler way to do this?
drop flag3 remove group
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
egen group = group(id flag3 eventdate)
bys group (eventdate flag3): egen low_dia=min(dia_bp_copy)
bys group: gen remove =1 if low_dia!=dia_bp_copy
drop if remove==1 & group !=.

The lowest systolic pressure for a patient on a particular day is easy to define: you just sort and look for the lowest value in each block of observations.
We can refine the definition by breaking ties on systolic by values of diastolic. That's another sort. In this example, that makes no difference.
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
bysort id eventdate (sys) : gen lowest = sys[1]
bysort id eventdate (sys dia) : gen lowest_2 = sys[1]
egen tag = tag(id eventdate)
count if lowest != lowest_2
list id event dia sys lowest* if tag, sepby(id)
+-----------------------------------------------------------+
| id eventd~e dia_bp~y sys_bp~y lowest lowest_2 |
|-----------------------------------------------------------|
1. | mary 14998 90 154 154 154 |
2. | mary 15165 91 179 179 179 |
3. | mary 15280 91 156 156 156 |
4. | mary 15386 81 154 154 154 |
5. | mary 15952 77 133 133 133 |
7. | mary 16390 91 159 159 159 |
|-----------------------------------------------------------|
8. | pat 15698 100 140 140 140 |
9. | pat 16183 80 120 120 120 |
10. | pat 19226 98 155 155 155 |
11. | pat 19375 80 130 130 130 |
|-----------------------------------------------------------|
12. | sue 14296 80 120 120 120 |
13. | sue 14334 88 127 127 127 |
16. | sue 14403 86 124 124 124 |
21. | sue 14431 80 120 120 120 |
24. | sue 15456 80 130 130 130 |
25. | sue 15501 80 120 120 120 |
26. | sue 15596 80 120 120 120 |
+-----------------------------------------------------------+
egen is very useful (disclosure of various interests there), but the main idea here is just that by: defines groups of observations and you can do that for two or more variables, and not just one -- and control the sort order too. As it were, about half of egen is built on such ideas, but it can be easiest and best to use them directly.

If I understand:
Create an identifier for same id and same date
egen temp_group = group(id eventdate)
Find the first occurrence based on lowest sys_bp_copy and then lowest dia_bp_copy
bys temp_group (sys_bp_copy dia_bp_copy): gen temp_first = _n
keep if temp_first == 1
drop temp*
or in 1 line as suggest in comment:
bys id eventdate (sys_bp_copy dia_bp_copy): keep if _n==1

Related

Filling missing observations with equal parts of the existing observation (Stata)

I would like to fill the missing observation(s) with the values of the next cell and distribute it equally over the missing rows.
For example using data from below, I would fill value for 2004m1 and 2004m2 with 142 and also replace value for 2004m3 with 142, as we fill two missings (142 = 426/3). For 2005m7/m8 it would be 171 etc. I am able to fill the missings with revered sorting and carryforward, however I cannot figure out how to redistribute the values, especially that the number of rows that I try to fill can vary and it is not simple [_n+1].
My try to fill the values (but this does not redistribute):
carryforward value, gen(value_filled)
Example data set:
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 342
2005m7 .
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 380
2004m7 .
2004m6 174
2004m5 178
2004m4 104
2004m3 426
2004m2 .
2004m1 .
Expected result
date_m value
2005m12 56
2005m11 150
2005m10 190
2005m9 157
2005m8 171
2005m7 171
2005m6 181
2005m5 151
2005m4 107
2005m3 131
2005m2 247
2005m1 100
2004m12 77
2004m11 181
2004m10 132
2004m9 153
2004m8 190
2004m7 190
2004m6 174
2004m5 178
2004m4 104
2004m3 142
2004m2 142
2004m1 142
Thanks for your data example, which is helpful, but as detailed in the Stata tag wiki and on Statalist an example using dataex is even better. Date and time variables are especially awkward otherwise.
You allude to carryforward, which is from SSC and which many have found useful. Having written the FAQ on this accessible here my prejudice is that most such problems yield quickly and directly to sorting, subscripting and replace. Your problem is trickier than most in including a value to be divided after an unpredictable gap of missing values.
This works for your example and doesn't rule out a simpler solution.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date int mvalue
551 56
550 150
549 190
548 157
547 342
546 .
545 181
544 151
543 107
542 131
541 247
540 100
539 77
538 181
537 132
536 153
535 380
534 .
533 174
532 178
531 104
530 426
529 .
528 .
end
format %tm date
gsort -date
gen copy = mvalue
replace copy = copy[_n-1] if missing(copy)
gen gap = missing(mvalue[_n+1]) | missing(mvalue)
replace gap = gap + gap[_n-1] if gap == 1 & _n > 1
sort date
replace gap = gap[_n-1] if inrange(gap[_n-1], 1, .) & gap >= 1
gen wanted = cond(gap, copy/gap, copy)
list , sepby(gap)
+----------------------------------------+
| date mvalue copy gap wanted |
|----------------------------------------|
1. | 2004m1 . 426 3 142 |
2. | 2004m2 . 426 3 142 |
3. | 2004m3 426 426 3 142 |
|----------------------------------------|
4. | 2004m4 104 104 0 104 |
5. | 2004m5 178 178 0 178 |
6. | 2004m6 174 174 0 174 |
|----------------------------------------|
7. | 2004m7 . 380 2 190 |
8. | 2004m8 380 380 2 190 |
|----------------------------------------|
9. | 2004m9 153 153 0 153 |
10. | 2004m10 132 132 0 132 |
11. | 2004m11 181 181 0 181 |
12. | 2004m12 77 77 0 77 |
13. | 2005m1 100 100 0 100 |
14. | 2005m2 247 247 0 247 |
15. | 2005m3 131 131 0 131 |
16. | 2005m4 107 107 0 107 |
17. | 2005m5 151 151 0 151 |
18. | 2005m6 181 181 0 181 |
|----------------------------------------|
19. | 2005m7 . 342 2 171 |
20. | 2005m8 342 342 2 171 |
|----------------------------------------|
21. | 2005m9 157 157 0 157 |
22. | 2005m10 190 190 0 190 |
23. | 2005m11 150 150 0 150 |
24. | 2005m12 56 56 0 56 |
+----------------------------------------+

Get frequency from dataset with repeated measurements over time

this is my problem: I have a dataset that has 10 measurements over time, something like this:
ID Expenditure Age
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
.
.
.
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
.
.
.
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
.
.
.
Now I want to obtain the frequency of age, so I did this:
proc freq data=Expenditure;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age Frequency Count Percent of total frequency
79 10 0.1
80 140 1.4
89 50 0.5
The problem is that this counts all rows, but doesn't take into account the repeated measurements per id. So I wanted to create a new colum with the actual frequencies like this:
data Age;
set Age_freq;
freq = Frequency Count /10;
run;
but I think sas doesn't recognize this 'Frequency Count' variable, can anybody gives me some insight on this?
thanks
You have to remove the duplicate records so that each ID had one record containing the age.
Solution: create a new table with the disticnt values of the ID and Age. then run the proc freq
Code:
I created a new table called Expenditure_ids that doesn't have any duplicate values for the ID & Age.
data Expenditure;
input ID Expenditure Age ;
datalines;
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
28 100 80
28 102 80
28 178 80
28 290 80
28 200 80
;
run;
proc sql;
create table Expenditure_ids as
select distinct ID, Age from Expenditure ;
quit;
proc freq data=Expenditure_ids;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age=79 COUNT=1 PERCENT=25
Age=80 COUNT=2 PERCENT=50
Age=89 COUNT=1 PERCENT=25

Year to date vs Year to date last year | Pandas

I would like to calculate the Year to date (YTD) value for this year and compare it to the same period last year in Pandas. My df looks like this:
Month Product A Product B
2015-01-01 24 62
2015-02-01 46 24
2015-03-01 30 70
2015-04-01 26 51
2015-05-01 34 42
2015-06-01 45 35
2015-07-01 25 13
2015-08-01 98 95
2015-09-01 6 81
2015-10-01 93 38
2015-11-01 98 59
2015-12-01 98 1
2016-01-01 67 42
2016-02-01 72 34
2016-03-01 7 6
2016-04-01 19 24
2016-05-01 82 38
2016-06-01 15 79
2016-07-01 49 83
2016-08-01 97 56
The two values i am after for product A are
YTD = 408 and YTD SPLY = 328 (Sum Jan-Aug 2016, Sum Jan-Aug 2015).
When a new month is added to the df, i would like the formula to calculate Jan-Sep and so on.
Any ideas how to proceed?
Not exactly sure what you want but it looks like you want to take the cumulative sum for each year.
df[['A_cumsum', 'B_cumsum']] = df.resample('A', on='Month').transform('cumsum')
Month Product A Product B A_cumsum B_cumsum
0 2015-01-01 24 62 24 62
1 2015-02-01 46 24 70 86
2 2015-03-01 30 70 100 156
3 2015-04-01 26 51 126 207
4 2015-05-01 34 42 160 249
5 2015-06-01 45 35 205 284
6 2015-07-01 25 13 230 297
7 2015-08-01 98 95 328 392
8 2015-09-01 6 81 334 473
9 2015-10-01 93 38 427 511
10 2015-11-01 98 59 525 570
11 2015-12-01 98 1 623 571
12 2016-01-01 67 42 67 42
13 2016-02-01 72 34 139 76
14 2016-03-01 7 6 146 82
15 2016-04-01 19 24 165 106
16 2016-05-01 82 38 247 144
17 2016-06-01 15 79 262 223
18 2016-07-01 49 83 311 306
19 2016-08-01 97 56 408 362

Creating statistical data from a table

I have a table with 20 columns of measurements. I would like 'convert' the table into a table with 20 rows with columns of Avg, Min, Max, StdDev, Count types of information. There is another question like this but it was for the 'R' language. Other question here.
I could do the following for each column (processing the results with C++):
Select Count(Case When [avgZ_l1] <= 0.15 and avgZ_l1 > 0 then 1 end) as countValue1,
Count(case when [avgZ_l1] <= 0.16 and avgZ_l1 > 0.15 then 1 end) as countValue2,
Count(case when [avgZ_l1] <= 0.18 and avgZ_l1 > 0.16 then 1 end) as countValue3,
Count(case when [avgZ_l1] <= 0.28 and avgZ_l1 > 0.18 then 1 end) as countValue4,
Avg(avgwall_l1) as avg1, Min(avgwall_l1) as min1, Max(avgZ_l1) as max1,
STDEV(avgZ_l1) as stddev1, count(*) as totalCount from myProject.dbo.table1
But I do not want to process the 50,000 records 20 times (once for each column). I thought there would be away to 'pivot' the table onto its side and process the data at the same time. I have seen examples of the 'Pivot' but they all seem to pivot on a integer type field, Month number or Device Id. Once the table is converted I could then fetch each row with C++. Maybe this is really just 'Insert into ... select ... from' statements.
Would the fastest (execution time) approach be to simply create a really long select statement that returns all the information I want for all the columns?
We might end up with 500,000 rows. I am using C++ and SQL 2014.
Any thoughts or comments are welcome. I just don't want have my naive code to be used as a shining example of how NOT to do something... ;)...
If your table looks the same as the code that you sent in r then the following query should work for you. It selects the data that you requested and pivots it at the same time.
create table #temp(ID int identity(1,1),columnName nvarchar(50));
insert into #temp
SELECT COLUMN_NAME as columnName
FROM myProject.INFORMATION_SCHEMA.COLUMNS -- change myProject to the name of your database. Unless myProject is your database
WHERE TABLE_NAME = N'table1'; --change table1 to your table that your looking at. Unless table1 is your table
declare #TableName nvarchar(50) = 'table1'; --change table1 to your table again
declare #loop int = 1;
declare #query nvarchar(max) = '';
declare #columnName nvarchar(50);
declare #endQuery nvarchar(max)='';
while (#loop <= (select count(*) from #temp))
begin
set #columnName = (select columnName from #temp where ID = #loop);
set #query = 'select t.columnName, avg(['+#columnName+']) as Avg ,min(['+#columnName+']) as min ,max(['+#columnName+'])as max ,stdev(['+#columnName+']) as STDEV,count(*) as totalCount from '+#tablename+' join #temp t on t.columnName = '''+#columnName+''' group by t.columnName';
set #loop += 1;
set #endQuery += 'union all('+ #query + ')';
end;
set #endQuery = stuff(#endQuery,1,9,'')
Execute(#endQuery);
drop table #temp;
It creates a #temp table which stores the values of your column headings next to an ID. It then uses the ID when looping though the number of columns that you have. It then generates a query which selects what you want and then unions it together. This query will work on any number of columns meaning that if you add or remove more columns it should give the correct result.
With this input:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
It will produce this output:
avg min max stdev totalcount
age 20 17 31 3.3 29
height_seca1 1737 1569 1877 91.9 29
height_chad1 1736 1570 1880 92.7 29
height_DL 173 155 188 9.7 29
weight_alog1 73 50 106 14.5 29
Hope this helps and works for you. :)

AWK - Printing a specific pattern

I have file that looks like this
gene_id_100100 sp|Q53IZ1|ASDP_PSESP 35.81 148 90 2 13 158 6 150 6e-27 109 158 531
gene_id_100600 sp|Q49W80|Y1834_STAS1 31.31 99 63 2 1 95 279 376 7e-07 50.1 113 402
gene_id_100 sp|A7TSV7|PAN1_VANPO 36.36 44 24 1 41 80 879 922 1.9 32.3 154 1492
gene_id_10100 sp|P37348|YECE_ECOLI 32.77 177 104 6 3 172 2 170 2e-13 71.2 248 272
gene_id_101100 sp|B0U4U5|SURE_XYLFM 29.11 79 41 3 70 148 143 206 0.14 35.8 175 262
gene_id_101600 sp|Q5AWD4|BGLM_EMENI 35.90 39 25 0 21 59 506 544 4.9 30.4 129 772
gene_id_102100 sp|P20374|COX1_APILI 38.89 36 22 0 3 38 353 388 0.54 32.0 92 521
gene_id_102600 sp|Q46127|SYW_CLOLO 79.12 91 19 0 1 91 1 91 5e-44 150 92 341
gene_id_103100 sp|Q9UJX6|ANC2_HUMAN 53.57 28 13 0 11 38 608 635 2.1 28.9 42 822
gene_id_103600 sp|C1DA02|SYL_LARHH 35.59 59 30 2 88 138 382 440 4.6 30.8 140 866
gene_id_104100 sp|B8DHP2|PROB_LISMH 25.88 85 50 2 37 110 27 109 0.81 32.3 127 276
gene_id_105100 sp|A1ALU1|RL3_PELPD 31.88 69 42 2 14 77 42 110 2.2 31.6 166 209
gene_id_105600 sp|P59696|T200_SALTY 64.00 125 45 0 5 129 3 127 9e-58 182 129 152
gene_id_10600 sp|G3XDA3|CTPH_PSEAE 28.38 74 48 1 4 77 364 432 0.56 31.6 81 568
gene_id_106100 sp|P94369|YXLA_BACSU 35.00 100 56 3 25 120 270 364 4e-08 53.9 120 457
gene_id_106600 sp|P34706|SDC3_CAEEL 60.00 20 8 0 18 37 1027 1046 2.3 32.7 191 2150
Now, I need to extract the gene ID, which is the one between || in the second column. In other words, I need an output that looks like this:
Q53IZ1
Q49W80
A7TSV7
P37348
B0U4U5
Q5AWD4
P20374
Q46127
Q9UJX6
C1DA02
B8DHP2
A1ALU1
P59696
G3XDA3
P94369
P34706
I have been trying to do it using the following command:
awk '{for(i=1;i<=NF;++i){ if($i==/[A-Z][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]/){print $i} } }'
but it doesn't seem to work.
Pattern matching is not really necessary. I'd suggest
awk -F\| '{print $2}' filename
This splits the line into |-delimited fields and prints the second of them.
Alternatively,
cut -d\| -f 2 filename
achieves the same.