Lets say I ahve a dataset as follows:
ID Cat
101 G
101 G
101 F
101 G
102 F
102 F
102 G
102 F
102 F
i want to create a variable for sequence by group variable ID,Cat(notsorted)
a count can be this way
data X1; set have; by ID, cat notsorted;
if first.cat then count=1; else count+1;run;
ID Cat count
101 G 1
101 G 2
101 F 1
101 G 1
102 F 1
102 F 2
102 G 1
102 F 1
102 F 1
however what I am looking for is.
ID Cat Seq
101 G 1
101 G 1
101 F 2
101 G 3
102 F 1
102 F 1
102 G 2
102 F 3
102 F 3
You can just use
seq+first.cat;
So every time you start a new CAT value the SEQ will increment by one.
To reset for each ID add:
if first.id then seq=1;
Related
I need to get the weight for below titles? Since they aren't follow the same way I couldn't get the desired results.
Pommes d'Aquitaine et mangue (dès 4 mois) - 2 x 130 g (Babybio)
Céréales vanille avec quinoa - à partir de 6 mois - 220 g (Babybio)
Ratatouille riz (dès 12 mois) 2 x 200 g (Babybio)
Pomme de terre, petits pois et jambon (dès 8 mois) 2 x 200 g
Fondue de carotte et maïs doux au quinoa (dès 12 mois) 230 g (Babybio)
Gourdes de fruits: pomme d'Aquitaine, poire et pêche - dès 6 mois - 4 x 90 g (Babybio)
Douceur de panais du Val de Loire, carotte et riz (dès 12 mois) 230 g (Babybio)
Expecting results:
2 x 130 g
220 g
2 x 200 g
230 g
4 x 90 g
230 g
I tried this pattern:
[0-9]+ x \d+ g
Try this:
\d+( ?x ?\d+)? g
Here Is Demo
This allow:
220 g
220x10 g
10 x23 g
45x 78 g
...
This will work. Your regex was missing only the cases that there was not x
[0-9]+( x )?\d+ g
The (...) captures everything enclosed in it, and the ? captures one occurrence or none.
I need to assign seq number by group. I have tried using seq number but got it by one group (1,2,3, etc). However, I need it by two groups. As in the example below:
Have:
Var1 Var2 Var3
101 aaa 202
101 aaa 202
101 bbb 203
101 ccc 206
101 ddd 207
102 aaa 222
102 aaa 222
102 bbb 223
Want:
Obs var1 var2 var3 seq
1 101 aaa 202 1
2 101 aaa 202 1
3 101 bbb 203 2
4 101 ccc 206 3
5 101 ddd 207 4
6 102 aaa 222 1
7 102 aaa 222 1
8 102 bbb 223 2
If you sort your data it is quite simple:
proc sort data=sashelp.class out=class;
by sex age;
run;
data class;
set class;
by sex age;
if first.sex then
seqn = 0;
if first.age then
seqn + 1;
run;
data bob;
input subjid test$ lbdy flag1$;
datalines;
101 alt 1 e
101 alt 2 .
101 alt 3 e
101 alt 4 e
101 ast 1 e
101 ast 3 e
101 ast 4 .
101 ast 5 e
102 alt 1 e
102 alt 2 .
102 alt 3 e
102 alt 4 e
102 ast 1 e
102 ast 2 .
102 ast 3 e
102 ast 4 .
102 ast 5 e
102 ast 6 e
;
run;
I need to find the consecutive value
My expected output is
count
101 alt 1 e 1
101 alt 2 .
101 alt 3 e 2
101 alt 4 e 2
101 ast 1 e 2
101 ast 3 e 2
101 ast 4 .
101 ast 5 e 1
102 alt 1 e 4
102 alt 2 e 4
102 alt 3 e 4
102 alt 4 e 4
102 ast 1 e 1
102 ast 2 .
102 ast 3 e 1
102 ast 4 .
102 ast 5 e 2
102 ast 6 e 2
Thanks In Advance
I reckon this is what you want. Hard to know from cryptic description.
data bob;
input subjid test$ lbdy flag1$;
datalines;
101 alt 1 e
101 alt 2 .
101 alt 3 e
101 alt 4 e
101 ast 1 e
101 ast 3 e
101 ast 4 .
101 ast 5 e
102 alt 1 e
102 alt 2 e
102 alt 3 e
102 alt 4 e
102 ast 1 e
102 ast 2 .
102 ast 3 e
102 ast 4 .
102 ast 5 e
102 ast 6 e
;;;;
run;
proc sort data=bob;
by subjid test lbdy;
run;
proc print;
run;
data dob2;
do _n_=1 by 1 until(last.flag1);
set bob;
by subjid test flag1 notsorted;
end;
if not missing(flag1) then count = _n_;
do _n_ = 1 to _n_;
set bob;
output;
end;
run;
proc print;
run;
I have this data:
id test test_date value
1 A 02/06/2014 12:26 11
1 B 02/06/2014 12:26 23
1 C 02/06/2014 13:17 43
1 D 02/06/2014 13:17 65
1 E 02/06/2014 13:17 34
1 F 02/06/2014 13:17 64
1 A 05/06/2014 15:14 234
1 B 05/06/2014 15:14 646
1 C 05/06/2014 16:50 44
1 E 05/06/2014 16:50 55
2 E 05/06/2014 16:50 443
2 F 05/06/2014 16:50 22
2 G 05/06/2014 16:59 445
2 B 05/06/2014 20:03 66
2 C 05/06/2014 20:03 77
2 D 05/06/2014 20:03 88
2 E 05/06/2014 20:03 44
2 F 05/06/2014 20:19 33
2 G 05/06/2014 20:19 22
I would like to transform this data into wide format like this:
id date A B C D E F G
1 02/06/2014 12:26 11 23 43 65 34 64 .
1 05/06/2014 15:14 234 646 44 . 55 . .
2 05/06/2014 16:50 . . . . 443 22 445
2 05/06/2014 20:03 . 66 77 88 44 33 22
I am using reshape command in Stata, but it is not producing required results:
reshape wide test_date value, i(id) j(test) string
Any idea how to do this?
UPDATE:
You're right that we need this missvar. I try to create this by programming, but failed. Let say with-in 2 hours of test date the batch will consider same. We have only 7 tests (A,B,C,D,E,F,G). First I try to find the time difference;
bysort id: gen diff_bd = (test_date[_n] - test_date[_n-1])/(1000*60*60)
bysort id: generate missvar = _n if diff_bd <= 2
#jfeigenbaum has given part of the answer.
The problem I see is that you are missing a variable that identifies relevant sub-groups. These sub-groups seem to be bounded by test taking values A - G. But I may be wrong.
I've included this variable in the example data set, and named it missvar. I forced this variable into the data set believing it identifies groups that, although implicit in your original post, are important for your analysis.
clear
set more off
*----- example data -----
input ///
id str1 test str30 test_date value missvar
1 A "02/06/2014 12:26" 11 1
1 B "02/06/2014 12:26" 23 1
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
2 E "05/06/2014 16:50" 443 1
2 F "05/06/2014 16:50" 22 1
2 G "05/06/2014 16:59" 445 1
2 B "05/06/2014 20:03" 66 2
2 C "05/06/2014 20:03" 77 2
2 D "05/06/2014 20:03" 88 2
2 E "05/06/2014 20:03" 44 2
2 F "05/06/2014 20:19" 33 2
2 G "05/06/2014 20:19" 22 2
end
gen double tdate = clock( test_date, "DM20Yhm")
format %tc tdate
drop test_date
list, sepby(id)
*----- what you want ? -----
reshape wide value, i(id missvar tdate) j(test) string
collapse (min) tdate value?, by(id missvar)
rename value* *
list
There should be some way of identifying the groups programmatically. Relying on the original sort order of the data is one way, but it may not be the safest. It may be the only way, but only you know that.
Edit
Regarding your comment and the "missing" variable, one way to create it is:
// one hour is 3600000 milliseconds
bysort id (tdate): gen batch = sum(tdate - tdate[_n-1] > 7200000)
For your example data, this creates a batch variable identical to my missvar. You can also use time-series operators.
Let me emphasize the need for you to be carefull when presenting your example data. It must be representative of the real one or you might get code that doesn't suit it; that includes the possibility that you don't notice it because Stata gives no error.
For example, if you have the same test, applied to the same id within the two-hour limit, then you'll lose information with this code (in the collapse). (This is not a problem in your example data.)
Edit 2
In response to another question found in the comments:
Suppose a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at a different time :
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:35" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Test A is applied at 12:26 and at 12:35. Reshape will have no problem with this, but the collapse will discard information because it is taking the minimum values amongst the id missvar groups; notice that for the variable valueA, new information (the 99) will be lost (so too happens with all other variables, but you are explicit about wanting to discard that). After the reshape but before the collapse you get:
. list, sepby(id)
+--------------------------------------------------------------------------------------------------+
| id missvar tdate valueA valueB valueC valueD valueE valueF valueG |
|--------------------------------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 . . . . . |
2. | 1 1 02jun2014 12:35:00 99 . . . . . . |
3. | 1 1 02jun2014 13:17:00 . . 43 65 34 64 . |
4. | 1 2 05jun2014 15:14:00 234 646 . . . . . |
5. | 1 2 05jun2014 16:50:00 . . 44 . 55 . . |
|--------------------------------------------------------------------------------------------------|
6. | 2 1 05jun2014 16:50:00 . . . . 443 22 . |
7. | 2 1 05jun2014 16:59:00 . . . . . . 445 |
8. | 2 2 05jun2014 20:03:00 . 66 77 88 44 . . |
9. | 2 2 05jun2014 20:19:00 . . . . . 33 22 |
+--------------------------------------------------------------------------------------------------+
Running the complete code confirms what we just said:
. list, sepby(id)
+--------------------------------------------------------------------------+
| id missvar tdate A B C D E F G |
|--------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 43 65 34 64 . |
2. | 1 2 05jun2014 15:14:00 234 646 44 . 55 . . |
|--------------------------------------------------------------------------|
3. | 2 1 05jun2014 16:50:00 . . . . 443 22 445 |
4. | 2 2 05jun2014 20:03:00 . 66 77 88 44 33 22 |
+--------------------------------------------------------------------------+
Suppose now a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at the same time:
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:26" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Then the reshape won't work. Stata complains:
values of variable test not unique within id missvar tdate
and with reason. The error is clear in signalling the problem. (If not clear, go back to help reshape and work out some exercises.) The request makes no sense given the functioning of the command.
Finally, note it's relatively easy to check if something will work or not: just try it! All that was necessary in this case was to modify a bit the example data. Go back to help files and manuals, if necessary.
The command is slightly misspecified. You want to reshape value. Look at the output you want and notice the observations are uniquely identified by id and test_date. Therefore, they should be in the i option.
reshape wide value, i(id test_date) j(test) string
This yields something close you what you want, you just need to rename a few variables to get exactly the output. Specifically:
rename test_date date
renpfix value
In the following SAS statement, what do the parameters "noobs" and "label" stand for?
proc print data-sasuser.schedule noobs label;
per SAS 9.2 documentation on PROC PRINT:
"NOOBS - Suppress the column in the output that identifies each observation by number"
"LABEL - Use variables' labels as column headings"
noobs don't show you the column of observations number
(1,2,3,4,5,....)
my first title
results without noobs
Obs name sex group height weight
1 mike m a 21 150
2 henry m b 30 140
3 norian f b 18 130
4 nadine f b 32 135
5 dianne f a 23 135
results with noobs
my first title
name sex group height weight
mike m a 21 150
henry m b 30 140
norian f b 18 130
nadine f b 32 135
dianne f a 23 135