I have this data
ID A1 A2 B1 B2 C
1 0 1 2 3 4
2 5 6 7 8 9
Here, A1 means A at year 1, A2 means A at year 2. Same goes for B.
I want to make a data where each row is ID-year pair, not just ID.
Like this:
ID year A B C
1 1 0 2 4
1 2 1 3 4
2 1 5 7 9
2 2 6 8 9
Luckily, there are same number of years of A and B.
Honestly I am stuck and all I could come up was just create the desired data structure first and manually copy and paste things. But the data is too big to do it manually.
How should I go about it?
EDIT:
The names of the variables should be more like below:
ID A00 A01 B00 B01 C
1 0 1 2 3 4
2 5 6 7 8 9
See help for the reshape command. It's a reshape long problem.
clear
input ID A1 A2 B1 B2 C
1 0 1 2 3 4
2 5 6 7 8 9
end
reshape long A B , i(ID) j(Year)
list, sepby(ID)
+-----------------------+
| ID Year A B C |
|-----------------------|
1. | 1 1 0 2 4 |
2. | 1 2 1 3 4 |
|-----------------------|
3. | 2 1 5 7 9 |
4. | 2 2 6 8 9 |
+-----------------------+
Related
This question already has answers here:
variable showing the highest value attained of another variable, recorded so far, over time
(2 answers)
Closed 1 year ago.
I want to generate a variable max_count wherein, for a given group ID, if the value of count for the current year is higher than for the previous year then max_count takes the value for the current year. The value for the current year will be applied to the succeeding years until a higher value than that in the current year occurs. For instance, in the example below for ID 2, the value of count in 2001 is 10 but the succeeding years (2002 and 2003) have values less than 10 (i.e. 2 and 4) so 2002 and 2003 then take the value of 10 (the highest value after 2001).
I used this Stata code but it doesn't work:
bysort id (Year): gen max_count=max(count, count[_n-1])
The highest value is only applied to the immediately succeeding year and not to all succeeding years.
ID Year count max_count
1 2000 5 5
1 2001 0 5
1 2002 3 5
1 2003 7 7
2 2000 5 5
2 2001 10 10
2 2002 2 10
2 2003 4 10
3 2000 2 2
3 2001 5 5
3 2002 9 9
3 2003 6 9
clear
input ID Year count max_count
1 2000 5 5
1 2001 0 5
1 2002 3 5
1 2003 7 7
2 2000 5 5
2 2001 10 10
2 2002 2 10
2 2003 4 10
3 2000 2 2
3 2001 5 5
3 2002 9 9
3 2003 6 9
end
bysort ID (Year) : gen wanted = count[1]
by ID : replace wanted = max(wanted[_n-1], count) if _n > 1
list, sepby(ID)
+---------------------------------------+
| ID Year count max_co~t wanted |
|---------------------------------------|
1. | 1 2000 5 5 5 |
2. | 1 2001 0 5 5 |
3. | 1 2002 3 5 5 |
4. | 1 2003 7 7 7 |
|---------------------------------------|
5. | 2 2000 5 5 5 |
6. | 2 2001 10 10 10 |
7. | 2 2002 2 10 10 |
8. | 2 2003 4 10 10 |
|---------------------------------------|
9. | 3 2000 2 2 2 |
10. | 3 2001 5 5 5 |
11. | 3 2002 9 9 9 |
12. | 3 2003 6 9 9 |
+---------------------------------------+
There is a detailed discussion of how to get such records (the maximum or minimum so far is the "record", as in sport) in this Stata FAQ.
For a one-line solution, install rangestat from SSC and then
rangestat (max) WANTED = count, int(Year . 0) by(ID)
The problem of when the record occurred is naturally related:
by ID : gen when = Year[1]
by ID : replace when = cond(wanted > wanted[_n-1], Year, when[_n-1]) if _n > 1
I have data which is as follows:
data have;
length
group 8
replicate $ 1
day 8
observation 8
;
input (_all_) (:);
datalines;
1 A 1 0
1 A 1 5
1 A 1 3
1 A 1 3
1 A 2 7
1 A 2 2
1 A 2 4
1 A 2 2
1 B 1 1
1 B 1 3
1 B 1 8
1 B 1 0
1 B 2 3
1 B 2 8
1 B 2 1
1 B 2 3
1 C 1 1
1 C 1 5
1 C 1 2
1 C 1 7
1 C 2 2
1 C 2 1
1 C 2 4
1 C 2 1
2 A 1 7
2 A 1 5
2 A 1 3
2 A 1 1
2 A 2 0
2 A 2 5
2 A 2 3
2 A 2 0
2 B 1 0
2 B 1 3
2 B 1 4
2 B 1 8
2 B 2 1
2 B 2 3
2 B 2 4
2 B 2 0
2 C 1 0
2 C 1 4
2 C 1 3
2 C 1 1
2 C 2 2
2 C 2 3
2 C 2 0
2 C 2 1
3 A 1 4
3 A 1 5
3 A 1 6
3 A 1 7
3 A 2 3
3 A 2 1
3 A 2 5
3 A 2 2
3 B 1 2
3 B 1 0
3 B 1 2
3 B 1 3
3 B 2 0
3 B 2 6
3 B 2 3
3 B 2 7
3 C 1 7
3 C 1 5
3 C 1 3
3 C 1 1
3 C 2 0
3 C 2 3
3 C 2 2
3 C 2 1
;
run;
I want to split observation into two columns based on day.
observation_ observation_
Obs group replicate day_1 day_2
1 1 A 0 7
2 1 A 5 2
3 1 A 3 4
4 1 A 3 2
5 1 B 1 3
6 1 B 3 8
7 1 B 8 1
8 1 B 0 3
9 1 C 1 2
10 1 C 5 1
11 1 C 2 4
12 1 C 7 1
13 2 A 7 0
14 2 A 5 5
15 2 A 3 3
16 2 A 1 0
17 2 B 0 1
18 2 B 3 3
19 2 B 4 4
20 2 B 8 0
21 2 C 0 2
22 2 C 4 3
23 2 C 3 0
24 2 C 1 1
25 3 A 4 3
26 3 A 5 1
27 3 A 6 5
28 3 A 7 2
29 3 B 2 0
30 3 B 0 6
31 3 B 2 3
32 3 B 3 7
33 3 C 7 0
34 3 C 5 3
35 3 C 3 2
36 3 C 1 1
The observant SO reader will notice that I have asked essentially the same question previously. However, because of SAS's obsession with "levels" and "by groups", since the variable being used to split the variable of interest isn't binary, that solution doesn't generalize.
Trying it directly, the following occurs:
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test;
by
group
replicate
;
var observation;
id day;
run;
ERROR: The ID value "_1" occurs twice in the same BY group.
I can use a LET statement to repress the errors, but in addition to cluttering up the log, SAS retains only the last observation of each BY group.
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test let;
by
group
replicate
;
var observation;
id day;
run;
Obs group replicate _NAME_ _1 _2
1 1 A observation 3 2
2 1 B observation 0 3
3 1 C observation 7 1
4 2 A observation 1 0
5 2 B observation 8 0
6 2 C observation 1 1
7 3 A observation 7 2
8 3 B observation 3 7
9 3 C observation 1 1
I don't doubt there's some kludgy way it could be done, such as splitting each group into a separate data set and then re-merging them. It seems like it should be doable with PROC TRANSPOSE, although how escapes me. Any ideas?
Not sure what you're talking about with "SAS's obsession...", but the issue here is fairly straightforward; you need to tell SAS about the four rows (or whatever) being separate, distinct rows. by tells SAS what the row-level ID is, but you're lying to it when you say by group replicate, since there are still multiple rows under that. So you need to have a unique key. (This would be true in any database-like language, nothing unique to SAS here. )
I would do this - make a day_row field, then sort by that.
data have_id;
set have;
by group replicate day;
if first.day then day_row = 0;
day_row+1;
run;
proc sort data=have_id;
by group replicate day_row;
run;
proc transpose data=have_id out=want(drop=_name_) prefix=observation_day_;
by group replicate day_row;
var observation;
id day;
run;
Your output looks like you don't want to transpose the data but instead just want split it into DAY1 and DAY2 sets and merge them back together. This will just pair the multiple readings per BY group in the same order that they appear, which is what it looks like you did in your example.
data want ;
merge
have(where=(day=1) rename=(observation=day_1))
have(where=(day=2) rename=(observation=day_2))
;
by group replicate;
drop day ;
run;
You can read the source data as many times as you need for the number of values of DAY.
If you think that you might not have the same number of observations per BY group for each DAY then you should add these statements at the end of the data step.
output;
call missing(of day_:);
I have a Stata dataset organized as follows:
payment class molecule State
10 1 1 1
8 2 1 1
25 3 2 1
7 4 2 1
12 1 1 2
5 2 1 2
24 3 2 2
7 4 2 2
How do I create a variable that is the difference of the payment variable between classes within the same molecule?
Expected output:
payment class molecule State payment_difference
10 1 1 1 2
8 2 1 1 2
25 3 2 1 18
7 4 2 1 18
12 1 1 2 7
5 2 1 2 7
24 3 2 2 17
7 4 2 2 17
Using your toy example:
clear
input payment class molecule state
10 1 1 1
8 2 1 1
25 3 2 1
7 4 2 1
12 1 1 2
5 2 1 2
24 3 2 2
7 4 2 2
end
The following works for me:
bysort state molecule (class) : generate diff = payment[1] - payment[2]
list, separator(0)
+-------------------------------------------+
| payment class molecule state diff |
|-------------------------------------------|
1. | 10 1 1 1 2 |
2. | 8 2 1 1 2 |
3. | 25 3 2 1 18 |
4. | 7 4 2 1 18 |
5. | 12 1 1 2 7 |
6. | 5 2 1 2 7 |
7. | 24 3 2 2 17 |
8. | 7 4 2 2 17 |
+-------------------------------------------+
For details, read Speaking Stata: How to move step by: step
on Stata Journal.
I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.
I have the following data:
id tests testvalue
1 A 4
1 B 5
1 C 3
1 D 3
2 A 3
2 B 3
3 C 3
3 D 4
4 A 3
4 B 5
4 A 1
4 B 3
I would like to change the above long data format into following wide data.
id testA testB testC testD index
1 4 5 3 3 1
2 3 3 . . 2
3 . . 3 4 3
4 3 5 . . 4
4 1 3 . . 5
I am trying
reshape wide testvalue, i(id) j(tests)
It gives error because there are no unique values within tests.
What would be the solution to this problem?
You need to create an extra identifier to make replicates distinguishable.
clear
input id str1 tests testvalue
1 A 4
1 B 5
1 C 3
1 D 3
2 A 3
2 B 3
3 C 3
3 D 4
4 A 3
4 B 5
4 A 1
4 B 3
end
bysort id tests: gen replicate = _n
reshape wide testvalue, i(id replicate) j(tests) string
See also here for documentation.