I have the following data:
id tests testvalue
1 A 4
1 B 5
1 C 3
1 D 3
2 A 3
2 B 3
3 C 3
3 D 4
4 A 3
4 B 5
4 A 1
4 B 3
I would like to change the above long data format into following wide data.
id testA testB testC testD index
1 4 5 3 3 1
2 3 3 . . 2
3 . . 3 4 3
4 3 5 . . 4
4 1 3 . . 5
I am trying
reshape wide testvalue, i(id) j(tests)
It gives error because there are no unique values within tests.
What would be the solution to this problem?
You need to create an extra identifier to make replicates distinguishable.
clear
input id str1 tests testvalue
1 A 4
1 B 5
1 C 3
1 D 3
2 A 3
2 B 3
3 C 3
3 D 4
4 A 3
4 B 5
4 A 1
4 B 3
end
bysort id tests: gen replicate = _n
reshape wide testvalue, i(id replicate) j(tests) string
See also here for documentation.
Related
I have data which is as follows:
data have;
length
group 8
replicate $ 1
day 8
observation 8
;
input (_all_) (:);
datalines;
1 A 1 0
1 A 1 5
1 A 1 3
1 A 1 3
1 A 2 7
1 A 2 2
1 A 2 4
1 A 2 2
1 B 1 1
1 B 1 3
1 B 1 8
1 B 1 0
1 B 2 3
1 B 2 8
1 B 2 1
1 B 2 3
1 C 1 1
1 C 1 5
1 C 1 2
1 C 1 7
1 C 2 2
1 C 2 1
1 C 2 4
1 C 2 1
2 A 1 7
2 A 1 5
2 A 1 3
2 A 1 1
2 A 2 0
2 A 2 5
2 A 2 3
2 A 2 0
2 B 1 0
2 B 1 3
2 B 1 4
2 B 1 8
2 B 2 1
2 B 2 3
2 B 2 4
2 B 2 0
2 C 1 0
2 C 1 4
2 C 1 3
2 C 1 1
2 C 2 2
2 C 2 3
2 C 2 0
2 C 2 1
3 A 1 4
3 A 1 5
3 A 1 6
3 A 1 7
3 A 2 3
3 A 2 1
3 A 2 5
3 A 2 2
3 B 1 2
3 B 1 0
3 B 1 2
3 B 1 3
3 B 2 0
3 B 2 6
3 B 2 3
3 B 2 7
3 C 1 7
3 C 1 5
3 C 1 3
3 C 1 1
3 C 2 0
3 C 2 3
3 C 2 2
3 C 2 1
;
run;
I want to split observation into two columns based on day.
observation_ observation_
Obs group replicate day_1 day_2
1 1 A 0 7
2 1 A 5 2
3 1 A 3 4
4 1 A 3 2
5 1 B 1 3
6 1 B 3 8
7 1 B 8 1
8 1 B 0 3
9 1 C 1 2
10 1 C 5 1
11 1 C 2 4
12 1 C 7 1
13 2 A 7 0
14 2 A 5 5
15 2 A 3 3
16 2 A 1 0
17 2 B 0 1
18 2 B 3 3
19 2 B 4 4
20 2 B 8 0
21 2 C 0 2
22 2 C 4 3
23 2 C 3 0
24 2 C 1 1
25 3 A 4 3
26 3 A 5 1
27 3 A 6 5
28 3 A 7 2
29 3 B 2 0
30 3 B 0 6
31 3 B 2 3
32 3 B 3 7
33 3 C 7 0
34 3 C 5 3
35 3 C 3 2
36 3 C 1 1
The observant SO reader will notice that I have asked essentially the same question previously. However, because of SAS's obsession with "levels" and "by groups", since the variable being used to split the variable of interest isn't binary, that solution doesn't generalize.
Trying it directly, the following occurs:
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test;
by
group
replicate
;
var observation;
id day;
run;
ERROR: The ID value "_1" occurs twice in the same BY group.
I can use a LET statement to repress the errors, but in addition to cluttering up the log, SAS retains only the last observation of each BY group.
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test let;
by
group
replicate
;
var observation;
id day;
run;
Obs group replicate _NAME_ _1 _2
1 1 A observation 3 2
2 1 B observation 0 3
3 1 C observation 7 1
4 2 A observation 1 0
5 2 B observation 8 0
6 2 C observation 1 1
7 3 A observation 7 2
8 3 B observation 3 7
9 3 C observation 1 1
I don't doubt there's some kludgy way it could be done, such as splitting each group into a separate data set and then re-merging them. It seems like it should be doable with PROC TRANSPOSE, although how escapes me. Any ideas?
Not sure what you're talking about with "SAS's obsession...", but the issue here is fairly straightforward; you need to tell SAS about the four rows (or whatever) being separate, distinct rows. by tells SAS what the row-level ID is, but you're lying to it when you say by group replicate, since there are still multiple rows under that. So you need to have a unique key. (This would be true in any database-like language, nothing unique to SAS here. )
I would do this - make a day_row field, then sort by that.
data have_id;
set have;
by group replicate day;
if first.day then day_row = 0;
day_row+1;
run;
proc sort data=have_id;
by group replicate day_row;
run;
proc transpose data=have_id out=want(drop=_name_) prefix=observation_day_;
by group replicate day_row;
var observation;
id day;
run;
Your output looks like you don't want to transpose the data but instead just want split it into DAY1 and DAY2 sets and merge them back together. This will just pair the multiple readings per BY group in the same order that they appear, which is what it looks like you did in your example.
data want ;
merge
have(where=(day=1) rename=(observation=day_1))
have(where=(day=2) rename=(observation=day_2))
;
by group replicate;
drop day ;
run;
You can read the source data as many times as you need for the number of values of DAY.
If you think that you might not have the same number of observations per BY group for each DAY then you should add these statements at the end of the data step.
output;
call missing(of day_:);
I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.
I am trying to copy the value from the previous column to the present column if there is a missing value, but there is something wrong in the code I wrote.
data X;
input A B C D E;
cards;
1 . . . 2
2 2 3 . .
3 3 4 5 6
4 4 4 2 .
. . 6 . .
;
run;
Program
data Y;
set x;
array arr(5) a--e;
array brr(4) b--e;
do j=1 to dim(arr);
do i =2 to dim(brr);
if brr(i)=. then brr(i)=arr(j);
end;
end;
drop i j;
run;
However the output that I get is
1 . 1 1 2
2 2 3 2 2
3 3 4 5 6
4 4 4 2 4
. . 6 6 6
Which is wrong!
The output I want is like this:
1 1 1 1 2
2 2 3 3 3
3 3 4 5 6
4 4 4 2 4
. . 6 6 6
What is wrong with the code?
Do you want 4 4 4 2 2 instead of 4 4 4 2 4 ?
You need only one loop:
Try this code:
data Y;
set x;
array arr(5) a--e;
do i=2 to dim(arr);
if arr(i)=. then arr(i)=arr(i-1);
end;
drop i;
run;
Also, don't forget to think what is happening in this code!
You could try to check for every row and every i:
what is the arr(i) value?
what is the arr(i-1) value?
is the outcome what is expected? (Convince yourself that the problem is solved :) )
I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5
I have a dataset that shows how much was paid ("cenoz" - cents per ounce) per product category during specific week and in a specific store.
clear
set more off
input week store cenoz category
1 1 2 1
1 1 4 2
1 1 3 3
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 2 5 1
3 2 4 2
end
I create a new variable cenoz3 that indicates how much on average was paid for category 3 given specific week and a store. Same with cenoz1, and cenoz2.
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
It turns out that category 3 was not sold in any of the stores (1 and 2) in week 3. As a result, missing values are generated.
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 .
3 1 3 2 5 3 .
3 2 5 1 5 4 .
3 2 4 2 5 4 .
I would like to replace missing values of a particular week with values of the previous week and matching store. That's to say:
replace missing values for category 3 in week 3 in store 1
with values for category 3 in week 2 in store 1
and
replace missing values for category 3 in week 3 in store 2
with values for category 3 in week 2 in store 2
Can I use command replace or is it something more complicated than that?
Something like:
replace cenoz1 = cenoz1[_n-1] if missing(cenoz1)
But I also need to the stores to match, not just the time variable week.
I found this code provided by Nicholas Cox at
http://www.stata.com/support/faqs/data-management/replacing-missing-values/:
by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
Do you think
by store (week), sort: cenoz1 = cenoz1[_n-1] if missing(cenoz1)
is sufficient?
UPDATE:
When I use the code
by store (week category), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
It seems it delivers correct values:
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 10
3 1 3 2 5 3 10
3 2 5 1 5 4 7
3 2 4 2 5 4 7
Is there any way to double check this code given that my dataset is quite large?
How make this code not so specific but applicable to any missing cenoz if it finds one with missing vaues? (cenoz1, cenoz2, cenoz3, cenoz4...cenoz12)
If you want to use the previous information for the same store and the same category, that should be
by store category (week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
A generalization could be
sort store category week
forval j = 1/12 {
by store category: replace cenoz`j' = cenoz`j'[_n-1] if missing(cenoz`j')
}
However this carrying forward is a fairly crude method of interpolation. Consider linear, cubic, cubic spline, PCHIP methods of interpolation. Use search to find Stata programs.
A quick note on why your code
by store (category week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
won't work.
It will work for the example dataset you give. But a slight modification can give unexpected results. Consider the following example:
clear all
set more off
input week store cenoz category
1 1 2 1
1 1 4 2 /*
1 1 3 3 deleted observation */
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 1 999 3 // new observation
3 2 5 1
3 2 4 2
end
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
order store category week
sort store category week
list, sepby(store category)
*----- method 1 (your code) -----
gen cenoz3x1 = cenoz3
by store (category week), sort: replace cenoz3x1 = cenoz3x1[_n-1] if missing(cenoz3x1)
*----- method 2 (Nick's code) -----
gen cenoz3x2 = cenoz3
by store category (week), sort: replace cenoz3x2 = cenoz3x2[_n-1] if missing(cenoz3x2)
list, sepby(store category)
Method 1 will assign the price of a category 1 article to a category 2 article (observation 4 of cenoz3x1). Presumably, something you don't want. If you want to avoid this, then the groups should be based on store category and not just store.
The best place to start reading is help and the manuals.