I have a sample data frame below which has two values (Bus and Car+Minibus) in the mode column I have two questions, the first is how can I break this up into two data frames with single values as seen below. The second question is since I have a list of data frames with some occurring in the input format, how can I write a condition statement them.
input
Dest a b c
Orig Variable Time Mode
1 x y Bus 5.00 17.32 12.61
Car+Minibus 0.87 15.34 12.01
2 x y Bus 5.00 14.72 10.15
Car+Minibus 10.47 3.03 11.05
3 x y Bus 14.72 5.00 15.98
Car+Minibus 11.64 11.25 2.08
4 x y Bus 15.15 14.62 5.94
Car+Minibus 12.02 9.25 5.80
outputs:
Dest a b c
Orig Variable Time Mode
1 x y Bus 5.00 17.32 12.61
2 x y Bus 5.00 14.72 10.15
3 x y Bus 14.72 5.00 15.98
4 x y Bus 15.15 14.62 5.94
Dest a b c
Orig Variable Time Mode
1 x y Car+Minibus 0.87 15.34 12.01
2 x y Car+Minibus 10.47 3.03 11.05
3 x y Car+Minibus 11.64 11.25 2.08
4 x y Car+Minibus 12.02 9.25 5.80
I believe you need check fourth level of MultiIndex and filter by boolean indexing:
mask = df.index.get_level_values(3) == 'Bus'
df1 = df[mask]
df2 = df[~mask]
But if want working with list of DataFrames:
dfs = [df11,df12,df13]
for df in dfs:
mask = df.index.get_level_values(3) == 'Bus'
df1 = df[mask]
print (df1)
df2 = df[~mask]
print (df2)
Related
I want to generate ranks of values from lowest to highest across multiple variables in Stata. In the table below, the columns 2–4 show observed data values for variables x, y, and z, and columns 5–7 show ranks—including tied ranks—across all three variables.
Notice that "across all three variables" means that, for example, the lowest rank = 1 is applied only to the smallest value out of all three variables (i.e. only to the value 0.2 for variable x).
id
x
y
z
rank(x)
rank(y)
rank(z)
1
1.2
2.6
2.0
5
12
10.5
2
0.2
2.0
0.9
1
10.5
3.5
3
0.6
1.5
1.7
2
6
7
4
1.8
0.9
1.9
8
3.5
9
I was hoping egen would provide a one-line kind of solution, but I think it only creates a single rank variable.
Is there a function or one-liner a la (an imagined) rankvars x y z that would accomplish this? Or would it require writing a program to do so?
Correct: egen creates one outcome variable at a time, and you need other code to do this. That is not a program; it could be a few lines in a do-file.
A better way would push the data into Mata and pull out the results.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id float(x y z) byte rankx float(ranky rankz)
1 1.2 2.6 2 5 12 10.5
2 .2 2 .9 1 10.5 3.5
3 .6 1.5 1.7 2 6 7
4 1.8 .9 1.9 8 3.5 9
end
rename (x y z) (v=)
reshape long v, i(id) j(which) string
egen Rank = rank(v)
reshape wide Rank v, i(id) j(which) string
rename v* *
order id x y z rankx Rankx ranky Ranky rankz
list
+----------------------------------------------------------------------+
| id x y z rankx Rankx ranky Ranky rankz Rankz |
|----------------------------------------------------------------------|
1. | 1 1.2 2.6 2 5 5 12 12 10.5 10.5 |
2. | 2 .2 2 .9 1 1 10.5 10.5 3.5 3.5 |
3. | 3 .6 1.5 1.7 2 2 6 6 7 7 |
4. | 4 1.8 .9 1.9 8 8 3.5 3.5 9 9 |
+----------------------------------------------------------------------+
In the matrix, I have this representation -
X Y Z TOTAL
A 3 4 6 13
B 6 44 55 105
C 0 4 8 12
TOTAL 9 52 69 130
I want to show this as the following -
X Y Z
A 23% 31% 46%
B 6% 42% 52%
C 0% 33% 67%
example, for row A - (X/Total)*100 , (Y/Total)*100 ,(Z/Total)*100.
How do i do it?
Thanks in advance for your hep !
Select values field and show value as pecentage of row total
I have written C++ code to numerically solve a PDE. I would like to plot the result. I have outputted the data to an ascii file, as 3 columns of numbers. The x-coordinate, the y-coordinate and the z-coordinate. This might look like
0.01 7 -3
-12 1.2 -0.24
...
I often have in excess of 1000 data points. I want to plot a surface. I was able to load the data in both R and octave. In R scatterplot3D worked, and in octave plot3 worked. However, I wish to produce a surface, and not distinct points (scatterplot3d), or a curve (plot3). I am struggling to get mesh or surf to work from data in octave. I am looking for a simple way to plot a surface in 3D space with octave, R, C++ or any other program.
You could coerce the data into the correct format for plotting with the base R function persp. This requires a vector of unique x values, a vector of unique y values, and a matrix of z values which is a length(unique(x)) by length(unique(y)) matrix.
Suppose your data looks like this:
x <- y <- seq(-pi, pi, length = 20)
df <- expand.grid(x = x, y = y)
df$z <- cos(df$x) + sin(df$y)
head(df)
#> x y z
#> 1 -3.141593 -3.141593 -1.00000000
#> 2 -2.810899 -3.141593 -0.94581724
#> 3 -2.480205 -3.141593 -0.78914051
#> 4 -2.149511 -3.141593 -0.54694816
#> 5 -1.818817 -3.141593 -0.24548549
#> 6 -1.488123 -3.141593 0.08257935
Then you can create a matrix like this:
z <- tapply(df$z, list(df$x, df$y), mean)
So your plot would look like this:
persp(unique(df$x), unique(df$y), z,
col = "gold", theta = 45, shade = 0.75, ltheta = 90)
If your x and y co-ordinates are not nicely aligned, then a more general approach would be:
z <- tapply(df$z, list(cut(df$x, 20), cut(df$y, 20)), mean, na.rm = TRUE)
persp(as.numeric(factor(levels(cut(df$x, 20)), levels(cut(df$x, 20)))),
as.numeric(factor(levels(cut(df$y, 20)), levels(cut(df$y, 20)))),
z, col = "gold", theta = 45, shade = 0.75, ltheta = 90, xlab = "x",
ylab = "y")
I have a variable age, 13 variables x1 to x13, and 802 observations in a Stata dataset. age has values ranging 1 to 9. x1 to x13 have values ranging 1 to 13.
I want to know how to count the number of 1 .. 13 in x1 to x13 according to different values of age. For example, for age 1, in x1 to x13, count the number of 1,2,3,4,...13.
I first change x1 to x13 as a matrix by using
mkmat x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13, matrix (a)
Then, I want to count using the following loop:
gen count = 0
quietly forval i = 1/802 {
quietly forval j = 1/13 {
replace count = count + inrange(a[r'i', x'j'], 0, 1), if age==1
}
}
I failed.
I am still somewhat uncertain as to what you like to achieve. But if I am understanding you correctly, here is one way to do it.
First, a simple data that has age ranging from one to three, and four variables x1-x4, each with values of integers ranging between 5 and 7.
clear
input age x1 x2 x3 x4
1 5 6 6 6
1 7 5 6 5
2 5 7 6 6
3 5 6 7 7
3 7 6 6 6
end
Then we create three count variables (n5, n6 and n7) that counts the number of 5s, 6s, and 7s for each subject across x1-x4.
forval i=5/7 {
egen n`i'=anycount(x1 x2 x3 x4),v(`i')
}
Below is how the data looks like now. To explain, the first "1" under n5 indicates that there is only one "5" for the subject across x1-x4.
+----------------------------------------+
| age x1 x2 x3 x4 n5 n6 n7 |
|----------------------------------------|
1. | 1 5 6 6 6 1 3 0 |
2. | 1 7 5 6 5 2 1 1 |
3. | 2 5 7 6 6 1 2 1 |
4. | 3 5 6 7 7 1 1 2 |
5. | 3 7 6 6 6 0 3 1 |
+----------------------------------------+
It sounds to me like your ultimate goal is to have sums calculated separately for each value in age. Assuming this is true, let's create a 3x3 matrix to store such results.
mat A=J(3,3,.) // age (1-3) and values (5-7)
mat rown A=age1 age2 age3
mat coln A=value5 value6 value7
forval i=5/7 {
forval j=1/3 {
qui su n`i' if age==`j'
loca k=`i'-4 // the first column for value5
mat A[`j',`k']=r(sum)
}
}
The matrix looks like this. To explain, the first "3" under value5 indicates that for all children of the age of 1, the value 5 appears a total of three times across x1-x4
A[3,3]
value5 value6 value7
age1 3 4 1
age2 1 2 1
age3 1 4 3
With Aspen's example, you could do this:
gen id = _n
reshape long x, i(id)
tab age x
Note that your sample code doesn't loop over different ages and there is an incorrect comma in the count command. I won't try to fix the code, as there are many more direct methods, one of which is above. tabulate has an option to save the table as a matrix.
Here is another solution closer to the original idea. Warning: code not tested.
matrix count = J(9, 13, 0)
forval i = 1/9 {
forval j = 1/13 {
forval J = 1/13 {
qui count if age == `i' & x`J' == `j'
matrix count[`i', `j'] = count[`i', `j'] + r(N)
}
}
}
I would be very thankful if you could give me some hints (how to do or what procedures to have a look at)
on the following issue:
If, for example, I have a dataset that contain (for each brand) 4 character variables and 3 numerical variables, then I would like to calculate several averages of numerical variables based on all possible combinations of character variables (whether some of characer variables are missing or not).
Brand Char1 Char2 Char3 Char4 NumVar1 NumVar2 NumVar3
A a xx 3 a 0.471 0.304 0.267
A b xy 3 s 0.086 0.702 0.872
A c xz 3 a 0.751 0.962 0.080
A d xx 2 s 0.711 0.229 0.474
A a xy 3 a 0.160 0.543 0.256
A b xz 1 s 0.200 0.633 0.241
A c xx 3 a 0.765 0.511 0.045
A d xy 4 s 0.397 0.815 0.950
A a xz 1 a 0.890 0.757 0.483
A b xx 3 a 0.575 0.625 0.341
A c xy 3 a 0.595 0.047 0.584
A d xz 1 s 0.473 0.806 0.329
A a xx 2 s 0.062 0.161 0.018
A b xy 2 s 0.935 0.990 0.072
A c xz 4 s 0.564 0.490 0.112
A d xx 2 a 0.251 0.228 0.215
A a xy 4 a 0.551 0.778 0.605
A b xz 1 s 0.887 0.392 0.866
A c xx 1 s 0.238 0.569 0.245
A d xz 1 a 0.736 0.961 0.627
Thus, I want to compute the following (written not in the sas notations, but just logically):
%let numeric_var = NumVar1 NumVar2 NumVar3; *macro of all numerical variables;
*compute mean values for each NumVar by all combinations of Char.variables;
compute mean(&numeric_var) by Char1 Char2 Char3 Char4
compute mean(&numeric_var) by Char1 Char2 Char3
compute mean(&numeric_var) by Char1 Char2
compute mean(&numeric_var) by Char1
compute mean(&numeric_var) by Char1 Char2 Char4
compute mean(&numeric_var) by Char1 Char4
compute mean(&numeric_var) by Char1 Char3 Char4
etc.
Is there any more efficient way in sas to compute all these averages than just type all these combinations by hand?
In principle, at the end I would like to merge two datasets: one dataset as given above; and another dataset with only Character Variables (Brand Char1 Char2 Char3 Char4) and missing values for some of them. That is why I want to calculate averages of numerical variables over all possible combnations of character variables
Many thanks in advance for any ideas.
Best,
Vlada
You will want to do some reading about PROC MEANS, one of my favorite SAS procedures. For example, consider this:
data have;
input Brand $ Char1 $ Char2 $ Char3 $ Char4 $
NumVar1 NumVar2 NumVar3;
datalines;
A a xx 3 a 0.471 0.304 0.267
A b xy 3 s 0.086 0.702 0.872
A c xz 3 a 0.751 0.962 0.080
A d xx 2 s 0.711 0.229 0.474
A a xy 3 a 0.160 0.543 0.256
A b xz 1 s 0.200 0.633 0.241
A c xx 3 a 0.765 0.511 0.045
A d xy 4 s 0.397 0.815 0.950
A a xz 1 a 0.890 0.757 0.483
A b xx 3 a 0.575 0.625 0.341
A c xy 3 a 0.595 0.047 0.584
A d xz 1 s 0.473 0.806 0.329
A a xx 2 s 0.062 0.161 0.018
A b xy 2 s 0.935 0.990 0.072
A c xz 4 s 0.564 0.490 0.112
A d xx 2 a 0.251 0.228 0.215
A a xy 4 a 0.551 0.778 0.605
A b xz 1 s 0.887 0.392 0.866
A c xx 1 s 0.238 0.569 0.245
A d xz 1 a 0.736 0.961 0.627
run;
proc means noprint data=have completetypes;
class Char1 Char2 Char3 Char4;
var NumVar1 NumVar2 NumVar3;
output out=want mean=mNumVar1 mNumVar2 mNumVar3;
run;
As written, the procedure will create an output data set named "want" with one observation for every combination of the variables listed in the "class" statement and with the MEAN statistic for each variable listed in the "var" statement. In this example, there will be 300 observations (which you will note is larger than the original data set).
Additionally, the output data set will contain two automatic variables:
_FREQ_ - The number of observations in the combination
_TYPE_ - An identifier for the specific combination (based on the CLASS variables)
The _TYPE_ variable will be especially useful in your case. It's a numeric value based on the number of variables listed in the class statement. Because you have four class variables, _TYPE_ will have 16 values ranging from 0 to 15. For example, the twelve observations that account for the combinations of variables Char1 and Char2 will have _TYPE_=12.
Here is a link to the Online Docs for PROC MEANS in SAS version 9.3.
PROC MEANS should accomplish what you need, assuming I understand your problem.
proc means data=have;
class char1 char2 char3 char4;
types char1*char2*char3*char4
char1*char2*char3
char2*char3*char4 ... etc... ; *or use the various WAYS statements to get all combinations of a particular number of variables, or use _ALL_ to get all combinations;
var num1 num2 num3;
output out=want mean=;
run;
If the character variables might have missing values, then you need to use /missing; on the CLASS statement.
(Largely crossposted from SAS-L)