I have a list of comprising of sub-lists with different numbers of entries, as follows:
x <- list(
c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
c("b1","b2","b3","b4"),
c("c1","c2","c3"),
c("d1")
)
I want to convert this file to a dataframe with three columns (1st column is sequence of the sub-list, i.e. 1 to 4: 2nd column is the entries; the 3rd stands for my stop code, so, I used 1 for every lines, the final results is as follows:
1 a1 1
1 a2 1
1 a3 1
1 a4 1
1 a5 1
1 a6 1
1 a7 1
2 b1 1
2 b2 1
2 b3 1
2 b4 1
3 c1 1
3 c2 1
3 c3 1
4 d1 1
I tried to use cbind, however, it seems to me only works for sub-list with same number of entries. Are there any smarter way of doing this?
here is an example:
data.frame(
x=rep(1:length(x), sapply(x, length)),
y=unlist(x),
z=1
)
library(reshape2)
x <- melt(x) ## Done...
## Trivial...
x$stop <- 1
x <- x[c(2,1,3)]
One option is to use the split, apply, combine functionality in package plyr. In this case you need ldply which will take a list and combine the elements into data.frame:
library(plyr)
ldply(seq_along(x), function(i)data.frame(n=i, x=x[[i]], stop=1))
n x stop
1 1 a1 1
2 1 a2 1
3 1 a3 1
4 1 a4 1
5 1 a5 1
6 1 a6 1
7 1 a7 1
8 2 b1 1
9 2 b2 1
10 2 b3 1
11 2 b4 1
12 3 c1 1
13 3 c2 1
14 3 c3 1
15 4 d1 1
Related
I have done a groupby which resulted in a dataframe similar to the below example.
df = pd.DataFrame({'a': ['A', 'A','A', 'B', 'B','B'], 'b': ['A1', 'A2','A3' ,'B1', 'B2','B3'], 'c': ['2','3','4','5','6','1'] })
>>> df
a b c
0 A A1 2
1 A A2 3
2 A A3 4
3 B B1 5
4 B B2 6
5 B B3 1
desired output
>>> df
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
As you can see it is a double ranking based on column a then column b. We first start with the highest which is B and within B we also start with the highest which is B2.
how i can do that in python please
you can first find maxima in each group and sort your DF descending by this local maxima and column c:
In [49]: (df.assign(x=df.groupby('a')['c'].transform('max'))
.sort_values(['x','c'], ascending=[0,0])
.drop('x',1))
Out[49]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
Use
In [1072]: df.sort_values(by=['a', 'c'], ascending=[False, False])
Out[1072]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
I think need first get max values by aggregating, then create ordered Categorical by ordering by max indices and last sort_values working as you need:
c = df.groupby('a')['c'].max().sort_values(ascending=False)
print (c)
a
B 6
A 4
Name: c, dtype: object
df['a'] = pd.Categorical(df['a'], categories=c.index, ordered=True)
df = df.sort_values(by=['a', 'c'], ascending=[True, False])
print (df)
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
I need to find for every row the last 3hr usage (Usage is one of the columns in dataset) grouped by User and ID_option.
Every line(row) represent one record (within 3 min time interval). For example (including desired column sum_usage_3hr):
User ID_option time usage sum_usage_3hr
1 a1 12OCT2017:11:20:32 3 10
1 a1 12OCT2017:10:23:24 7 14
1 b1 12OCT2017:09:34:55 12 12
2 b1 12OCT2017:08:55:06 4 6
1 a1 12OCT2017:07:59:53 7 7
2 b1 12OCT2017:06:59:12 2 2
I have used code below for hash table:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('user','id_option','time');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage_3hr=0;
do i=time-3*3600 to time ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage_3hr+_usage;
end;
drop _usage i;
run;
But I got an error: Invalid DO loop control information, either the INITIAL or TO expression is missing or the BY expression is missing, zero, or invalid. If I add:
output;
end:
just above the "run;" it gives me an error: 'No matching DO/Select statement'.
Anybody know what causes the problem?
I have also the version with sorting the table firstly and gives me the same error.
Thank you
After implementing the for answer:
User ID_option time usage sum_usage_3hr col_i_got
1 a1 12OCT2017:11:22:32 3 12 3
1 a1 12OCT2017:11:20:24 0 9 3
1 a1 12OCT2017:10:34:55 2 9 2
1 a1 12OCT2017:09:55:06 0 7 2
1 a1 12OCT2017:09:43:45 0 7 0
1 a1 12OCT2017:08:59:53 7 7 7
1 a1 12OCT2017:06:59:12 0 0 7
Try this out:
Problem 1:
Input:
data have;
input User ID_option $ time usage ;
informat time datetime18.;
format time datetime18.;
cards;
1 a1 12OCT2017:11:20:32 3
1 a1 12OCT2017:10:23:24 7
1 b1 12OCT2017:09:34:55 12
2 b1 12OCT2017:08:55:06 4
1 a1 12OCT2017:07:59:53 7
2 b1 12OCT2017:06:59:12 2
;
run;
Code:
proc sort data=have out=have1;
by user id_option time;
quit;
data have2;
set have1;
by user id_option;
format previous_time datetime18.;
previous_time = lag(time);
previous_usage = lag(usage);
if first.ID_option then previous_time=.;
if previous_time ~= . and intnx("hour",time,-3,"s") <= previous_time <= time then sum_usage_3hr=usage+previous_usage;
else sum_usage_3hr = usage;
drop previous_time previous_usage;
run;
proc sort data=have2 out=want;
by descending time ;
quit;
Output:
User ID_option time usage sum_usage_3hr
1 a1 12Oct2017 11:20:32 3 10
1 a1 12Oct2017 10:23:24 7 14
1 b1 12Oct2017 9:34:55 12 12
2 b1 12Oct2017 8:55:06 4 6
1 a1 12Oct2017 7:59:53 7 7
2 b1 12Oct2017 6:59:12 2 2
Problem2:
Input:
data have;
input user1 ID_option $ time usage ;
informat time datetime18.;
format time datetime18.;
cards;
1 a1 12OCT2017:11:22:32 3
1 a1 12OCT2017:11:20:24 0
1 a1 12OCT2017:10:34:55 2
1 a1 12OCT2017:09:55:06 0
1 a1 12OCT2017:09:43:45 0
1 a1 12OCT2017:08:59:53 7
1 a1 12OCT2017:06:59:12 0
;
run;
Code:
proc sql;
create table want as
select user1,id_option,time,min(usage) as usage,sum(usage1) as sum_usage_3hr
from
(
select a.*,b.time as time1 ,b.usage as usage1
from
have a
left join
have b
on a.user1 = b.user1 and a.id_option = b.id_option and b.time <= a.time
where intck("hour",a.time ,b.time) >= -3
)
group by 1,2,3
order by time desc;
quit;
Output:
user1 ID_option time usage sum_usage_3hr
1 a1 12Oct2017 11:22:32 3 12
1 a1 12Oct2017 11:20:24 0 9
1 a1 12Oct2017 10:34:55 2 9
1 a1 12Oct2017 9:55:06 0 7
1 a1 12Oct2017 9:43:45 0 7
1 a1 12Oct2017 8:59:53 7 7
1 a1 12Oct2017 6:59:12 0 0
Let me know in case of any queries.
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
I have this question is an extension after reading the "Python pandas groupby object apply method duplicates first group".
I get the answer, and tried some experiments on my own, e.g.:
import pandas as pd
from cStringIO import StringIO
s = '''c1 c2 c3
1 2 3
4 5 6'''
df = pd.read_csv(StringIO(s), sep=' ')
print df
def f2(df):
print df.iloc[:]
print "--------"
return df.iloc[:]
df2 = df.groupby(['c1']).apply(f2)
print "======"
print df2
gives as expected:
c1 c2 c3
0 1 2 3
1 4 5 6
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
0 1 2 3
1 4 5 6
However, when I try to return only df.iloc[0]:
def f3(df):
print df.iloc[0:]
print "--------"
return df.iloc[0:]
df3 = df.groupby(['c1']).apply(f3)
print "======"
print df3
, I get an additional index:
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
c1
1 0 1 2 3
4 1 4 5 6
I did some search and suspect this may mean there is a different code path taken?
The difference is that iloc[:] returns the object itself, while iloc[0:] returns a view of the object. Take a look at this:
>>> df.iloc[:] is df
True
>>> df.iloc[0:] is df
False
Where this makes a difference is that within the groupby, each group has a name attribute that reflects the grouping. When your function returns an object with this name attribute, no index is added to the result, while if you return an object without this name attribute, an index is added to track which group each came from.
Interestingly, you can force the iloc[:] behavior for iloc[0:] by explicitly setting the name attribute of the group before returning:
def f(x):
out = x.iloc[0:]
out.name = x.name
return out
df.groupby('c1').apply(f)
# c1 c2 c3
# 0 1 2 3
# 1 4 5 6
My guess is that the no-index behavior with named output is basically a special case meant to make df.groupby(col).apply(lambda x: x) be a no-op.
Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11
If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})