Pandas - Create two columns - Simple, no? - python-2.7

Well hello everyone!
I want to create a (panda) dataset called df. This df panda form must contain "Id" and "Feature" columns. Any idea on how to do it?
I have done the following code but... the ## dictionaries are messy and put in random the two columns. I want "Id" as first column and "Feature" as a second one.
Thank you in advance! Have a loooong weekend!
df = DataFrame({'Feature': X["Feature"],'Id': X["Id"] })

From the pandas docs "If no columns are passed, the columns will be the sorted list of dict keys." I do this simple trick to arrange the columns. Just add "1", "2", etc. to beginning of your column names. For example:
>>>> df1 = pd.DataFrame({"Id":[1,2,3],"Feature":[5,6,7]})
>>>> df1
Feature Id
0 5 1
1 6 2
2 7 3
>>>> df2 = pd.DataFrame({"1Id":[1,2,3],"2Feature":[5,6,7]})
>>>> df2
1Id 2Feature
0 1 5
1 2 6
2 3 7
>>>> df2.columns = ["Id","Feature"]
>>>> df2
Id Feature
0 1 5
1 2 6
2 3 7
Now you have the order you wanted for printing or saving the DataFrame.

If this what you wanted?
import pandas as pd
data=["id","Feature"]
index=[1,2]
s = pd.Series(data,index=index)
df = pd.DataFrame(np.random.randn(2,2), index=index, columns=('id','features'))
The data frame :
>>> df['id']
1 0.254105
2 -0.132025
Name: id, dtype: float64
>>> df['features']
1 0.189972
2 2.262103
Name: features, dtype: float64

Related

How to split dataframe or reorder dataframe by rows in pandas

I just want to clean the dataframe and analyse the dataframe. However, I got in trouble. I created a simple dataframe to illustrate it:
import pandas as pd
d = {'Resutls': ['IIL', 'pass','pass','IIH','pass','IIL','pass'], 'part':['None',1,2,'None',5,'None',4] }
df = pd.DataFrame(d)
the result looks like:
Resutls part
0 IIL None
1 pass 1
2 pass 2
3 IIH None
4 pass 5
5 IIL None
6 pass 4
There are some repeatable modules in the dataframe. I just want to reorder the dataframe by rows and drop the duplicated ones like:
Resutls part
0 IIL None
1 pass 1
2 pass 2
6 pass 4
3 IIH None
4 pass 5
or just split the dataframe into several sub dataframes:
Resutls part
0 IIL None
1 pass 1
2 pass 2
3 pass 4
Resutls part
0 IIH None
1 pass 5
This is just an easy example what I want to do. Actually I have a 4000-thousand rows dataframe, I tried to use reindex or df.iloc to do this. It is intuitive
for me but seems a little complicated to achieve. Is there any good way to do this? Please advise.
I think you need replace pass to NaNs and use forward filling, then sort by argsort and reorder by iloc:
df = df.iloc[df['Resutls'].mask(df['Resutls'].eq('pass')).ffill().argsort()]
print (df)
Resutls part
3 IIH None
4 pass 5
0 IIL None
1 pass 1
2 pass 2
5 IIL None
6 pass 4
Last remove repeating rows by boolean indexing:
df = df[~df['Resutls'].duplicated() | (df['Resutls'] == 'pass')]
print (df)
Resutls part
3 IIH None
4 pass 5
0 IIL None
1 pass 1
2 pass 2
6 pass 4
If want each DataFrame separately:
df['g'] = df['Resutls'].mask(df['Resutls'].eq('pass')).ffill()
df = df[~df['Resutls'].duplicated() | (df['Resutls'] == 'pass')]
print (df)
Resutls part g
0 IIL None IIL
1 pass 1 IIL
2 pass 2 IIL
3 IIH None IIH
4 pass 5 IIH
6 pass 4 IIL
dfs = {k:v.drop('g', axis=1) for k, v in df.groupby('g')}
#print (dfs)
print (dfs['IIH'])
Resutls part
3 IIH None
4 pass 5
print (dfs['IIL'])
Resutls part
0 IIL None
1 pass 1
2 pass 2
6 pass 4

Delete Rows That Have The Same Value In One Column In Pandas [duplicate]

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

How to display rows or select rows in pandas where any of its column contains NAN

My table:
Ram Shyam Kamal
2 nan 4
1 2 5
8 7 10
I want to select or display the first row? How should I do that.
Ram Shyam Kamal
2 nan 4
Let df be your dataframe, you can:
df = df[df.isnull().any(axis=1)]
This returns:
Ram Shyam Kamal
0 2 NaN 4

stop pd.DataFrame.from_csv() from converting integer index to date

pandas.DataFrame.from_csv(filename) seems to be converting my integer index into a date.
This is undesirable. How do I prevent this?
The code shown here is a toy version of a larger problem. In the larger problem, I am estimating and writing the parameters of statistical models for each zone for later use. I thought by using a pandas dataframe indexed by zone, I could easily read back the parameters. While pickle or some other format like json might solve this problem I'd like to see a pandas solution....except pandas is converting the zone number to a date.
#!/usr/bin/python
cache_file="./mydata.csv"
import numpy as np
import pandas as pd
zones = [1,2,3,8,9,10]
def create():
data = []
for z in zones:
info = {'m': int(10*np.random.rand()), 'n': int(10*np.random.rand())}
info.update({'zone':z})
data.append(info)
df = pd.DataFrame(data,index=zones)
print "about to write this data:"
print df
df.to_csv(cache_file)
def read():
df = pd.DataFrame.from_csv(cache_file)
print "read this data:"
print df
create()
read()
Sample output:
about to write this data:
m n zone
1 0 3 1
2 5 8 2
3 6 4 3
8 1 8 8
9 6 2 9
10 7 2 10
read this data:
m n zone
2013-12-01 0 3 1
2013-12-02 5 8 2
2013-12-03 6 4 3
2013-12-08 1 8 8
2013-12-09 6 2 9
2013-12-10 7 2 10
The CSV file looks OK, so the problem seems to be in reading not creating.
mydata.csv
,m,n,zone
1,0,3,1
2,5,8,2
3,6,4,3
8,1,8,8
9,6,2,9
10,7,2,10
I suppose this might be useful:
pd.__version__
0.12.0
Python version is python 2.7.5+
I want to record the zone as an index so I can easily pull out the corresponding
parameters later. How do I keep pandas.DataFrame.from_csv() from turning it into a date?
Reading pandas.DataFrame.from_csv? the parse_dates argument defaults to True. Set it to False.

Working with a list of lists of dataframes with different dimensions

I am working with a list of lists of dataframes, like so:
results <- list()
for( i in 1:4 ) {
runData <- data.frame(id=i, t=1:10, value=runif(10))
runResult <- data.frame( id=i, avgValue=mean(runData$value))
results <- c(results,list(list(runResult,runData)))
}
The reason the data looks this way is its essentially how my actual data is generated from running simulations via clusterApply using the new parallel package in R 2.14.0, where each simulation returns a list of some summary results (runResult) and the raw data (runData)
I would like to combine the first dataframe of the second level lists together (they are the same structure), and likewise the second dataframe of the second level lists. This question seemed to be the answer, however all the dataframes have the same structure.
The best method I've found so far is using unlist to make it a list of dataframes, where odd indices and even indices represent dataframes that need to be combined:
results <- unlist(results,recursive=FALSE)
allRunResults <- do.call("rbind", results[seq(1,length(results),2)])
allRunData <- do.call("rbind", results[seq(2,length(results),2)])
I'm certain there's a better way to do this, I just don't see it yet. Can anyone supply one?
Shamelessly stealing a construct from Ben Bolker's excellent answer to this question...
Reduce(function(x,y) mapply("rbind", x,y), results)
[[1]]
id avgValue
1 1 0.3443166
2 2 0.6056410
3 3 0.6765076
4 4 0.4942554
[[2]]
id t value
1 1 1 0.11891086
2 1 2 0.17757710
3 1 3 0.25789284
4 1 4 0.26766182
5 1 5 0.83790204
6 1 6 0.99916116
7 1 7 0.40794841
8 1 8 0.19490817
9 1 9 0.16238479
10 1 10 0.01881849
11 2 1 0.62178443
12 2 2 0.49214165
........
........
........
One option is to extract the give data frame from each piece of the list, then rbind them together:
runData <- do.call(rbind, lapply(results, '[[', 2))
runResult <- do.call(rbind, lapply(results, '[[', 1))
This example gives 2 data frames, but you can recombine them into a single list if you want.