I have a data frame that is grouped by 2 columns - Date And Client and I sum the amount so:
new_df = df.groupby(['Date',Client'])
Now I get the following df:
Sum
Date Client
1/1 A 0.8
B 0.2
1/2 A 0.1
B 0.9
I want to be able to catch the fact that there is a high fluctuation between the ratio of 0.8 to 0.2 that changed to 0.1 to 0.9. What would be the most efficient way to do it? Also I can't access the Date and Client fields when I try to do
new_df[['Date','Client']]
Why is that?
IIUC you can use pct_change or diff:
new_df = df.groupby(['Date','Client'], as_index=False).sum()
print (new_df)
Date Client Sum
0 1/1 A 0.8
1 1/1 B 0.2
2 1/2 A 0.1
3 1/2 B 0.9
new_df['pct_change'] = new_df.groupby('Date')['Sum'].pct_change()
new_df['diff'] = new_df.groupby('Date')['Sum'].diff()
print (new_df)
Date Client Sum pct_change diff
0 1/1 A 0.8 NaN NaN
1 1/1 B 0.2 -0.75 -0.6
2 1/2 A 0.1 NaN NaN
3 1/2 B 0.9 8.00 0.8
Related
I have a dataset which have columns Event and Time. I need to create columns Group and Cumulative. What I need to measure is the duration of the Event 'Event1_Stop' until an 'Event1_Start' appears. Last group should sum the time meaning that the stop is ongoing and no start for the event has entered.
My data sample is:
data have;
length Event $15;
input Event $ Time;
datalines;
Event3_Start 0.2
Event2_Start 0.4
Event2_Stop 0.2
Event1_Stop 0.2
Event3_Start 0
Event4_Start 0.5
Event3_Stop 0.2
Event1_Start 0
Event4_Stop 0
Event4_Stop 0
Event1_Stop 0.3
Event3_Start 0.3
Event1_Start 0
Event3_Start 0.4
Event3_Stop 0
Event1_Stop 0.2
Event3_Start 0.2
Event2_Start 0.4
run;
The result dataset that I need to obtain is:
data have;
length Event $15;
input Event $ Time Group Cumulative;
datalines;
Event3_Start 0.2 0 0
Event2_Start 0.4 0 0
Event2_Stop 0.2 0 0
Event1_Stop 0.2 1 0.9
Event3_Start 0 1 0
Event4_Start 0.5 1 0
Event3_Stop 0.2 1 0
Event1_Start 0 0 0
Event4_Stop 0 0 0
Event4_Stop 0 0 0
Event1_Stop 0.3 2 0.6
Event3_Start 0.3 2 0
Event1_Start 0 0 0
Event3_Start 0.4 0 0
Event3_Stop 0 0 0
Event1_Stop 0.2 3 0.8
Event3_Start 0.2 3 0
Event2_Start 0.4 3 0
run;
Thanks for your suggestions.
Regards.
Thanks to #mkeintz on SAS forum for the solution:
data stop_to_start (keep=group cumulative);
set have end=end_of_have;
group+(event='Event1_Stop');
if event='Event1_Stop' then cumulative=0;
cumulative+time;
if end_of_have or event='Event1_Start' ;
run;
data want;
set have;
if _n_=1 or event='Event1_Start' then group=0;
cumulative=0;
if event='Event1_Stop' then set stop_to_start;
run;
I have a text file with with a header and a few columns, which represents results of experiments where some parameters were fixed to obtain some metrics. the file is he following format :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
So I want to plot x = B, y = C for each fixed value of And E so basically for an E=1.25 I want a series of line plots of x = B, y = C at each value of A then a plot for each unique value of E.
Anyone could help with this?
You could do a combination of groupby() and seaborn.lineplot():
for e,d in df.groupby('E'):
fig, ax = plt.subplots()
sns.lineplot(data=d, x='B', y='C', hue='A', ax=ax)
ax.set_title(e)
*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10
IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN
I use some example:
INPUT:
0.6 0.7 A:0.01 - 0
C:0.01 0.1 - 0.2 0
0.7 0.02 G:0.2 - 0
0.5 0.23 0.1 T:0.05 0
0.1 0.2 0.3 0.58 0
So, if some column has a value start with A C T or G I would like to change it to "0" or "-" and last column change to "W" (it is $34 $35 $36 $37 $38 )
OUTPUT:
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
I would like to use awk.
awk '{if($34=="^:^");gsub($34,"*","0") && gsub($38,"0","W"); else print}' file
and same for other columns.
Thank you.
How about like this:
$ awk '{for(i=1;i<=4;i++){if ($i ~ /A:|C:|T:|G:/){$i=0; $NF="W"}}}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
In a more readable format:
$ awk '{
for(i=1;i<=4;i++) { # Loop through the fieds
if ($i ~ /A:|C:|T:|G:/) { # If current field matches pattern
$i=0 # Replace it with zero
$NF="W" # And make the last field a 'W'
}
}
}1' file | column -t
If you want to limit it to specific columns, you can use an array:
awk '{c="1,3";split(c,cols,/,/);for(i in cols){if ($cols[i] ~ /A:|C:|T:|G:/){$cols[i]=0; $NF="W"}}}1' file | column -t
what about something like this:
awk -v OFS="\t" '{if (gsub(/G:|C:|A:|T:/, "0")) print $1,$2,$3,$4,"W"; else print $0}'
And then replace values strarting 00 to zero.
If you don't care about spacing:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
if you do:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
I have a table in an HDFStore with a column of floats f stored as a data_column. I would like to select a subset of rows where, e.g., f==0.6.
I'm running in to trouble that I'm assuming is related to a floating-point precision mismatch somewhere. Here is an example:
In [1]: f = np.arange(0, 1, 0.1)
In [2]: s = f.astype('S')
In [3]: df = pd.DataFrame({'f': f, 's': s})
In [4]: df
Out[4]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
3 0.3 0.3
4 0.4 0.4
5 0.5 0.5
6 0.6 0.6
7 0.7 0.7
8 0.8 0.8
9 0.9 0.9
[10 rows x 2 columns]
In [5]: with pd.get_store('test.h5', mode='w') as store:
...: store.append('df', df, data_columns=True)
...:
In [6]: with pd.get_store('test.h5', mode='r') as store:
...: selection = store.select('df', 'f=f')
...:
In [7]: selection
Out[7]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
4 0.4 0.4
5 0.5 0.5
8 0.8 0.8
9 0.9 0.9
[7 rows x 2 columns]
I would like the query to return all of the rows but instead several are missing. A query with where='f=0.3' returns an empty table:
In [8]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', 'f=0.3')
...:
In [9]: selection
Out[9]:
Empty DataFrame
Columns: [f, s]
Index: []
[0 rows x 2 columns]
I'm wondering whether this is the intended behavior, and if so is there is a simple workaround, such as setting a precision limit for floating-point queries in pandas? I'm using version 0.13.1:
In [10]: pd.__version__
Out[10]: '0.13.1-55-g7d3e41c'
I don't think so, no. Pandas is built around numpy, and I have never seen any tools for approximate float equality except testing utilities like assert_allclose, and that won't help here.
The best you can do is something like:
In [17]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', '(f > 0.2) & (f < 0.4)')
....:
In [18]: selection
Out[18]:
f s
3 0.3 0.3
If this is a common idiom for you, make a function for it. You can even get fancy by incorporating numpy float precision.