Merging multiple .txt files into a csv - python-2.7

*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10

IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0

import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN

Related

How do I plot data in a text file depending on the the value present in one of the columns

I have a text file with with a header and a few columns, which represents results of experiments where some parameters were fixed to obtain some metrics. the file is he following format :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
So I want to plot x = B, y = C for each fixed value of And E so basically for an E=1.25 I want a series of line plots of x = B, y = C at each value of A then a plot for each unique value of E.
Anyone could help with this?
You could do a combination of groupby() and seaborn.lineplot():
for e,d in df.groupby('E'):
fig, ax = plt.subplots()
sns.lineplot(data=d, x='B', y='C', hue='A', ax=ax)
ax.set_title(e)

Getting 'ValueError: x and y must be 1D arrays of the same length' when they are in fact 1D arrays of same length

I have this dataframe:
key variable value
0 0.25 -0.2 606623.455859
1 0.27 -0.2 621462.029200
2 0.30 -0.2 640299.078053
3 0.33 -0.2 653686.910706
4 0.35 -0.2 659278.593742
5 0.37 -0.2 665684.466383
6 0.40 -0.2 671975.695814
7 0.25 0 530091.733402
8 0.27 0 542501.852937
9 0.30 0 557799.179433
10 0.33 0 571140.149887
11 0.35 0 575117.783803
12 0.37 0 582709.048163
13 0.40 0 588168.965913
14 0.25 0.2 466275.721535
15 0.27 0.2 478678.452615
16 0.30 0.2 492749.041489
17 0.33 0.2 500792.917910
18 0.35 0.2 503620.638204
19 0.37 0.2 507884.996510
20 0.40 0.2 512504.976664
21 0.25 0.5 351579.595889
22 0.27 0.5 359555.855803
23 0.30 0.5 368924.362358
24 0.33 0.5 375069.238800
25 0.35 0.5 377847.414729
26 0.37 0.5 381146.573247
27 0.40 0.5 383836.933547
And I am trying to make a contour plot using this dataframe with the following code:
x = df['key'].values
y = df['variable'].values
z = df['value'].values
plt.tricontourf(x, y, z, colors='k')
I keep getting this error:
ValueError: x and y must be 1D arrays of the same length
But whenever I check the len, .size, .shape, and .ndim of x and y, they are 1D arrays of the same length. Does anyone know why I would get this error?
x.shape returns (28L,) and y.shape returns (28L,) as well
Okay I found a way to make it work. Really not sure why it didn't work the original way because I was feeding tricontourf 1D arrays, but basically I wrarpped my data in a list() function just to double make sure it was 1D arrays. This made it work. Here's the code:
x = df_2020_pivot['key'].values
y = df_2020_pivot['variable'].values
z = df_2020_pivot['value'].values
plt.tricontourf(list(x), list(y), list(z))
plt.show()
And this is what it produced
I had the same issue crop up. I was passing in two numpy arrays of the same length, and got the 'must be 1D arrays of same length' error. Looking at type(array), the arrays I was passing in were numpy.ndarrays. I used array.tolist() to turn them into simple (1D) lists, and this removed the error for me. Wrapping in the list() function as mentioned above also works.
x = df['key'].values.tolist()
y = df['variable'].values.tolist()
z = df['value'].values
plt.tricontourf(x, y, z, colors='k')

pandas.DataFrame: How to div row by row [python]

I want to div row[i] by row[i+1] in pandas.DataFrame
row[i] = row[i+1] / row[i]
for example:
1 2 3 4
4 2 6 2
8 5 3 1
the result is
0.25 1 0.5 2
0.5 0.4 2 2
You can divide by div shifted DataFrame, last remove NaN row by dropna:
print (df)
a b c d
0 1 2 3 4
1 4 2 6 2
2 8 5 3 1
print (df.div(df.shift(-1), axis=1))
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
2 NaN NaN NaN NaN
df = df.div(df.shift(-1), axis=1).dropna(how='all')
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
Another solution for remove last row is select by iloc:
df = df.div(df.shift(-1), axis=1).iloc[:-1]
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0

Calculation on groups after group by with Pandas

I have a data frame that is grouped by 2 columns - Date And Client and I sum the amount so:
new_df = df.groupby(['Date',Client'])
Now I get the following df:
Sum
Date Client
1/1 A 0.8
B 0.2
1/2 A 0.1
B 0.9
I want to be able to catch the fact that there is a high fluctuation between the ratio of 0.8 to 0.2 that changed to 0.1 to 0.9. What would be the most efficient way to do it? Also I can't access the Date and Client fields when I try to do
new_df[['Date','Client']]
Why is that?
IIUC you can use pct_change or diff:
new_df = df.groupby(['Date','Client'], as_index=False).sum()
print (new_df)
Date Client Sum
0 1/1 A 0.8
1 1/1 B 0.2
2 1/2 A 0.1
3 1/2 B 0.9
new_df['pct_change'] = new_df.groupby('Date')['Sum'].pct_change()
new_df['diff'] = new_df.groupby('Date')['Sum'].diff()
print (new_df)
Date Client Sum pct_change diff
0 1/1 A 0.8 NaN NaN
1 1/1 B 0.2 -0.75 -0.6
2 1/2 A 0.1 NaN NaN
3 1/2 B 0.9 8.00 0.8

subtract two columns of different Dataframe with python

I have two DataFrames, df1:
Lat1 Lon1 tp1
0 34.475000 349.835000 1
1 34.476920 349.862065 0.5
2 34.478833 349.889131 0
3 34.480739 349.916199 3
4 34.482639 349.943268 0
5 34.484532 349.970338 0
and df2:
Lat2 Lon2 tp2
0 34.475000 349.835000 2
1 34.476920 349.862065 1
2 34.478833 349.889131 0
3 34.480739 349.916199 6
4 34.482639 349.943268 0
5 34.484532 349.970338 0
I want to substract (tp1-tp2) columns and create a new dataframe whose colums are Lat1,lon1,tp1-tp2. anyone know how can I do it?
import pandas as pd
df3 = df1[['Lat1', 'Lon1']]
df3['tp1-tp2'] = df1.tp1 - df2.tp2
Out[97]:
Lat1 Lon1 tp1-tp2
0 34.4750 349.8350 -1.0
1 34.4769 349.8621 -0.5
2 34.4788 349.8891 0.0
3 34.4807 349.9162 -3.0
4 34.4826 349.9433 0.0
5 34.4845 349.9703 0.0