I have csv file in the order
a
1 2 3
4 5 6
7 8 9
b
7 8 9
4 5 6
1 2 3
how can I change it to the following form
a 1 2 3
4 5 6
7 8 9
b 7 8 9
4 5 6
1 2 3
with a, b being the first column and the number in the second, third and fourth column respectively
my code is:
with open('csv_test.csv', 'w') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow([a, b, c])
wr.writerow([1, 2, 3])
wr.writerow([4, 5, 6])
wr.writerow([7, 8, 9])
I note that you are using the csv module.
This code reads each line. If the line consists of a single field, the contents of the field are considered a header, and are remembered for the next line
The after the next line is read, the header and the contents of the line are written out, then the header is replaced with an empty string, so subsequent lines get a blank field in the first column, and numbers in the second, third etc columns.
import csv
with open("infile.csv","r") as infile:
with open("myfile.csv","w") as myfile:
reader = csv.reader(infile)
wr = csv.writer(myfile)
column1 = ""
for fields in reader:
if len(fields) == 1:
column1 = fields[0]
else:
wr.writerow([column1]+fields)
column1 = ""
Related
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
I have to compare a columns with all other columns in the dataframe. The column that i have to compare with others is located in position 4 so i write df.iloc[x,4] to take column values. Then i have to consider these values, multiply them with the values in the next column (for example df.iloc[x,5]), create a new column in the dataframe and save results. Then i have to repeat this procedure to the end the existing column (the original dataframe has 43 column, so the end it is the df.iloc[x,43] )
How can i do this in python?
If it is possibile can you do some examples? I try to put my code in the post but i 'm not good with my new phone.
I think you can use eq - compare filtered DataFrame with column E in position 4:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,8,9],
'G':[1,3,5],
'H':[5,3,6],
'I':[7,4,3]})
print (df)
A B C D E F G H I
0 1 4 7 1 5 7 1 5 7
1 2 5 8 3 3 8 3 3 4
2 3 6 9 5 6 9 5 6 3
print (df.iloc[:,5:].eq(df.iloc[:,4], axis=0))
F G H I
0 False False True False
1 False True True False
2 False False True False
If need multiple by column in position 4 use mul:
print (df.iloc[:,5:].mul(df.iloc[:,4], axis=0))
F G H I
0 35 5 25 35
1 24 9 9 12
2 54 30 36 18
Or if need multiple by shifted columns:
print (df.iloc[:,4:].mul(df.iloc[:,5:], axis=0, fill_value=1))
E F G H I
0 5.0 49 1 25 49
1 3.0 64 9 9 16
2 6.0 81 25 36 9
Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()
I have an xlsx file with over 1000 columns of data. I would like to firstly parse every second column from the data file (which can contain numbers and letters) and then create a unique list from the parsed data.
I'm a complete noob & have tried a "for" and "do while" loop but neither have worked for me.
So far I have:
import pandas as pd
workbook = pd.read_excel('C:\Python27\Scripts\Data.xlsx')
worksheet = workbook.sheetname='Data'
for col in range(worksheet[0], worksheet[1300]):
print(col)
I think I need to append the data and maybe write to a text file then create a unique list from the text file - I can do the second part it's just getting it into the text file I'm having trouble with.
Thanks
You can iterate over your columns by slicing and using a step arg i.e. df.ix[:, ::2]
In [35]:
df = pd.DataFrame({'a':1, 'b':[1,2,3,4,5], 'c':[2,3,4,5,6], 'd':0,'e':np.random.randn(5)})
df
Out[35]:
a b c d e
0 1 1 2 0 -0.352310
1 1 2 3 0 1.189140
2 1 3 4 0 -1.470507
3 1 4 5 0 0.742709
4 1 5 6 0 -2.798007
here we step every 2nd column:
In [37]:
df.ix[:,::2]
Out[37]:
a c e
0 1 2 -0.352310
1 1 3 1.189140
2 1 4 -1.470507
3 1 5 0.742709
4 1 6 -2.798007
we can then just call np.unique on the entire df to get a single array of all the unique values:
In [36]:
np.unique(df.ix[:,::2])
Out[36]:
array([-2.79800676, -1.47050675, -0.35231005, 0.74270934, 1. ,
1.18914011, 2. , 3. , 4. , 5. , 6. ])
So I have recently started teaching a course and wanted to handle my grades using python and the pandas module. For this class the students work in groups and turn in one assignment per table. I have a file with all of the students that is formatted like such
Name, Email, Table
"John Doe", jdoe#school.edu, 3
"Jane Doe", jane#gmail.com, 5
.
.
.
and another file with the grades for each table for the assignments done
Table, worksheet, another assignment, etc
1, 8, 15, 4
2, 9, 23, 5
3, 3, 20, 7
.
.
.
What I want to do is assign the appropriate grade to each student based on their table number. Here is what I have done
import pandas as pd
t_data = pd.read_csv('table_grades.csv')
roster = pd.read_csv('roster.csv')
for i in range(1, len(t_data.columns)):
x = []
for j in range(len(roster)):
for k in range(len(t_data)):
if roster.Table.values[j] == k+1:
x.append(t_data[t_data.columns.values[i]][k])
roster[t_data.columns.values[i]] = x
Which does what I want but I feel like there must be a better way to do a task like this using the pandas. I am new to pandas and appreciate any help.
IIUC -- unfortunately your code doesn't run for me with your data and you didn't give example output, so I can't be sure -- you're looking for merge. Adding a new student, Fred Smith, to table 3:
In [182]: roster.merge(t_data, on="Table")
Out[182]:
Name Email Table worksheet another assignment etc
0 John Doe jdoe#school.edu 3 3 20 7
1 Fred Smith fsmith#example.com 3 3 20 7
[2 rows x 6 columns]
or maybe an outer merge, to make it easier to spot missing/misaligned data:
In [183]: roster.merge(t_data, on="Table", how="outer")
Out[183]:
Name Email Table worksheet another assignment etc
0 John Doe jdoe#school.edu 3 3 20 7
1 Fred Smith fsmith#example.com 3 3 20 7
2 Jane Doe jane#gmail.com 5 NaN NaN NaN
3 NaN NaN 1 8 15 4
4 NaN NaN 2 9 23 5
[5 rows x 6 columns]
I would do something like this
import pandas as pd
from StringIO import StringIO
roster = pd.read_csv(\
StringIO("""Name,Email,Table
'John Doe', jdoe#school.edu, 1
'Jane Doe', jane#gmail.com, 3
'Jack Doe', jack#gmail.com, 2"""))
t_data = pd.read_csv(\
StringIO("""Table,worksheet,another assignment,etc
1, 8, 15, 4
2, 9, 23, 5
3, 3, 20, 7"""))
roster=roster.set_index('Table')
res = pd.concat((roster.loc[t_data.Table].set_index(t_data.index), t_data), axis=1)
The result is
Name Email Table worksheet another assignment etc
0 'John Doe' jdoe#school.edu 1 8 15 4
1 'Jack Doe' jack#gmail.com 2 9 23 5
2 'Jane Doe' jane#gmail.com 3 3 20 7