So I have recently started teaching a course and wanted to handle my grades using python and the pandas module. For this class the students work in groups and turn in one assignment per table. I have a file with all of the students that is formatted like such
Name, Email, Table
"John Doe", jdoe#school.edu, 3
"Jane Doe", jane#gmail.com, 5
.
.
.
and another file with the grades for each table for the assignments done
Table, worksheet, another assignment, etc
1, 8, 15, 4
2, 9, 23, 5
3, 3, 20, 7
.
.
.
What I want to do is assign the appropriate grade to each student based on their table number. Here is what I have done
import pandas as pd
t_data = pd.read_csv('table_grades.csv')
roster = pd.read_csv('roster.csv')
for i in range(1, len(t_data.columns)):
x = []
for j in range(len(roster)):
for k in range(len(t_data)):
if roster.Table.values[j] == k+1:
x.append(t_data[t_data.columns.values[i]][k])
roster[t_data.columns.values[i]] = x
Which does what I want but I feel like there must be a better way to do a task like this using the pandas. I am new to pandas and appreciate any help.
IIUC -- unfortunately your code doesn't run for me with your data and you didn't give example output, so I can't be sure -- you're looking for merge. Adding a new student, Fred Smith, to table 3:
In [182]: roster.merge(t_data, on="Table")
Out[182]:
Name Email Table worksheet another assignment etc
0 John Doe jdoe#school.edu 3 3 20 7
1 Fred Smith fsmith#example.com 3 3 20 7
[2 rows x 6 columns]
or maybe an outer merge, to make it easier to spot missing/misaligned data:
In [183]: roster.merge(t_data, on="Table", how="outer")
Out[183]:
Name Email Table worksheet another assignment etc
0 John Doe jdoe#school.edu 3 3 20 7
1 Fred Smith fsmith#example.com 3 3 20 7
2 Jane Doe jane#gmail.com 5 NaN NaN NaN
3 NaN NaN 1 8 15 4
4 NaN NaN 2 9 23 5
[5 rows x 6 columns]
I would do something like this
import pandas as pd
from StringIO import StringIO
roster = pd.read_csv(\
StringIO("""Name,Email,Table
'John Doe', jdoe#school.edu, 1
'Jane Doe', jane#gmail.com, 3
'Jack Doe', jack#gmail.com, 2"""))
t_data = pd.read_csv(\
StringIO("""Table,worksheet,another assignment,etc
1, 8, 15, 4
2, 9, 23, 5
3, 3, 20, 7"""))
roster=roster.set_index('Table')
res = pd.concat((roster.loc[t_data.Table].set_index(t_data.index), t_data), axis=1)
The result is
Name Email Table worksheet another assignment etc
0 'John Doe' jdoe#school.edu 1 8 15 4
1 'Jack Doe' jack#gmail.com 2 9 23 5
2 'Jane Doe' jane#gmail.com 3 3 20 7
Related
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()
I have csv file in the order
a
1 2 3
4 5 6
7 8 9
b
7 8 9
4 5 6
1 2 3
how can I change it to the following form
a 1 2 3
4 5 6
7 8 9
b 7 8 9
4 5 6
1 2 3
with a, b being the first column and the number in the second, third and fourth column respectively
my code is:
with open('csv_test.csv', 'w') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow([a, b, c])
wr.writerow([1, 2, 3])
wr.writerow([4, 5, 6])
wr.writerow([7, 8, 9])
I note that you are using the csv module.
This code reads each line. If the line consists of a single field, the contents of the field are considered a header, and are remembered for the next line
The after the next line is read, the header and the contents of the line are written out, then the header is replaced with an empty string, so subsequent lines get a blank field in the first column, and numbers in the second, third etc columns.
import csv
with open("infile.csv","r") as infile:
with open("myfile.csv","w") as myfile:
reader = csv.reader(infile)
wr = csv.writer(myfile)
column1 = ""
for fields in reader:
if len(fields) == 1:
column1 = fields[0]
else:
wr.writerow([column1]+fields)
column1 = ""
I have an xlsx file with over 1000 columns of data. I would like to firstly parse every second column from the data file (which can contain numbers and letters) and then create a unique list from the parsed data.
I'm a complete noob & have tried a "for" and "do while" loop but neither have worked for me.
So far I have:
import pandas as pd
workbook = pd.read_excel('C:\Python27\Scripts\Data.xlsx')
worksheet = workbook.sheetname='Data'
for col in range(worksheet[0], worksheet[1300]):
print(col)
I think I need to append the data and maybe write to a text file then create a unique list from the text file - I can do the second part it's just getting it into the text file I'm having trouble with.
Thanks
You can iterate over your columns by slicing and using a step arg i.e. df.ix[:, ::2]
In [35]:
df = pd.DataFrame({'a':1, 'b':[1,2,3,4,5], 'c':[2,3,4,5,6], 'd':0,'e':np.random.randn(5)})
df
Out[35]:
a b c d e
0 1 1 2 0 -0.352310
1 1 2 3 0 1.189140
2 1 3 4 0 -1.470507
3 1 4 5 0 0.742709
4 1 5 6 0 -2.798007
here we step every 2nd column:
In [37]:
df.ix[:,::2]
Out[37]:
a c e
0 1 2 -0.352310
1 1 3 1.189140
2 1 4 -1.470507
3 1 5 0.742709
4 1 6 -2.798007
we can then just call np.unique on the entire df to get a single array of all the unique values:
In [36]:
np.unique(df.ix[:,::2])
Out[36]:
array([-2.79800676, -1.47050675, -0.35231005, 0.74270934, 1. ,
1.18914011, 2. , 3. , 4. , 5. , 6. ])
assume I have a dataframe looks like below.
df = pd.DataFrame({
'name' : ['1st', '2nd', '3rd'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
'peter_03' : [7, 8, 9],
'roger_04' : [10,11, 12],
'ken_05' : [13, 14, 15],
})
df2 = df.set_index('name')
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 5 8 11
3rd 3 15 6 9 12
Modify_List_col = ['mary_02','peter_03']
Modify_List_row = ['2nd'] # use tolist() to get this list from additional files
I only want to modify those cells in List_col and List_row. So I will get something like below, those cells are replaced by 'X'.
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 X X 11
3rd 3 15 6 9 12
Does anyone know how to get the results in one line using pandas please?
You can use the loc method:
In[25]: df = pd.DataFrame(pd.np.arange(25).reshape(5,5)).set_index(0)
In[26]: df
Out[26]:
1 2 3 4
0
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
In[27]: df.loc[[10,15],[2,3,4]] = "x"
In[28]: df
Out[28]:
1 2 3 4
0
0 1 2 3 4
5 6 7 8 9
10 11 x x x
15 16 x x x
20 21 22 23 24
To do that, just set the column 0 as index, then select the portion of the dataframe with loc and assign the value "x".
It works in the same way for your last dataset:
In[51]: Modify_List_col = ['mary_02', 'peter_03']
Modify_List_row = ['2nd']
df.loc[Modify_List_row, Modify_List_col] = "X"
In[52]: df
Out[52]:
john_01 ken_05 mary_02 peter_03 roger_04
name
1st 1 13 4 7 10
2nd 2 14 X X 11
3rd 3 15 6 9 12
I hope this can help you.