ReshapeError while trying to pivot pandas dataframe - python-2.7

Using pandas 0.11 on python 2.7.3 I am trying to pivot a simple dataframe with the following values:
StudentID QuestionID Answer DateRecorded
0 1234 bar a 2012/01/21
1 1234 foo c 2012/01/22
2 4321 bop a 2012/01/22
3 5678 bar a 2012/01/24
4 8765 baz b 2012/02/13
5 4321 baz b 2012/02/15
6 8765 bop b 2012/02/16
7 5678 bop c 2012/03/15
8 5678 foo a 2012/04/01
9 1234 baz b 2012/04/11
10 8765 bar a 2012/05/03
11 4321 bar a 2012/05/04
12 5678 baz c 2012/06/01
13 1234 bar b 2012/11/01
I am using the following command:
df.pivot(index='StudentID', columns='QuestionID')
But I am getting the following error:
ReshapeError: Index contains duplicate entries, cannot reshape
Note that the same dataframe without the last line
13 1234 bar b 2012/11/01
The pivot results successfully in following:
Answer DateRecorded
QuestionID bar baz bop foo bar baz bop foo
StudentID
1234 a b NaN c 2012/01/21 2012/04/11 NaN 2012/01/22
4321 a b a NaN 2012/05/04 2012/02/15 2012/01/22 NaN
5678 a c c a 2012/01/24 2012/06/01 2012/03/15 2012/04/01
8765 a b b NaN 2012/05/03 2012/02/13 2012/02/16 NaN
I am new to pivoting and would like to know why having duplicate StudentID, QuestionID pair causing this problem? And, how can I fix this using the df.pivot() function?
thank you.

What do you expect your pivot table to look like with the duplicate entries? I'm not sure it would make sense to have multiple elements for (1234, bar) in the pivot table. Your data looks like it's naturally indexed by (questionID, studentID, dateRecorded).
If you go with the Hierarchical Index approach (they're really not that complicated!) I'd try:
In [104]: df2 = df.set_index(['StudentID', 'QuestionID', 'DateRecorded'])
In [105]: df2
Out[105]:
Answer
StudentID QuestionID DateRecorded
1234 bar 2012/01/21 a
foo 2012/01/22 c
4321 bop 2012/01/22 a
5678 bar 2012/01/24 a
8765 baz 2012/02/13 b
4321 baz 2012/02/15 b
8765 bop 2012/02/16 b
5678 bop 2012/03/15 c
foo 2012/04/01 a
1234 baz 2012/04/11 b
8765 bar 2012/05/03 a
4321 bar 2012/05/04 a
5678 baz 2012/06/01 c
1234 bar 2012/11/01 b
In [106]: df2.unstack('QuestionID')
Out[106]:
Answer
QuestionID bar baz bop foo
StudentID DateRecorded
1234 2012/01/21 a NaN NaN NaN
2012/01/22 NaN NaN NaN c
2012/04/11 NaN b NaN NaN
2012/11/01 b NaN NaN NaN
4321 2012/01/22 NaN NaN a NaN
2012/02/15 NaN b NaN NaN
2012/05/04 a NaN NaN NaN
5678 2012/01/24 a NaN NaN NaN
2012/03/15 NaN NaN c NaN
2012/04/01 NaN NaN NaN a
2012/06/01 NaN c NaN NaN
8765 2012/02/13 NaN b NaN NaN
2012/02/16 NaN NaN b NaN
2012/05/03 a NaN NaN NaN
Otherwise you can come up with some rule to determine which of the multiple entries to take for the pivot table, and avoid the Hierarchical index.

Instead of relying on Pandas (which is better of course) you could also aggregate your data manually.
def heatmap_seaborn():
na_lr_measures = [50, 50, 50, 49, 49, 49, 48, 47, 47, 47, 46, 46, 46, 46, 45, 45, 45, 45, 45, 45, 45, 45, 45, 43, 43, 43, 43, 42, 42, 42, 41, 41, 41, 41, 41, 41, 41, 40, 40, 40, 40, 40, 40, 40, 40, 39, 39, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 34, 34, 34, 33, 33, 33, 32, 32, 31, 30, 30, 30, 29, 29]
na_lr_labels = ('bi2e', 'bi21', 'bi22', 'si21', 'si22', 'si2e', 'si11', 'bi11', 'bi1e', 'si1e', 'bx21', 'ti22', 'bx2e', 'si12', 'ti1e', 'sx22', 'ti21', 'bx22', 'sx2e', 'bi12', 'ti11', 'sx21', 'ti2e', 'ti12', 'sx11', 'sx1e', 'bxx2', 'bx1e', 'bx11', 'tx2e', 'tx22', 'tx21', 'sx12', 'six1', 'six2', 'sixe', 'sixx', 'tx11', 'bx12', 'bix2', 'bix1', 'tx1e', 'bixe', 'bixx', 'bxxe', 'sxx2', 'tx12', 'tixe', 'tix1', 'sxxe', 'sxx1', 'si1x', 'tixx', 'bxx1', 'tix2', 'bi2x', 'sxxx', 'si2x', 'txx1', 'bxxx', 'txxe', 'ti2x', 'sx2x', 'bx2x', 'txxx', 'bi1x', 'tx1x', 'sx1x', 'tx2x', 'txx2', 'bx1x', 'ti1x')
na_lr_labelcategories = ["TF", "IDF", "Normalisation", "Regularisation", "Acc#161"]
measures = na_lr_measures
labels = na_lr_labels
cats = na_lr_labelcategories
new_measures = defaultdict(list)
new_labels = []
#cats = ["TF", "Normalisation", "Acc#161"]
for i,c in enumerate(labels):
c=c[0]+c[2]
new_labels.append(c)
m = measures[i]
new_measures[c].append(m)
labels = list(set(new_labels))
measures = []
for l in labels:
m = np.mean(new_measures[l])
measures.append(m)
df = pd.DataFrame(
{cats[0]:pd.Categorical([a[0] for a in labels]),
#cats[1]:pd.Categorical([a[1] for a in labels]),
cats[2]:pd.Categorical([a[1] for a in labels]),
#cats[3]:pd.Categorical([a[3] for a in labels]),
cats[4]:measures})
print df
df = df.pivot(cats[0], cats[2], cats[4])
sns.set_context("paper",font_scale=2.7)
fig, ax = plt.subplots()
ax = sns.heatmap(df)
plt.show()
as you can see in the example a pandas dataframe is built from some arrays then the table is manually aggregated.
I did this because I didn't have the time to learn more pandas.

Related

How can find some pin point with regex

I am trying to analyze a string with regex (e.g. 20, 38,, 20, 24 n2,, 20, 28, 38,, 851, 859 n3,) in XML files.
Example text:
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 38, 111 S Ct 1647:</p>
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 24 n2, 111 S Ct 1647</p>
<p>Gilmer v Interstate/Johnson Lane Corp.</italic> (1991) 500 US 20, 28, 38, 111 S Ct 1647</p>
<p>International Bhd. of Elec. Workers v Hechler (1987) 481 US 851, 859 n3, 107 S Ct 2161:</p>
I want to modify the (\([^()]*)|([0-9]+,)\s*[0-9]+,?\s*[0-9]+, regex because I am replacing the text with $1$2.
(https://regex101.com/r/jWt2w1/2)
Use
(\([^()]*)|([0-9]+,)\s*[0-9]+(?:\s+[a-z]+)?,?\s*[0-9]+(?:\s+[a-z]+)?,
See proof
The (?:\s+[a-z]+)? optionally matches one or more whitespace characters and one or more letters.

Subsetting using a Bool-Vector in Rcpp-Function (problems of a Rcpp Beginner...)

Problem description (think of a membership with different prices for adults and kids):
I am having two data sets, one containing age and a code. A second dataframe "decodes" the codes to numeric values dependent someone is a kid or adult. I know want to match the codes in both data sets and receive a vector that contains numeric values for each customer in the data set.
I can make this work with standard R-functionalities, but since my original data contains several million observations I would like to speed up computation using the Rcpp package.
Unfortunately I do not succeed, especially how to perform the subsetting based on a logical vector as I would do it in R. I am quite new to Rcpp and have no experience with C++ so I am maybe missing some very basic point.
I attached a minimum working example for R and appreciate any kind of help or explanation!
library(Rcpp)
raw_data = data.frame(
age = c(10, 14, 99, 67, 87, 54, 12, 44, 22, 8),
iCode = c("code1", "code2", "code3", "code1", "code4", "code3", "code2", "code5", "code5", "code3"))
decoder = data.frame(
code = c("code1","code2","code3","code4","code5"),
kid = c(0,0,0,0,100),
adult = c(100,200,300,400,500))
#-------- R approach (works, but takes ages for my original data set)
calc_value = function(data, decoder){
y = nrow(data)
for (i in 1:nrow(data)){
position_in_decoder = (data$iCode[i] == decoder$code)
if (data$age[i] > 18){
y[i] = decoder$adult[position_in_decoder]
}else{
y[i] = decoder$kid[position_in_decoder]
}
}
return(y)
}
y = calc_value(raw_data, decoder)
#--------- RCPP approach (I cannot make this one work) :(
cppFunction(
'NumericVector calc_Rcpp(DataFrame df, DataFrame decoder) {
NumericVector age = df["age"];
CharacterVector iCode = df["iCode"];
CharacterVector code = decoder["code"];
NumericVector adult = decoder["adult"];
NumericVector kid = decoder["kid"];
const int n = age.size();
LogicalVector position;
NumericVector y(n);
for (int i=0; i < n; ++i) {
position = (iCode[i] == code);
if (age[i] > 18 ) y[i] = adult[position];
else y[i] = kid[position];
}
return y;
}')
There is no need to go for C++ here. Just use R properly:
raw_data = data.frame(
age = c(10, 14, 99, 67, 87, 54, 12, 44, 22, 8),
iCode = c("code1", "code2", "code3", "code1", "code4", "code3", "code2", "code5", "code5", "code3"))
decoder = data.frame(
code = c("code1","code2","code3","code4","code5"),
kid = c(0,0,0,0,100),
adult = c(100,200,300,400,500))
foo <- merge(raw_data, decoder, by.x = "iCode", by.y = "code")
foo$res <- ifelse(foo$age > 18, foo$adult, foo$kid)
foo
#> iCode age kid adult res
#> 1 code1 10 0 100 0
#> 2 code1 67 0 100 100
#> 3 code2 14 0 200 0
#> 4 code2 12 0 200 0
#> 5 code3 54 0 300 300
#> 6 code3 99 0 300 300
#> 7 code3 8 0 300 0
#> 8 code4 87 0 400 400
#> 9 code5 44 100 500 500
#> 10 code5 22 100 500 500
That should also work for large data sets.

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?
This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.
You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64
SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

How to extract character component in values and replace the values with -99

My data looks like:
VAR_A: 134, 15M3, 2004, 301ME, 201E, 41, 53, 22
I'd like to change this vector like below:
VAR_A: 134, -99, 2004, -99, -99, 41, 53, 22
If a value contain characters (e.g., M, E), I want to change those values with -99.
How could I do it in R? I've heard that regular expression would be a possible way, but I'm not good at it.
It seems to me you want to replace the values that are not digits, if that is the case ...
x <- c('134', '15M3', '2004', '301ME', '201E', '41', '53', '22')
sub('.*\\D.*', '-99', x)
# [1] "134" "-99" "2004" "-99" "-99" "41" "53" "22"
Or essentially you could do:
x[grepl('\\D', x)] <- -99
as.numeric(x)
# [1] 134 -99 2004 -99 -99 41 53 22

How do you calculate expanding mean on time series using pandas?

How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000