Multiply rows in dataframe, then sum them together Python - python-2.7

I have a function to apply to this table
F(x) = 1.5*x1 + 2*x2 - 1.5*x3
Where xi, i = 1,2,3, is the column value.
And I have the following table below.
X1 | X2 | X3
------|------|------
20 |15 |12
30 |17 |24
40 |23 |36
The desired output that I would like is the following below, where I apply the function to each row, taking the value in each column and applying it to the function iteratively then receiving value as a sum and another column appended onto the dataframe.
X1 | X2 | X3 |F(X)
------|------|------|------
20 |15 |12 |42
30 |17 |24 |43
40 |23 |36 |52
Is there a way to do this in Python 2.7?

Something like this ?
df['F(x)']=df.mul([1.5,2,-1.5]).sum(1)
df
Out[1076]:
X1 X2 X3 F(x)
0 20 15 12 42.0
1 30 17 24 43.0
2 40 23 36 52.0

Ok. Found a sample code to solve my problem.
var1 = 1.5
var2 = 2
var3 = -1.5
def calculate_fx(row):
return (var1 * row['X1']) + (var2 * row['X2']) + (var3 * row['X3'])
#function_df is the predefined dataframe
function_df['F(X)'] = function_df.apply(calculate_fx, axis=1)
function_df

Related

How to rename a column name to a new value in a dataframe if the column names are dynamic

I have csv file with column names changing based on month and year but has keyword like 'sales' 'product' etc. Is there a way to rename the column to a fixed value using python rename by searching the keyword
Sample column names would be 2019 May sales Tv, 2018 April sales Fridge
eg
nil
df_nw = df.rename(df.filter(like='Sales').columns.values
Current data:
column1 column2 2019AprilSalesTV 2018ActualSalesTV
X BBBB 7766 60
Y CCCC 10 20
Z LLLLL 60 65
K TTTTT 10 67
New Data:
column1 column2 Sales ActualSales
X BBBB 7766 60
Y CCCC 10 20
Z LLLLL 60 65
K TTTTT 10 67
You can do:
> clean_colname = lambda x: re.sub(r'(^\w+(?<!Actual))(Sales)', r'\2',
re.sub(r'^\d+|TV$', r'', x))
> df_nw.rename(clean_colname, axis=1)
column2 Sales ActualSales
column1
X BBBB 7766 60
Y CCCC 10 20
Z LLLLL 60 65
K TTTTT 10 67

Iterating through pandas dataframe and appending to a list

I have a pandas dataframe df
Date SKU Balance
0 1/1/2017 X1 8
1 2/1/2017 X2 45
2 3/1/2017 X1 47
3 4/1/2017 X2 16
4 5/1/2017 X1 14
5 6/1/2017 X2 67
6 7/1/2017 X2 9
8 8/1/2017 X1 66
9 9/1/2017 X1 158
I wanna break it and append it to a list so that each item in the list is the collection of 4 days of the data frame
For Example
List[1]
Date SKU Balance
0 1/1/2017 X1 8
1 2/1/2017 X2 45
2 3/1/2017 X1 47
3 4/1/2017 X2 16
List[2]
Date SKU Balance
0 2/1/2017 X2 45
1 3/1/2017 X1 47
2 4/1/2017 X2 16
3 5/1/2017 X1 14
At the moment I can only achieve by appending one day for each list by this below code
dr = pd.date_range('20170101','20170109')
list=[]
for d in dr:
list.append(df.loc[df.Date.isin([d])])
As mentioned above,How can I append 4 days from the 1st day in one list and loop it to the 2nd day , append another 4 days of rows and so on.
Highly appreciate your help
Use reindex and np.r_ with list comprehension:
l = [df.reindex(np.r_[i:i+4]) for i in range(len(df))]
You can try with np.roll
l=[]
a=df.index.values
for x in a:
l.append(df.loc[a[:4]])
a=np.roll(a,-1)
Slice in a list comprehension.
ls = [df.loc[i:i+3] for i in range(len(df))]

Delete or remove unexpected records and strings based on multiple criteria by python or R script

I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9

Rolling sum with unbalanced panel with non-even times in Stata

I have an unbalanced daily panel where entries occur at uneven times. I would like to generate the rolling sum of some variable x over the past 365 days. I can think of two ways to do this, but the first is memory hungry and the second is processor hungry. Is there a third alternative that avoids these problems?
Here are my two solutions. Is there a third solution without memory or speed problems?
clear
set obs 200
set seed 2001
/* panel variables */
generate id = 1 + int(2*runiform())
generate time = mdy(1, 1, 2000) + int(10*365*runiform())
format time %td
duplicates drop
xtset id time
/* data */
generate x = runiform()
/* first approach is to fill the panel with `tsfill` */
/* then remove "seasonality" with `s.` */
tsfill
generate sx = sum(x)
generate ssx = s365.sx
/* second approach without `tsfill` */
/* but nested loop is fairly slow */
drop if missing(x)
generate double ssx_alt = 0
forvalues i = 1/`= _N' {
local j = `i'
local delta = time[`i'] - time[`j']
while ((`j' > 0) & (`delta' < 365) & (id[`i'] == id[`j'])) {
local x = cond(missing(x[`j']), 0, x[`j'])
replace ssx_alt = ssx_alt + `x' in `i'
local j = `j' - 1
local delta = time[`i'] - time[`j']
}
}
The sum over the last # days is the difference between two cumulative sums, the cumulative sum to now and the cumulative sum to # days ago. The extension to panel data is easy, but not shown here. I don't think gaps disturb this principle once you have applied tsfill.
. set obs 20
obs was 0, now 20
. gen t = _n
. gen y = 100 + _n
. gen sumy = sum(y)
. tsset t
time variable: t, 1 to 20
delta: 1 unit
. gen diff = sumy - L10.sumy
(10 missing values generated)
. l
+------------------------+
| t y sumy diff |
|------------------------|
1. | 1 101 101 . |
2. | 2 102 203 . |
3. | 3 103 306 . |
4. | 4 104 410 . |
5. | 5 105 515 . |
|------------------------|
6. | 6 106 621 . |
7. | 7 107 728 . |
8. | 8 108 836 . |
9. | 9 109 945 . |
10. | 10 110 1055 . |
|------------------------|
11. | 11 111 1166 1065 |
12. | 12 112 1278 1075 |
13. | 13 113 1391 1085 |
14. | 14 114 1505 1095 |
15. | 15 115 1620 1105 |
|------------------------|
16. | 16 116 1736 1115 |
17. | 17 117 1853 1125 |
18. | 18 118 1971 1135 |
19. | 19 119 2090 1145 |
20. | 20 120 2210 1155 |
+------------------------+

Why Sum of Squares were different between Stata anova and SAS glm?

anova y group hop
Number of obs = 206 R-squared = 0.0331
Root MSE = 20.0345 Adj R-squared = 0.0139
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 2761.85468 4 690.463671 1.72 0.1469
|
group | 42.2798948 1 42.2798948 0.11 0.7459
hop | 2633.73186 3 877.910619 2.19 0.0907
|
Residual | 80677.5664 201 401.380927
-----------+----------------------------------------------------
Total | 83439.4211 205 407.021566
proc glm data=ccc;
class group hop;
model y=group hop;
run;
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 2761.79407 690.44852 1.72 0.1469
Error 201 80677.50607 401.38063
Corrected Total 205 83439.30014
R-Square Coeff Var Root MSE hbalcv27 Mean
0.033099 129.8628 20.03449 15.42743
Source DF Type I SS Mean Square F Value Pr > F
group 1 128.138891 128.138891 0.32 0.5727
HOP 3 2633.655176 877.885059 2.19 0.0907
Source DF Type III SS Mean Square F Value Pr > F
group 1 42.289824 42.289824 0.11 0.7458
HOP 3 2633.655176 877.885059 2.19 0.0907
Perhaps the storage precision is not the same in the SAS and Stata data sets. The computations could also be done in different precisions. I don't know about SAS, but according to this blog by Bill Gould:
Stata does all calculations in double (and sometimes quad) precision.