How to convert Object to Float in Python - python-2.7

I have the following dataframe :
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
5 4177.014316 2016 9 12 0 0 0
6 3077.103445 2016 9 13 0 0 0
7 4123.103795 2016 9 14 0 0 0
.. ... ... ... ... ... ... ...
551 NaN 2016 11 23 0 0 0
552 NaN 2016 11 24 0 0 0
553 NaN 2016 11 25 0 0 0
.. ... ... ... ... ... ... ...
579 NaN 2016 11 27 0 0 0
580 NaN 2016 11 28 0 0 0
The variables type is as follows:
print(df.dtypes)
Daily_KWH_System object
year int32
month int32
day int32
hour int32
minute int32
second int32
I need to convert "Daily_KWH_System" to Float, so that I use in Linear Regression model.
I tried the below code, which worked fine.
df['Daily_KWH_System'] = pd.to_numeric(df['Daily_KWH_System'], errors='coerce')
Then I replaced the NaN's to Blank space, to use in my model. And I used the following code
df = df.replace(np.nan,' ', regex=True)
But, again the variable " Daily_KWH_System" is getting converted to Object as soon as i replace NaN'.
Please let me know how to go about it

Related

Incorrect reading of a variable from a txt (Fortran)

I'm trying to read this txt:
Fecha dia mes ano hora min
03/06/2016 00:00 3 6 2016 0 0
03/06/2016 00:05 3 6 2016 0 5
03/06/2016 00:10 3 6 2016 0 10
03/06/2016 00:15 3 6 2016 0 15
03/06/2016 00:20 3 6 2016 0 20
03/06/2016 00:25 3 6 2016 0 25
03/06/2016 00:30 3 6 2016 0 30
03/06/2016 00:35 3 6 2016 0 35
03/06/2016 00:40 3 6 2016 0 40
03/06/2016 00:45 3 6 2016 0 45
03/06/2016 00:50 3 6 2016 0 50
03/06/2016 00:55 3 6 2016 0 55
03/06/2016 01:00 3 6 2016 1 0
With the following code:
program fecha
implicit none
integer, dimension(13):: dia, mes, ano, hora, minuto
character*50 :: formato = '(11x,5x,1x,i1,1x,i1,1x,i4,1x,i1,1x,i2)'
open (unit = 10, file = 'datos.txt')
read(10,*)
read(unit = 10, fmt = formato) dia, mes, ano, hora, minuto
write(*,*) dia
close(10)
end program
Why this code read 'dia' in this way:
3 6 2016 0 0 3 6 2016 0 5 3 6 2016
(I know how it's reading but not why)
You need to skip two lines at the beginning as well as reading the values line by line.
The following example is a slight modification of your program which runs smoothly.
program fecha
implicit none
integer :: i, iounit
integer, parameter :: n = 13
integer, dimension(n) :: dia, mes, ano, hora, minuto
open (newunit = iounit, file = 'datos.txt')
read (iounit, *)
read (iounit, *)
do i = 1, n
read (unit = iounit, fmt = '(16x, i5, i4, i7, 2i5)') dia(i), mes(i), ano(i), hora(i), minuto(i)
print *, dia(i), mes(i), ano(i), hora(i), minuto(i)
end do
close (iounit)
end program
My output is
$ gfortran -g3 -Wall -fcheck=all a.f90 && ./a.out
3 6 2016 0 0
3 6 2016 0 5
3 6 2016 0 10
3 6 2016 0 15
3 6 2016 0 20
3 6 2016 0 25
3 6 2016 0 30
3 6 2016 0 35
3 6 2016 0 40
3 6 2016 0 45
3 6 2016 0 50
3 6 2016 0 55
3 6 2016 1 0

In SAS: How to consolidate non zero values in rows by group

I have a dataset consisting of variables ObservationNumber, MeasurementNumber, SubjectID, and many dummy variables.
I would like to consolidate all non-zero values into one row by SubjectID GroupNumber.
Have:
ObsNum MeasurementNum SubjectID Dummy0 Dummy1 ... Dummy999
----------------------------------------------------...---------------
01 1 1 0 1 ... 0
02 2 1 0 1 ... 0
03 3 1 0 1 ... 0
04 4 1 0 0 ... 0
05 5 1 - - ... -
06 6 1 0 0 ... 0
07 1 2 1 0 ... 0
08 2 2 0 0 ... 0
09 3 2 0 1 ... 0
10 4 2 1 0 ... 0
11 4 2 0 1 ... 0
12 5 2 0 0 ... 1
13 6 2 0 0 ... 0
14 6 2 0 0 ... 1
15 6 2 0 0 ... 0
16 6 2 0 0 ... 0
17 6 2 0 1 ... 0
18 6 2 0 0 ... 0
19 6 2 0 0 ... 0
20 6 2 0 0 ... 0
21 6 2 1 0 ... 0
22 1 3 1 0 ... 0
23 2 3 0 1 ... 0
24 3 3 0 0 ... 1
25 4 3 - - ... -
26 5 3 0 0 ... 0
27 6 3 0 0 ... 0
28 1 4 - - ... -
29 2 4 0 0 ... 0
30 3 4 0 1 ... 0
31 4 4 1 0 ... 0
32 4 4 0 1 ... 0
33 4 4 0 0 ... 1
34 5 4 0 0 ... 1
35 6 4 0 1 ... 0
36 6 4 0 0 ... 1
Want:
MeasurementNum SubjectID Dummy0 Dummy1 ... Dummy999
----------------------------------------------------...---------------
1 1 0 1 ... 0
2 1 0 1 ... 0
3 1 0 1 ... 0
4 1 0 0 ... 0
5 1 - - ... -
6 1 0 0 ... 0
1 2 1 0 ... 0
2 2 0 0 ... 0
3 2 0 1 ... 0
4 2 1 1 ... 0
5 2 0 0 ... 1
6 2 1 1 ... 1
1 3 1 0 ... 0
2 3 0 1 ... 0
3 3 0 0 ... 1
4 3 - - ... -
5 3 0 0 ... 0
6 3 0 0 ... 0
1 4 - - ... -
2 4 0 0 ... 0
3 4 0 1 ... 0
4 4 1 1 ... 1
5 4 0 0 ... 1
6 4 0 1 ... 1
Each SubjectID has six measurement in which a set of dummyvariables are measured without outcome 0, 1 or missing. If a missing value occurs, all dummy variables for the respective observation are missing--and only one observation will be present in the dataset for that `MeasurementNumber.
I have tried to use the UPDATE statement, but it seems to not be able to deal with '0' and '-'.
Is there a direct way of condensing all dummyvariables in this dataset for each SubjectID grouped by MeasurementNumber?
Use Proc MEANS with BY and OUTPUT statements.
data have;
rownum = 0;
do rowid = 1 to 1000;
subjectid + 1;
do measurenum = 1 to 6;
do repeat = 1 to ceil(4 * ranuni(123));
array flags flag1-flag999;
do _n_ = 1 to dim(flags);
flags(_n_) = ranuni(123) < 0.10;
if subjectid < 7 and measurenum = subjectid then flags(_n_) = .;
end;
rownum + 1;
output;
end;
end;
end;
keep rownum measurenum subjectid flag:;
run;
proc means noprint data=have;
by subjectid measurenum;
var flag:;
output max=;
run;

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

Fortran 77 Real to Int rounding Direction?

Porting a bit of Fortran 77 code. It appears that REAL variables are being assigned to INTEGER variables. I do not have a method to run this code and wonder what the behavior is in the following case:
REAL*4 A
A = 123.25
B = INT(A)
B = 123 or B = 124?
How about at the 0.5 mark?
REAL*4 C
C = 123.5
D = INT(C)
D = 123 or D = 123.5?
INT is always rounding down:
From the GCC documentation:
These functions return a INTEGER variable or array under the following rules:
(A) If A is of type INTEGER, INT(A) = A
(B) If A is of type REAL and
|A| < 1, INT(A) equals 0. If |A| \geq 1, then INT(A) equals the
largest integer that does not exceed the range of A and whose sign is
the same as the sign of A.
(C) If A is of type COMPLEX, rule B is
applied to the real part of A.
If you want to round to the nearest integer, use NINT.
So, in your case B and D are always 123 (if they are declared as integer).
Here is one example of code and the output, it's an extension of previous answer:
PROGRAM test
implicit none
integer :: i=0
real :: dummy = 0.
do i = 0,30
dummy = -1.0 + (i*0.1)
write(*,*) i, dummy , int(dummy) , nint(dummy) ,floor(dummy)
enddo
stop
end PROGRAM test
This is the output:
$ ./test
0 -1.000000 -1 -1 -1
1 -0.9000000 0 -1 -1
2 -0.8000000 0 -1 -1
3 -0.7000000 0 -1 -1
4 -0.6000000 0 -1 -1
5 -0.5000000 0 -1 -1
6 -0.4000000 0 0 -1
7 -0.3000000 0 0 -1
8 -0.2000000 0 0 -1
9 -9.9999964E-02 0 0 -1
10 0.0000000E+00 0 0 0
11 0.1000000 0 0 0
12 0.2000000 0 0 0
13 0.3000001 0 0 0
14 0.4000000 0 0 0
15 0.5000000 0 1 0
16 0.6000000 0 1 0
17 0.7000000 0 1 0
18 0.8000001 0 1 0
19 0.9000000 0 1 0
20 1.000000 1 1 1
21 1.100000 1 1 1
22 1.200000 1 1 1
23 1.300000 1 1 1
24 1.400000 1 1 1
25 1.500000 1 2 1
26 1.600000 1 2 1
27 1.700000 1 2 1
28 1.800000 1 2 1
29 1.900000 1 2 1
30 2.000000 2 2 2
I hope that this can better clarify the question
EDIT: Compiled with ifort 2013 on xeon
Only as completion to the existing answers I want to add an example how the commercial rounds can be realized without using NINT by
L = INT(F + 0.5)
where L is INTEGER and F is a positive REAL number. I've found this in FORTRAN 77 code samples from the last century.
Extending this to negative REAL numbers by
L = SIGN(1.0,F)*INT(ABS(F) + 0.5)
and going back to the 80th of last century, the minimal code example looks like this
PROGRAM ROUNDTEST
DO 12345 I=0,30
F = -1.0 + I * 0.1
J = INT(F)
K = NINT(F)
L = SIGN(1.0,F)*INT(ABS(F) + 0.5)
PRINT *, I, F, J, K, L
12345 CONTINUE
END
which creates the output
$ ./ROUNDTEST
0 -1.00000000 -1 -1 -1
1 -0.899999976 0 -1 -1
2 -0.800000012 0 -1 -1
3 -0.699999988 0 -1 -1
4 -0.600000024 0 -1 -1
5 -0.500000000 0 -1 -1
6 -0.399999976 0 0 0
7 -0.300000012 0 0 0
8 -0.199999988 0 0 0
9 -9.99999642E-02 0 0 0
10 0.00000000 0 0 0
11 0.100000024 0 0 0
12 0.200000048 0 0 0
13 0.300000072 0 0 0
14 0.399999976 0 0 0
15 0.500000000 0 1 1
16 0.600000024 0 1 1
17 0.700000048 0 1 1
18 0.800000072 0 1 1
19 0.899999976 0 1 1
20 1.00000000 1 1 1
21 1.10000014 1 1 1
22 1.20000005 1 1 1
23 1.29999995 1 1 1
24 1.40000010 1 1 1
25 1.50000000 1 2 2
26 1.60000014 1 2 2
27 1.70000005 1 2 2
28 1.79999995 1 2 2
29 1.90000010 1 2 2
30 2.00000000 2 2 2
ROUNDTEST is compiled and linked by gfortran version 7.4.0 by
$ gfortran.exe ROUNDTEST.FOR -o ROUNDTEST
Hope this helps you if you have to deal with old FORTRAN code.