Why math.ceil() give different answers - python-2.7

I am new to python. I was trying to solve a matrix problem in which I have to use exit condition in loop for example if column and row of matrix is 3 or 4 then i want to run the loop 2 times and if col and row is 5 or 6 then it run 3 times.
>>> math.ceil(1.5)
2.0
>>> i=3
>>> math.ceil(i/2)
1.0

This is because 3 / 2 isn't 1.5 in Python 2, it's 1. Do from __future__ import division and then it'll be what you expect.

try this first:
i=3/2
print i
j=float(3)/2
print j
print math.ceil(j)
you should see
1
1.5
2.0
the way python deals with integer division is taking the lower bound.
Reference:
http://docs.python.org/2/reference/expressions.html

Related

Looping through file with .ix and .isin

My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.

How Python calculates % function can some one please explain 3%5

How Python calculates % function? can some one please explain 3%5 outcome as 3 in Python? Answer for 5%3 is also showing 3. I use python 2.7
The Python % operator isn't percentage, it's modulo. That means the remainder part of a division. Remember when you were a kid and your math problems would be like 11 divided by 3 = 3 R 2 (remainder 2)? That's what % does. 5 % 3 = 2.
If you want to calculate percentage, do that yourself like A * 100.0 / B.

Moving GroupBys

My Data Set Looks like
1
2
3
4
5
...
I have an intermediate step which should do the folowing
1
1,2
1,2,3
1,2,3,4
1,2,3,4,5
....
And finally calculate its mean
1
1.5
2
2.5
3
...
Questions
a) Is there a way to implement this in python / py-spark?.
b) Is there a method/api which does this out of the box.
c) I googled around for this kind of solution the closest i got was to moving mean/ rolling average / moving group. Is there a term for this operation?
In Pandas, this is called an expanding_mean:
import pandas as pd
df = pd.Series(range(1,6))
s = pd.Series(range(1,6))
pd.expanding_mean(s)
Out[128]:
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
dtype: float64
I'm not sure how you'd do this in Spark, but that said, I'm also not sure if this is a "parallelalizable" task: since each step relies on the previous step, I'm not sure how you'd break this up into steps.

Iterating over selection with query of an HDFStore

I have a very large table in an HDFStore of which I would like to select a subset using a query and then iterate over the subset chunk by chunk. I would like the query to take place before the selection is broken into chunks, so that all of the chunks are the same size.
The documentation here seems to indicate that this is the default behavior but is not so clear. However, it seems to me that the chunking is actually taking place before the query, as shown in this example:
In [1]: pd.__version__
Out[1]: '0.13.0-299-gc9013b8'
In [2]: df = pd.DataFrame({'number': np.arange(1,11)})
In [3]: df
Out[3]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
[10 rows x 1 columns]
In [4]: with pd.get_store('test.h5') as store:
store.append('df', df, data_columns=['number'])
In [5]: evens = [2, 4, 6, 8, 10]
In [6]: with pd.get_store('test.h5') as store:
for chunk in store.select('df', 'number=evens', chunksize=5):
print len(chunk)
2
3
I would expect only a single chunk of size 5 if the querying were happening before the result is broken into chunks, but this example gives two chunks of lengths 2 and 3.
Is this the intended behavior and if so is there an efficient workaround to give chunks of the same size without reading the table into memory?
I think when I wrote that, the intent was to use chunksize of the results of the query. I think it was changed as was implementing it. The chunksize determines sections that the query is applied, and then you iterate on those. The problem is you don't apriori know how many rows that you are going to get.
However their IS a way to do this. Here is the sketch. Use select_as_coordinates to actually execute your query; this returns an Int64Index of the row number (the coordinates). Then apply an iterator to that where you select based on those rows.
Something like this (this makes a nice recipe, will include in the docs I think):
In [15]: def chunks(l, n):
return [l[i:i+n] for i in xrange(0, len(l), n)]
....:
In [16]: with pd.get_store('test.h5') as store:
....: coordinates = store.select_as_coordinates('df','number=evens')
....: for c in chunks(coordinates, 2):
....: print store.select('df',where=c)
....:
number
1 2
3 4
[2 rows x 1 columns]
number
5 6
7 8
[2 rows x 1 columns]
number
9 10
[1 rows x 1 columns]

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...