Iterating over selection with query of an HDFStore - python-2.7

I have a very large table in an HDFStore of which I would like to select a subset using a query and then iterate over the subset chunk by chunk. I would like the query to take place before the selection is broken into chunks, so that all of the chunks are the same size.
The documentation here seems to indicate that this is the default behavior but is not so clear. However, it seems to me that the chunking is actually taking place before the query, as shown in this example:
In [1]: pd.__version__
Out[1]: '0.13.0-299-gc9013b8'
In [2]: df = pd.DataFrame({'number': np.arange(1,11)})
In [3]: df
Out[3]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
[10 rows x 1 columns]
In [4]: with pd.get_store('test.h5') as store:
store.append('df', df, data_columns=['number'])
In [5]: evens = [2, 4, 6, 8, 10]
In [6]: with pd.get_store('test.h5') as store:
for chunk in store.select('df', 'number=evens', chunksize=5):
print len(chunk)
2
3
I would expect only a single chunk of size 5 if the querying were happening before the result is broken into chunks, but this example gives two chunks of lengths 2 and 3.
Is this the intended behavior and if so is there an efficient workaround to give chunks of the same size without reading the table into memory?

I think when I wrote that, the intent was to use chunksize of the results of the query. I think it was changed as was implementing it. The chunksize determines sections that the query is applied, and then you iterate on those. The problem is you don't apriori know how many rows that you are going to get.
However their IS a way to do this. Here is the sketch. Use select_as_coordinates to actually execute your query; this returns an Int64Index of the row number (the coordinates). Then apply an iterator to that where you select based on those rows.
Something like this (this makes a nice recipe, will include in the docs I think):
In [15]: def chunks(l, n):
return [l[i:i+n] for i in xrange(0, len(l), n)]
....:
In [16]: with pd.get_store('test.h5') as store:
....: coordinates = store.select_as_coordinates('df','number=evens')
....: for c in chunks(coordinates, 2):
....: print store.select('df',where=c)
....:
number
1 2
3 4
[2 rows x 1 columns]
number
5 6
7 8
[2 rows x 1 columns]
number
9 10
[1 rows x 1 columns]

Related

How to identify invalid pattern using regx?

I have a dataset such as below:
import pandas as pd
dic={"ID":[1,2,3,4,5,6],
"Size":["3-4mm","12mm",math.nan,"1 mm","1mm, 2mm, 3mm","13*18mm"]}
dt = pd.DataFrame(dic)
so, the dataset is:
ID Size
1 3-4mm
2 12mm
3 NaN
4 1 mm
5 1mm, 2mm, 3mm
6 13*18mm
In the column Size, i should have only 3 valid patterns, and anything except these 3 are invalid. These 3 pattern are as below
3-4mm (int-intmm)
NaN
4mm (intmm)
I am wondering how can i have function which specifies the ID of the rows which has invalid size pattern?
So, in my example:
ID
4
5
6
The reason is their size is not in valid format.
I have no preference for the solution, but i guess the most easiest solution comes from Regx
using #CodeManiac's pattern, you can pass it to series.str.contains() and pass the na parameter as True since it is a actual NaN:
dt.loc[~dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True),'ID']
3 4
4 5
5 6
Details:
executing: dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$')
0 True
1 True
2 NaN
3 False
4 False
5 False
pass na=True to fill NaN as True:
dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True)
0 True
1 True
2 True
3 False
4 False
5 False
Then use invert ~ to invert True as False and vice versa since we want False values and call the ID column under df.loc[]
The function that returns 'ID'-s of rows with invalid value in 'Size' column:
import re # standard Python regular expressions module
def get_invalid(dt):
return dt[dt['Size'].apply(lambda r: re.match(r'^\d+-\d+mm|nan|\d+mm$', str(r), re.MULTILINE) is None)]['ID']
Output:
3 4
4 5
5 6
Name: ID, dtype: int64

how to read generator data as numpy array

def laser_callback(self, laserMsg):
cloud = self.laser_projector.projectLaser(laserMsg)
gen = pc2.read_points(cloud, skip_nans=True, field_names=('x', 'y', 'z'))
self.xyz_generator = gen
print(gen)
I'm trying to convert the laser data into pointcloud2 data, and then display them using matplotlib.pyplot. I tried traversing individual points in the generator but it takes a long time. Instead I'd like to convert them into a numpy array and then plot it. How do I go about doing that?
Take a look at some of these other posts which seem to answer the basic question of "convert a generator to an array":
How do I build a numpy array from a generator?
How to construct an np.array with fromiter
How to fill a 2D Python numpy array with values from a generator?
numpy fromiter with generator of list
Without knowing exactly what your generator is returning, the best I can do is provide a somewhat generic (but not particularly efficient) example:
#!/usr/bin/env -p python
import numpy as np
# Sample generator of (x, y, z) tuples
def my_generator():
for i in range(10):
yield (i, i*2, i*2 + 1)
i += 1
def gen_to_numpy(gen):
return np.array([x for x in gen])
gen = my_generator()
array = gen_to_numpy(gen)
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 0 0 1]
[ 1 2 3]
[ 2 4 5]
[ 3 6 7]
[ 4 8 9]
[ 5 10 11]
[ 6 12 13]
[ 7 14 15]
[ 8 16 17]
[ 9 18 19]]
Again though, I cannot comment on the efficiency of this. You mentioned that it takes a long time to plot by reading points directly from the generator, but converting to a Numpy array will still require going through the whole generator to get the data. It would probably be much more efficient if the laser to pointcloud implementation you are using could provide the data directly as an array, but that is a question for the ROS Answers forum (I notice you already asked this there).

Set values of a column to a specific list of numbers

I want to create a small test data set with some specific values:
x
-
1
3
4
5
7
I can do this the hard way:
. set obs 5
. generate x = .
. replace x = 1 in 1
. replace x = 3 in 2
. replace x = 4 in 3
. replace x = 5 in 4
. replace x = 7 in 5
I can also use the data editor, but I'd like to create a .do file which can recreate this data set.
So how do I set the values of a variable from a list of numbers?
This can be done using a (to my mind) poorly documented feature of input:
clear
input x
1
3
4
5
7
end
I say poorly documented because the title of the input help page is
[D] Input -- Enter data from keyboard
which is clearly only a subset of what this command can do.
Here is another way
clear
mat x = (1,3,4,5,7)
set obs `=colsof(x)'
generate x = x[1, _n]
and another
clear
mata : x = (1,3,4,5,7)'
getmata x=x

Looping through file with .ix and .isin

My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.

Pythonic way to get some rows of a matrix

I was thinking about a code that I wrote a few years ago in Python, at some point it had to get just some elements, by index, of a list of lists.
I remember I did something like this:
def getRows(m, row_indices):
tmp = []
for i in row_indices:
tmp.append(m[i])
return tmp
Now that I've learnt a little bit more since then, I'd use a list comprehension like this:
[m[i] for i in row_indices]
But I'm still wondering if there's an even more pythonic way to do it. Any ideas?
I would like to know also alternatives with numpy o any other array libraries.
It's worth looking at NumPy for its slicing syntax. Scroll down in the linked page until you get to "Indexing, Slicing and Iterating".
It's the clean an obvious way. So, I'd say it doesn't get more Pythonic than that.
As Curt said, it seems that Numpy is a good tool for this. Here's an example,
from numpy import *
a = arange(16).reshape((4,4))
b = a[:, [1,2]]
c = a[[1,2], :]
print a
print b
print c
gives
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 1 2]
[ 5 6]
[ 9 10]
[13 14]]
[[ 4 5 6 7]
[ 8 9 10 11]]