I have a dataset such as below:
import pandas as pd
dic={"ID":[1,2,3,4,5,6],
"Size":["3-4mm","12mm",math.nan,"1 mm","1mm, 2mm, 3mm","13*18mm"]}
dt = pd.DataFrame(dic)
so, the dataset is:
ID Size
1 3-4mm
2 12mm
3 NaN
4 1 mm
5 1mm, 2mm, 3mm
6 13*18mm
In the column Size, i should have only 3 valid patterns, and anything except these 3 are invalid. These 3 pattern are as below
3-4mm (int-intmm)
NaN
4mm (intmm)
I am wondering how can i have function which specifies the ID of the rows which has invalid size pattern?
So, in my example:
ID
4
5
6
The reason is their size is not in valid format.
I have no preference for the solution, but i guess the most easiest solution comes from Regx
using #CodeManiac's pattern, you can pass it to series.str.contains() and pass the na parameter as True since it is a actual NaN:
dt.loc[~dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True),'ID']
3 4
4 5
5 6
Details:
executing: dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$')
0 True
1 True
2 NaN
3 False
4 False
5 False
pass na=True to fill NaN as True:
dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True)
0 True
1 True
2 True
3 False
4 False
5 False
Then use invert ~ to invert True as False and vice versa since we want False values and call the ID column under df.loc[]
The function that returns 'ID'-s of rows with invalid value in 'Size' column:
import re # standard Python regular expressions module
def get_invalid(dt):
return dt[dt['Size'].apply(lambda r: re.match(r'^\d+-\d+mm|nan|\d+mm$', str(r), re.MULTILINE) is None)]['ID']
Output:
3 4
4 5
5 6
Name: ID, dtype: int64
def laser_callback(self, laserMsg):
cloud = self.laser_projector.projectLaser(laserMsg)
gen = pc2.read_points(cloud, skip_nans=True, field_names=('x', 'y', 'z'))
self.xyz_generator = gen
print(gen)
I'm trying to convert the laser data into pointcloud2 data, and then display them using matplotlib.pyplot. I tried traversing individual points in the generator but it takes a long time. Instead I'd like to convert them into a numpy array and then plot it. How do I go about doing that?
Take a look at some of these other posts which seem to answer the basic question of "convert a generator to an array":
How do I build a numpy array from a generator?
How to construct an np.array with fromiter
How to fill a 2D Python numpy array with values from a generator?
numpy fromiter with generator of list
Without knowing exactly what your generator is returning, the best I can do is provide a somewhat generic (but not particularly efficient) example:
#!/usr/bin/env -p python
import numpy as np
# Sample generator of (x, y, z) tuples
def my_generator():
for i in range(10):
yield (i, i*2, i*2 + 1)
i += 1
def gen_to_numpy(gen):
return np.array([x for x in gen])
gen = my_generator()
array = gen_to_numpy(gen)
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 0 0 1]
[ 1 2 3]
[ 2 4 5]
[ 3 6 7]
[ 4 8 9]
[ 5 10 11]
[ 6 12 13]
[ 7 14 15]
[ 8 16 17]
[ 9 18 19]]
Again though, I cannot comment on the efficiency of this. You mentioned that it takes a long time to plot by reading points directly from the generator, but converting to a Numpy array will still require going through the whole generator to get the data. It would probably be much more efficient if the laser to pointcloud implementation you are using could provide the data directly as an array, but that is a question for the ROS Answers forum (I notice you already asked this there).
My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.
I was thinking about a code that I wrote a few years ago in Python, at some point it had to get just some elements, by index, of a list of lists.
I remember I did something like this:
def getRows(m, row_indices):
tmp = []
for i in row_indices:
tmp.append(m[i])
return tmp
Now that I've learnt a little bit more since then, I'd use a list comprehension like this:
[m[i] for i in row_indices]
But I'm still wondering if there's an even more pythonic way to do it. Any ideas?
I would like to know also alternatives with numpy o any other array libraries.
It's worth looking at NumPy for its slicing syntax. Scroll down in the linked page until you get to "Indexing, Slicing and Iterating".
It's the clean an obvious way. So, I'd say it doesn't get more Pythonic than that.
As Curt said, it seems that Numpy is a good tool for this. Here's an example,
from numpy import *
a = arange(16).reshape((4,4))
b = a[:, [1,2]]
c = a[[1,2], :]
print a
print b
print c
gives
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 1 2]
[ 5 6]
[ 9 10]
[13 14]]
[[ 4 5 6 7]
[ 8 9 10 11]]