How to identify invalid pattern using regx? - regex

I have a dataset such as below:
import pandas as pd
dic={"ID":[1,2,3,4,5,6],
"Size":["3-4mm","12mm",math.nan,"1 mm","1mm, 2mm, 3mm","13*18mm"]}
dt = pd.DataFrame(dic)
so, the dataset is:
ID Size
1 3-4mm
2 12mm
3 NaN
4 1 mm
5 1mm, 2mm, 3mm
6 13*18mm
In the column Size, i should have only 3 valid patterns, and anything except these 3 are invalid. These 3 pattern are as below
3-4mm (int-intmm)
NaN
4mm (intmm)
I am wondering how can i have function which specifies the ID of the rows which has invalid size pattern?
So, in my example:
ID
4
5
6
The reason is their size is not in valid format.
I have no preference for the solution, but i guess the most easiest solution comes from Regx

using #CodeManiac's pattern, you can pass it to series.str.contains() and pass the na parameter as True since it is a actual NaN:
dt.loc[~dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True),'ID']
3 4
4 5
5 6
Details:
executing: dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$')
0 True
1 True
2 NaN
3 False
4 False
5 False
pass na=True to fill NaN as True:
dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True)
0 True
1 True
2 True
3 False
4 False
5 False
Then use invert ~ to invert True as False and vice versa since we want False values and call the ID column under df.loc[]

The function that returns 'ID'-s of rows with invalid value in 'Size' column:
import re # standard Python regular expressions module
def get_invalid(dt):
return dt[dt['Size'].apply(lambda r: re.match(r'^\d+-\d+mm|nan|\d+mm$', str(r), re.MULTILINE) is None)]['ID']
Output:
3 4
4 5
5 6
Name: ID, dtype: int64

Related

Logic and spaces in Regex

I have the following Regex:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>[\d\s]+)(?P<cfg_duplex>[\w]+)
That I use on the following configuration:
MLAG-ISC>None E A OFF 40000 40000 FULL FULL NONE 1 Q+CR4_1m
2 None E NP OFF 10000 FULL NONE
3 None E NP OFF 10000 FULL NONE
4 None E NP OFF 10000 FULL NONE
MLAG-ISC>None E A OFF 40000 40000 FULL FULL NONE 1 Q+CR4_1m
6 None E NP OFF 10000 FULL NONE
Which gives me this result (https://regex101.com):
This is the result I want however I would also like to capture the next FULL or empty, NONE or empty, 1 or empty and Q+CR4_1m or NONE. I just can't seem to make it work because of the spaces in the rows 2,3,4 and 6.
Note that I am using Python3.
If FULL, NONE and 1 are the only possible (though optional) values in that columns:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>\d*)\s+(?P<cfg_duplex>[\w]+)\s+((?:FULL)?)\s+((?:NONE)?)\s+(1?)\s+([\w+]+)
otherwise:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>\d*)\s+(?P<cfg_duplex>[\w]+)[^\S\r\n]+(\w*)[^\S\r\n]+(\w*)[^\S\r\n]+(\d*)[^\S\r\n]+([\w+]+)

Converting datetime to pandas index

My pandas dataframe is structured as follows:
date tag
0 2015-07-30 19:19:35-04:00 E7RG6
1 2016-01-27 08:20:01-05:00 ER57G
2 2015-11-15 23:32:16-05:00 EQW7G
3 2016-07-12 00:01:11-04:00 ERV7G
4 2016-02-14 00:35:21-05:00 EQW7G
5 2016-03-01 00:08:59-05:00 EQW7G
6 2015-06-19 07:15:06-04:00 ER57G
7 2016-09-08 18:17:53-04:00 ER5TT
8 2016-09-03 01:53:45-04:00 EQW7G
9 2015-11-30 09:31:02-05:00 ER57G
10 2016-03-03 22:28:26-05:00 ES5TG
11 2016-02-11 10:39:24-05:00 E5P7G
12 2015-03-16 07:18:47-04:00 ER57G
...
[11015 rows x 2 columns]
date datetime64[ns, America/New_York]
tag object
dtype: object
I'm attempting to set the column 'date' as the index:
df = df.set_index(pd.DatetimeIndex(df['date']))
which yields the following error (using pandas 0.19)
File "pandas/tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:64516)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 01:38:12'), try using the 'ambiguous' argument
I've consulted this, but I'm still unable to work through this error. For example,
df = df.set_index(pd.DatetimeIndex(df['date']), ambiguous='infer')
yields:
File "pandas/tslib.pyx", line 3703, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:63553)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2015-11-01 01:38:12 asthere are no repeated times
Any advice on how to convert the datetime column to the index would be greatly appreciated.
If your dtype for a column is already datetime then you can just call set_index without the need to try to construct a DatetimeIndex from the column:
df.set_index(df['date'], inplace=True)
should just work, the dtype for the index is sniffed out so there is no need to construct an index object from the Series/column here.

Filtering on annotations with max date in Django

I have 3 models in Django-project:
class Hardware(models.Model):
inventory_number = models.IntegerField(unique=True,)
class Subdivision(models.Model):
name = models.CharField(max_length=50,)
class Relocation(models.Model):
hardware = models.ForeignKey('Hardware',)
subdivision = models.ForeignKey('Subdivision',)
relocation_date = models.DateField(verbose_name='Relocation Date', default=date.today())
Table 'Hardware_Relocation' with values for example:
id hardware subdivision relocation_date
1 1 1 01.01.2009
2 1 2 01.01.2010
3 1 1 01.01.2011
4 1 3 01.01.2012
5 1 3 01.01.2013
6 1 3 01.01.2014
7 1 3 01.01.2015 # Now hardware 1 located in subdivision 3 because relocation_date is max
I would like to write a filter to find hardwares in subdivisions on today.
I'm trying to write a filter:
subdivision = Subdivision.objects.get(pk=1)
hardware_list = Hardware.objects.annotate(relocation__relocation_date=Max('relocation__relocation_date')).filter(relocation__subdivision = subdivision)
Now hardware_list contains hardware 1, but it is wrong (because now hardware 1 in subdivision 3).
hardware_list must be None in this example.
The following code works wrong (hardware_list contains hardware 1, for subdivision 1).
limit_date = datetime.datetime.now()
q1 = Hardware.objects.filter(relocation__subdivision=subdivision, relocation__relocation_date__lte=limit_date)
q2 = q1.exclude(~Q(relocation__relocation_date__gt=F('relocation__relocation_date')), ~Q(relocation__subdivision=subdivision))
hardware_list = q2.distinct()
Maybe better use SQL?
This might work...
from django.db.models import F, Q
Hardware.objects
.filter(relocation__subdivision=target_subdivision, relocation__relocation_date__lte=limit_date)
.exclude(~Q(relocation__subdivision=target_subdivision), relocation__relocation_date__gt=F('relocation__relocation_date'))
.distinct()
The idea is, give me all hardware that have been relocated to target division before limit date, which DON'T have been relocated to other divisions after that.

Pandas Interpolate Returning NaNs

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000

Iterating over selection with query of an HDFStore

I have a very large table in an HDFStore of which I would like to select a subset using a query and then iterate over the subset chunk by chunk. I would like the query to take place before the selection is broken into chunks, so that all of the chunks are the same size.
The documentation here seems to indicate that this is the default behavior but is not so clear. However, it seems to me that the chunking is actually taking place before the query, as shown in this example:
In [1]: pd.__version__
Out[1]: '0.13.0-299-gc9013b8'
In [2]: df = pd.DataFrame({'number': np.arange(1,11)})
In [3]: df
Out[3]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
[10 rows x 1 columns]
In [4]: with pd.get_store('test.h5') as store:
store.append('df', df, data_columns=['number'])
In [5]: evens = [2, 4, 6, 8, 10]
In [6]: with pd.get_store('test.h5') as store:
for chunk in store.select('df', 'number=evens', chunksize=5):
print len(chunk)
2
3
I would expect only a single chunk of size 5 if the querying were happening before the result is broken into chunks, but this example gives two chunks of lengths 2 and 3.
Is this the intended behavior and if so is there an efficient workaround to give chunks of the same size without reading the table into memory?
I think when I wrote that, the intent was to use chunksize of the results of the query. I think it was changed as was implementing it. The chunksize determines sections that the query is applied, and then you iterate on those. The problem is you don't apriori know how many rows that you are going to get.
However their IS a way to do this. Here is the sketch. Use select_as_coordinates to actually execute your query; this returns an Int64Index of the row number (the coordinates). Then apply an iterator to that where you select based on those rows.
Something like this (this makes a nice recipe, will include in the docs I think):
In [15]: def chunks(l, n):
return [l[i:i+n] for i in xrange(0, len(l), n)]
....:
In [16]: with pd.get_store('test.h5') as store:
....: coordinates = store.select_as_coordinates('df','number=evens')
....: for c in chunks(coordinates, 2):
....: print store.select('df',where=c)
....:
number
1 2
3 4
[2 rows x 1 columns]
number
5 6
7 8
[2 rows x 1 columns]
number
9 10
[1 rows x 1 columns]