Selecting elements in numpy array using regular expressions - regex

One may select elements in numpy arrays as follows
a = np.random.rand(100)
sel = a > 0.5 #select elements that are greater than 0.5
a[sel] = 0 #do something with the selection
b = np.array(list('abc abc abc'))
b[b==a] = 'A' #convert all the a's to A's
This property is used by the np.where function to retrive indices:
indices = np.where(a>0.9)
What I would like to do is to be able to use regular expressions in such element selection. For example, if I want to select elements from b above that match the [Aab] regexp, I need to write the following code:
regexp = '[Ab]'
selection = np.array([bool(re.search(regexp, element)) for element in b])
This looks too verbouse for me. Is there any shorter and more elegant way to do this?

There's some setup involved here, but unless numpy has some kind of direct support for regular expressions that I don't know about, then this is the most "numpytonic" solution. It tries to make iteration over the array more efficient than standard python iteration.
import numpy as np
import re
r = re.compile('[Ab]')
vmatch = np.vectorize(lambda x:bool(r.match(x)))
A = np.array(list('abc abc abc'))
sel = vmatch(A)

Related

sympy: check if expression is a trigonometric function

I have a dict that contains tuples as keys with a sympy expression and the number of arguments in each tuple and as value a customized translation. I want to check if the sympy expression is a trigonometric function i.e a function of those: https://omz-software.com/pythonista/sympy/modules/mpmath/functions/trigonometric.html
Do you guys know a nice command to check this? I could just think of a list in which I name [sin, cos, tan] etc and then check if key is contained in it. But I'd be really happy if there was a nicer solution.
Can one generally classify sympy expression and check to which class of expression they belong?
default_semantic_latex_table = {
(sympy.functions.elementary.trigonometric.sin, 1): FormatTemplate(r"\sin#{$0}"), (sympy.functions.special.polynomials.jacobi, 4): FormatTemplate(r"\JacobipolyP{{$0}{$1}{$2}#{$3}}"),
(sympy.functions.elementary.trigonometric.cos, 2): FormatTemplate(r"\cos#{$0,$2}")
}
for k, v in default_semantic_latex_table:
# check if key is instance of a sympy trigonometric function
# if so and the number of args is not equal to one, throw a warning
If the if condition evaluates to True, I want to check the second element of the tuple/the number of args. How would I do that?
I expected k to be a tuple like (sympy.functions.special.polynomials.jacobi, 4) yet if I test it k turns out to be just sympy.functions.special.polynomials.jacobi. How can I get the second tuple element?
I'd be really glad if someone could help!
You can import the TrigonometricFunction base class from sympy and use the issubclass method to perform the tests.
Note that you have to use dictionary_name.items() in order to loop over the keys and values of a dictionary:
from sympy import *
import sympy
from sympy.functions.elementary.trigonometric import TrigonometricFunction
import warnings
default_semantic_latex_table = {
(sympy.functions.elementary.trigonometric.sin, 1): "a",
(sympy.functions.elementary.trigonometric.cos, 2): "b"
}
for k, v in default_semantic_latex_table.items():
t, n = k
if issubclass(t, TrigonometricFunction) and (n != 1):
warnings.warn("Your warning: function=%s with n=%s" % (t, n))

How to sympify initial conditions for ODE in sympy?

I am passing initial conditions as string, to be used to solving an ODE in sympy.
It is a first order ode, so for example, lets take initial conditions as y(0):3 for example. From help
ics is the set of initial/boundary conditions for the differential
equation. It should be given in the form of {f(x0): x1,
f(x).diff(x).subs(x, x2): x3}
I need to pass this to sympy.dsolve. But sympify(ic) gives an error for some reason.
What other tricks to use to make this work? Here is MWE. First one shows it works without initial conditions being string (normal mode of operation)
from sympy import *
x = Symbol('x')
y = Function('y')
ode = Eq(Derivative(y(x),x),1+2*x)
sol = dsolve(ode,y(x),ics={y(0):3})
gives sol Eq(y(x), x**2 + x + 3)
Now the case when ics is string
from sympy import *
ic = "y(0):3"
x = Symbol('x')
y = Function('y')
ode = Eq(Derivative(y(x),x),1+2*x)
sol = dsolve(ode,y(x),ics={ sympify(ic) })
gives
SympifyError: Sympify of expression 'could not parse 'y(0):3'' failed,
because of exception being raised: SyntaxError: invalid syntax
(, line 1)
So looking at sympify webpage
sympify(a, locals=None, convert_xor=True, strict=False, rational=False, evaluate=None)
And tried changing different options as shown above, still the syntax error shows up.
I also tried
sol = dsolve(ode,y(x),ics= { eval(ic) } )
But this gives syntax error as well
Is there a trick to use to convert this initial conditions string to something sympy is happy with?
Python 4.7 with sympy 1.5
As temporary work around, currently I do this
from sympy import *
ic = "y(0):3"
ic = ic.split(":")
x = Symbol('x')
y = Function('y')
ode = Eq(Derivative(y(x),x),1+2*x)
sol = dsolve(ode,y(x),ics= {S(ic[0]):S(ic[1])} )
Which works. So the problem is with : initially, sympify (or S) do not handle : it seems.
You can use sympify('{y(0):3}').
I don't know what your actual goal is but I don't recommend parsing strings like this in general. The format for ICs is actually slightly awkward so that for a second order ODE it looks like:
ics = '{y(0):3, y(x).diff(x).subs(x, 0):1}'
If you're parsing a string then you can come up with a better syntax than that like
ics = "y(0)=3, y'(0)=1"
Also you should use parse_expr rather than converting strings with sympify or S:
https://docs.sympy.org/latest/modules/parsing.html#sympy.parsing.sympy_parser.parse_expr

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Pandas apply function taking up to 10min (numba doesnot help)

I have got a very simple function to apply to each row of my dataframe:
def distance_ot(fromwp,towp,pl,plee):
` if fromwp[0:3]==towp[0:3]:
sxcord=pl.loc[fromwp,"XCORD"]
sycord=pl.loc[fromwp,"YCORD"]
excord=pl.loc[towp,"XCORD"]
eycord=pl.loc[towp,"YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distance=x+y
return distance
else:
x1=np.abs(plee.loc[fromwp[0:3],"exitx"]-pl.loc[fromwp,"XCORD"])
y1=np.abs(plee.loc[fromwp[0:3],"exity"]-pl.loc[fromwp,"YCORD"])
x2=np.abs(plee.loc[fromwp[0:3],"exitx"]-plee.loc[towp[0:3],"entryx"])
y2=np.abs(plee.loc[fromwp[0:3],"exity"]-plee.loc[towp[0:3],"entryy"])
x3=np.abs(plee.loc[towp[0:3],"entryx"]-pl.loc[towp,"XCORD"])
y3=np.abs(plee.loc[towp[0:3],"entryy"]-pl.loc[towp,"YCORD"])
distance=x1+x2+x3+y1+y2+y3
return distance
With this line it is called:
pot["traveldistance"]=pot.apply(lambda row: distance_ot(fromwp=row["from_wpadr"],towp=row["to_wpadr"],pl=pl,plee=plee),axis=1)
Where: fromwp and towp are both strings and xcord and ycord are floats. I tried using numba but for some reasons it does not enhance this performance. Any suggestions?
Thanks to caiohamamura hint hereby the solution:
distance_ot(pl=pl,plee=plee)
pot.ix[pot.from_wpadr.str[0:3]==pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance1"]
pot.ix[pot.from_wpadr.str[0:3]!=pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance2"]
def distance_ot(pl,plee):
from_df = pl.loc[pot["from_wpadr"]]
to_df = pl.loc[pot["to_wpadr"]]
sxcord=from_df["XCORD"].values
sycord=from_df["YCORD"].values
excord=to_df["XCORD"].values
eycord=to_df["YCORD"].values
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
pot["distance1"]=x+y
from_df2=plee.loc[pot["from_wpadr"].str[0:3]]
to_df2=plee.loc[pot["to_wpadr"].str[0:3]]
x1=np.abs(from_df2["exitx"].values-from_df["XCORD"].values)
y1=np.abs(from_df2["exity"].values-from_df["YCORD"].values)
x2=np.abs(from_df2["exitx"].values-to_df2["entryx"].values)
y2=np.abs(from_df2["exity"].values-to_df2["entryy"].values)
x3=np.abs(to_df2["entryx"].values-to_df["XCORD"].values)
y3=np.abs(to_df2["entryy"].values-to_df["YCORD"].values)
pot["distance2"]=x1+x2+x3+y1+y2+y3
Vectorize the distance_ot function to calculate all distances at once. I would begin populating a from_df and a to_df like the following:
import numpy as np
from_df = pl.loc[np.in1d(pl.loc.index, pot["from_wpadr"])
to_df = pl.loc[np.in1d(pl.loc.index, pot["to_wpadr"])
Then you can continue as in your function:
sxcord=from_df["XCORD"]
sycord=from_df["YCORD"]
excord=to_df["XCORD"]
eycord=to_df["YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distances=x+y
This will calculate all the distances at once. Your if clause can also be vectorized, your results will be calculated into different arrays, those which match the if clause and those that doesn't, you just have to keep track of the boolean array, so you can put them together in the dataframe afterwards:
first_three_equals = np.char.ljust(pot["from_wpadr"].values.astype(str), 3) \
== np.char.ljust(pot["to_wpadr"].values.astype(str), 3)

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)
have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.
A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice