How to use pandas.Series.str.contains with tqdm progress map? - regex

I'm trying to add a new column to a dataframe (dfA) based on values from another dataframe (dfB):
s = dfA['value'].tolist()
dfB['value'] = dfB['text_bod'].str.contains('|'.join(s))
Can progress_map be used with this setup?
dfB['value] = 'dfB['text_bod].progress_map(func)'
Or is there some other way tqdm can be implemented?
Alternative method using FlashText:
from flashtext import KeywordProcessor
s = dfA['value'].tolist()
processor = KeywordProcessor()
processor.add_keywords_from_list(s)
dfB['value'] = dfB['text_bod'].progress_map(lambda x: processor.extract_keywords(x))

Not aware of a str.contains way, but you can use progress_map with a callback that does the exact same thing, but with re.search:
import re
dfB['value'] = dfB['text_bod'].progress_map(
lambda x: bool(re.search('|'.join(s), x))
)
As a function, you can use
def extract(x, p):
m = p.search(x)
if m:
return m.groups(0)
return np.nan
p = re.compile('|'.join(s))
dfB['value'] = dfB['text_bod'].progress_map(lambda x: extract(x, p))
This should allow you greater flexibility than a lambda.

Related

Replace values in sympy NDimArray

I would like to replace values in a sympy NDimArray.
I have the following code
import sympy as sp
import numpy as np
e = sp.MatrixSymbol('e',3,3)
E = sp.Matrix(e)
# Make E symmetric
E[1,0] = E[0,1]
E[2,0] = E[0,2]
E[2,1] = E[1,2]
result = sp.tensorproduct(E,E)
E_tst = np.random.rand(3,3)
E_tst[1,0] = E_tst[0,1]
E_tst[2,0] = E_tst[0,2]
E_tst[2,1] = E_tst[1,2]
resultNumeric = np.tensordot(E_tst,E_tst,axes=0)
check = resultNumeric - result.as_mutable().subs({E:sym.Matrix(E_tst)})
I get the error AttributeError: 'MutableDenseNDimArray' object has no attribute 'subs'.
How can I replace the symbols in a NDimArray?
Best Regards
Unfortunately, MutableDenseNDimArray doesn't inherit from Basic, whereas ImmutableDenseNDimArray does, therefore some attributes are not available. Don't ask me about this design decision.
However, you can achieve the same result by creating a substitution dictionary:
# substitution dictionary
d = {k: v for k, v in zip(list(E), list(Matrix(E_tst)))}
check = resultNumeric - result.subs(d)

Error using multiprocessing library: "got multiple values for keyword argument 'x' "

I am trying to parallelize a penalized linear model using the multiprocessing library in python.
I created a function that solves my model:
from __future__ import division
import numpy as np
from cvxpy import *
def lm_lasso_solver(x, y, lambda1):
n = x.shape[0]
m = x.shape[1]
lambda1_param = Parameter(sign="positive")
betas_var = Variable(m)
response = dict(model='lm', penalization='l')
response["parameters"] = {"lambda_vector": lambda1}
lasso_penalization = lambda1_param * norm(betas_var, 1)
lm_penalization = 0.5 * sum_squares(y - x * betas_var)
objective = Minimize(lm_penalization + lasso_penalization)
problem = Problem(objective)
lambda1_param.value = lambda1
try:
problem.solve(solver=ECOS)
except:
try:
problem.solve(solver=CVXOPT)
except:
problem.solve(solver=SCS)
beta_sol = np.asarray(betas_var.value).flatten()
response["solution"] = beta_sol
return response
In this function x is a matrix of predictors and y is the response variable. lambda1 is the parameter that must be optimized, and so, is the parameter that I want to parallelize. I saved this script in a python file called "ms.py"
Then I created another python file called "parallelization.py" and in that file I defined the following:
import multiprocessing as mp
import ms
import functools
def myFunction(x, y, lambda1):
pool = mp.Pool(processes=mp.cpu_count())
results = pool.map(functools.partial(ms.lm_lasso_solver, x=x, y=y), lambda1)
return results
So the idea was now, on the python interpreter, execute:
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
runfile('parallelization.py')
lambda_vector = np.array([1,2,3])
myFunction(x, y, lambda_vector)
But when I do this, I get the following error message:
The problem is on the line:
results = pool.map(functools.partial(ms.lm_lasso_solver, x=x, y=y), lambda1)
You are calling the functools.partial() method with keyworded arguments whereas in your lm_lasso_solver method, you don't define them as keyworded arguments. You should call it with x and y as positional arguments as follows:
results = pool.map(functools.partial(ms.lm_lasso_solver, x, y), lambda1)
or simply use the apply_async() method the pool object :
results = pool.apply_async(ms.lm_lasso_solver, args=[x, y, lambda1])

Tensorflow while loop : dealing with lists

import tensorflow as tf
array = tf.Variable(tf.random_normal([10]))
i = tf.constant(0)
l = []
def cond(i,l):
return i < 10
def body(i,l):
temp = tf.gather(array,i)
l.append(temp)
return i+1,l
index,list_vals = tf.while_loop(cond, body, [i,l])
I want to process a tensor array in the similar way as described in the above code. In the body of the while loop I want to process the array by element by element basis to apply some function. For demonstration, I have given a small code snippet. However, it is giving an error message as follows.
ValueError: Number of inputs and outputs of body must match loop_vars: 1, 2
Any help in resolving this is appreciated.
Thanks
Citing the documentation:
loop_vars is a (possibly nested) tuple, namedtuple or list
of tensors that is passed to both cond and body
You cannot pass regular python array as a tensor. What you can do, is:
i = tf.constant(0)
l = tf.Variable([])
def body(i, l):
temp = tf.gather(array,i)
l = tf.concat([l, [temp]], 0)
return i+1, l
index, list_vals = tf.while_loop(cond, body, [i, l],
shape_invariants=[i.get_shape(),
tf.TensorShape([None])])
The shape invariants are there, because normally tf.while_loop expects the shapes of tensors inside while loop won't change.
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(list_vals)
Out: array([-0.38367489, -1.76104736, 0.26266089, -2.74720812, 1.48196387,
-0.23357525, -1.07429159, -1.79547787, -0.74316853, 0.15982138],
dtype=float32)
TF offers a TensorArray to deal with such cases. From the doc,
Class wrapping dynamic-sized, per-time-step, write-once Tensor arrays.
This class is meant to be used with dynamic iteration primitives such as while_loop and map_fn. It supports gradient back-propagation via special "flow" control flow dependencies.
Here is an example,
import tensorflow as tf
array = tf.Variable(tf.random_normal([10]))
step = tf.constant(0)
output = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
def cond(step, output):
return step < 10
def body(step, output):
output = output.write(step, tf.gather(array, step))
return step + 1, output
_, final_output = tf.while_loop(cond, body, loop_vars=[step, output])
final_output = final_output.stack()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(final_output))

Construct a matrix in openCV based on a given matrix

I am writing some code using the OpenCV library in Python. In the process, I need to construct a matrix based on another matrix given. Now my code looks like the following:
for x in range(0, width):
for y in range(0, height):
if I_mat[x][y]>=0 and I_mat[x][y]<=c_low:
w_mat[x][y] = float(I_mat[x][y])/c_low
elif I_mat[x][y]>c_low and I_mat[x][y]<c_high:
w_mat[x][y] = 1
else:
w_mat[x][y] = float((255-I_mat[x][y]))/float((255-c_high))
where, I_mat is the input matrix and w_mat is the matrix I am going to construct. Since the input matrix is quite large, this algorithm is quite slow. I wonder if there are any other methods to construct w_mat more efficiently. Thank a lot!
(It is not necessary to show the solution in Python.)
edit:you might want to use numba
import numpy as np
import timeit
from numba import void,jit
c_low = .3
c_high = .6
def func(val):
if val>=0 and val<=c_low:
return float(val)/c_low
elif val>c_low and val<c_high:
return 1.
else:
return (255.-val)/(255.-c_high)
def npvectorize():
global w_mat
vfunc = np.vectorize(func)
w_mat = vfunc(I_mat)
def orig():
for x in range(I_mat.shape[0]):
for y in range(I_mat.shape[1]):
if I_mat[x][y]>=0 and I_mat[x][y]<=c_low:
w_mat[x][y] = float(I_mat[x][y])/c_low
elif I_mat[x][y]>c_low and I_mat[x][y]<c_high:
w_mat[x][y] = 1
else:
w_mat[x][y] = float((255-I_mat[x][y]))/float((255-c_high))
I_mat = np.array(np.random.random((1000,1000)), dtype = np.float)
w_mat = np.empty_like(I_mat)
fast = jit(void(),nopython=True)(orig)
print timeit.Timer(fast).timeit(1)
print timeit.Timer(npvectorize).timeit(1)
print timeit.Timer(orig).timeit(1)
output:
0.0352660446331
0.472590475098
4.78634474265

How do I do an OR filter in a Django query?

I want to be able to list the items that either a user has added (they are listed as the creator) or the item has been approved.
So I basically need to select:
item.creator = owner or item.moderated = False
How would I do this in Django? (preferably with a filter or queryset).
There is Q objects that allow to complex lookups. Example:
from django.db.models import Q
Item.objects.filter(Q(creator=owner) | Q(moderated=False))
You can use the | operator to combine querysets directly without needing Q objects:
result = Item.objects.filter(item.creator = owner) | Item.objects.filter(item.moderated = False)
(edit - I was initially unsure if this caused an extra query but #spookylukey pointed out that lazy queryset evaluation takes care of that)
It is worth to note that it's possible to add Q expressions.
For example:
from django.db.models import Q
query = Q(first_name='mark')
query.add(Q(email='mark#test.com'), Q.OR)
query.add(Q(last_name='doe'), Q.AND)
queryset = User.objects.filter(query)
This ends up with a query like :
(first_name = 'mark' or email = 'mark#test.com') and last_name = 'doe'
This way there is no need to deal with or operators, reduce's etc.
You want to make filter dynamic then you have to use Lambda like
from django.db.models import Q
brands = ['ABC','DEF' , 'GHI']
queryset = Product.objects.filter(reduce(lambda x, y: x | y, [Q(brand=item) for item in brands]))
reduce(lambda x, y: x | y, [Q(brand=item) for item in brands]) is equivalent to
Q(brand=brands[0]) | Q(brand=brands[1]) | Q(brand=brands[2]) | .....
Similar to older answers, but a bit simpler, without the lambda...
To filter these two conditions using OR:
Item.objects.filter(Q(field_a=123) | Q(field_b__in=(3, 4, 5, ))
To get the same result programmatically:
filter_kwargs = {
'field_a': 123,
'field_b__in': (3, 4, 5, ),
}
list_of_Q = [Q(**{key: val}) for key, val in filter_kwargs.items()]
Item.objects.filter(reduce(operator.or_, list_of_Q))
operator is in standard library: import operator
From docstring:
or_(a, b) -- Same as a | b.
For Python3, reduce is not a builtin any more but is still in the standard library: from functools import reduce
P.S.
Don't forget to make sure list_of_Q is not empty - reduce() will choke on empty list, it needs at least one element.
Multiple ways to do so.
1. Direct using pipe | operator.
from django.db.models import Q
Items.objects.filter(Q(field1=value) | Q(field2=value))
2. using __or__ method.
Items.objects.filter(Q(field1=value).__or__(field2=value))
3. By changing default operation. (Be careful to reset default behavior)
Q.default = Q.OR # Not recommended (Q.AND is default behaviour)
Items.objects.filter(Q(field1=value, field2=value))
Q.default = Q.AND # Reset after use.
4. By using Q class argument _connector.
logic = Q(field1=value, field2=value, field3=value, _connector=Q.OR)
Item.objects.filter(logic)
Snapshot of Q implementation
class Q(tree.Node):
"""
Encapsulate filters as objects that can then be combined logically (using
`&` and `|`).
"""
# Connection types
AND = 'AND'
OR = 'OR'
default = AND
conditional = True
def __init__(self, *args, _connector=None, _negated=False, **kwargs):
super().__init__(children=[*args, *sorted(kwargs.items())], connector=_connector, negated=_negated)
def _combine(self, other, conn):
if not(isinstance(other, Q) or getattr(other, 'conditional', False) is True):
raise TypeError(other)
if not self:
return other.copy() if hasattr(other, 'copy') else copy.copy(other)
elif isinstance(other, Q) and not other:
_, args, kwargs = self.deconstruct()
return type(self)(*args, **kwargs)
obj = type(self)()
obj.connector = conn
obj.add(self, conn)
obj.add(other, conn)
return obj
def __or__(self, other):
return self._combine(other, self.OR)
def __and__(self, other):
return self._combine(other, self.AND)
.............
Ref. Q implementation
This might be useful https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships
Basically it sounds like they act as OR
Item.objects.filter(field_name__startswith='yourkeyword')