Multiprocessing / How to map a function with 2 arg comprised in a comprehensive list of tuple - python-2.7

Here is my need:
I have a huge set of IPs to test within a complicated network. For better understanding, IP is associated with the name of the equipment. Test shouldn't last too long, so multiprocess seems to be a good idea.
poolThread = Pool(threadNumber)
results = poolThread.map(testIP, [tupleIP for tupleIP in listTupleIP])
def testIP(name, ip): [...]
But I'm stucked at unpacking the tupleIP when I'm mapping to the function.
I've tried to do [tupleIP[0], tupleIP[1]]... well this is a list not two different arguments.
The easy way is to unpack in the function and it works perfectly. Nevertheless, I'd like to know if there is an elegant way to do this otherwise.

Related

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

userWarning pymc3 : What does reparameterize mean?

I built a pymc3 model using the DensityDist distribution. I have four parameters out of which 3 use Metropolis and one uses NUTS (this is automatically chosen by the pymc3). However, I get two different UserWarnings
1.Chain 0 contains number of diverging samples after tuning. If increasing target_accept does not help try to reparameterize.
MAy I know what does reparameterize here mean?
2. The acceptance probability in chain 0 does not match the target. It is , but should be close to 0.8. Try to increase the number of tuning steps.
Digging through a few examples I used 'random_seed', 'discard_tuned_samples', 'step = pm.NUTS(target_accept=0.95)' and so on and got rid of these user warnings. But I couldn't find details of how these parameter values are being decided. I am sure this might have been discussed in various context but I am unable to find solid documentation for this. I was doing a trial and error method as below.
with patten_study:
#SEED = 61290425 #51290425
step = pm.NUTS(target_accept=0.95)
trace = sample(step = step)#4000,tune = 10000,step =step,discard_tuned_samples=False)#,random_seed=SEED)
I need to run these on different datasets. Hence I am struggling to fix these parameter values for each dataset I am using. Is there any way where I give these values or find the outcome (if there are any user warnings and then try other values) and run it in a loop?
Pardon me if I am asking something stupid!
In this context, re-parametrization basically is finding a different but equivalent model that it is easier to compute. There are many things you can do depending on the details of your model:
Instead of using a Uniform distribution you can use a Normal distribution with a large variance.
Changing from a centered-hierarchical model to a
non-centered
one.
Replacing a Gaussian with a Student-T
Model a discrete variable as a continuous
Marginalize variables like in this example
whether these changes make sense or not is something that you should decide, based on your knowledge of the model and problem.

model.remove_constraint() performance

I'm working with CPLEX/docplex solving an LP problem that has a lot of infeasible constraints, most of the issues in feasibility come from the automated formulation of the model, and its hard to detect a priory the conflicts between constraints.
using the docplex functions ConflictRefiner().refine_conflict(model) im able to found, at least, one set of constraints in conflict.
The problem is that, in order to found all the sets of constraints in conflict, I have to remove some of the constraints in conflict using the function model.remove_constraint(constraint.name) and that function takes a long time to execute.
Edit the timings for 135.000 constraints are:
model.remove_constraint(constraint.name)
time= 124 sec
model.remove_constraint(constraint.element)
time= 126 sec
¿Is there a way to remove a constraint faster than with model.remove_constraint(str_name_constraint)?¿is there a way to get all the sets in conflict without having to remove/refine_conflict() for each set?¿is there a way to use hierarchy in constraints in order to avoid conflicts between constraints?
(last question its a little out of topic, but its related with the original problem)
thanks in advance!
finally I used a workaround,
I didn't use mdl.remove_constraint(). to all the constraints i added a priority, and then I used the relaxer library provided by [docplex][1]. I couldn't found any example in the docs (or anywhere else) of the use of the relaxer, so i made one on my own (really simple to understand). The relaxer library is a really powerful tool, and its way much more easier to use rather than making all the relaxations by hand, especially when you have to deal with hierarchies in the constraints.
Example:
from docplex.mp.model import Model
import docplex
# we create a simple model
mdl = Model("relax_model")
x1=mdl.continuous_var(name='X1', lb=0)
x2=mdl.continuous_var(name='X2', lb=0)
# add conflict constraints
c1=mdl.add_constraint(x1<=10,'c1_low')
c2=mdl.add_constraint(x1<=5,'c2_medium')
c3=mdl.add_constraint(x1>=400,'c3_high')
c4=mdl.add_constraint(x2>=1,'c4_low')
mdl.minimize(x1+x2)
mdl.solve()
print mdl.report()
print mdl.get_solve_status() #infeasible model
print
print 'relaxation begin'
from docplex.mp.relaxer import Relaxer
rx = Relaxer(prioritizer='match')
rx.relax(mdl,relax_mode= docplex.mp.relaxer.RelaxationMode.OptInf)
print 'number_of_relaxations= ' + str(rx.number_of_relaxations)
print rx.relaxations()
print mdl.report()
print mdl.get_solve_status()
print mdl.solution
I know that this isn't "the solution" for the model.remove_constraint() performance problem, but it fits well when you need to avoid it.

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
id.append(row['id'])
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
FingerDevice.identify_finger(gallary)
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.

How to optimize use of querysets with lists

I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.