Test if lists share any items in python - list

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.
In [78]: a = [1, 2, 3, 4, 5]
In [79]: b = [8, 7, 6]
In [80]: c = [8, 7, 6, 5]
In [81]: def lists_overlap(a, b):
....: for i in a:
....: if i in b:
....: return True
....: return False
....:
In [82]: lists_overlap(a, b)
Out[82]: False
In [83]: lists_overlap(a, c)
Out[83]: True
In [84]: def lists_overlap2(a, b):
....: return len(set(a).intersection(set(b))) > 0
....:

Short answer: use not set(a).isdisjoint(b), it's generally the fastest.
There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:
bool(set(a) & set(b))
Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But
it must first create sets out of the lists, which can take a non-negligible amount of time, and
it supposes that hashing collisions are sparse among your data.
The second way to do it is using a generator expression performing iteration on the lists, such as:
any(i in a for i in b)
This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).
Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:
a = set(a); any(i in a for i in b)
A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:
not set(a).isdisjoint(b)
If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:
from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974
Here's a graph of the execution time for this example in function of list size:
Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.
On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668
It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.
Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):
High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].
Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:
Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.
In summary:
If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.
In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

def lists_overlap3(a, b):
return bool(set(a) & set(b))
Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):

def lists_overlap(a, b):
sb = set(b)
return any(el in sb for el in a)
This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any's short-circuiting.
E.g.:
lists_overlap([3,4,5], [1,2,3])
will return True as soon as it gets to 3 in sb
EDIT: Another variation (with thanks to Dave Kirby):
def lists_overlap(a, b):
sb = set(b)
return any(itertools.imap(sb.__contains__, a))
This relies on imap's iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

You could also use any with list comprehension:
any([item in a for item in b])

In python 2.6 or later you can do:
return not frozenset(a).isdisjoint(frozenset(b))

You can use the any built in function /w a generator expression:
def list_overlap(a,b):
return any(i for i in a if i in b)
As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:
return any(i in b for i in a)

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux's example,
Worst case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234
And the best case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707
So even faster than iterating through two lists is iterating though a list to see if it's in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.
Thus, my conclusion is that iterate through a list, and check if it's in a set.

if you don't care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:
len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.

I'll throw another one in with a functional programming style:
any(map(lambda x: x in a, b))
Explanation:
map(lambda x: x in a, b)
returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.

Related

Enforcing inequality of lists?

For a given CSP I used a variety of viewpoints, one of which is a somewhat more exotic boolean model which uses a variable array of size NxNxN. Then I enforce unequality of various subarrays with this snippet :
(foreach(X, List1),
foreach(Y, List2),
foreach((X #\= Y), Constraints)
do true),
1 #=< sum(Constraints).
The performance of the model is bad, so I was curious to know more about what happens behind the scenes. Is this a proper way to ensure that two given lists are different? Do I understand it correctly that every constraint (X #\= Y) in the Constraints list needs to be instantiated before the sum is calculated, meaning that all the corresponding variables need to be instantiated too?
The constraint library library(ic_global) is indeed missing a constraint here; it should provide lex_ne/2, analogous to lex_lt/2. This would have the same logical and operational behaviour as the code you have written, i.e. propagate when there is only a single variable left in its argument lists:
?- B#::0..1, lex_ne([1,0,1], [1,B,1]).
B = 1
For comparison, you can try the sound difference operator ~=/2 (called dif/2 in some Prologs). This is efficiently implemented, but it doesn't know about domains and will thererefore not propagate; it simply waits until both sides are instantiated and then fails or succeeds:
?- B#::0..1, [1,0,1] ~= [1,B,1].
B = B{[0, 1]}
There is 1 delayed goal.
?- B#::0..1, [1,0,1] ~= [1,B,1], B = 0.
No (0.00s cpu)
Whether this is overall faster will depend on your application.

Checking if the difference between consecutive elements is the same

I am new to using arithmetic in Prolog.
I’ve done a few small programs, but mostly involving logic. I am trying to implement a function that will return true or false if the difference between every consecutive pair of elements is the same or not.
My input would look like this: sameSeqDiffs([3, 5, 7, 9], 2)
I feel like I need to split the first two elements from the list, find their difference, and add the result to a new list. Once all the elements have been processed, check if the elements of the new list are all the same.
I’ve been taught some Prolog with building relationships and querying those, but this doesn’t seem to fit in with Prolog.
Update1: This is what I've come up with so far. I am brand new to this syntax and am still getting an error on my code, but I hope it conveys the general idea of what I'm trying to do.
diff([X,Y|Rest], Result):-
diff([Y,Z|Rest], Result2):-
Result2 = Result,
Z - Y = Result.
Update2: I know I still have much to do on this code, but here is where I will remain until this weekend, I have some other stuff to do. I think I understand the logic of it a bit more, and I think I need to figure out how to run the last line of the function only if there is at least two more things in the rest of the list to process.
diff([X,Y|Rest], Result):-
number(Y),
Y-X=Result,
diff([Rest], Result).
Update3: I believe I have the function the way I want it to. The only quirk I noticed is that when I run and input like: sameSeqDiffs([3,5,7],2).I get true returned immediately followed by a false. Is this the correct operation or am I still missing something?
sameSeqDiffs([X,Y], Result):-
A is Y - X,
A = Result.
sameSeqDiffs([X,Y,Z|T], Result):-
sameSeqDiffs([Y,Z|T], Result).
Update 4: I posted a new question about this....here is the link: Output seems to only test the very last in the list for difference function
Prolog's syntax
The syntax is a bit off: normally a clause has a head like foo(X, Y, Z), then an arrow (:-), followed by a body. That body normally does not contain any arrows :-. So the second arrow :- makes not much sense.
Predicates and unification
Secondly in Prolog predicates have no input or output, a predicate is true or false (well it can also error, or got stuck into an infinite loop, but that is typically behavior we want to avoid). It communicates answers by unifying variables. For example a call sameSeqDiffs([3, 5, 7, 9], X). can succeed by unifying X with 2, and then the predicate - given it is implemented correctly - will return true..
Inductive definitions
In order to design a predicate, on typically first aims to come up with an inductive definition: a definition that consists out of one or more base cases, and one or more "recursive" cases (where the predicate is defined by parts of itself).
For example here we can say:
(base case) For a list of exactly two elements [X, Y], the predicate sameSeqDiffs([X, Y], D) holds, given D is the difference between Y and X.
In Prolog this will look like:
sameSeqDiffs([X, Y], D) :-
___.
(with the ___ to be filled in).
Now for the inductive case we can define a sameSeqDiffs/2 in terms of itself, although not with the same parameters of course. In mathematics, one sometimes defines a function f such that for example f(i) = 2×f(i-1); with for example f(0) = 1 as base. We can in a similar way define an inductive case for sameSeqDiffs/2:
(inductive case) For a list of more than two elements, all elements in the list have the same difference, given the first two elements have a difference D, and in the list of elements except the first element, all elements have that difference D as well.
In Prolog this will look like:
sameSeqDiffs([X, Y, Z|T], D) :-
___,
sameSeqDiffs(___, ___).
Arithmetic in Prolog
A common mistake people who start programming in Prolog make is they think that, like it is common in many programming languages, Prolog add semantics to certain functors.
For example one can think that A - 1 will decrement A. For Prolog this is however just -(A, 1), it is not minus, or anything else, just a functor. As a result Prolog will not evaluate such expressions. So if you write X = A - 1, then X is just X = -(A,1).
Then how can we perform numerical operations? Prolog systems have a predicate is/2, that evaluates the right hand side by attaching semantics to the right hand side. So the is/2 predicate will interpret this (+)/2, (-)/2, etc. functors ((+)/2 as plus, (-)/2 as minus, etc.).
So we can evaluate an expression like:
A = 4, is(X, A - 1).
and then X will be set to 3, not 4-1. Prolog also allows to write the is infix, like:
A = 4, X is A - 1.
Here you will need this to calculate the difference between two elements.
You were very close with your second attempt. It should have been
samediffs( [X, Y | Rest], Result):-
Result is Y - X,
samediffs( [Y | Rest], Result).
And you don't even need "to split the first two elements from the list". This will take care of itself.
How? Simple: calling samediffs( List, D), on the first entry into the predicate, the not yet instantiated D = Result will be instantiated to the calculated difference between the second and the first element in the list by the call Result is Y - X.
On each subsequent entry into the predicate, which is to say, for each subsequent pair of elements X, Y in the list, the call Result is Y - X will calculate the difference for that pair, and will check the numerical equality for it and Result which at this point holds the previously calculated value.
In case they aren't equal, the predicate will fail.
In case they are, the recursion will continue.
The only thing missing is the base case for this recursion:
samediffs( [_], _Result).
samediffs( [], _Result).
In case it was a singleton (or even empty) list all along, this will leave the differences argument _Result uninstantiated. It can be interpreted as a checking predicate, in such a case. There's certainly no unequal differences between elements in a singleton (or even more so, empty) list.
In general, ......
recursion(A, B):- base_case( A, B).
recursion( Thing, NewThing):-
combined( Thing, Shell, Core),
recursion( Core, NewCore),
combined( NewThing, Shell, NewCore).
...... Recursion!

Sympy lambdify array with shape (n,)

I have the following 'issue' with sympy at the moment:
I have a symbolic expression like M = matrix([pi*a, sin(1)*b]) which I want to lambdify and pass to a numerical optimizer. The issue is that the optimizer needs the function to input/output numpy arrays of shape (n,) and specifically NOT (n,1).
Now I have been able to achieve this with the following code (MWE):
import numpy as np
import sympy as sp
a, b = sp.symbols('a, b')
M = sp.Matrix([2*a, b])
f_tmp = sp.lambdify([[a,b]], M, 'numpy')
fun = lambda x: np.reshape( f_tmp(x), (2,))
Now, this is of course extremely ugly, since the reshape needs to be applied every time fun is evaluated (which might be LOTS of times). Is there a way to avoid this problem? The Matrix class is by definition always 2 dimensional. I tried using sympy's MutableDenseNDimArray-class, but they don't work in conjunction with lambdify. (symbolic variables don't get recognized)
One way is to convert a matrix to a nested list and take the first row:
fun = sp.lambdify([[a, b]], M.T.tolist()[0], 'numpy')
Now fun([2, 3]) is [4, 3]. This is a Python list, not a NumPy array, but optimizers (at least those in SciPy) should be okay with that.
One can also do
fun = sp.lambdify([[a, b]], np.squeeze(M), 'numpy')
which also returns a list.
In my test the above were equally fast, and faster than the version with a wrapping function (be it np.squeeze or np.reshape): about 6 µs vs 9 µs. It seems the gain is in eliminating one function call.

Match all numbers present in two arrays efficiently [Python]

I want to match numbers that are present in two arrays (not of equal length) and output them to another array if there is a match. The numbers are floating point.
Currently I have a working program in Python but it is slow when I run it for large datasets. What I've done is two nested for loops.
The first nested for loop runs through array1 and checks if any numbers from array2 are in array 1. If there is a match, I write it to an array called arrayMatch1.
I then check array2 and see if there is a match with arrayMatch1. And output the final result to arrayFinal.
arrayFinal will have all numbers that exist within both array1, array2.
My problem:
Two nested for loops give me a complexity of O(n^2). This method works fine for data sets under an array length of 25000 but slows down significant if greater. How can I make it more efficient. The numbers are floating point and always are in this format ######.###
I want to speed up my program but keep using Python because of the simplicity. Are there better ways to find matches between two arrays?
Why not just find the interesection of two lists?
a = [1,2,3,4.3,5.7,9,11,15]
b = [4.3,5.7,6.3,7.9,8.1]
def intersect(a, b):
return list(set(a) & set(b))
print intersect(a, b)
Output:
[5.7, 4.3]
Gotten from this question.
So what you're basically trying to do is find intersection(logically correct term) of 2 list.
First you need to eliminate the duplicate form the list itself, set is great way to do that, then you can just & those lists and you will be good to go.
a = [23.3213,23.123,43.213,12.234] #List First
b = [12.234,23.345,34.224] #List Second
def intersect(a, b):
return list(set(a) & set(b))
print intersect(a, b)

Slice like functionality from a List in F#

With an array let foo = [|1;2;3;4|] I can use any of the following to return a slice from an array.
foo.[..2]
foo.[1..2]
foo.[2..]
How can I do the same thing for List let foo2 = [1;2;3;4]? When I try the same syntax as the array I get error FS00039: The field, constructor or member 'GetSlice' is not defined.
What's the preferred method of getting a subsection of a List and why aren't they built to support GetSlice?
What's the preferred method of getting
a subsection of a List and why aren't
built to support GetSlice?
Let's make the last question first and the first question last:
Why lists don't support GetSlice
Lists are implemented as linked lists, so we don't have efficient indexed access to them. Comparatively speaking, foo.[|m..n|] takes O(n-m) time for arrays, an equivalent syntax takes O(n) time on lists. This is a pretty big deal, because it prevents us from using slicing syntax efficiently in the vast majority of cases where it would be useful.
For example, we can cut up an array into equal sized pieces in linear time:
let foo = [|1 .. 100|]
let size = 4
let fuz = [|for a in 0 .. size .. 100 do yield foo.[a..a+size] |]
But what if we were using a list instead? Each call to foo.[a..a+size] would take longer and longer and longer, the whole operation is O(n^2), making it pretty unsuitable for the job.
Most of the time, slicing a list is the wrong approach. We normally use pattern matching to traverse and manipulate lists.
Preferred method for slicing a list?
Wherever possible, use pattern matching if you can. Otherwise, you can fall back on Seq.skip and Seq.take to cut up lists and sequences for you:
> [1 .. 10] |> Seq.skip 3 |> Seq.take 5 |> Seq.toList;;
val it : int list = [4; 5; 6; 7; 8]
F# 4.0 will allow slicing syntax for lists (link).
Rationale is here:
The F# list type already supports an index operator, xs.[3]. This is done despite the fact that lists are linked lists in F# - lists are just so commonly used in F# that in F# 2.0 it was decided to support this.
Since an index syntax is supported, it makes sense to also support the F# slicing syntax, e.g. xs.[3..5]. It is very strange to have to switch to an array type to use slicing, but you don't have to make that switch for indexing.
Still, Juliet answer, saying that, most of the time slicing a list is the wrong approach, still holds true. So be wise when using this feature.