Match all numbers present in two arrays efficiently [Python]

Match all numbers present in two arrays efficiently [Python] - python-2.7

I want to match numbers that are present in two arrays (not of equal length) and output them to another array if there is a match. The numbers are floating point.
Currently I have a working program in Python but it is slow when I run it for large datasets. What I've done is two nested for loops.
The first nested for loop runs through array1 and checks if any numbers from array2 are in array 1. If there is a match, I write it to an array called arrayMatch1.
I then check array2 and see if there is a match with arrayMatch1. And output the final result to arrayFinal.
arrayFinal will have all numbers that exist within both array1, array2.
My problem:
Two nested for loops give me a complexity of O(n^2). This method works fine for data sets under an array length of 25000 but slows down significant if greater. How can I make it more efficient. The numbers are floating point and always are in this format ######.###
I want to speed up my program but keep using Python because of the simplicity. Are there better ways to find matches between two arrays?

Why not just find the interesection of two lists?
a = [1,2,3,4.3,5.7,9,11,15]
b = [4.3,5.7,6.3,7.9,8.1]
def intersect(a, b):
return list(set(a) & set(b))
print intersect(a, b)
Output:
[5.7, 4.3]
Gotten from this question.

So what you're basically trying to do is find intersection(logically correct term) of 2 list.
First you need to eliminate the duplicate form the list itself, set is great way to do that, then you can just & those lists and you will be good to go.
a = [23.3213,23.123,43.213,12.234] #List First
b = [12.234,23.345,34.224] #List Second
def intersect(a, b):
return list(set(a) & set(b))
print intersect(a, b)

Related

How can I calculate the length of a list containing lists in OCAML

i am a beginner in ocaml and I am stuck in my project.
I would like to count the number of elements of a list contained in a list.
Then test if the list contains odd or even lists.
let listoflists = [[1;2] ; [3;4;5;6] ; [7;8;9]]
output
l1 = even
l2 = even
l3 = odd
The problem is that :
List.tl listoflists
Gives the length of the rest of the list
so 2
-> how can I calculate the length of the lists one by one ?
-> Or how could I get the lists and put them one by one in a variable ?
for the odd/even function, I have already done it !
Tell me if I'm not clear
and thank you for your help .

Unfortunately it's not really possible to help you very much because your question is unclear. Since this is obviously a homework problem I'll just make a few comments.
Since you talk about putting values in variables you seem to have some programming experience. But you should know that OCaml code tends to work with immutable variables and values, which means you have to look at things differently. You can have variables, but they will usually be represented as function parameters (which indeed take different values at different times).
If you have no experience at all with OCaml it is probably worth working through a tutorial. The OCaml.org website recommends the first 6 chapters of the OCaml manual here. In the long run this will probably get you up to speed faster than asking questions here.
You ask how to do a calculation on each list in a list of lists. But you don't say what the answer is supposed to look like. If you want separate answers, one for each sublist, the function to use is List.map. If instead you want one cumulative answer calculated from all the sublists, you want a fold function (like List.fold_left).
You say that List.tl calculates the length of a list, or at least that's what you seem to be saying. But of course that's not the case, List.tl returns all but the first element of a list. The length of a list is calculated by List.length.
If you give a clearer definition of your problem and particularly the desired output you will get better help here.

Use List.iter f xs to apply function f to each element of the list xs.
Use List.length to compute the length of each list.
Even numbers are integrally divisible by two, so if you divide an even number by two the remainder will be zero. Use the mod operator to get the remainder of the division. Alternatively, you can rely on the fact that in the binary representation the odd numbers always end with 1 so you can use land (logical and) to test the least significant bit.
If you need to refer to the position of the list element, use List.iteri f xs. The List.iteri function will apply f to two arguments, the first will be the position of the element (starting from 0) and the second will be the element itself.

Duplicate values in Julia with Function

I need writing a function which takes as input
a = [12,39,48,36]
and produces as output
b=[4,4,4,13,13,13,16,16,16,12,12,12]
where the idea is to repeat one element three times or two times (this should be variable) and divided by 2 or 3.
I tried doing this:
c=[12,39,48,36]
a=size(c)
for i in a
repeat(c[i]/3,3)
end

You need to vectorize the division operator with a dot ..
Additionally I understand that you want results to be Int - you can vectorizing casting to Int too:
repeat(Int.(a./3), inner=3)

Przemyslaw's answer, repeat(Int.(a./3), inner=3), is excellent and is how you should write your code for conciseness and clarity. Let me in this answer analyze your attempted solution and offer a revised solution which preserves your intent. (I find that this is often useful for educational purposes).
Your code is:
c = [12,39,48,36]
a = size(c)
for i in a
repeat(c[i]/3, 3)
end
The immediate fix is:
c = [12,39,48,36]
output = Int[]
for x in c
append!(output, fill(x/3, 3))
end
Here are the changes I made:
You need an array to actually store the output. The repeat function, which you use in your loop, would produce a result, but this result would be thrown away! Instead, we define an initially empty output = Int[] and then append! each repeated block.
Your for loop specification is iterating over a size tuple (4,), which generates just a single number 4. (Probably, you misunderstand the purpose of the size function: it is primarily useful for multidimensional arrays.) To fix it, you could do a = 1:length(c) instead of a = size(c). But you don't actually need the index i, you only require the elements x of c directly, so we can simplify the loop to just for x in c.
Finally, repeat is designed for arrays. It does not work for a single scalar (this is probably the error you are seeing); you can use the more appropriate fill(scalar, n) to get [scalar, ..., scalar].

negation before brackets

This is a follow-up question to this answer. I am trying to build a loop that produces a set of three random numbers until they match a particular pre-defined set of three arbitrary chosen numbers.
I'm still trying to figure out what operators to use for the program to accept the random numbers in any order but without any results.
I tried to your
!(first==one && second==two && third==three)
but it doesn't seem to work in c++. Thanks for your answer.

The condition that you tried implies that first, second, and third are in the same specific order as one, two, and three. You could try all six permutations, but that would make for a rather unreadable program. A better solution would be to add values to vectors, sort them, and then compare for equality, like this:
vector<int> a;
a.push_back(first);
a.push_back(second);
a.push_back(third);
vector<int> b;
b.push_back(one);
b.push_back(two);
b.push_back(three);
sort(a.begin(), a.end());
sort(b.begin(), b.end());
if (a == b) ... // values match
Here is a link to this snippet on ideone.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).

The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.

You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.

for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.

off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

Test if lists share any items in python

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.
In [78]: a = [1, 2, 3, 4, 5]
In [79]: b = [8, 7, 6]
In [80]: c = [8, 7, 6, 5]
In [81]: def lists_overlap(a, b):
....: for i in a:
....: if i in b:
....: return True
....: return False
....:
In [82]: lists_overlap(a, b)
Out[82]: False
In [83]: lists_overlap(a, c)
Out[83]: True
In [84]: def lists_overlap2(a, b):
....: return len(set(a).intersection(set(b))) > 0
....:

Short answer: use not set(a).isdisjoint(b), it's generally the fastest.
There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:
bool(set(a) & set(b))
Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But
it must first create sets out of the lists, which can take a non-negligible amount of time, and
it supposes that hashing collisions are sparse among your data.
The second way to do it is using a generator expression performing iteration on the lists, such as:
any(i in a for i in b)
This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).
Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:
a = set(a); any(i in a for i in b)
A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:
not set(a).isdisjoint(b)
If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:
from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974
Here's a graph of the execution time for this example in function of list size:
Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.
On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668
It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.
Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):
High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].
Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:
Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.
In summary:
If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.
In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

def lists_overlap3(a, b):
return bool(set(a) & set(b))
Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):

def lists_overlap(a, b):
sb = set(b)
return any(el in sb for el in a)
This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any's short-circuiting.
E.g.:
lists_overlap([3,4,5], [1,2,3])
will return True as soon as it gets to 3 in sb
EDIT: Another variation (with thanks to Dave Kirby):
def lists_overlap(a, b):
sb = set(b)
return any(itertools.imap(sb.__contains__, a))
This relies on imap's iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

You could also use any with list comprehension:
any([item in a for item in b])

In python 2.6 or later you can do:
return not frozenset(a).isdisjoint(frozenset(b))

You can use the any built in function /w a generator expression:
def list_overlap(a,b):
return any(i for i in a if i in b)
As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:
return any(i in b for i in a)

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux's example,
Worst case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234
And the best case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707
So even faster than iterating through two lists is iterating though a list to see if it's in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.
Thus, my conclusion is that iterate through a list, and check if it's in a set.

if you don't care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:
len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.

I'll throw another one in with a functional programming style:
any(map(lambda x: x in a, b))
Explanation:
map(lambda x: x in a, b)
returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js