Fastest way to remove N random objects - python-2.7

My question is as follows, I am currently working with a generated list of length m. However the list is supposed to be the result of an algorithm taking n as an argument for the final length. m is always much large than n. Currently I am running a while loop where m is the result of len(list).
ie:
from numpy import random as rnd
m = 400000
n = 3000
list = range(0, m)
while len(list) > n:
rmi = rnd.randint(0, len(list))
del list[rmi]
print('%s/%s' %(len(list), n))
This approach certainly works but takes an incredibly long time to run. Is there a more efficient and less time consuming way of removing m-n random entries from my list? The entries removed must be random or the resulting list will no longer represent what it should be.
edit:
Later in my code I then have two arrays of size n, which need to be shortened to size b, the caveat here being that both lists need to have the elements removed randomly but the elements removed must also share the same index. ie:
from numpy import random as rnd
n = 3000
b = 500
list1 = range(0, n)
list2 = rnd.sample(xrange(10000), n)
while len(list1) > b:
rmi = rnd.randint(0, len(list1))
del list1[rmi]
del list2[rmi]
print('%s/%s' %(len(list1), b)
alvis' answer below answers the first part of my question however it does not work for the second part.

Try numpy.random.choice, it creates random sample of your list:
https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.random.choice.html
import numpy as np
...
np.random.choice(range(0,m), size=n)

Related

use elements from a large list till the list becomes empty (python)

I m new to python , I have looping issue to get chuck of data for a list.
I have large list where I need to use chunk of it until it becomes entirely nil.
Lets say I have list as :
a = range(4000) # range 100 -9k
n = 99
while a:
x = a[:n] # want to use first 100 elements
some insertion work of (x) in dB
a = a[n+1 :] reducing first 100 elements from main list
but this method is not working .
Can anybody suggest me a proper approach for this.
Thanks
a[:n] when n is 99 gets the first 99 elements - so change n to 100.
a = a[n+1:] will miss an element - so change n+1 to n
The full code:
a = range(4000)
n = 100
while a:
x = a[:n]
# some insertion work of (x) in dB
a = a[n:] # reducing first 100 elements from main list

[Leetcode]448. Find All Numbers Disappeared in an Array

problem link from leetcode
I came up with two solutions wrote in Python but did not pass and do not know why.
Given an array of integers where 1 ≤ a[i] ≤ n (n = size of array),
some elements appear twice and others appear once.
Find all the elements of [1, n] inclusive that do not appear in this
array.
Here is my first solution:
class Solution(object):
def findDisappearedNumbers(self, nums):
nums=sorted(list(set(nums)))
for x in range(1, nums[-1] + 1):
if x in nums:
nums.remove(x)
else:
nums.append(x)
return nums
the result is " Runtime Error Message: Line 4: IndexError: list index out of range". But I did not get it.
The second solution:
return [x for x in range(1, len(nums) + 1) if x not in nums]
The result is "Time Limit Exceeded",still,confused.
Both solutions works okay in my Pycharm with python 2.7.11.Maybe there are some test cases my solutions did not pass but I can not find it.
First of all, try to use xrange instead of range as this uses less space when the value of nums is very large. Also, you are trying to iterate as well as delete/append a value at the same time in the same array. This is most likely the reason why you are getting the error.
Also, removing a value in the list (if it is not at the end) takes a lot of time because all other elements before it need to moved.
From the first solution: DO NOT modify the list you are iterating. Always brings problems. Better copy the list and modify the list!
class Solution(object):
def findDisappearedNumbers(self, nums):
nums=sorted(list(set(nums)))
nums_copy = nums.copy(nums)
for x in range(1, nums[-1] + 1):
if x in nums:
nums_copy.remove(x)
else:
nums_copy.append(x)
return nums_copy
On the other hand, if num is very large (has many elemente)range can bring problems because it creates the list first (and VERY large lists occupy a LOT of memory). For me cases it is better to xrange than to return a generator.
This does not happen in python3, where the default range returns a generator.
You can use nums = set(nums) which will sort and remove all the duplicates. Then you can run a loop to append all the numbers not present in nums to output array.
Your first solution will fail if the test input is an empty list as num[-1] would give an index out of bound.
Your second solution will be slow as it has to iterate through the list. Would the below solution work? Set operations are optimised. But is the space complexity ok for you?
ret = set(range(1, len(nums)+1))
ret = ret - set(nums)
return list(ret)

Why is python's built in sum function slow when used to flatten a list of lists?

When trying to flatten a list of lists using python 2.7's built-in sum function, I've ran across some performance issues - not only was the computation slow, but the iterative approach yielded much faster results.
The short code below seems to illustrate this performance gap:
import timeit
def sum1(arrs):
return sum(arrs, [])
def sum2(arrs):
s = []
for arr in arrs:
s += arr
return s
def main():
array_of_arrays = [[0] for _ in range(1000)]
print timeit.timeit(lambda: sum1(array_of_arrays), number=100)
print timeit.timeit(lambda: sum2(array_of_arrays), number=100)
if __name__=='__main__':
main()
On my laptop, I get as output:
>> 0.247241020203
>> 0.0043830871582
Could anyone explain to me why is it so?
Your sum2 uses +=:
for arr in arrs:
s += arr
sum does not use +=. sum is defined to use +. The difference is that s += arr is allowed to perform the operation by mutating the existing s list, while s = s + arr must construct a new list, copying the buffers of the old lists.
With +=, Python can use an efficient list resizing strategy that requires an amount of copying proportional to the size of the final list. For N lists of length K each, this takes time proportional to N*K.
With +, Python cannot do that. For every s = s + arr, Python must copy the entire s and arr lists to construct the new s. For N lists of size K each, the total time spent copying is proportional to N**2 * K, much worse.
Because of this, you should pretty much never use sum to concatenate sequences.

Subset sum variant with a non-zero target sum

I have an array of integers and need to apply a variant of the subset sum algorithm on it, except that instead of finding a set of integers whose sum is 0 I am trying to find a set of integers whose sum is n. I am unclear as to how to adapt one of the standard subset sum algorithms to this variant and was hoping for any insight into the problem.
This is subset sum problem, which is NP-Complete (there is no known efficient solution to NP-Complete problems), but if your numbers are relatively small integers - there is an efficient pseudo polynomial solution to it that follows the recurrence:
D(x,i) = false x<0
D(0,i) = true
D(x,0) = false x != 0
D(x,i) = D(x,i-1) OR D(x-arr[i],i-1)
Later, you need to step back on your choices, see where you decided to "reduce" (take the element), and where you decided not to "reduce" (not take the element), on the generated matrix.
This thread and this thread discuss how to get the elements for similar problems.
Here is a python code (taken from the thread I linked to) that does the trick.
If you are not familiar with python - read it as pseudo code, it's pretty easy to understand python!.
arr = [1,2,4,5]
n = len(arr)
SUM = 6
#pre processing:
D = [[True] * (n+1)]
for x in range(1,SUM+1):
D.append([False]*(n+1))
#DP solution to populate D:
for x in range(1,SUM+1):
for i in range(1,n+1):
D[x][i] = D[x][i-1]
if x >= arr[i-1]:
D[x][i] = D[x][i] or D[x-arr[i-1]][i-1]
print D
#get a random solution:
if D[SUM][n] == False:
print 'no solution'
else:
sol = []
x = SUM
i = n
while x != 0:
possibleVals = []
if D[x][i-1] == True:
possibleVals.append(x)
if x >= arr[i-1] and D[x-arr[i-1]][i-1] == True:
possibleVals.append(x-arr[i-1])
#by here possibleVals contains 1/2 solutions, depending on how many choices we have.
#chose randomly one of them
from random import randint
r = possibleVals[randint(0,len(possibleVals)-1)]
#if decided to add element:
if r != x:
sol.append(x-r)
#modify i and x accordingly
x = r
i = i-1
print sol
You can solve this by using dynamic programming.
Lets assume that:
N - is the sum that required (your first input).
M - is the number of summands available (your second input).
a1...aM - are the summands available.
f[x] is true when you can reach the sum of x, and false otherwise
Now the solution:
Initially f[0] = true and f[1..N] = false - we can reach only the sum of zero without taking any summand.
Now you can iterate over all ai, where i in [1..M], and with each of them perform next operation:
f[x + ai] = f[x + ai] || f[x], for each x in [M..ai] - the order of processing is relevant!
Finally you output f[N].
This solution has the complexity of O(N*M), so it is not very useful when you either have large input numbers or large number of summands.

How can I remove similar but not duplicate items from a list?

I have a list:
values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]
What is a way I can go through this list and remove all items that are within say 0.01 to other elements in the same list.
I know how to do it for a specific set of lists using del, but I want it to be general for if values has n lists in it and each list has n elements.
What I want to happen is perform some operation on this list
values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]
and get this output
new_values = [[6.23234121],[1.352672],[6.3245,123.35323,2.3]]
I'm going to write a function to do this for a single list, eg
>>> compact([6.23234121,6.23246575], tol=.01)
[6.23234121]
You can then get it to work on your nested structure through just [compact(l) for l in lst].
Each of these methods will keep the first element that doesn't have anything closer to it in the list; for #DSM's example of [0, 0.005, 0.01, 0.015, 0.02] they'd all return [0, 0.0.15] (or, if you switch > to >=, [0, 0.01, 0.02]). If you want something different, you'll have to define exactly what it is more carefully.
First, the easy approach, similar to David's answer. This is O(n^2):
def compact(lst, tol):
new = []
for el in lst:
if all(abs(el - x) > tol for x in new):
new.append(el)
return compact
On three-element lists, that's perfectly nice. If you want to do it on three million-element lists, though, that's not going to cut it. Let's try something different:
import collections
import math
def compact(lst, tol):
round_digits = -math.log10(tol) - 1
seen = collections.defaultdict(set)
new = []
for el in lst:
rounded = round(seen, round_digits)
if all(abs(el - x) > tol for x in seen[rounded]):
seen[rounded].add(el)
new.append(el)
return new
If your tol is 0.01, then round_digits is 1. So 6.23234121 is indexed in seen as just 6.2. When we then see 6.23246575, we round it to 6.2 and look that up in the index, which should contain all numbers that could possibly be within tol of the number we're looking up. Then we still have to check distances to those numbers, but only on the very few numbers that are in that index bin, instead of the entire list.
This approach is O(n k), where k is the average number of elements that'll fall within one such bin. It'll only be helpful if k << n (as it typically would be, but that depends on the distribution of the numbers you're using relative to tol). Note that it also uses probably more than twice as much memory as the other approach, which could be an issue for very large lists.
Another option would be to sort the list first; then you only have to look at the previous and following elements to check for a conflict.