python performance issue while searching in a huge list - list

I need to speed up (dramatically) the search in a "huge" single dimension list of unsigned values. The list has 389.114 elements, and I need to perform a check before I add an item to make sure it doesn't already exist
I do this check 15 millions times...
Of course, it takes too much time
The fastest way I found was :
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
I am building a dataset from time series logs
One column of these (huge) logs is a text message, which is very redondant
To dramatically speed up the process, I transform this text into an unsigned with Adler32(), and get a unique numeric value, which is great
Then I store the messages in a PostgreSQL database, with this value as index
For each line of my log files (15 millions all together), I need to update my database of unique messages (389.114 unique messages)
It means that for each line, I need to check if the message ID belongs to my in memory list
I tried "... in list", same with dictionaries, numpy arrays, transforming the list in a string and using string.search(), sql query in the database with good index...
Nothing better than "if item in list" when the list is loaded into memory (very fast)
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
For 15 millions iterations with some stuff and NO search in the list:
- It takes 8 minutes to generate 2 tables of 15 millions lines (features and targets)
- When I activate the code above to check if a message ID already exists, it takes 1 hour 35 mn ...
How could I optimize this?
Thank you for your help

If your code is, roughly, this:
my_list = []
for this_item in collection:
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
Then it will run in O(n^2) time since the in operator for lists is O(n).
You can achieve linear time if you use a dictionary (which is implemented with a hash table) instead:
my_list = []
table = {}
for this_item in collection:
i = table.get(this_item)
if i is None:
i = len(my_list)
my_list.append(this_item)
table[this_item] = i
...
Of course, if you don't care about processing the items in the original order, you can just do:
for i, this_item in enumerate(set(collection)):
...

Related

Django request.POST loop

How can I loop through the request data and post it as one line in to the database, user can submit multiple descriptions, lengths and so on, problem I have is in the DB its creating massive amounts of rows to get to the correct format of the last one A1 but the user could submit A1,1,1,1,1; A2,2,2,8,100 and so on as its a dynamic add form)
descriptions = request.POST.getlist('description')
lengths = request.POST.getlist('lengthx')
widths = request.POST.getlist('widthx')
depths = request.POST.getlist('depthx')
quantitys = request.POST.getlist('qtyx')
for description in descriptions:
for lengt in lengths:
for width in widths:
for depth in depths:
for quantity in quantitys:
newquoteitem = QuoteItem.objects.create(
qdescription=description,
qlength=lengt,
qwidth=width,
qdepth=depth,
qquantity=quantity,
quote_number=quotenumber,
)
bottom entry is correct
post system
First solutions
Use formsets. That is exactly what they are meant to handle.
Second solution
descriptions = request.POST.getlist('description') is returning a list of all descriptions, so let's say there are 5, it iterates 5 times. Now lengths = request.POST.getlist('lengthx') is a list of all lengths, again, 5 of them, so it will iterate 5 times, and since it is nested within the descriptions for loop, that's 25 times!
So, although I still think formsets are the way to go, you can try the following:
descriptions = request.POST.getlist('description')
lengths = request.POST.getlist('lengthx')
widths = request.POST.getlist('widthx')
depths = request.POST.getlist('depthx')
quantitys = request.POST.getlist('qtyx')
for i in range(len(descriptions)):
newquoteitem = QuoteItem.objects.create(
qdescription=descriptions[i],
qlength=lengths[i],
qwidth=widths[i],
qdepth=depths[i],
qquantity=quantitys[i],
quote_number=quotenumber,
)
Here, if there are 5 descriptions, then len(descriptions) will be 5, and there is one loop, which will iterate 5 times in total.

find index based on first element in a nested list

I have a list that contains sublists. The sequence of the sublist is fixed, as are the number of elements.
schedule = [['date1', 'action1', beginvalue1, endvalue1],
['date2', 'action2', beginvalue2, endvalue2],
...
]
Say, I have a date and I want find what I have to do on that date, meaning I require to find the contents of the entire sublist, given only the date.
I did the following (which works): I created a intermediate list, with all the first values of the sublists. Based on the index i was able to retrieve its entire contents, as follows:
dt = 'date150' # To just have a value to make underlying code more clear
ls_intermediate = [item[0] for item in schedule]
index = ls_intermediate.index(dt)
print(schedule[index])
It works but it just does not seem the Python way to do this. How can I improve this piece of code?
To be complete: there are no double 'date' entries in the list. Every date is unique and appears only once.
Learning Python, and having quite a journey in front of me...
thank you!

View execute time is very long (above one minute)

In our Django project, there is a view which creates multiple objects (from 5 to even 100). The problem is that creating phase takes a very long time.
Don't know why is that so but I suppose that it could be because on n objects, there are n database lookups and commits.
For example 24 objects takes 67 seconds.
I want to speed up this process.
There are two things I think may be worth to consider:
To create these objects in one query so only one commit is executed.
Create a ThreadPool and create these objects parallel.
This is a part of the view which causes problems (We use Postgres on localhost so connection is not a problem)
#require_POST
#login_required
def heu_import(request):
...
...
product = Product.objects.create(user=request.user,name=name,manufacturer=manufacturer,category=category)
product.groups.add(*groups)
occurences = []
counter = len(urls_xpaths)
print 'Occurences creating'
start = datetime.now()
eur_currency = Currency.objects.get(shortcut='eur')
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occ = Occurence.objects.create(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency)
occurences.append(occ)
print 'End'
print datetime.now()-start
...
return render(request,'main_app/dashboard/new-product.html',context)
Output:
Occurences creating
24
.
.
.
0
End
0:01:07.727000
EDIT:
I tried to put the for loop into the with transaction.atomic(): block but it seems to help only a bit (47 seconds instead of 67).
EDIT2:
I'm not sure but it seems that SQL queries are not a problem:
Please use bulk_create for inserting multiple objects.
occurences = []
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occurances.append(Occurence(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency))
Occurence.objects.bulk_create(occurences)

Organize items of a list

How to organize a list items? Suppose, if i have a list l = ['a','b','c','d','e','f','g','h','i']
Requirement is to get a,b,c to one list, d,e,f to other and g,h,i to another list.Current implementation is
list l = ['a','b','c','d','e','f','g','h','i']
m= list()
for i in l:
if (i.find("a")>=0) or (i.find("b")>=0) or (i.find("c")>=0):
m.append(i)
print m
and so as for next items.Is there any better logic to this? With current implementation Cyclomatic Complexity is high.
In your example, you must not use find in the list because:
you don't really need the index, so you just could use if "a" in l
find or even in in a list has a linear (O(n)) complexity, so this is not optimal. Not noticeable on a small list, but with a million elements, it is.
with has linear complexity, and loop on the searched items instead of the list itself.
in a set, elements are hashed (and must be unique, then) ensuring a much better search performance (and insert performance too but that's not the point).
l= set(['a','b','c','d','e','f'])
m=list()
for i in ['a','b','z','c']: # I have introduced an extra element
if i in l:
m.append(i)
print(m)
result:
['a', 'b', 'c']
What is funny is the above code is that it works with a set but with a list because in is shared by all collection objects. Only the performance varies.
You could replace the first line by l= ['a','b','c','d','e','f'] it would work but you'll get bad performance (well, not for 6 items, of course), just like your example in the question.
Proof for people still doubting about the power of the set object, here's a test to check if item is in the list. I have chosen the worst case for list, but can be done with another value.
import time
data=range(1000000)
start_time = time.time()
for i in range(1,1000):
999999 in data
print("list elapsed %f" % (time.time()-start_time))
data=set(data)
start_time = time.time()
for i in range(1,1000):
999999 in data
print("set elapsed %f" % (time.time()-start_time))
result:
list elapsed 17.284000
set elapsed 0.000000
not even close :) and you can reduce the searched value, the list value will decrease (but the set will always show 0)

Comparing large list of hashes to other large list of hashes

I have a list of 100,000 hashes (list a) that I'd like to compare to a list of 15,000,000 hashes (list b).
The hash is taken from list a. If it exists in list b, do nothing. If it does not exist in list b, write it to a file.
Here is the logic I have so far:
def compareHashes(map, hashdb, out):
output_file = openFile(out)
line_cnt = 0
total_lines = len(map)
for m in map:
if m not in hashdb:
writeToFile(m + "\r\n", output_file)
sys.stdout.write("\r" + str(round(percentage(line_cnt, total_lines), 2)) + "%")
sys.stdout.flush()
line_cnt = line_cnt + 1
output_file.close()
It works, but takes an extremely long time. Can I get some suggestions on how to increase the performance on this? The box running the script has 60gb of ram and 8 cores. I dont think all the cores are being utilized because python is not multithreading. Any ideas how I could increase the throughput on this?
First, you state that you'd like to write to file if an element in list a doesn't exist in list b. This can be represented in code as:
for a in list_a:
if a not in list_b:
writeFile(...)
Using the infix operator in on a list is an O(n) complexity computation. Instead, use a set, an associative (unordered) array with item lookup in O(1) time.
set_b = set(list_b)
for a in list_a:
if a not in set_b:
writeFile(...)
You can also find all the items in list_a that aren't in list_b and then only perform actions on those items:
a_disjoint_b = set(list_a) - set(list_b)
for a in list_a:
if a in a_disjoint_b:
writeFile(...)
Or, if the order of items in list_a doesn't matter, and all items in list_a are unique:
for a in set(list_a) - set(list_b):
writeFile(...)