Organize items of a list - python-2.7

How to organize a list items? Suppose, if i have a list l = ['a','b','c','d','e','f','g','h','i']
Requirement is to get a,b,c to one list, d,e,f to other and g,h,i to another list.Current implementation is
list l = ['a','b','c','d','e','f','g','h','i']
m= list()
for i in l:
if (i.find("a")>=0) or (i.find("b")>=0) or (i.find("c")>=0):
m.append(i)
print m
and so as for next items.Is there any better logic to this? With current implementation Cyclomatic Complexity is high.

In your example, you must not use find in the list because:
you don't really need the index, so you just could use if "a" in l
find or even in in a list has a linear (O(n)) complexity, so this is not optimal. Not noticeable on a small list, but with a million elements, it is.
with has linear complexity, and loop on the searched items instead of the list itself.
in a set, elements are hashed (and must be unique, then) ensuring a much better search performance (and insert performance too but that's not the point).
l= set(['a','b','c','d','e','f'])
m=list()
for i in ['a','b','z','c']: # I have introduced an extra element
if i in l:
m.append(i)
print(m)
result:
['a', 'b', 'c']
What is funny is the above code is that it works with a set but with a list because in is shared by all collection objects. Only the performance varies.
You could replace the first line by l= ['a','b','c','d','e','f'] it would work but you'll get bad performance (well, not for 6 items, of course), just like your example in the question.
Proof for people still doubting about the power of the set object, here's a test to check if item is in the list. I have chosen the worst case for list, but can be done with another value.
import time
data=range(1000000)
start_time = time.time()
for i in range(1,1000):
999999 in data
print("list elapsed %f" % (time.time()-start_time))
data=set(data)
start_time = time.time()
for i in range(1,1000):
999999 in data
print("set elapsed %f" % (time.time()-start_time))
result:
list elapsed 17.284000
set elapsed 0.000000
not even close :) and you can reduce the searched value, the list value will decrease (but the set will always show 0)

Related

Finding the number of items in a list that are not duplicated

I have a list of fruits and there are some duplicates in the list. I am not after the unique items in the list nor the count of the number of unique items. If I wanted thsoe details I would simply use a set.
I want to instead calculate the number of fruits in the list that are not duplicates i.e lemon,orange,tomato,passionfruit = 4
Here's my code which works ok but is slow as it uses nested loops.
fruits=['apple','pear','pear','apple','strawberry','lemon','orange','strawberry','tomato','passionfruit']
fruits_len=len(fruits)
uniq=0
for loop1 in range(0,fruits_len):
flag=0
for loop2 in range (0,fruits_len):
if loop1==loop2:
continue
if fruits[loop1]==fruits[loop2]:
flag=1
break
if flag==0:
uniq+=1
print(f'There are {uniq} fruits not duplicated in the fruits list ')
The code yields the following correct result.
runfile('C:/A/untitled0.py', wdir='C:/A')
There are 4 fruits not duplicated in the fruits list
The only problem is that on much larger lists my code would be slow.I want to resolve the above in the most efficient way possible. I tried to devise a solution using list comprehension but its awkward because the two nested loops have code in between them.
How can I make my code run much faster? (bragging rights to he/she who can come up with the fastest solution)
Thanks Peter
You could use a dict to store the counts of the elements from the array and select the elements that have only one occurrance
fruits=['apple','pear','pear','apple','strawberry','lemon','orange','strawberry','tomato','passionfruit']
histogram = {}
for item in fruits:
if item not in histogram:
histogram[item] = 1
else:
histogram[item] += 1
print(len([ i for i in histogram if histogram[i] == 1]))
Outputs
4
Time
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.2 µs

python performance issue while searching in a huge list

I need to speed up (dramatically) the search in a "huge" single dimension list of unsigned values. The list has 389.114 elements, and I need to perform a check before I add an item to make sure it doesn't already exist
I do this check 15 millions times...
Of course, it takes too much time
The fastest way I found was :
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
I am building a dataset from time series logs
One column of these (huge) logs is a text message, which is very redondant
To dramatically speed up the process, I transform this text into an unsigned with Adler32(), and get a unique numeric value, which is great
Then I store the messages in a PostgreSQL database, with this value as index
For each line of my log files (15 millions all together), I need to update my database of unique messages (389.114 unique messages)
It means that for each line, I need to check if the message ID belongs to my in memory list
I tried "... in list", same with dictionaries, numpy arrays, transforming the list in a string and using string.search(), sql query in the database with good index...
Nothing better than "if item in list" when the list is loaded into memory (very fast)
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
For 15 millions iterations with some stuff and NO search in the list:
- It takes 8 minutes to generate 2 tables of 15 millions lines (features and targets)
- When I activate the code above to check if a message ID already exists, it takes 1 hour 35 mn ...
How could I optimize this?
Thank you for your help
If your code is, roughly, this:
my_list = []
for this_item in collection:
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
Then it will run in O(n^2) time since the in operator for lists is O(n).
You can achieve linear time if you use a dictionary (which is implemented with a hash table) instead:
my_list = []
table = {}
for this_item in collection:
i = table.get(this_item)
if i is None:
i = len(my_list)
my_list.append(this_item)
table[this_item] = i
...
Of course, if you don't care about processing the items in the original order, you can just do:
for i, this_item in enumerate(set(collection)):
...

Finding the max value of a list of tuples, (applying max to the second value of the tuple)

So I have a list of tuples which I created from zipping two lists like this:
zipped =list(zip(neighbors, cv_scores))
max(zipped) produces
(49, 0.63941769316909292) where 49 is the max value.
However, I'm interesting in finding the max value among the latter value of the tuple (the .63941).
How can I do that?
The problem is that Python compares tuples lexicographically so it orders on the first item and only if these are equivalent, it compares the second and so on.
You can however use the key= in the max(..) function, to compare on the second element:
max(zipped,key=lambda x:x[1])
Note 1: Note that you do not have to construct a list(..) if you are only interested in the maximum value. You can use
max(zip(neighbors,cv_scores),key=lambda x:x[1]).
Note 2: Finding the max(..) runs in O(n) (linear time) whereas sorting a list runs in O(n log n).
max(zipped)[1]
#returns second element of the tuple
This should solve your problem in case you want to sort your data
and find the maximum you can use itemgetter
from operator import itemgetter
zipped.sort(key=itemgetter(1), reverse = True)
print(zipped[0][1]) #for maximum

How to repeat a function in Python (complete beginner - first lines of code ever)

I have the following code which I have to build upon (i.e. it can't be written a different way). I know there are other better ways of achieving the end result, but I want to use this code and then repeat it to make a list.
from random import choice
number_list = range(1,1001) # Creates a list from 1 to 1000
random_from_list = choice(number_list) # Chooses a random number from the list
I want to now repeat the choice function above 100 times, then print that list of 100 random numbers that I have chosen from my list of 1000 numbers. I have read up on "for" loops but I can't see how to apply it here.
If you don't need to build up your list you could just print them one at a time:
for _ in range(100):
print(choice(number_list))
If you want to build your list first you can use a "list comprehension":
choices = [choice(number_list) for _ in range(100)]
print(choices)
for i in range(100):
print(choice(number_list))

Comparing large list of hashes to other large list of hashes

I have a list of 100,000 hashes (list a) that I'd like to compare to a list of 15,000,000 hashes (list b).
The hash is taken from list a. If it exists in list b, do nothing. If it does not exist in list b, write it to a file.
Here is the logic I have so far:
def compareHashes(map, hashdb, out):
output_file = openFile(out)
line_cnt = 0
total_lines = len(map)
for m in map:
if m not in hashdb:
writeToFile(m + "\r\n", output_file)
sys.stdout.write("\r" + str(round(percentage(line_cnt, total_lines), 2)) + "%")
sys.stdout.flush()
line_cnt = line_cnt + 1
output_file.close()
It works, but takes an extremely long time. Can I get some suggestions on how to increase the performance on this? The box running the script has 60gb of ram and 8 cores. I dont think all the cores are being utilized because python is not multithreading. Any ideas how I could increase the throughput on this?
First, you state that you'd like to write to file if an element in list a doesn't exist in list b. This can be represented in code as:
for a in list_a:
if a not in list_b:
writeFile(...)
Using the infix operator in on a list is an O(n) complexity computation. Instead, use a set, an associative (unordered) array with item lookup in O(1) time.
set_b = set(list_b)
for a in list_a:
if a not in set_b:
writeFile(...)
You can also find all the items in list_a that aren't in list_b and then only perform actions on those items:
a_disjoint_b = set(list_a) - set(list_b)
for a in list_a:
if a in a_disjoint_b:
writeFile(...)
Or, if the order of items in list_a doesn't matter, and all items in list_a are unique:
for a in set(list_a) - set(list_b):
writeFile(...)