I got a nice answer to my earlier question about de/serialization, which led me to create a method that either deserializes a defaultdict(list) from a file if it exists, or creates the dictionary itself if the file does not exist.
After implementing a simple code
try:
#deserialize - this takes about 6 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
lexicon[vals[0]] = vals[1:]
except:
#create new - this takes about 40 seconds
for word in lexicon_file:
word = word.lower()
for letter n-gram in word:
lexicon[n-gram].append(word)
#serialize - about 6 seconds
with open('dict.flat', 'w') as stream:
stream.write('\n'.join([' '.join([k] + v) for k, v in lexicon.iteritems()]))
I was a little shocked at the amount of RAM my script takes when deserializing from a file.
(The lexicon_file contains about 620 000 words and the processed defaultdict(list) contains 25 000 keys, while each key contains a list of between 1 and 133 000 strings (average 500, median 20).
Each key is a letter bi/trigram and it's values are words that contain the key letter n-gram.)
When the script creates the lexicon anew, the whole process doesn't use much more than 160 MB of RAM - the serialized file itself is a little over 129 MB.
When the script deserializes the lexicon, the amount of RAM taken by python.exe jumps up to 500 MB.
When I try to emulate the method of creating a new lexicon in the deserialization process with
#deserialize one by one - about 15 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
for item in vals[1:]:
lexicon[vals[0]].append(item)
The results are exactly the same - except this code snippet runs significantly slower.
What is causing such a drastic difference in memory consumption? My first though was that since a lot of elements in the resulting lists are exactly the same, python somehow creates the dictionary more efficiently memory-wise with references - something there is no time for when deserializing and mapping whole lists to keys. But if that is the case, why is this problem not solved by appending the items one by one, exactly like creating a new lexicon?
edit: This topic was already discussed in this question (how have I missed it?!). Python can be forced to create the dictionary from references by using the intern() function:
#deserialize with intern - 45 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
for item in vals[1:]:
lexicon[intern(vals[0])].append(intern(item))
This reduces the amount of RAM taken by the dictionary to expected values (160 MB), but the offset is that computational time is back to the same value as creating the dict anew, which completely negates the reason for serialization.
Related
I need to speed up (dramatically) the search in a "huge" single dimension list of unsigned values. The list has 389.114 elements, and I need to perform a check before I add an item to make sure it doesn't already exist
I do this check 15 millions times...
Of course, it takes too much time
The fastest way I found was :
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
I am building a dataset from time series logs
One column of these (huge) logs is a text message, which is very redondant
To dramatically speed up the process, I transform this text into an unsigned with Adler32(), and get a unique numeric value, which is great
Then I store the messages in a PostgreSQL database, with this value as index
For each line of my log files (15 millions all together), I need to update my database of unique messages (389.114 unique messages)
It means that for each line, I need to check if the message ID belongs to my in memory list
I tried "... in list", same with dictionaries, numpy arrays, transforming the list in a string and using string.search(), sql query in the database with good index...
Nothing better than "if item in list" when the list is loaded into memory (very fast)
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
For 15 millions iterations with some stuff and NO search in the list:
- It takes 8 minutes to generate 2 tables of 15 millions lines (features and targets)
- When I activate the code above to check if a message ID already exists, it takes 1 hour 35 mn ...
How could I optimize this?
Thank you for your help
If your code is, roughly, this:
my_list = []
for this_item in collection:
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
Then it will run in O(n^2) time since the in operator for lists is O(n).
You can achieve linear time if you use a dictionary (which is implemented with a hash table) instead:
my_list = []
table = {}
for this_item in collection:
i = table.get(this_item)
if i is None:
i = len(my_list)
my_list.append(this_item)
table[this_item] = i
...
Of course, if you don't care about processing the items in the original order, you can just do:
for i, this_item in enumerate(set(collection)):
...
I have been trying to implement the Stupid Backoff language model (the description is available here, though I believe the details are not relevant to the question).
The thing is, the code's working and producing the result that is expected, but works slower than I expected. I figured out the part that was slowing down everything is here (and NOT in the training part):
def compute_score(self, sentence):
length = len(sentence)
assert length <= self.n
if length == 1:
word = tuple(sentence)
return float(self.ngrams[length][word]) / self.total_words
else:
words = tuple(sentence[::-1])
count = self.ngrams[length][words]
if count == 0:
return self.alpha * self.compute_score(sentence[1:])
else:
return float(count) / self.ngrams[length - 1][words[:-1]]
def score(self, sentence):
""" Takes a list of strings as argument and returns the log-probability of the
sentence using your language model. Use whatever data you computed in train() here.
"""
output = 0.0
length = len(sentence)
for idx in range(length):
if idx < self.n - 1:
current_score = self.compute_score(sentence[:idx+1])
else:
current_score = self.compute_score(sentence[idx-self.n+1:idx+1])
output += math.log(current_score)
return output
self.ngrams is a nested dictionary that has n entries. Each of these entries is a dictionary of form (word_i, word_i-1, word_i-2.... word_i-n) : the count of this combination.
self.alpha is a constant that defines the penalty for going n-1.
self.n is the maximum length of that tuple that the program is looking for in the dictionary self.ngrams. It is set to 3 (though setting it to 2 or even 1 doesn't anything). It's weird because the Unigram and Bigram models work just fine in fractions of a second.
The answer that I am looking for is not a refactored version of my own code, but rather a tip which part of it is the most computationally expensive (so that I could figure out myself how to rewrite it and get the most educational profit from solving this problem).
Please, be patient, I am but a beginner (two months into the world of programming). Thanks.
UPD:
I timed the running time with the same data using time.time():
Unigram = 1.9
Bigram = 3.2
Stupid Backoff (n=2) = 15.3
Stupid Backoff (n=3) = 21.6
(It's on some bigger data than originally because of time.time's bad precision.)
If the sentence is very long, most of the code that's actually running is here:
def score(self, sentence):
for idx in range(len(sentence)): # should use xrange in Python 2!
self.compute_score(sentence[idx-self.n+1:idx+1])
def compute_score(self, sentence):
words = tuple(sentence[::-1])
count = self.ngrams[len(sentence)][words]
if count == 0:
self.compute_score(sentence[1:])
else:
self.ngrams[len(sentence) - 1][words[:-1]]
That's not meant to be working code--it just removes the unimportant parts.
The flow in the critical path is therefore:
For each word in the sentence:
Call compute_score() on that word plus the following 2. This creates a new list of length 3. You could avoid that with itertools.islice().
Construct a 3-tuple with the words reversed. This creates a new tuple. You could avoid that by passing the -1 step argument when making the slice outside this function.
Look up in self.ngrams, a nested dict, with the first key being a number (might be faster if this level were a list; there are only three keys anyway?), and the second being the tuple just created.
Recurse with the first word removed, i.e. make a new tuple (sentence[2], sentence[1]), or
Do another lookup in self.ngrams, implicitly creating another new tuple (words[:-1]).
In summary, I think the biggest problem you have is the repeated and nested creation and destruction of lists and tuples.
I have just started doing my first research project, and I have just begun programming (approximately 2 weeks ago). Excuse me if my questions are naive. I might be using python very inefficiently. I am eager to improve here.
I have experimental data that I want to analyse. My goal is to create a python script that takes the data as input, and that for output gives me graphs, where certain parameters contained in text files (within the experimental data folders) are plotted and fitted to certain equations. This script should be as generalizable as possible so that I can use it for other experiments.
I'm using the Anaconda, Python 2.7, package, which means I have access to various libraries/modules related to science and mathematics.
I am stuck at trying to use For and While loops (for the first time).
The data files are structured like this (I am using regex brackets here):
.../data/B_foo[1-7]/[1-6]/D_foo/E_foo/text.txt
What I want to do is to cycle through all the 7 top directories and each of their 6 subdirectories (named 1,2,3...6). Furthermore, within these 6 subdirectories, a text file can be found (always with the same filename, text.txt), which contain the data I want to access.
The 'text.txt' files is structured something like this:
1 91.146 4.571 0.064 1.393 939.134 14.765
2 88.171 5.760 0.454 0.029 25227.999 137.883
3 88.231 4.919 0.232 0.026 34994.013 247.058
4 ... ... ... ... ... ...
The table continues down. Every other row is empty. I want to extract information from 13 rows starting from the 8th line, and I'm only interested in the 2nd, 3rd and 5th columns. I want to put them into lists 'parameter_a' and 'parameter_b' and 'parameter_c', respectively. I want to do this from each of these 'text.txt' files (of which there is a total of 7*6 = 42), and append them to three large lists (each with a total of 7*6*13 = 546 items when everything is done).
This is my attempt:
First, I made a list, 'list_B_foo', containing the seven different 'B_foo' directories (this part of the script is not shown). Then I made this:
parameter_a = []
parameter_b = []
parameter_c = []
j = 7 # The script starts reading 'text.txt' after the j:th line.
k = 35 # The script stops reading 'text.txt' after the k:th line.
x = 0
while x < 7:
for i in range(1, 7):
path = str(list_B_foo[x]) + '/%s/D_foo/E_foo/text.txt' % i
m = open(path, 'r')
line = m.readlines()
while j < k:
line = line[j]
info = line.split()
print 'info:', info
parameter_a.append(float(info[1]))
parameter_b.append(float(info[2]))
parameter_c.append(float(info[5]))
j = j + 2
x = x + 1
parameter_a_vect = np.array(parameter_a)
parameter_b_vect = np.array(parameter_b)
parameter_c_vect = np.array(parameter_c)
print 'a_vect:', parameter_a_vect
print 'b_vect:', parameter_b_vect
print 'c_vect:', parameter_c_vect
I have tried to fiddle around with indentation without getting it to work (receiving either syntax error or indentation errors). Currently, I get this output:
info: ['1', '90.647', '4.349', '0.252', '0.033', '93067.188', '196.142']
info: ['.']
Traceback (most recent call last):
File "script.py", line 104, in <module>
parameter_a.append(float(info[1]))
IndexError: list index out of range
I don't understand why I get the "list index out of range" message. If anyone knows why this is the case, I would be happy to hear you out.
How do I solve this problem? Is my approach completely wrong?
EDIT: I went for a pure while-loop solution, taking RebelWithoutAPulse and CamJohnson26's suggestions into account. This is how I solved it:
parameter_a=[]
parameter_b=[]
parameter_c=[]
k=35 # The script stops reading 'text.txt' after the k:th line.
x=0
while x < 7:
y=1
while y < 7:
j=7
path1 = str(list_B_foo[x]) + '/%s/pdata/999/dcon2dpeaks.txt' % (y)
m = open(path, 'r')
lines = m.readlines()
while j < k:
line = lines[j]
info = line.split()
parameter_a.append(float(info[1]))
parameter_b.append(float(info[2]))
parameter_c.append(float(info[5]))
j = j+2
y = y+1
x = x+1
Meta: I am not sure If I should give the answer to the person who answered the quickest and who helped me finish my task. Or the person with the answer which I learned most from. I am sure this is a common issue that I can find an answer to by reading the rules or going to Stackexchange Meta. Until I've read up on the recomendations, I will hold off on marking the question as answered by any of you two.
Welcome to stack overflow!
The error is due to name collision that you inadvertenly have created. Note the output before the exception occurs:
info: ['1', '90.647', '4.349', '0.252', '0.033', '93067.188', '196.142']
info: ['.']
Traceback (most recent call last):
...
The line[1] cannot compute - there is no "1"-st element in the list, containing only '.' - in python the lists start with 0 position.
This happens in your nested loop,
while j < k
where you redefine the very line you read previously created:
line = m.readlines()
while j < k:
line = line[j]
info = line.split()
...
So what happens is on first run of the loop, your read the lines of the files into line list, then you take one line from the list, assign it to line again, and continue with the loop. At this point line contains a string.
On the next run reading from line via specified index reads the character from the string on the j-th position and the code malfunctions.
You could fix this with different naming.
P.S. I would suggest using with ... as ... syntax while working with files, it is briefly described here - this is called a context manager and it takes care of opening and closing the files for you.
P.P.S. I would also suggest reading the naming conventions
Looks like you are overwriting the line array with the first line of the file. You call line = m.readlines(), which sets line equal to an array of lines. You then set line = line[j], so now the line variable is no longer an array, it's a string equal to
1 91.146 4.571 0.064 1.393 939.134 14.765
This loop works fine, but the next loop will treat line as an array of chars and take the 4th element, which is just a period, and set it equal to itself. That explains why the info variable only has one element on the second pass through the loop.
To solve this, just use 2 line variables instead of one. Call one lines and the other line.
lines = m.readlines()
while j < k:
line = lines[j]
info = line.split()
May be other errors too but that should get you started.
I am connecting to an API that sends streaming data in string format. This stream runs for about 9 hours per day. Each string that is sent (about 1-5, 40 charter stings per second) need to be parsed and certain values need to extracted. Those values are then stored in a list which is parsed and data is retrieved at various intervals to create another list which needs to be parsed. What is the best way to accomplish this with miltiprocessing, queues? Is there a better way?
from multiprocessing import Process
import requests
data_stream = requests.get("http://myDataStreamUrl", stream=True)
lines = data_stream.iter_lines()
first_line = next(lines)
for line in lines:
find what i need and append to first_list
def parse_first_list_and create_second_list():
find what i need in first_list to create second list
def parse_second_list():
find what i need
I've been reading stackoverflow questions all morning, trying different approaches with no headway.
I'm trying to automate the process of reading data from 318 qdp files for plotting onto a graph (plus a few other things). The reason I'm doing this and not using qdp is because it's not helpful for what I'm trying to do.
The qdp file is just like any other .txt file, except with hidden characters of \n after each line and \t between each data entry so reading from it in a pythonic way should be straightforward. However the file format is giving me a headache.
A typical file has the following format:
Header - 8 Lines
space
qdp code line
datatype header \ Data Group
data column header / 1
data - 6 columns /
qdp code line
datatype header \ Data Group
data column header / 2
data - 6 columns /
This seems straightforward enough, however each file has varying numbers of data groups (between 1 and 3), of which only 1 I want to extract. So sometimes the data I want is the first group, sometimes it's the second and sometimes there isn't a data group after the data I want and thus the extra qdp code line isn't there.
Each line (except data) has varying amount of columns, so np.genfromtxt doesn't work. I've tried telling it to ignore every line till it finds the specific datatype header which heads the data I want and then extract from there but I can't seem to figure out how to do that. I've tried reading the file, assigning each line an index and then going back to find the index of the datatype header and going from there but with no success either.
Like my previous questions its seems like such a trivial issue, and yet I can't figure it out.
Appreciate the help.
So after more reading and trying all sorts of solutions, I've come up with a rather inelegant solution.
with open(file, "r") as f:
index = 0
data_start = 0
data_end = 0
EOF = 0
for line in f:
temp = line.strip()
datatemp.append(temp)
if line.strip() == "datatype header":
index += 1
data_start = index + 2
elif line.strip() == "next datatype header":
index += 1
data_end = index - 3
else:
index += 1
if f.readline() = "":
EOF = index
if data_end == 0:
data_end = EOF
Thus in the case when there is a data group after the one I want to extract, it uses that groups header to point back to end of data to be extracted, and when the data group I want to extract is the last group in the file it uses the EOF marker to point back.
After this I then split datatemp into 6 columns, assigning each to a list. Finally I can then manipulate the data I wanted and the program runs through all 318 files yey!