Why does the sequence of the output of an dictionary change with the number of entries in the dictionary? - python-2.7

I have to use "Python 2.7.3" for this script and an upgrade to an newer version is not possible.
If I put in an dictionary 4 enties (key:value) and print it everything works as aspected.
The sequence of the entries will be printed from the first until the last enty:
`parameters = {"ifname": None, "speed": "auto", "bcast_unit": "kps", "max_bcast": "300"}
print(parameters)
C:\Python27\python.exe C:/myscripts/Python/Projects/comware/get_arguments.py
{'ifname': None, 'speed': 'auto', 'bcast_unit': 'kps', 'max_bcast': '300'}`
==
If I put more than 4 entries (key:value) into this dictionary, things are getting strange;
The sequence change now and the last key:value will be printed first.
`parameters = {"ifname": None, "speed": "auto", "bcast_unit": "kps", "max_bcast": "300"}
print(parameters)
C:\Python27\python.exe C:/myscripts/Python/Projects/comware/get_arguments.py
{'ifname': None, 'speed': 'auto', 'bcast_unit': 'kps', 'max_bcast': '300'}`
What I'm doing wrong here? Or is it a limitation of this old Python version?
I used an dictionary with 4 enties and print this dict. The output had the aspected sequence of the entries.
With more than 4 entries (Python 2.7.3) prints the elements in different sequence.
Under Python 3.9.2 the dictionary with more than 4 entries will be printed in the correct and aspected sequence.

Related

How to find the log line with the most recent time stamp?

Currently I am looping through a directory to find/store/display the latest log line containing the version number information. I am finding the log lines with version numbers by using Regex, and I am trying to find the log line with the latest time stamp by comparing them with parse_version.
For example the log lines in the files of my folder looks like this:
2018-05-08T15:47:27.752Z 00000000-0000-0000-0000-000000000000 > LVL:2 RC: version: 2.12.1.10452
2018-05-08T21:27:14.2049217Z> <INFO >: Version: 2.10.0.23960
2018-05-08T21:18:53.0428568Z> <INFO >: Version: 2.12.1.26051
These are just a few examples of the thousands of log lines in the files of my folder, and I am trying to find the single latest log line with information regarding to version number. In this case, I would hope to select the the second line even though it has a lower version number because it has a more recent time stamp.
Below is my code, I didn't include the code on looping through for folders for the sake of simplicity.
for line in f: #For simplicity sake, I won't include my code above this line because it's just for looping through the folder to find the log lines
#0strip out \x00 from read content, in case it's encoded differently
line = line.replace('\x00', '')
#Regular expressions for finding the log lines in the folder
RE2 = r"^.+INFO.+Version.+"
RE3 = r"^.+RC: version"
previous_version_line = '0'
version_to_display = '00'
#Find the general matches, and get the version line with the latest time stamp
pattern2 = re.compile('('+RE2+'|'+RE3+')', re.IGNORECASE)
for match2 in pattern2.finditer(line):
if parse_version(line) > parse_version(previous_version_line):
version_to_display = line
previous_version_line = line
else:
version_to_display = previous_version_line
print(version_to_display)
Right now the problem seems be with the parse_version comparison, where although the log lines found through the regex should have a higher value than 0, the if statement is always evaluating to false and I am just printing a bunch of 0's.
Thanks in advance!
Find every row with 'version' in them, sort it by time and print latest time along the log message:
data = """
2018-05-08T15:47:27.752Z 00000000-0000-0000-0000-000000000000 > LVL:2 RC: version: 2.12.1.10452
2018-05-08T21:27:14.2049217Z> <INFO >: Version: 2.10.0.23960
2018-05-08T21:18:53.0428568Z> <INFO >: Version: 2.12.1.26051
"""
import re
from datetime import datetime
data_new = []
for (d, log) in re.findall(r'([\d\-:T\.]+Z)>?\s+(.*)', data):
if not re.search('version', log, flags=re.I):
continue
parts = d.split('.')
if len(parts[1]) >= 8:
d = parts[0] + '.' + parts[1][:6] + 'Z'
data_new.append((datetime.strptime(d, '%Y-%m-%dT%H:%M:%S.%fZ'), log))
data_new = sorted(data_new, reverse=True)
if data_new:
t = data_new[0][0].strftime('%Y-%m-%dT%H:%M:%S.%fZ')
print(f'Latest version to display:\ntime=[{t}] msg=[{data_new[0][1]}]')
Prints:
Latest version to display:
time=[2018-05-08T21:27:14.204921Z] msg=[<INFO >: Version: 2.10.0.23960]
Caveat:
The Python datetime class accepts microseconds only to 6 digits (so this program truncates it).

Why do these two seemingly equivalent scripts take significantly different times to run?

I was doing some experiments in the interpreter and at some point I had to load a huge line-separated file and concat its line together. I first tried this:
strand = ['', '']
for i in range(1, 3):
for line in open('chr2_strand_' +str(i) + '.fa').readlines()[1:]:
strand[i] += line.strip()
Which was still running when I killed it after a few minutes. Next I tried this:
strand = ['', '']
for i in range(1, 3):
s = ''
for line in open('chr2_strand_' +str(i) + '.fa').readlines()[1:]:
s += line.strip()
strand[i - 1] = s
Which took less than a minute to run.
There are two files being read in the script and both are nearly 300MB each with each line having a 100 characters.
The difference is in directly updating the strand array rather than assigning the final concatenated result to it. I don't understand why this should affect the runtime as the array's size remains constant (two references to strings) which means it doesn't need to be relocated.
Python version is Python 2.7.12 (default, Dec 4 2017, 14:50:18).
Any ideas?

Why this python script does not work?

Update 1: the last line of code sorted_xlist = sorted(xlist).extend(sorted(words_cp)) should be changed to:
sorted_xlist.extend(sorted(xlist))
sorted_xlist.extend(sorted(words_cp))
Update 1: Code is updated to solve the problem of changing length of words list.
This exercise of list functions is from Google's Python Introduction course. I don't know why the code doesn't work in Python 2.7. The goal of the code is explained in annotation portion.
# B. front_x
# Given a list of strings, return a list with the strings
# in sorted order, except group all the strings that begin with 'x' first.
# e.g. ['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] yields
# ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
# Hint: this can be done by making 2 lists and sorting each of them
# before combining them.
def front_x(words):
words_cp = []
words_cp.extend(words)
xlist=[]
sorted_xlist=[]
for i in range(0, len(words)):
if words[i][0] == 'x':
xlist.append(words[i])
words_cp.remove(words[i])
print sorted(words_cp) # For debugging
print sorted(xlist) # For debugging
sorted_xlist = sorted(xlist).extend(sorted(words_cp))
return sorted_xlist
Update 1: Now error message is gone.
front_x
['axx', 'bbb', 'ccc']
['xaa', 'xzz']
X got: None expected: ['xaa', 'xzz', 'axx', 'bbb', 'ccc']
['aaa', 'bbb', 'ccc']
['xaa', 'xcc']
X got: None expected: ['xaa', 'xcc', 'aaa', 'bbb', 'ccc']
['aardvark', 'apple', 'mix']
['xanadu', 'xyz']
X got: None expected: ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
The splitting of the original list works fine. But the merging doesn't work.
You're iterating over a sequence as you're changing its length.
Imagine if you start off with an array
arr = ['a','b','c','d','e']
When you remove the first two items from it, now you have:
arr = ['c','d','e']
But you're still iterating over the length of the original array. Eventually you get to i > 2, in my example above, which raises an IndexError.

for information retrieval course using python, accessing given tf-idf weight

I am doing this python program where i have to access :
This is what i am trying to achieve with my code: Return a dict mapping doc_id to length, computed as sqrt(sum(w_i**2)), where w_i is the tf-idf weight for each term in the document.
E.g., in the sample index below, document 0 has two terms 'a' (with
tf-idf weight 3) and 'b' (with tf-idf weight 4). It's length is
therefore 5 = sqrt(9 + 16).
>>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0,4]]})
>>> lengths[0]
5.0
The code i have is this:
templist=[]
for iter in index.values():
templist.append(iter)
d = defaultdict(list)
for i,l in templist[1]:
d[i].append(l)
lent = defaultdict()
for m in d:
lo= math.sqrt(sum(lent[m]**2))
return lo
So, if I'm understanding you correctly, we have to transform the input dictionary:
ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }
To the output dictionary:
{1:5}
Where the 5 is calculated as the euclidian distance for the value portion of the input dictionary (the vector [3,4] in this case), Correct?
Given that information, the answer becomes a bit more straight-forwards:
def calculate_length(ind):
# Frist, let's transform the dictionary into a list of doc_id, tl_idf pairs; [[doc_id_1,tl_idf_1],...]
data = [entry[0] for entry in ind.itervalues()] # use just ind.values() in python 3.X
# Next, let's split that list into two, one for doc_id's, one for tl_idfs
doc_ids, tl_idfs = zip(*data)
# We can just assume that all the doc_id's are the same. you could check that here if you wanted
doc_id = doc_ids[0]
# Next, we calculate the length as per our formula
length = sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))
# Finally, we return the output dictionary
return {doc_id: length}
Example:
>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1:5.0}
There are a couple places in here where you could optimize this to remove the intermidary lists (this method can be two lines of operation and a return) but I'll leave that to you to find out since this is a homework assignment. I also hope you take the time to actually understand what this code does, rather than just copying it wholesale.
Also note that this answer makes the very large asumption that all doc_id values are the same, and there will only ever be a single doc_id,tl_idf list at each key in the dictionary! If that's not true, then your transform becomes more complicated. But you did not provide sample input nore textual explination indicating that's the case (though, based on the data structure, I'd think it quite likely).
Update
In fact, it's really bothering me because I definitely think that's the case. Here is a version that solves the more complex case:
from itertools import chain
from collections import defaultdict
def calculate_length(ind):
# We want to transform this first into a dict of {doc_id:[tl_idf_a,...]}
# First we transform it into a generator of ([doc_id,tl_idf],...)
tf_gen = chain.from_iterable(ind.itervalues())
# which we then use to generate our transformed dictionary
tf_dict = defaultdict(list)
for doc_id, tl_idf in tf_gen:
tf_dict[doc_id].append(tl_idf)
# Now we proceed mostly as before, but we can just do it in one line
return dict((doc_id, sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))) for doc_id, tl_idfs in tf_dict.iteritems())
Example use:
>>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1: 5.0}
>>> calculate_length({'a':[ [1,3],[2,3] ], 'b': [ [1,4 ], [2,1] ] })
{1: 5.0, 2: 3.1622776601683795}

Python3 max function using value in defaultdict as key not working

Suppose model is a defaultdict, and num is a set
>>> model
>>> defaultdict(<function <lambda> at 0x11076f758>, {1: 3, 2: 2, 4: 1})
>>> num
>>> {1, 2, 3, 4, 5, 6}
I want to get the item from num that has maximum value in model, and the following code works fine in Python2
>>> # python 2.7.6
>>> max(num, key=model.get)
>>> 1
But it doesn't work in Python3,
>>> # python 3.3.3
>>> max(num, key=model.get)
>>> TypeError: unorderable types: NoneType() > int()
I can use max(num, key=lambda k:model[k]) to get it work in Python3, but if the item in num is not in the model, it will be added. This will modify model.
I am wondering why model.get doesn't work in Python3, and how can I do it without modifying model.
Use key=lambda x: model.get(x, 0).
defaultdict.get by default returns None if the item is not found. Python 2 allows ordered comparisons (like less-than and greater-than) on different types, but Python 3 doesn't. When Python 3 tries to find the max, it tries to see if the value for one key is greater than another. If one of the values is None, it fails with the error you saw. The solution is to make your key function return zero instead of None for missing values.