Slicing a Python OrderedDict - python-2.7

In my code I frequently need to take a subset range of keys+values from a Python OrderedDict (from collections package). Slicing doesn't work (throws TypeError: unhashable type) and the alternative, iterating, is cumbersome:
from collections import OrderedDict
o = OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
# want to do:
# x = o[1:3]
# need to do:
x = OrderedDict()
for idx, key in enumerate(o):
if 1 <= idx < 3:
x[key] = o[key]
Is there a better way to get this done?

You can use the itertools.islice function, which takes an iterable and outputs the stop first elements. This is beneficial since iterables don't support the common slicing method, and you won't need to create the whole items list from the OrderedDict.
from collections import OrderedDict
from itertools import islice
o = OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
sliced = islice(o.items(), 3) # o.iteritems() in Python 2.7 is o.items() in Python 3
sliced_o = OrderedDict(sliced)

The ordered dict in the standard library, doesn't provide that functionality. Even though libraries existed for a few years before collections.OrderedDict that have this functionality (and provide essentially a superset of OrderedDict): voidspace odict and ruamel.ordereddict (I am the author of the latter package, which is a reimplementation of odict in C):
from odict import OrderedDict as odict
p = odict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
print p[1:3]
In ruamel.ordereddict you can relax the ordered input requirement (AFAIK you cannot ask derivative of dict if its keys are ordered (would be good addition to ruamel.ordereddict to recognise collection.OrderedDicts)):
from ruamel.ordereddict import ordereddict
q = ordereddict(o, relax=True)
print q[1:3]
r = odict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
print r[1:3]
If you want (or have to) stay within the standard library you can sublass collections.OrderedDict's __getitem__:
class SlicableOrderedDict(OrderedDict):
def __getitem__(self, k):
if not isinstance(k, slice):
return OrderedDict.__getitem__(self, k)
x = SlicableOrderedDict()
for idx, key in enumerate(self.keys()):
if k.start <= idx < k.stop:
x[key] = self[key]
return x
s = SlicableOrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
print s[1:3]
of course you could use Martijn's or Jimmy's shorter versions to get the actual slice that needs returning:
from itertools import islice
class SlicableOrderedDict(OrderedDict):
def __getitem__(self, k):
if not isinstance(k, slice):
return OrderedDict.__getitem__(self, k)
return SlicableOrderedDict(islice(self.viewitems(), k.start, k.stop))
t = SlicableOrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
print t[1:3]
or if you just want smarten up all existing OrderedDicts without subclassing:
def get_item(self, k):
if not isinstance(k, slice):
return OrderedDict._old__getitem__(self, k)
return OrderedDict(islice(self.viewitems(), k.start, k.stop))
OrderedDict._old__getitem__ = OrderedDict.__getitem__
OrderedDict.__getitem__ = get_item
u = OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
print u[1:3]

In Python 2, you can slice the keys:
x.keys()[1:3]
and to support both Python 2 and Python 3, you'd convert to a list first:
list(k)[1:3]
The Python 2 OrderedDict.keys() implementation does exactly that.
In both cases you are given a list of keys in correct order. If creating a whole list first is an issue, you can use itertools.islice() and convert the iterable it produces to a list:
from itertools import islice
list(islice(x, 1, 3))
All of the above also can be applied to the items; use dict.viewitems() in Python 2 to get the same iteration behaviour as Python 3 dict.items() provides. You can pass the islice() object straight to another OrderedDict() in this case:
OrderedDict(islice(x.items(), 1, 3)) # x.viewitems() in Python 2

I was able to slice an OrderedDict using the following:
list(myordereddict.values())[start:stop]
I didn't test the performance.

I wanted to slice using a key, since I didn't know the index in advance:
o = OrderedDict(zip(list('abcdefghijklmnopqrstuvwxyz'),range(1,27)))
stop = o.keys().index('e') # -> 4
OrderedDict(islice(o.items(),stop)) # -> OrderedDict([('a', 1), ('b', 2), ('c', 3)])
or to slice from start to stop:
start = o.keys().index('c') # -> 2
stop = o.keys().index('e') # -> 4
OrderedDict(islice(o.iteritems(),start,stop)) # -> OrderedDict([('c', 3), ('d', 4)])

def slice_odict(odict, start=None, end=None):
return OrderedDict([
(k,v) for (k,v) in odict.items()
if k in list(odict.keys())[start:end]
])
This allows for:
>>> x = OrderedDict([('a',1), ('b',2), ('c',3), ('d',4)])
>>> slice_odict(x, start=-1)
OrderedDict([('d', 4)])
>>> slice_odict(x, end=-1)
OrderedDict([('a', 1), ('b', 2), ('c', 3)])
>>> slice_odict(x, start=1, end=3)
OrderedDict([('b', 2), ('c', 3)])

x = OrderedDict(o.items()[1:3])

Related

Check if all first elements in tuple list satisfies condition

I want to check if a list is a subset of another, based on the first element in its tuple.
subset(List(('a', 1), ('b', 2), ('c', 3)), List(('a', 4), ('b', 5)) // True
subset(List(('a', 1), ('b', 2), ('c', 3)), List(('a', 4), ('b', 5), ('f', 6)) // False
The size of the lists does not have to be the same. I've tried something like this, but with no luck
x.forall((char: Char, num: Int) => {y.contains((_,num))})
You can map in the input lists to retain only the first element, then use some set functionality to check equality:
def subset(a: List[(Char, Int)], b: List[(Char, Int)]): Boolean = {
val a_ = a.map(_._1).toSet
val b_ = b.map(_._1).toSet
b_.subsetOf(a_)
}
Update: Simplified based on suggestion from Luis

unnest list in pyspark

I am trying to use combineByKey to find the median per key for my assignment (using combineByKey is a requirement of the assignment) and I'm planning to use the following function to return (k, v) pairs where v = a list of all values associated with the same key. After that, I plan to sort the values and then find the median.
data = sc.parallelize([('A',2), ('A',4), ('A',9), ('A',3), ('B',10), ('B',20)])
rdd = data.combineByKey(lambda value: value, lambda c, v: median1(c,v), lambda c1, c2: median2(c1,c2))
def median1 (c,v):
list = [c]
list.append(v)
return list
def median2 (c1,c2):
list2 = [c1]
list2.append(c2)
return list2
However, my code gives output like this:
[('A', [[2, [4, 9]], 3]), ('B', [10, 20])]
where value is a nested list. Is there anyway that I can unnest the values in pyspark to get
[('A', [2, 4, 9, 3]), ('B', [10, 20])]
Or is there other ways I can find the median per key using combineByKey? Thanks!
it's way easier to use collect_list on a dataframe column.
from pyspark.sql.functions import collect_list
df = rdd.toDF(['key', 'values'])
key_lists = df.groupBy('key').agg(collect_list('values').alias('value_list'))
You just didn't make a good combiner out of the value.
Here is your answer :
data = sc.parallelize([('A',2), ('A',4), ('A',9), ('A',3), ('B',10), ('B',20)])
def createCombiner(value):
return [value]
def mergeValue(c, value):
return c.append(value)
def mergeCombiners(c1, c2):
return c1+c2
rdd = data.combineByKey(createCombiner, mergeValue, mergeCombiners)
[('A', [9, 4, 2, 3]), ('B', [10, 20])]

Can somebody give me an example for the zip() function in Python?

In Python's document, it says the following things for the zip function:
"The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n)."
I have a difficulty in understanding the zip(*[iter(s)]*n) idiom. Can any body give me an example on when we should use that idiom?
Thank you very much!
I don't know what documentation you're using, but this version of zip() documentation, has this example:
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)
>>> x == list(x2) and y == list(y2)
True
It interpolates two lists together, in respective order, and it also has an "unzip" feature
And since you asked, here's a slightly more understandable example:
>>> friends = ["Amy", "Bob", "Cathy"]
>>> orders = ["Burger", "Pizza", "Hot dog"]
>>> friend_order_pairs = zip(x, y)
>>> friend_order_pairs
[("Amy", "Burger"), ("Bob", "Pizza"), ("Cathy", "Hot dog")]
It's 2020, but let me leave this here for reference.
The zip(*[iter(s)]*n) idiom is used to split a flat list into chunks.
For example:
>>> mylist = [1, 2, 3, 'a', 'b', 'c', 'first', 'second', 'third']
>>> list(zip(*[iter(mylist)]*3))
[(1, 2, 3), ('a', 'b', 'c'), ('first', 'second', 'third')]
The idiom is analyzed here.
zip() is for sticking two or more lists together.
names=['bob','tim','larry']
ages=[15,36,50]
zip(names,ages)
Out: [('bob', 15), ('tim', 36), ('larry', 50)]
I use it to create dictionaries when I have a separate lists of keys and values:
>>> keys = ('pi', 'c', 'e')
>>> values = (3.14, 3*10**8, 1.6*10**-19)
>>> dict(zip(keys, values))
{'c': 300000000, 'pi': 3.14, 'e': 1.6000000000000002e-19}
Here is how to iterate over two lists and their indices using enumerate() together with zip():
alist = ['a1', 'a2', 'a3']
blist = ['b1', 'b2', 'b3']
for i, (a, b) in enumerate(zip(alist, blist)):
print i, a, b
zip() basically combines two or more items to form another list of equal length:
>>> alist = ['a1', 'a2', 'a3']
>>> blist = ['b1', 'b2', 'b3']
>>>
>>> zip(alist, blist)
[('a1', 'b1'), ('a2', 'b2'), ('a3', 'b3')]
>>>
Use izip instead.
When working with very large data sets, you can use izip which uses a generator and only evaluates results when requested - therefore great for memory management and much better performance. I usually use generator based variants of python modules when possible.
imagine an example like this:
from itertools import islice,izip
w = xrange(9000000000000000000)
x = xrange(2000000000000000000)
y = xrange(9000000000000000000)
z = xrange(9000000000000000000)
# The following only returns a generator that holds an iterator for the first 100 items
# without loading that large mess of numbers into memory
first_100_items_generator = islice(izip(w,x,y,z), 100)
# Iterate through the generator and return only what you need - first 100 items
first_100_items = list(first_100_items_generator)
print(first_100_items)
Output:
[ (0, 0, 0, 0),
(1, 1, 1, 1),
(2, 2, 2, 2),
(3, 3, 3, 3),
(4, 4, 4, 4),
(5, 5, 5, 5),
(6, 6, 6, 6),
(7, 7, 7, 7),
(8, 8, 8, 8),
(9, 9, 9, 9),
(10, 10, 10, 10),
(11, 11, 11, 11)
...
...
]
So here I have four large arrays of numbers, I used izip to zip the values then used islice to pick out the first 100 items.
The nice thing about using xrange, izip and islice is that are use generators, therefore they are not executed until the final "list()" method is called on it.
It's a bit of a digression into generators but good to know when you start doing large data processing in python.
Info on generators:
youtube
Generator intro

How can I print the values of keys returned by heapq?

I am trying to obtain the nine keys with the highest values from a large (14m keys) dictionary.
I am using the following to return the nine keys:
import heapq
def dict_nlargest(d,n):
return heapq.nlargest(n ,d, key = lambda k: d[k])
print dict_nlargest(mydict,9)
This works, but I would also like to print the values of those keys. Is there a way to do that using this method?
Normally, iterating over a dict iterates over its keys, so only those will be in the heap. You can change that by using items() or (preferably) iteritems(). You then iterate over (key, value) tuples. The key (for comparison) should be only the value, which can be achieved with lambda x: x[1] or (slightly faster) using operator.itemgetter.
import heapq
from operator import itemgetter
def dict_nlargest_items(d,n):
return heapq.nlargest(n, d.iteritems(), key=itemgetter(1))
mydict = {'a': 1, 'b': 2, 'c': 3}
print dict_nlargest_items(mydict, 2) # [('c', 3), ('b', 2)]
Of course, there is no real need to make this adjustment. Once you have the key, you can always look up the value:
print [(k, mydict[k]) for k in dict_nlargest(mydict, 2)] # [('c', 3), ('b', 2)]

How to add items in a list of tuples if the item is the same

I'm trying to multiply two polynomials in Python3 (2x^3-3x^2+4x * 2x^2-3 = 4x^5-6x^4+2x^3+9x^2-12x) and to represent the polynomial I'm using a tuple (exponent, variable), so the operation I described above would be: [(3,2), (2,-3), (1,4)] * [(2,2), (0, -3)]
And I got the next list as an answer: [(5, 4), (3, -6), (4, -6), (2, 9), (3, 8), (1, -12)]
That would represent: 4x^5-6x^3-6x^4+9x^2+8x^3-12x
But my problem is that I can't find a way to 'add' the tuples that have the same first element as you can see with the -6x^3 (3, -6) and 8x^3 (3, 8).
Is there a "Pythonic" way to achieve this?
I would switch from lists to dictionaries. To make addition easier, I'd use defaultdict:
from collections import defaultdict
poly = defaultdict(int)
And then add those tuples into the dictionary:
for exponent, variable in poly_list:
poly[exponent] += variable
It sort of works:
>>> from collections import defaultdict
>>>
>>> poly = defaultdict(int)
>>>
>>> for poly_list in [[(1, 1)], [(1, 1)]]:
... for exponent, variable in poly_list:
... poly[exponent] += variable
...
>>> poly
defaultdict(<type 'int'>, {1: 2})
>>> poly.items()
[(1, 2)]
Although personally, I would just make a Polynomial class:
class Polynomial(object):
def __init__(self, terms=None):
if isinstance(terms, dict):
self.terms = terms
else:
self.terms = dict(terms) or {}
def copy(self):
return Polynomial(self.terms.copy())
def __add__(self, other):
result = self.copy()
for e, c in self.terms.items():
result[e] = self.get(e, 0) + c
return result
def __mul__(self, other):
result = self.copy()
for e1, c1 in self.terms.items():
for e2, c2 in other.terms.items():
result[e1 + e2] = self.get(e1, 0) * other.get(e2, 0)
return result
This could be done in one line using itertools.groupby():
>>> [(exponent, sum(value for _, value in values)) for exponent, values in groupby(sorted(l, key=itemgetter(0)), key=itemgetter(0))]
[(1, -12), (2, 9), (3, 2), (4, -6), (5, 4)]
Breaking it down into something more readable (readability counts)...
Import the tools:
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
Declaring the input (you've already done this bit):
>>> l = [(5, 4), (3, -6), (4, -6), (2, 9), (3, 8), (1, -12)]
>>>
Before we can group, we need to sort (on the first item in the tuple):
>>> l_sorted = sorted(l, key=itemgetter(0))
>>>
And then group (again, by that first item):
>>> l_grouped = groupby(l_sorted, key=itemgetter(0))
>>>
Then create a list comprehension, summing the values in the group (ignoring the key):
>>> [(exponent, sum(v for _,v in values)) for exponent, values in l_grouped]
[(1, -12), (2, 9), (3, 2), (4, -6), (5, 4)]