numpy recarray append_fields: can't append numpy array of datetimes - python-2.7

I have a recarray containing various fields and I want to append an array of datetime objects on to it.
However, it seems like the append_fields function in numpy.lib.recfunctions won't let me add an array of objects.
Here's some example code:
import numpy as np
import datetime
import numpy.lib.recfunctions as recfun
dtype= np.dtype([('WIND_WAVE_HGHT', '<f4'), ('WIND_WAVE_PERD', '<f4')])
obs = np.array([(0.1,10.0),(0.2,11.0),(0.3,12.0)], dtype=dtype)
dates = np.array([datetime.datetime(2001,1,1,0),
datetime.datetime(2001,1,1,0),
datetime.datetime(2001,1,1,0)])
# This doesn't work:
recfun.append_fields(obs,'obdate',dates,dtypes=np.object)
I keep getting the error TypeError: Cannot change data-type for object array.
It seems to only be an issue with np.object arrays as I can append other fields ok. Am I missing something?

The problem
In [143]: recfun.append_fields(obs,'test',np.array([None,[],1]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-143-5c3de23b09f7> in <module>()
----> 1 recfun.append_fields(obs,'test',np.array([None,[],1]))
/usr/local/lib/python3.5/dist-packages/numpy/lib/recfunctions.py in append_fields(base, names, data, dtypes, fill_value, usemask, asrecarray)
615 if dtypes is None:
616 data = [np.array(a, copy=False, subok=True) for a in data]
--> 617 data = [a.view([(name, a.dtype)]) for (name, a) in zip(names, data)]
618 else:
619 if not isinstance(dtypes, (tuple, list)):
/usr/local/lib/python3.5/dist-packages/numpy/lib/recfunctions.py in <listcomp>(.0)
615 if dtypes is None:
616 data = [np.array(a, copy=False, subok=True) for a in data]
--> 617 data = [a.view([(name, a.dtype)]) for (name, a) in zip(names, data)]
618 else:
619 if not isinstance(dtypes, (tuple, list)):
/usr/local/lib/python3.5/dist-packages/numpy/core/_internal.py in _view_is_safe(oldtype, newtype)
363
364 if newtype.hasobject or oldtype.hasobject:
--> 365 raise TypeError("Cannot change data-type for object array.")
366 return
367
TypeError: Cannot change data-type for object array.
So the problem is in this a.view([(name, a.dtype)]) expression. It tries to make a single field structured array from a. That works with dtypes like int and str, but fails with object. That failure is in the core view handling, so isn't likely to change.
In [148]: x=np.arange(3)
In [149]: x.view([('test', x.dtype)])
Out[149]:
array([(0,), (1,), (2,)],
dtype=[('test', '<i4')])
In [150]: x=np.array(['one','two'])
In [151]: x.view([('test', x.dtype)])
Out[151]:
array([('one',), ('two',)],
dtype=[('test', '<U3')])
In [152]: x=np.array([[1],[1,2]])
In [153]: x
Out[153]: array([[1], [1, 2]], dtype=object)
In [154]: x.view([('test', x.dtype)])
...
TypeError: Cannot change data-type for object array.
The fact that recfunctions requires a separate load indicates that it is somewhat of a backwater, that isn't used a lot, and not under active development. I haven't examined the code in detail, but I suspect a fix would be a kludge.
A fix
Here's a way of adding a new field from scratch. It performs the same basic actions as append_fields:
Define a new dtype, using the obs and the new field name and dtype:
In [158]: obs.dtype.descr
Out[158]: [('WIND_WAVE_HGHT', '<f4'), ('WIND_WAVE_PERD', '<f4')]
In [159]: obs.dtype.descr+[('TEST',object)]
Out[159]: [('WIND_WAVE_HGHT', '<f4'), ('WIND_WAVE_PERD', '<f4'), ('TEST', object)]
In [160]: dt1 =np.dtype(obs.dtype.descr+[('TEST',object)])
Make an empty target array, and fill it by copying data by field name:
In [161]: newobs = np.empty(obs.shape, dtype=dt1)
In [162]: for n in obs.dtype.names:
...: newobs[n]=obs[n]
In [167]: dates
Out[167]:
array([datetime.datetime(2001, 1, 1, 0, 0),
datetime.datetime(2001, 1, 1, 0, 0),
datetime.datetime(2001, 1, 1, 0, 0)], dtype=object)
In [168]: newobs['TEST']=dates
In [169]: newobs
Out[169]:
array([( 0.1 , 10., datetime.datetime(2001, 1, 1, 0, 0)),
( 0.2 , 11., datetime.datetime(2001, 1, 1, 0, 0)),
( 0.30000001, 12., datetime.datetime(2001, 1, 1, 0, 0))],
dtype=[('WIND_WAVE_HGHT', '<f4'), ('WIND_WAVE_PERD', '<f4'), ('TEST', 'O')])
datetime64 alternative
With the native numpy datetimes, append works
In [179]: dates64 = dates.astype('datetime64[D]')
In [180]: recfun.append_fields(obs,'test',dates64,usemask=False)
Out[180]:
array([( 0.1 , 10., '2001-01-01'),
( 0.2 , 11., '2001-01-01'), ( 0.30000001, 12., '2001-01-01')],
dtype=[('WIND_WAVE_HGHT', '<f4'), ('WIND_WAVE_PERD', '<f4'), ('test', '<M8[D]')])
append_fields has some bells-n-whistles that my version doesn't - fill values, masked arrays, recarray, etc.
structured dates array
I could create a structured array with the dates
In [197]: sdates = np.array([(i,) for i in dates],dtype=[('test',object)])
In [198]: sdates
Out[198]:
array([(datetime.datetime(2001, 1, 1, 0, 0),),
(datetime.datetime(2001, 1, 1, 0, 0),),
(datetime.datetime(2001, 1, 1, 0, 0),)],
dtype=[('test', 'O')])
There must be a function that merges fields of existing arrays, but I'm not finding it.
previous work
This felt familiar:
https://github.com/numpy/numpy/issues/2346
TypeError when appending fields to a structured array of size ONE
Adding datetime field to recarray

Related

storing a numpy arrays in a list

I'd like to automate the process of loading some ASCII datafiles in numpy in order to plot them. The filenames are given to the program via terminal and the content is loaded and saved in a list. So basically the idea is to have list that contains numpy arrays that I can call later via indexing to plot each individual data.
The problem I have is that indexing is not working with these lists I make
subplots_array = [[0,0],[0,0],[0,0],[0,0]]
subplots_axes = [0,0,0,0] #this array will allow to create subplots for '''
# each of the above data
fig = plt.figure()
counter = 0
for x in arguments_list:
for filename in glob.glob(x):
mydata = np.loadtxt(filename)
subplots_array[counter] = mydata # This loads the data from files
#specified in arguments argv into a subplot array as numpy sub-array
counter += 1
counter = 0
for x in subplots_array:
subplots_axes[counter] = fig.add_subplot(counter+1, 1, 1)
subplots_axes[counter].scatter(subplots_array[counter][:, 0]), subplots_array[counter][:, 1], s = 12, marker = "x")
counter = counter + 1
This is the error I get. The funny thing is if I substitute "counter" with a numerical index like 0 or 1 or 2 etc, the data is plotted correctly, despite counter being defined as an index as well. So I am out of ideas.
Traceback (most recent call last):
File "FirstTrial.py", line 89, in <module>
subplots_axes[counter].scatter(subplots_array[counter][:,0], subplots_array[counter][:, 1], s = 12, marker = "x")
TypeError: list indices must be integers or slices, not tuple
I hope this is enough description to help me solve the issue.
Your indentation is off, and with out your files I can't reproduce your code. But here's what I think is happening:
In [1]: subplots_array = [[0,0],[0,0],[0,0],[0,0]]
In [2]: subplots_array[0]=np.ones((2,3)) # one or more counter loops
indexing with counter==0 works:
In [4]: subplots_array[0][:,0]
Out[4]: array([1., 1.])
but your error with the next counter value:
In [5]: subplots_array[1][:,0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-0b654866bdee> in <module>
----> 1 subplots_array[1][:,0]
TypeError: list indices must be integers or slices, not tuple
Here I replaced one element of subplot_array with a 2d array, but left the others alone. They were initialed as lists:
In [6]: subplots_array
Out[6]:
[array([[1., 1., 1.],
[1., 1., 1.]]), [0, 0], [0, 0], [0, 0]]
So the problem isn't with the counter type itself, but with the next level of indexing.

Reading Data from CSV and fill Empty Values Python

I am reading in a CSV file with the general schema of
,abv,ibu,id,name,style,brewery_id,ounces
14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
I am running into problems where fields are not existing such as in object 0 where it is lacking an IBU. I would like to be able to insert a value such as 0.0 that would work as a float for values that require floats and an empty string for ones that require strings.
My code is along the lines of
import csv
import numpy as np
def dataset(path, filter_field, filter_value):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
if filter_field:
for row in filter(lambda row: row[filter_field]==filter_value, reader):
yield row
def main(path):
data = [(row["ibu"], float(row["ibu"])) for row in dataset(path, "style", "American Pale Lager")]
As of right now my code would throw an error sine there are empty values in the "ibu" column for object 0.
How should one go about solving this problem?
You can do the following:
add a default dictionary input that you can use for missing values
and also to update upon certain conditions such as when ibu is empty
this is your implementation changed to correct for what you need. If I were you I would use pandas ...
import csv, copy
def dataset(path, filter_field, filter_value, default={'brewery_id':-1, 'style': 'unkown style', ' ': -1, 'name': 'unkown name', 'abi':0.0, 'id': -1, 'ounces':-1, 'ibu':0.0}):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row is None:
break
if row[filter_field].strip() != filter_value:
continue
default_row = copy.copy(default)
default_row.update(row)
# you might want to add conditions
if default_row["ibu"] == "":
default_row["ibu"] = default["ibu"]
yield default_row
data = [(row["ibu"], float(row["ibu"])) for row in dataset('test.csv', "style", "American Pale Lager")]
print data
>> [(0.0, 0.0)]
Why don't you use
import pandas as pd
df = pd.read_csv(data_file)
The following is the result:
In [13]: df
Out[13]:
Unnamed: 0 abv ibu id name style \
0 14 0.061 60.0 1979 Bitter Bitch American Pale Ale (APA)
1 0 0.050 NaN 1436 Pub Beer American Pale Lager
brewery_id ounces
0 177 12.0
1 408 12.0
Simulating your file with a text string:
In [48]: txt=b""" ,abv,ibu,id,name,style,brewery_id,ounces
...: 14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
...: 0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
...: """
I can load it with numpy genfromtxt.
In [49]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None,skip_heade
...: r=1,filling_values=0)
In [50]: data
Out[50]:
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), ('f4', 'S12'), ('f5', 'S23'), ('f6', '<i4'), ('f7', '<f8')])
In [51]:
I had to skip the header line because it is incomplete (a blank for the 1st field). The result is a structured array - a mix of ints, floats and strings (bytestrings in Py3).
After correcting the header line, and using names=True, I get
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('abv', '<f8'), ('ibu', '<f8'), ('id', '<i4'), ('name', 'S12'), ('style', 'S23'), ('brewery_id', '<i4'), ('ounces', '<f8')])
genfromtxt is the most powerful csv reader in numpy. See it's docs for more parameters. The pandas reader is faster and more flexible - but of course produces a data frame, not array.

how can I use estimators not in sklearn for model pipeline

I tried to use arima model in the gridSearchCV function, but it returns
"TypeError: Cannot clone object '' (type ): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
"
import numpy as np
import pandas as pd
from sklearn.grid_search import GridSearchCV
from statsmodels.tsa.arima_model import ARIMA
df_original = pd.DataFrame({"date_col": ['2016-08-01', '2016-08-02', '2016-08-03', '2016-08-04', '2016-08-05',
'2016-08-06', '2016-08-07', '2016-08-08', '2016-08-09', '2016-08-10',
'2016-08-11'],
'sum_base_revenue_cip': [1, 2, 7, 5, 1, 2, 5, 10, 9, 0, 1]})
df_original["sum_base_revenue_cip"] = np.log(df_original["sum_base_revenue_cip"] + 1e-6)
df_original_ts = df_original.copy(deep=True)
df_original_ts['date_col'] = pd.to_datetime(df_original['date_col'])
df_original_ts = df_original_ts.set_index('date_col')
print df_original_ts
estimator = ARIMA(df_original_ts,order=(1,1,0))
params = {
'order': ((2, 1, 0), (0, 2, 1), (1, 0, 0))
}
grid_search = GridSearchCV(estimator,
params,
n_jobs=-1,
verbose=True)
grid_search.fit(df_original_ts)
You can find an sklearn wrapper for it
You can write your own inheriting from BaseEstimator and meeting all the requirements for an sklearn estimator e.g. all parameters have to be explicitly mentioned in the signature for init.
You can roll your own gridsearch just looping through the parameters.

Regarding arranging or sorting a dictionary in ascending order using python [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I have a dictionary of values read from two fields in a database: a string field and a numeric field. The string field is unique, so that is the key of the dictionary.
I can sort on the keys, but how can I sort based on the values?
Note: I have read Stack Overflow question here How do I sort a list of dictionaries by a value of the dictionary? and probably could change my code to have a list of dictionaries, but since I do not really need a list of dictionaries I wanted to know if there is a simpler solution to sort either in ascending or descending order.
Python 3.7+ or CPython 3.6
Dicts preserve insertion order in Python 3.7+. Same in CPython 3.6, but it's an implementation detail.
>>> x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
>>> {k: v for k, v in sorted(x.items(), key=lambda item: item[1])}
{0: 0, 2: 1, 1: 2, 4: 3, 3: 4}
or
>>> dict(sorted(x.items(), key=lambda item: item[1]))
{0: 0, 2: 1, 1: 2, 4: 3, 3: 4}
Older Python
It is not possible to sort a dictionary, only to get a representation of a dictionary that is sorted. Dictionaries are inherently orderless, but other types, such as lists and tuples, are not. So you need an ordered data type to represent sorted values, which will be a list—probably a list of tuples.
For instance,
import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(1))
sorted_x will be a list of tuples sorted by the second element in each tuple. dict(sorted_x) == x.
And for those wishing to sort on keys instead of values:
import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(0))
In Python3 since unpacking is not allowed we can use
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=lambda kv: kv[1])
If you want the output as a dict, you can use collections.OrderedDict:
import collections
sorted_dict = collections.OrderedDict(sorted_x)
As simple as: sorted(dict1, key=dict1.get)
Well, it is actually possible to do a "sort by dictionary values". Recently I had to do that in a Code Golf (Stack Overflow question Code golf: Word frequency chart). Abridged, the problem was of the kind: given a text, count how often each word is encountered and display a list of the top words, sorted by decreasing frequency.
If you construct a dictionary with the words as keys and the number of occurrences of each word as value, simplified here as:
from collections import defaultdict
d = defaultdict(int)
for w in text.split():
d[w] += 1
then you can get a list of the words, ordered by frequency of use with sorted(d, key=d.get) - the sort iterates over the dictionary keys, using the number of word occurrences as a sort key .
for w in sorted(d, key=d.get, reverse=True):
print(w, d[w])
I am writing this detailed explanation to illustrate what people often mean by "I can easily sort a dictionary by key, but how do I sort by value" - and I think the original post was trying to address such an issue. And the solution is to do sort of list of the keys, based on the values, as shown above.
You could use:
sorted(d.items(), key=lambda x: x[1])
This will sort the dictionary by the values of each entry within the dictionary from smallest to largest.
To sort it in descending order just add reverse=True:
sorted(d.items(), key=lambda x: x[1], reverse=True)
Input:
d = {'one':1,'three':3,'five':5,'two':2,'four':4}
a = sorted(d.items(), key=lambda x: x[1])
print(a)
Output:
[('one', 1), ('two', 2), ('three', 3), ('four', 4), ('five', 5)]
Dicts can't be sorted, but you can build a sorted list from them.
A sorted list of dict values:
sorted(d.values())
A list of (key, value) pairs, sorted by value:
from operator import itemgetter
sorted(d.items(), key=itemgetter(1))
In recent Python 2.7, we have the new OrderedDict type, which remembers the order in which the items were added.
>>> d = {"third": 3, "first": 1, "fourth": 4, "second": 2}
>>> for k, v in d.items():
... print "%s: %s" % (k, v)
...
second: 2
fourth: 4
third: 3
first: 1
>>> d
{'second': 2, 'fourth': 4, 'third': 3, 'first': 1}
To make a new ordered dictionary from the original, sorting by the values:
>>> from collections import OrderedDict
>>> d_sorted_by_value = OrderedDict(sorted(d.items(), key=lambda x: x[1]))
The OrderedDict behaves like a normal dict:
>>> for k, v in d_sorted_by_value.items():
... print "%s: %s" % (k, v)
...
first: 1
second: 2
third: 3
fourth: 4
>>> d_sorted_by_value
OrderedDict([('first': 1), ('second': 2), ('third': 3), ('fourth': 4)])
Using Python 3.5
Whilst I found the accepted answer useful, I was also surprised that it hasn't been updated to reference OrderedDict from the standard library collections module as a viable, modern alternative - designed to solve exactly this type of problem.
from operator import itemgetter
from collections import OrderedDict
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = OrderedDict(sorted(x.items(), key=itemgetter(1)))
# OrderedDict([(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)])
The official OrderedDict documentation offers a very similar example too, but using a lambda for the sort function:
# regular unsorted dictionary
d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
# dictionary sorted by value
OrderedDict(sorted(d.items(), key=lambda t: t[1]))
# OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])
Pretty much the same as Hank Gay's answer:
sorted([(value,key) for (key,value) in mydict.items()])
Or optimized slightly as suggested by John Fouhy:
sorted((value,key) for (key,value) in mydict.items())
As of Python 3.6 the built-in dict will be ordered
Good news, so the OP's original use case of mapping pairs retrieved from a database with unique string ids as keys and numeric values as values into a built-in Python v3.6+ dict, should now respect the insert order.
If say the resulting two column table expressions from a database query like:
SELECT a_key, a_value FROM a_table ORDER BY a_value;
would be stored in two Python tuples, k_seq and v_seq (aligned by numerical index and with the same length of course), then:
k_seq = ('foo', 'bar', 'baz')
v_seq = (0, 1, 42)
ordered_map = dict(zip(k_seq, v_seq))
Allow to output later as:
for k, v in ordered_map.items():
print(k, v)
yielding in this case (for the new Python 3.6+ built-in dict!):
foo 0
bar 1
baz 42
in the same ordering per value of v.
Where in the Python 3.5 install on my machine it currently yields:
bar 1
foo 0
baz 42
Details:
As proposed in 2012 by Raymond Hettinger (cf. mail on python-dev with subject "More compact dictionaries with faster iteration") and now (in 2016) announced in a mail by Victor Stinner to python-dev with subject "Python 3.6 dict becomes compact and gets a private version; and keywords become ordered" due to the fix/implementation of issue 27350 "Compact and ordered dict" in Python 3.6 we will now be able, to use a built-in dict to maintain insert order!!
Hopefully this will lead to a thin layer OrderedDict implementation as a first step. As #JimFasarakis-Hilliard indicated, some see use cases for the OrderedDict type also in the future. I think the Python community at large will carefully inspect, if this will stand the test of time, and what the next steps will be.
Time to rethink our coding habits to not miss the possibilities opened by stable ordering of:
Keyword arguments and
(intermediate) dict storage
The first because it eases dispatch in the implementation of functions and methods in some cases.
The second as it encourages to more easily use dicts as intermediate storage in processing pipelines.
Raymond Hettinger kindly provided documentation explaining "The Tech Behind Python 3.6 Dictionaries" - from his San Francisco Python Meetup Group presentation 2016-DEC-08.
And maybe quite some Stack Overflow high decorated question and answer pages will receive variants of this information and many high quality answers will require a per version update too.
Caveat Emptor (but also see below update 2017-12-15):
As #ajcr rightfully notes: "The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon." (from the whatsnew36) not nit picking, but the citation was cut a bit pessimistic ;-). It continues as " (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5)."
So as in some human languages (e.g. German), usage shapes the language, and the will now has been declared ... in whatsnew36.
Update 2017-12-15:
In a mail to the python-dev list, Guido van Rossum declared:
Make it so. "Dict keeps insertion order" is the ruling. Thanks!
So, the version 3.6 CPython side-effect of dict insertion ordering is now becoming part of the language spec (and not anymore only an implementation detail). That mail thread also surfaced some distinguishing design goals for collections.OrderedDict as reminded by Raymond Hettinger during discussion.
It can often be very handy to use namedtuple. For example, you have a dictionary of 'name' as keys and 'score' as values and you want to sort on 'score':
import collections
Player = collections.namedtuple('Player', 'score name')
d = {'John':5, 'Alex':10, 'Richard': 7}
sorting with lowest score first:
worst = sorted(Player(v,k) for (k,v) in d.items())
sorting with highest score first:
best = sorted([Player(v,k) for (k,v) in d.items()], reverse=True)
Now you can get the name and score of, let's say the second-best player (index=1) very Pythonically like this:
player = best[1]
player.name
'Richard'
player.score
7
I had the same problem, and I solved it like this:
WantedOutput = sorted(MyDict, key=lambda x : MyDict[x])
(People who answer "It is not possible to sort a dict" did not read the question! In fact, "I can sort on the keys, but how can I sort based on the values?" clearly means that he wants a list of the keys sorted according to the value of their values.)
Please notice that the order is not well defined (keys with the same value will be in an arbitrary order in the output list).
If values are numeric you may also use Counter from collections.
from collections import Counter
x = {'hello': 1, 'python': 5, 'world': 3}
c = Counter(x)
print(c.most_common())
>> [('python', 5), ('world', 3), ('hello', 1)]
Starting from Python 3.6, dict objects are now ordered by insertion order. It's officially in the specifications of Python 3.7.
>>> words = {"python": 2, "blah": 4, "alice": 3}
>>> dict(sorted(words.items(), key=lambda x: x[1]))
{'python': 2, 'alice': 3, 'blah': 4}
Before that, you had to use OrderedDict.
Python 3.7 documentation says:
Changed in version 3.7: Dictionary order is guaranteed to be insertion
order. This behavior was implementation detail of CPython from 3.6.
In Python 2.7, simply do:
from collections import OrderedDict
# regular unsorted dictionary
d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
# dictionary sorted by key
OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
# dictionary sorted by value
OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])
copy-paste from : http://docs.python.org/dev/library/collections.html#ordereddict-examples-and-recipes
Enjoy ;-)
This is the code:
import operator
origin_list = [
{"name": "foo", "rank": 0, "rofl": 20000},
{"name": "Silly", "rank": 15, "rofl": 1000},
{"name": "Baa", "rank": 300, "rofl": 20},
{"name": "Zoo", "rank": 10, "rofl": 200},
{"name": "Penguin", "rank": -1, "rofl": 10000}
]
print ">> Original >>"
for foo in origin_list:
print foo
print "\n>> Rofl sort >>"
for foo in sorted(origin_list, key=operator.itemgetter("rofl")):
print foo
print "\n>> Rank sort >>"
for foo in sorted(origin_list, key=operator.itemgetter("rank")):
print foo
Here are the results:
Original
{'name': 'foo', 'rank': 0, 'rofl': 20000}
{'name': 'Silly', 'rank': 15, 'rofl': 1000}
{'name': 'Baa', 'rank': 300, 'rofl': 20}
{'name': 'Zoo', 'rank': 10, 'rofl': 200}
{'name': 'Penguin', 'rank': -1, 'rofl': 10000}
Rofl
{'name': 'Baa', 'rank': 300, 'rofl': 20}
{'name': 'Zoo', 'rank': 10, 'rofl': 200}
{'name': 'Silly', 'rank': 15, 'rofl': 1000}
{'name': 'Penguin', 'rank': -1, 'rofl': 10000}
{'name': 'foo', 'rank': 0, 'rofl': 20000}
Rank
{'name': 'Penguin', 'rank': -1, 'rofl': 10000}
{'name': 'foo', 'rank': 0, 'rofl': 20000}
{'name': 'Zoo', 'rank': 10, 'rofl': 200}
{'name': 'Silly', 'rank': 15, 'rofl': 1000}
{'name': 'Baa', 'rank': 300, 'rofl': 20}
Try the following approach. Let us define a dictionary called mydict with the following data:
mydict = {'carl':40,
'alan':2,
'bob':1,
'danny':3}
If one wanted to sort the dictionary by keys, one could do something like:
for key in sorted(mydict.iterkeys()):
print "%s: %s" % (key, mydict[key])
This should return the following output:
alan: 2
bob: 1
carl: 40
danny: 3
On the other hand, if one wanted to sort a dictionary by value (as is asked in the question), one could do the following:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
The result of this command (sorting the dictionary by value) should return the following:
bob: 1
alan: 2
danny: 3
carl: 40
You can create an "inverted index", also
from collections import defaultdict
inverse= defaultdict( list )
for k, v in originalDict.items():
inverse[v].append( k )
Now your inverse has the values; each value has a list of applicable keys.
for k in sorted(inverse):
print k, inverse[k]
You can use the collections.Counter. Note, this will work for both numeric and non-numeric values.
>>> x = {1: 2, 3: 4, 4:3, 2:1, 0:0}
>>> from collections import Counter
>>> #To sort in reverse order
>>> Counter(x).most_common()
[(3, 4), (4, 3), (1, 2), (2, 1), (0, 0)]
>>> #To sort in ascending order
>>> Counter(x).most_common()[::-1]
[(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)]
>>> #To get a dictionary sorted by values
>>> from collections import OrderedDict
>>> OrderedDict(Counter(x).most_common()[::-1])
OrderedDict([(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)])
The collections solution mentioned in another answer is absolutely superb, because you retain a connection between the key and value which in the case of dictionaries is extremely important.
I don't agree with the number one choice presented in another answer, because it throws away the keys.
I used the solution mentioned above (code shown below) and retained access to both keys and values and in my case the ordering was on the values, but the importance was the ordering of the keys after ordering the values.
from collections import Counter
x = {'hello':1, 'python':5, 'world':3}
c=Counter(x)
print( c.most_common() )
>> [('python', 5), ('world', 3), ('hello', 1)]
You can also use a custom function that can be passed to parameter key.
def dict_val(x):
return x[1]
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=dict_val)
You can use a skip dict which is a dictionary that's permanently sorted by value.
>>> data = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
>>> SkipDict(data)
{0: 0.0, 2: 1.0, 1: 2.0, 4: 3.0, 3: 4.0}
If you use keys(), values() or items() then you'll iterate in sorted order by value.
It's implemented using the skip list datastructure.
Of course, remember, you need to use OrderedDict because regular Python dictionaries don't keep the original order.
from collections import OrderedDict
a = OrderedDict(sorted(originalDict.items(), key=lambda x: x[1]))
If you do not have Python 2.7 or higher, the best you can do is iterate over the values in a generator function. (There is an OrderedDict for 2.4 and 2.6 here, but
a) I don't know about how well it works
and
b) You have to download and install it of course. If you do not have administrative access, then I'm afraid the option's out.)
def gen(originalDict):
for x, y in sorted(zip(originalDict.keys(), originalDict.values()), key=lambda z: z[1]):
yield (x, y)
#Yields as a tuple with (key, value). You can iterate with conditional clauses to get what you want.
for bleh, meh in gen(myDict):
if bleh == "foo":
print(myDict[bleh])
You can also print out every value
for bleh, meh in gen(myDict):
print(bleh, meh)
Please remember to remove the parentheses after print if not using Python 3.0 or above
from django.utils.datastructures import SortedDict
def sortedDictByKey(self,data):
"""Sorted dictionary order by key"""
sortedDict = SortedDict()
if data:
if isinstance(data, dict):
sortedKey = sorted(data.keys())
for k in sortedKey:
sortedDict[k] = data[k]
return sortedDict
Here is a solution using zip on d.values() and d.keys(). A few lines down this link (on Dictionary view objects) is:
This allows the creation of (value, key) pairs using zip(): pairs = zip(d.values(), d.keys()).
So we can do the following:
d = {'key1': 874.7, 'key2': 5, 'key3': 8.1}
d_sorted = sorted(zip(d.values(), d.keys()))
print d_sorted
# prints: [(5, 'key2'), (8.1, 'key3'), (874.7, 'key1')]
As pointed out by Dilettant, Python 3.6 will now keep the order! I thought I'd share a function I wrote that eases the sorting of an iterable (tuple, list, dict). In the latter case, you can sort either on keys or values, and it can take numeric comparison into account. Only for >= 3.6!
When you try using sorted on an iterable that holds e.g. strings as well as ints, sorted() will fail. Of course you can force string comparison with str(). However, in some cases you want to do actual numeric comparison where 12 is smaller than 20 (which is not the case in string comparison). So I came up with the following. When you want explicit numeric comparison you can use the flag num_as_num which will try to do explicit numeric sorting by trying to convert all values to floats. If that succeeds, it will do numeric sorting, otherwise it'll resort to string comparison.
Comments for improvement welcome.
def sort_iterable(iterable, sort_on=None, reverse=False, num_as_num=False):
def _sort(i):
# sort by 0 = keys, 1 values, None for lists and tuples
try:
if num_as_num:
if i is None:
_sorted = sorted(iterable, key=lambda v: float(v), reverse=reverse)
else:
_sorted = dict(sorted(iterable.items(), key=lambda v: float(v[i]), reverse=reverse))
else:
raise TypeError
except (TypeError, ValueError):
if i is None:
_sorted = sorted(iterable, key=lambda v: str(v), reverse=reverse)
else:
_sorted = dict(sorted(iterable.items(), key=lambda v: str(v[i]), reverse=reverse))
return _sorted
if isinstance(iterable, list):
sorted_list = _sort(None)
return sorted_list
elif isinstance(iterable, tuple):
sorted_list = tuple(_sort(None))
return sorted_list
elif isinstance(iterable, dict):
if sort_on == 'keys':
sorted_dict = _sort(0)
return sorted_dict
elif sort_on == 'values':
sorted_dict = _sort(1)
return sorted_dict
elif sort_on is not None:
raise ValueError(f"Unexpected value {sort_on} for sort_on. When sorting a dict, use key or values")
else:
raise TypeError(f"Unexpected type {type(iterable)} for iterable. Expected a list, tuple, or dict")
I just learned a relevant skill from Python for Everybody.
You may use a temporary list to help you to sort the dictionary:
# Assume dictionary to be:
d = {'apple': 500.1, 'banana': 1500.2, 'orange': 1.0, 'pineapple': 789.0}
# Create a temporary list
tmp = []
# Iterate through the dictionary and append each tuple into the temporary list
for key, value in d.items():
tmptuple = (value, key)
tmp.append(tmptuple)
# Sort the list in ascending order
tmp = sorted(tmp)
print (tmp)
If you want to sort the list in descending order, simply change the original sorting line to:
tmp = sorted(tmp, reverse=True)
Using list comprehension, the one-liner would be:
# Assuming the dictionary looks like
d = {'apple': 500.1, 'banana': 1500.2, 'orange': 1.0, 'pineapple': 789.0}
# One-liner for sorting in ascending order
print (sorted([(v, k) for k, v in d.items()]))
# One-liner for sorting in descending order
print (sorted([(v, k) for k, v in d.items()], reverse=True))
Sample Output:
# Ascending order
[(1.0, 'orange'), (500.1, 'apple'), (789.0, 'pineapple'), (1500.2, 'banana')]
# Descending order
[(1500.2, 'banana'), (789.0, 'pineapple'), (500.1, 'apple'), (1.0, 'orange')]
Use ValueSortedDict from dicts:
from dicts.sorteddict import ValueSortedDict
d = {1: 2, 3: 4, 4:3, 2:1, 0:0}
sorted_dict = ValueSortedDict(d)
print sorted_dict.items()
[(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)]
Iterate through a dict and sort it by its values in descending order:
$ python --version
Python 3.2.2
$ cat sort_dict_by_val_desc.py
dictionary = dict(siis = 1, sana = 2, joka = 3, tuli = 4, aina = 5)
for word in sorted(dictionary, key=dictionary.get, reverse=True):
print(word, dictionary[word])
$ python sort_dict_by_val_desc.py
aina 5
tuli 4
joka 3
sana 2
siis 1
If your values are integers, and you use Python 2.7 or newer, you can use collections.Counter instead of dict. The most_common method will give you all items, sorted by the value.
This works in 3.1.x:
import operator
slovar_sorted=sorted(slovar.items(), key=operator.itemgetter(1), reverse=True)
print(slovar_sorted)
For the sake of completeness, I am posting a solution using heapq. Note, this method will work for both numeric and non-numeric values
>>> x = {1: 2, 3: 4, 4:3, 2:1, 0:0}
>>> x_items = x.items()
>>> heapq.heapify(x_items)
>>> #To sort in reverse order
>>> heapq.nlargest(len(x_items),x_items, operator.itemgetter(1))
[(3, 4), (4, 3), (1, 2), (2, 1), (0, 0)]
>>> #To sort in ascending order
>>> heapq.nsmallest(len(x_items),x_items, operator.itemgetter(1))
[(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)]

Renaming index values in multiindex dataframe

Creating my dataframe:
from pandas import *
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = zip(*arrays)
index = MultiIndex.from_tuples(tuples, names=['first','second'])
data = DataFrame(randn(8,2),index=index,columns=['c1','c2'])
data
Out[68]:
c1 c2
first second
bar one 0.833816 -1.529639
two 0.340150 -1.818052
baz one -1.605051 -0.917619
two -0.021386 -0.222951
foo one 0.143949 -0.406376
two 1.208358 -2.469746
qux one -0.345265 -0.505282
two 0.158928 1.088826
I would like to rename the "first" index values, such as "bar"->"cat", "baz"->"dog", etc. However, every example I have read either operates on a single-level index and/or loops through the entire index to effectively re-create it from scratch. I was thinking something like:
data = data.reindex(index={'bar':'cat','baz':'dog'})
but this does not work, nor do I really expect it to work on multiple indexes. Can I do such a replacement without looping through the entire dataframe index?
Begin edit
I am hesitant to update to 0.13 until release, so I used the following workaround:
index = data.index.tolist()
for r in xrange( len(index) ):
index[r] = (codes[index[r][0]],index[r][1])
index = pd.MultiIndex.from_tuples(index,names=data.index.names)
data.index = index
Where is a previous defined dictionary of code:string pairs. This actually isn't as big of a performance his as I was expecting (takes a couple seconds to operate over ~1.1 million rows). It is not as pretty as a one-liner, but it does work.
End Edit
Use the set_levels method (new in version 0.13.0):
data.index.set_levels([[u'cat', u'dog', u'foo', u'qux'],
[u'one', u'two']], inplace=True)
yields
c1 c2
first second
cat one -0.289649 -0.870716
two -0.062014 -0.410274
dog one 0.030171 -1.091150
two 0.505408 1.531108
foo one 1.375653 -1.377876
two -1.478615 1.351428
qux one 1.075802 0.532416
two 0.865931 -0.765292
To remap a level based on a dict, you could use a function such as this:
def map_level(df, dct, level=0):
index = df.index
index.set_levels([[dct.get(item, item) for item in names] if i==level else names
for i, names in enumerate(index.levels)], inplace=True)
dct = {'bar':'cat', 'baz':'dog'}
map_level(data, dct, level=0)
Here's a runnable example:
import numpy as np
import pandas as pd
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = zip(*arrays)
index = pd.MultiIndex.from_tuples(tuples, names=['first','second'])
data = pd.DataFrame(np.random.randn(8,2),index=index,columns=['c1','c2'])
data2 = data.copy()
data.index.set_levels([[u'cat', u'dog', u'foo', u'qux'],
[u'one', u'two']], inplace=True)
print(data)
# c1 c2
# first second
# cat one 0.939040 -0.748100
# two -0.497006 -1.185966
# dog one -0.368161 0.050339
# two -2.356879 -0.291206
# foo one -0.556261 0.474297
# two 0.647973 0.755983
# qux one -0.017722 1.364244
# two 1.007303 0.004337
def map_level(df, dct, level=0):
index = df.index
index.set_levels([[dct.get(item, item) for item in names] if i==level else names
for i, names in enumerate(index.levels)], inplace=True)
dct = {'bar':'wolf', 'baz':'rabbit'}
map_level(data2, dct, level=0)
print(data2)
# c1 c2
# first second
# wolf one 0.939040 -0.748100
# two -0.497006 -1.185966
# rabbit one -0.368161 0.050339
# two -2.356879 -0.291206
# foo one -0.556261 0.474297
# two 0.647973 0.755983
# qux one -0.017722 1.364244
# two 1.007303 0.004337
The set_levels method was causing my new column names to be out of order. So I found a different solution that isn't very clean, but works well. The method is to print df.index (or equivalently df.columns) and then copy and paste the output with the desired values changed. For example:
print data.index
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
data.index = MultiIndex(levels=[['new_bar', 'new_baz', 'new_foo', 'new_qux'],
['new_one', 'new_two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
We can have full control over names by editing the labels as well. For example:
data.index = MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'],
['one', 'twooo', 'three', 'four',
'five', 'siz', 'seven', 'eit']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 3, 4, 5, 6, 7]],
names=['first', 'second'])
Note that in this example we have already done something like from pandas import MultiIndex or from pandas import *.