Handle possible null values in python dictionary - python-2.7

I use the following code to count the number occurrence of two values (1, -1):
import numpy as np
a = np.empty(0, dtype=np.int)
tmp = [-1,1,1,1,1,1,-1, 1, -1]
a = np.append(a, tmp)
unique, counts = np.unique(a, return_counts=True)
r = dict(zip(unique, counts))
print r
if r.values()[0] > r.values()[1]:
print r.keys()[0]
else:
print r.keys()[1]
The problem is that tmp can be sometimes all 1s or -1s which causes to fail the printing. The possible solve I can think of is to add a null-like key of zero value. For instance, when tmp=[1,1,1,1], r should be {1: 4, -1: 0} and visa versa. How can I modify this code to do so?
Thank you

One trick given that the input list/array contains only -1 and 1 would be to use offsetted array (offset by 1 to make -1s as 0s and 1s as 2s) for binned counting with np.bincount and then slicing with a step-size of 2 to consider counting for -1 and 1 -
dict(zip([-1,1],np.bincount(a+1,minlength=3)[::2]))
Sample runs -
In [954]: a = np.array([-1,1,1,1,1,1,-1,1,-1])
In [955]: dict(zip([-1,1],np.bincount(a+1,minlength=3)[::2]))
Out[955]: {-1: 3, 1: 6}
In [956]: a = np.array([-1,-1,-1,-1])
In [957]: dict(zip([-1,1],np.bincount(a+1,minlength=3)[::2]))
Out[957]: {-1: 4, 1: 0}
In [958]: a = np.array([1,1,1,1])
In [959]: dict(zip([-1,1],np.bincount(a+1,minlength=3)[::2]))
Out[959]: {-1: 0, 1: 4}
If you just need which one of -1 or 1 has the bigger count, simply do -
np.bincount(a+1,minlength=3).argmax()-1

Just for free, say you have
>>> uvalues = [-1,1]
which stands for the list of values to count.
What about doing {uvalue:r.get(uvalue,0) for uvalue in uvalues}
Use case
>>> a = np.array([-1,-1, -1, -1])
>>> unique, counts = np.unique(a, return_counts=True)
>>> r = dict(zip(unique, counts))
>>> r
{-1: 4}
>>> {uvalue:r.get(uvalue,0) for uvalue in uvalues}
{1: 0, -1: 4}

An other natural (and fast) solution using collections.Counter :
from collections import Counter
tmp = [1,1,1,1,1,1]
c=Counter({1:0,-1:0}) # init
c.update(tmp)
#Counter({-1: 0, 1: 6})

Related

Best way to shift a list in Python?

I have a list of numbers, let's say :
my_list = [2, 4, 3, 8, 1, 1]
From this list, I want to obtain a new list. This list would start with the maximum value until the end, and I want the first part (from the beginning until just before the maximum) to be added, like this :
my_new_list = [8, 1, 1, 2, 4, 3]
(basically it corresponds to a horizontal graph shift...)
Is there a simple way to do so ? :)
Apply as many as you want,
To the left:
my_list.append(my_list.pop(0))
To the right:
my_list.insert(0, my_list.pop())
How about something like this:
max_idx = my_list.index(max(my_list))
my_new_list = my_list[max_idx:] + my_list[0:max_idx]
Alternatively you can do something like the following,
def shift(l,n):
return itertools.islice(itertools.cycle(l),n,n+len(l))
my_list = [2, 4, 3, 8, 1, 1]
list(shift(my_list, 3))
Elaborating on Yasc's solution for moving the order of the list values, here's a way to shift the list to start with the maximum value:
# Find the max value:
max_value = max(my_list)
# Move the last value from the end to the beginning,
# until the max value is the first value:
while my_list[0] != max_value:
my_list.insert(0, my_list.pop())

Calculate two dimensional pairwise distance on a large numpy three dimensional array

I have a numpy array of 3 million points in the form of [pt_id, x, y, z]. The goal is to return all pairs of points that have an Euclidean distance two numbers min_d and max_d.
The Euclidean distance is between x and y and not on the z. However, I'd like to preserve the array with pt_id_from, pt_id_to, distance attributes.
I'm using scipy's dist to calculate the distances:
import scipy.spatial.distance
coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
['pt2', 2479539.000, 7287455.000, 4.900],
['pt3', 2479626.000, 7287458.000, 10.000],
['pt4', 2484097.000, 7292784.000, 8.800],
['pt5', 2484106.000, 7293079.000, 7.300],
['pt6', 2484095.000, 7292891.000, 11.100]])
dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
np.savetxt('test.out', scipy.spatial.distance.squareform(dists), delimiter=',')
What should I do to return an array of form: [pt_id_from, pt_id_to, distance]?
You simply create a new array from the data by looping through all the possible combinations. The itertools module is excellent for this.
n = coords_arr.shape[0] # number of points
D = scipy.spatial.distance.squareform(dists) # distance matrix
data = []
for i, j in itertools.combinations(range(n), 2):
pt_a = coords_arr[i, 0]
pt_b = coords_arr[j, 0]
d_ab = D[i,j]
data.append([pt_a, pt_b, d_ab])
result_arr = np.array(data)
If memory is a problem, you might want to change the distance lookup from using the huge matrix D to looking up the value directly in dists using the i and j index.
Well, ['pt1', 'pt2', distance_as_number] is not exactly possible. The closest you can get with mixed datatypes is a structured array but then you can't do things like result[:2,0]. You'll have to index field names and array indices separately like: result[['a','b']][0].
Here is my solution:
import numpy as np
import scipy.spatial.distance
coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
['pt2', 2479539.000, 7287455.000, 4.900],
['pt3', 2479626.000, 7287458.000, 10.000],
['pt4', 2484097.000, 7292784.000, 8.800],
['pt5', 2484106.000, 7293079.000, 7.300],
['pt6', 2484095.000, 7292891.000, 11.100]])
dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
# Create a shortcut for `coords_arr.shape[0]` which is basically
# the total amount of points, hence `n`
n = coords_arr.shape[0]
# `a` and `b` contain the indices of the points which were used to compute the
# distances in dists. In this example:
# a = [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
# b = [1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5]
a = np.arange(n).repeat(np.arange(n-1, -1, -1))
b = np.hstack([range(x, n) for x in xrange(1, n)])
min_d = 1000
max_d = 10000
# Find out which distances are in range.
in_range = np.less_equal(min_d, dists) & np.less_equal(dists, max_d)
# Define the datatype of the structured array which will be the result.
dtype = [('a', '<f8', (3,)), ('b', '<f8', (3,)), ('dist', '<f8')]
# Create an empty array. We fill it later because it makes the code cleaner.
# Its size is given by the sum over `in_range` which is possible
# since True and False are equivalent to 1 and 0.
result = np.empty(np.sum(in_range), dtype=dtype)
# Fill the resulting array.
result['a'] = coords_arr[a[in_range], 1:4]
result['b'] = coords_arr[b[in_range], 1:4]
result['dist'] = dists[in_range]
print(result)
# In caste you don't want a structured array at all, this is what you can do:
result = np.hstack([coords_arr[a[in_range],1:],
coords_arr[b[in_range],1:],
dists[in_range, None]]).astype('<f8')
print(result)
The structured array:
[([2479539.0, 7287455.0, 4.9], [2484097.0, 7292784.0, 8.8], 7012.389393067102)
([2479539.0, 7287455.0, 4.9], [2484106.0, 7293079.0, 7.3], 7244.7819152821985)
([2479539.0, 7287455.0, 4.9], [2484095.0, 7292891.0, 11.1], 7092.75912462844)
([2479626.0, 7287458.0, 10.0], [2484097.0, 7292784.0, 8.8], 6953.856268287403)
([2479626.0, 7287458.0, 10.0], [2484106.0, 7293079.0, 7.3], 7187.909362255481)
([2479626.0, 7287458.0, 10.0], [2484095.0, 7292891.0, 11.1], 7034.873843929257)]
The ndarray:
[[2479539.0, 7287455.0, 4.9, 2484097.0, 7292784.0, 8.8, 7012.3893],
[2479539.0, 7287455.0, 4.9, 2484106.0, 7293079.0, 7.3, 7244.7819],
[2479539.0, 7287455.0, 4.9, 2484095.0, 7292891.0, 11.1, 7092.7591],
[2479626.0, 7287458.0, 10.0, 2484097.0, 7292784.0, 8.8, 6953.8562],
[2479626.0, 7287458.0, 10.0, 2484106.0, 7293079.0, 7.3, 7187.9093],
[2479626.0, 7287458.0, 10.0, 2484095.0, 7292891.0, 11.1, 7034.8738]]
You can use np.where to get a coords of distances within a range, then generate a new list in your format, filtering same pairs. Like this:
>>> import scipy.spatial.distance
>>> import numpy as np
>>> coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
... ['pt2', 2479539.000, 7287455.000, 4.900],
... ['pt3', 2479626.000, 7287458.000, 10.000],
... ['pt4', 2484097.000, 7292784.000, 8.800],
... ['pt5', 2484106.000, 7293079.000, 7.300],
... ['pt6', 2484095.000, 7292891.000, 11.100]])
>>>
>>> dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
>>> dists = scipy.spatial.distance.squareform(dists)
>>> x, y = np.where((dists >= 8000) & (dists <= 30000))
>>> [(coords_arr[x[i]][0], coords_arr[y[i]][0], dists[y[i]][x[i]]) for i in xrange(len(x)) if x[i] < y[i]]
[('pt1', 'pt2', 28959.576688895162), ('pt1', 'pt3', 29042.897927032005)]

How do i check for duplicate values present in a Dictionary?

I want to map a function that takes a dictionary as the input and returns a list of the keys.
The keys in the list must be of only the unique values present in the dictionary.
So, this is what I have done.
bDict={}
for key,value in aDict.items():
if bDict.has_key(value) == False:
bDict[value]=key
else:
bDict.pop(value,None)
This is the output :
>>> aDict.keys()
Out[4]: [1, 3, 6, 7, 8, 10]
>>> aDict.values()
Out[5]: [1, 2, 0, 0, 4, 0]
>>> bDict.keys()
Out[6]: [0, 1, 2, 4]
>>> bDict.values()
Out[7]: [10, 1, 3, 8]
But, the expected output should be for bDict.values() : [*1,3,8*]
This may help.
CODE
aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0, 11:0}
bDict = {}
for i,j in aDict.items():
if j not in bDict:
bDict[j] = [i]
else:
bDict[j].append(i)
print map(lambda x: x[0],filter(lambda x: len(x) == 1,bDict.values()))
OUTPUT
[1, 3, 8]
So it appears you're creating a new dictionary with the keys and values inverted, keeping pairs where the value is unique. You can figure out which of the items are unique first then build a dictionary off of that.
def distinct_values(d):
from collections import Counter
counts = Counter(d.itervalues())
return { v: k for k, v in d.iteritems() if counts[v] == 1 }
This yields the following result:
>>> distinct_values({ 1:1, 3:2, 6:0, 7:0, 8:4, 10:0 })
{1: 1, 2: 3, 4: 8}
Here is a solution (with two versions of the aDict to test a rand case which failed in another solution):
#aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0}
aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0, 11:2}
seenValues = {}
uniqueKeys = set()
for aKey, aValue in aDict.items():
if aValue not in seenValues:
# Store the key of the value, and assume it is unique
seenValues[aValue] = aKey
uniqueKeys.add(aKey)
elif seenValues[aValue] in uniqueKeys:
# The value has been seen before, and the assumption of
# it being unique was wrong, so remove it
uniqueKeys.remove(seenValues[aValue])
print "Remove non-unique key/value pair: {%d, %d}" % (aKey, aValue)
else:
print "Non-unique key/value pair: {%d, %d}" % (aKey, aValue)
print "Unique keys: ", sorted(uniqueKeys)
And this produces the output:
Remove non-unique key/value pair: {7, 0}
Non-unique key/value pair: {10, 0}
Remove non-unique key/value pair: {11, 2}
Unique keys: [1, 8]
Or with original version of aDict:
Remove non-unique key/value pair: {7, 0}
Non-unique key/value pair: {10, 0}
Unique keys: [1, 3, 8]
As a python 2.7 one-liner,
[k for k,v in aDict.iteritems() if aDict.values().count(v) == 1]
Note that the above
Calls aDict.values() many times, once for each entry in the dictionary, and
Calls aDict.values().count(v) multiple times for each replicated value.
This is not a problem if the dictionary is small. If the dictionary isn't small, the creation and destruction of those duplicative lists and the duplicative calls to count() may be costly. It may help to cache the value of adict.values(), and it may also help to create a dictionary that maps the values in the dictionary to the number of occurrences as a dictionary entry value.

How to count the number of zeros in Python?

My code is currently written as:
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - rows.count(0)
print numgood
>> 25
It always comes out as 25, so it's not just that rows contains no 0's.
Have you printed rows?
It's [[0, 1, 0, 0, 2], [1, 2, 0, 1, 2], [3, 1, 1, 1, 1], [1, 0, 0, 1, 0], [0, 3, 2, 0, 1]], so you have a nested list there.
If you want to count the number of 0's in those nested lists, you could try:
import random
convert = {0:0, 1:1, 2:2, 3:3, 4:0, 5:1, 6:2, 7:1}
rows = [[convert[random.randint(0, 7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - sum(e.count(0) for e in rows)
print numgood
Output:
18
rows doesn't contain any zeroes; it contains lists, not integers.
>>> row = [1,2,3]
>>> type(row)
<type 'list'>
>>> row.count(2)
1
>>> rows = [[1,2,3],[4,5,6]]
>>> rows.count(2)
0
>>> rows.count([1,2,3])
1
To count the number of zeroes in any of the lists in rows, you could use a generator expression:
>>> rows = [[1,2,3],[4,5,6], [0,0,8]]
>>> sum(x == 0 for row in rows for x in row)
2
You could also use numpy:
import numpy as np
import random
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - np.count_nonzero(rows)
print numgood
Output:
9

Pandas Dataframe ValueError: Shape of passed values is (X, ), indices imply (X, Y)

I am getting an error and I'm not sure how to fix it.
The following seems to work:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df.apply(func = random, axis = 1)
and my output is:
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
However, when I change one of the of the columns to a value such as 1 or None:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df['E'] = 1
df.apply(func = random, axis = 1)
I get the the error:
ValueError: Shape of passed values is (5,), indices imply (5, 5)
I've been wrestling with this for a few days now and nothing seems to work. What is interesting is that when I change
def random(row):
return [1,2,3,4]
to
def random(row):
print [1,2,3,4]
everything seems to work normally.
This question is a clearer way of asking this question, which I feel may have been confusing.
My goal is to compute a list for each row and then create a column out of that.
EDIT: I originally start with a dataframe that hase one column. I add 4 columns in 4 difference apply steps, and then when I try to add another column I get this error.
If your goal is add new column to DataFrame, just write your function as function returning scalar value (not list), something like this:
>>> def random(row):
... return row.mean()
and then use apply:
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 -1.278878
1 -0.198460 0.544879 0.554407 -0.161357 0.184867
2 0.269807 1.132344 0.120303 -0.116843 0.351403
3 -1.131396 1.278477 1.567599 0.483912 0.549648
4 0.288147 0.382764 -0.840972 0.838950 0.167222
I don't know if it possible for your new column to contain lists, but it deinitely possible to contain tuples ((...) instead of [...]):
>>> def random(row):
... return (1,2,3,4,5)
...
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 (1, 2, 3, 4, 5)
1 -0.198460 0.544879 0.554407 -0.161357 (1, 2, 3, 4, 5)
2 0.269807 1.132344 0.120303 -0.116843 (1, 2, 3, 4, 5)
3 -1.131396 1.278477 1.567599 0.483912 (1, 2, 3, 4, 5)
4 0.288147 0.382764 -0.840972 0.838950 (1, 2, 3, 4, 5)
I use the code below it is just fine
import numpy as np
df = pd.DataFrame(np.array(your_data), columns=columns)