integer linear program in pulp package generate error - linear-programming

I'm trying to run sample code for transportation problem in Jupyter book, but generate error
TypeError: list indices must be integers or slices, not str. What's problem here? How to solve it? Thanks!
from pulp import *
# Creates a list of all the supply nodes
Warehouses = ["A","B"]
# Creates a dictionary for the number of units of supply for each supply node
supply = {"A": 1000,
"B": 4000}
# Creates a list of all demand nodes
Bars = ["1", "2", "3", "4", "5"]
# Creates a dictionary for the number of units of demand for each demand node
demand = {"1": 500,
"2": 900,
"3": 1800,
"4": 200,
"5": 700}
costs = [ #Bars
#1 2 3 4 5
[2,4,5,2,1],#A Warehouses
[3,1,3,2,3] #B
]
# Creates the prob variable to contain the problem data
prob = LpProblem("Beer Distribution Problem", LpMinimize)
# Creates a list of tuples containing all the possible routes for transport
Routes = [(w,b) for w in Warehouses for b in Bars]
# A dictionary called route_vars is created to contain the referenced variables (the routes)
route_vars = LpVariable.dicts("Route",(Warehouses,Bars),0,None,LpInteger)
# The objective function is added to prob first
prob += lpSum([route_vars[w][b]*costs[w][b] for (w,b) in Routes]), "Sum of Transporting Costs"
# The supply maximum constraints are added to prob for each supply node (warehouse)
for w in Warehouses:
prob += lpSum([route_vars[w][b] for b in Bars]) <= supply[w], "Sum of Products out of Warehouse %s"%w
# The demand minimum constraints are added to prob for each demand node (bar)
for b in Bars:
prob += lpSum([route_vars[w][b] for w in Warehouses]) >= demand[b], "Sum of Products into Bars %s"%b

This is more of a basic Python question than a PuLP question.
w,b are strings. So in your code, you are evaluating costs['A']['1']. If you type this in, you see the same error message. To be able to use string indices, you need to use a dict instead of a list (array).
Solution: make costs a dict
One way to do this is:
costs = {'A': {'1':2,'2':4,'3':5,'4':2,'5':1},
'B': {'1':3,'2':1,'3':3,'4':2,'5':3}}

Related

Single endpoint post array with multiple type of dictionary

I am creating the RESTful endpoints for supporting frontend payload.
My payload is an order of build your own dish and ready-made single dish
Problem:
In single POST of frontend. He wants to put everything to the single time. That's mean in the given list will contains 2 types of dictionary
one for build your own and one for ready-made single dish
IMO:
He can POST 2 times for each type of payload. By this method the endpoint will do one thing and I prefer that way.
He has only 1 reason to POST everything to single endpoint
Question:
What is your best practice for this sort of problem?
Build Your Own Payload:
In short I call it BYO.
1. base_bowl will dictates the size and price of the item
1. base_bowl will also determine the number of fishes, toppings, sauces.
Because base_bowl size S, M, or L has different quota.
For example
Size S can has fishes 1 scoop size S, and toppings 2 scoops size S.
Size M can has fishes 2 scoops size M, and toppings 3 scoops size M. Then if the customer would like to add more than quota he must add it in the extra_fishes, extra_toppings
Base on Price id since quantity is determine by number of member in the list.
{
"base_bowl": salad.id, # require=True, Price id
"fishes": [salmon.id, tuna.id],
"extra_fishes": [tofu.id],
"toppings": [tamago.id, mango.id],
"extra_toppings": [rambutan.id],
"premium_toppings": [ikura.id],
"sauces": [shoyu.id, spicy_kimchi.id],
"extra_sauces": [],
"sprinkles": [sesame.id, fried_shalots.id],
"dish_order": 1, # require=True
"note": {
'msg': 'eat here',
},
}
And backend will validate the input and INSERT them to Order and OrderItem
Ready-Made Dish:
This is very straight forward because it has no implicit logic like BYO. It just add OrderItem to Order
Use Menu id, size, and qty to determine price. Because customer is free to choose
{
'order_items': [
{
'menu_id': has_poink_menu.id,
'size': Price.MenuSize.XL, # 27, 37, 47, 52
'qty': 2, # amount = 52 * 2
},
{
'menu_id': no_poink_menu.id,
'size': Price.MenuSize.L, # 20, 30, 40, 45
'qty': 1 # amount = 40 * 1
}
]
}
My answer is opinionated, but to me a RESTful design is kept much clearer by keeping endpoints specific and well defined. So in your case there may be a BYODishViewSet and ReadyMadeDishViewSet mapped to /api/byodish/ and /api/readymadedish/.
However, if this is part of a larger single model, say an Order model, then you may want to consider using a nested (writable) serializer to wrap up an Order as a single API request-response.

How can I remove indices of non-max values that correspond to duplicate values of separate list from both lists?

I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.
# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]
# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]
# actual dataset is of size ~ 11,000
The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.
def list_duplicates(data_list):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in data_list if x in seen or seen_add(x))
return list(seen_twice)
# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]
# get index of duplicates
for dup in dups:
print(time_count.index(dup))
>> 2
>> 4
When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.
for dup in dups:
indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s
I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.
I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)
Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:
import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)
For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.
OK, if you want to do this with numpy, best is to turn both of your lists into arrays:
l = np.array(linspeed)
tc = np.array(time_count)
Now, finding unique times is just an np.unique call:
u, i, c = np.unique(tc, return_inverse = True, return_counts = True)
u
Out[]: array([ 4., 6., 8., 10., 14., 16.])
i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)
c
Out[]: array([1, 1, 2, 3, 1, 1])
Now you can either build your maximums with a for loop
m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])
m
Out[]: array([ 280. , 275. , 475.2, 400.9, 360.1, 400.9])
Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.
np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. , 275. , 475.2, 400.9, 323.8, 289.7])
Now your m and u vectors are the same length and include the output you want.

How do I include a list type feature in sklearn.svm.libsvm.fit() classifier?

I'm trying to loop through a number of text documents and create a feature set by recording :
position list in text
Part of speech of keyphrase
Length of each keyphrase (number of words in it)
Frequency of each keyphrase
Code snippet of extraxting features :
#Take list of Keywords
keyword_list = [line.split(':')[1].lower().strip() for line in keywords.splitlines() if ':' in line ]
#Position
position_list = [ [m.start()/float(len(document)) for m in re.finditer(re.escape(kw),document,flags=re.IGNORECASE)] for kw in keyword_list]
#Part of Speech
pos_list = []
for key in keyword_list:
pos_list.append([pos for w,pos in nltk.pos_tag(nltk.word_tokenize(key))])
#Length of each keyword
len_list = [ len(k.split(' ')) for k in keyword_list]
#Text Frequency
freq_list = [ len(pos)/float(len(document)) for pos in position_list]
target.extend(keyword_list)
for i in range(0,len(keyword_list)):
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
Where
target : list of keywords
data : list of features
I passed this through a classifier :
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25,random_state = 42)
import numpy as np
X_train = np.array(X_train)
y_train = np.array(y_train)
from sklearn import svm
cls = svm.SVC(gamma=0.001,C=100) # Parameter values Matter!
cls.fit(X_train,y_train)
predictions = cls.predict(X_test)
But I get an error :
Traceback (most recent call last):
File "supervised_3.py", line 113, in <module>
cls.fit(X_train,y_train)
File "/Library/Python/2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence
So, I removed all the list items by changing
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
to
data.append([len_list[i],freq_list[i]])
It worked.
But I need to include position_list and pos_list
I thought it wasn't working because these 2 are lists. So, I tried converting them to arrays :
data.append([np.array(position_list[i]),np.array(pos_list[i]),len_list[i],freq_list[i]])
but I still get the same error.
In the last for loop of the feature extraction code you are trying to append to data a list of four elements, namely position_list[i], pos_list[i], len_list[i], freq_list[i]. The problem is that the first two elements are lists themselves, but individual features have to be escalars (this is why the issue is not solved by converting the sublists to numpy arrays). Each of them requires a different workaround:
position_list[i]
This is a list of float numbers. You could replace this list by some statistics computed from it, for example the mean and the standard deviation.
pos_list[i]
This is a list of tags extracted from the list of tuples of the form (token, tag)* yielded by nltk.pos_tag. The tags (which are strings) can be converted into numbers in a straightforward way by counting their number of occurrences. To keep things simple, I will just add the frequency of 'NN' and 'NNS' tags**.
To get your code working you just need to change the last for loop to:
for i in range(0, len(keyword_list)):
positions_i = position_list[i]
tags_i = pos_list[i]
len_tags_i = float(len(tags_i))
m = np.mean(positions_i)
s = np.std(positions_i)
nn = tags_i.count('NN')/len_tags_i
nns = tags_i.count('NNS')/len_tags_i
data.append([m, s, nn, nns, len_list[i], freq_list[i]])
By doing so the resulting feature vector becomes 6-dimensional. Needless to say, you could use a higher or lower number of statistics and/or tag frequencies, or even a different tagset.
* The identifiers w,pos you use in the for loop that creates pos_list are a bit misleading.
** You could utilize a collections.Counter to count the number of occurrences of each tag more efficiently.

for information retrieval course using python, accessing given tf-idf weight

I am doing this python program where i have to access :
This is what i am trying to achieve with my code: Return a dict mapping doc_id to length, computed as sqrt(sum(w_i**2)), where w_i is the tf-idf weight for each term in the document.
E.g., in the sample index below, document 0 has two terms 'a' (with
tf-idf weight 3) and 'b' (with tf-idf weight 4). It's length is
therefore 5 = sqrt(9 + 16).
>>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0,4]]})
>>> lengths[0]
5.0
The code i have is this:
templist=[]
for iter in index.values():
templist.append(iter)
d = defaultdict(list)
for i,l in templist[1]:
d[i].append(l)
lent = defaultdict()
for m in d:
lo= math.sqrt(sum(lent[m]**2))
return lo
So, if I'm understanding you correctly, we have to transform the input dictionary:
ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }
To the output dictionary:
{1:5}
Where the 5 is calculated as the euclidian distance for the value portion of the input dictionary (the vector [3,4] in this case), Correct?
Given that information, the answer becomes a bit more straight-forwards:
def calculate_length(ind):
# Frist, let's transform the dictionary into a list of doc_id, tl_idf pairs; [[doc_id_1,tl_idf_1],...]
data = [entry[0] for entry in ind.itervalues()] # use just ind.values() in python 3.X
# Next, let's split that list into two, one for doc_id's, one for tl_idfs
doc_ids, tl_idfs = zip(*data)
# We can just assume that all the doc_id's are the same. you could check that here if you wanted
doc_id = doc_ids[0]
# Next, we calculate the length as per our formula
length = sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))
# Finally, we return the output dictionary
return {doc_id: length}
Example:
>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1:5.0}
There are a couple places in here where you could optimize this to remove the intermidary lists (this method can be two lines of operation and a return) but I'll leave that to you to find out since this is a homework assignment. I also hope you take the time to actually understand what this code does, rather than just copying it wholesale.
Also note that this answer makes the very large asumption that all doc_id values are the same, and there will only ever be a single doc_id,tl_idf list at each key in the dictionary! If that's not true, then your transform becomes more complicated. But you did not provide sample input nore textual explination indicating that's the case (though, based on the data structure, I'd think it quite likely).
Update
In fact, it's really bothering me because I definitely think that's the case. Here is a version that solves the more complex case:
from itertools import chain
from collections import defaultdict
def calculate_length(ind):
# We want to transform this first into a dict of {doc_id:[tl_idf_a,...]}
# First we transform it into a generator of ([doc_id,tl_idf],...)
tf_gen = chain.from_iterable(ind.itervalues())
# which we then use to generate our transformed dictionary
tf_dict = defaultdict(list)
for doc_id, tl_idf in tf_gen:
tf_dict[doc_id].append(tl_idf)
# Now we proceed mostly as before, but we can just do it in one line
return dict((doc_id, sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))) for doc_id, tl_idfs in tf_dict.iteritems())
Example use:
>>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1: 5.0}
>>> calculate_length({'a':[ [1,3],[2,3] ], 'b': [ [1,4 ], [2,1] ] })
{1: 5.0, 2: 3.1622776601683795}

add_edges_from three tuples networkx

I am trying to use networkx to create a DiGraph. I want to use add_edges_from(), and I want the edges and their data to be generated from three tuples.
I am importing the data from a CSV file. I have three columns: one for ids (first set of nodes), one for a set of names (second set of nodes), and another for capacities (no headers in the file). So, I created a dictionary for the ids and capacities.
dictionary = dict(zip(id, capacity))
then I zipped the tuples containing the edges data:
List = zip(id, name, capacity)
but when I execute the next line, it gives me an assertion error.
G.add_edges_from(List, 'weight': 1)
Can someone help me with this problem? I have been trying for a week with no luck.
P.S. I'm a newbie in programming.
EDIT:
so, i found the following solution. I am honestly not sure how it works, but it did the job!
Here is the code:
import networkx as nx
import csv
G = nx.DiGraph()
capacity_dict = dict(zip(zip(id, name),capacity))
List = zip(id, name, capacity)
G.add_edges_from(capacity_dict, weight=1)
for u,v,d in List:
G[u][v]['capacity']=d
Now when I run:
G.edges(data=True)
The result will be:
[(2.0, 'First', {'capacity': 1.0, 'weight': 1}), (3.0, 'Second', {'capacity': 2.0, 'weight': 1})]
I am using the network simplex. Now, I am trying to find a way to make the output of the flowDict more understandable, because it is only showing the ids of the flow. (Maybe i'll try to input them in a database and return the whole row of data instead of using the ids only).
A few improvements on your version. (1) NetworkX algorithms assume that weight is 1 unless you specifically set it differently. Hence there is no need to set it explicitly in your case. (2) Using the generator allows the capacity attribute to be set explicitly and other attributes to also be set once per record. (3) The use of a generator to process each record as it comes through saves you having to iterate through the whole list twice. The performance improvement is probably negligible on small datasets but still it feels more elegant. Having said that -- your method clearly works!
import networkx as nx
import csv
# simulate a csv file.
# This makes a multi-line string behave as a file.
from StringIO import StringIO
filehandle = StringIO('''a,b,30
b,c,40
d,a,20
''')
# process each row in the file
# and generate an edge from each
def edge_generator(fh):
reader = csv.reader(fh)
for row in reader:
row[-1] = float(row[-1]) # convert capacity to float
# add other attributes to the dict() below as needed...
# e.g. you might add weights here as well.
yield (row[0],
row[1],
dict(capacity=row[2]))
# create the graph
G = nx.DiGraph()
G.add_edges_from(edge_generator(filehandle))
print G.edges(data=True)
Returns this:
[('a', 'b', {'capacity': 30.0}),
('b', 'c', {'capacity': 40.0}),
('d', 'a', {'capacity': 20.0})]