Linear problem for chemical composition of a formulation - linear-programming

Dears,
I´m writing a Linear Program for optimization.
One of my goals is to recommend to my supplier which raw material mix to use in its product formulation in order to optimally fulfill my nutrient needs in several locations. Example:
Sets:
SET_RAW_MATERIALS = ['Ureia', 'Ammonium Sulphate']
SET_NUTRIENTS = ['Nitrogen', 'Sulfur'] # Two nutrients that are supplied through the available raw materials
SET_LOCATIONS = ['Location A', 'Location B'] # Two locations for having nutrient demands fulfilled
SET_FORMULATIONS = ['Formulation 1'] # I have only 1 formulation to propose to my supplier
Parameters:
PARAM_NUTRIENT_DEMAND = [
('Location A', 'Nitrogen'): 10,
('Location A', 'Sulfur'): 2,
('Location B', 'Nitrogen'): 10
]
PARAM_RAW_MATERIAL_NUTRIENT_CONTENT = [
('Ammonium Sulphate', 'Nitrogen'): 10,
('Ammonium Sulphate', 'Sulfur'): 5,
('Urea', 'Nitrogen'): 2
]
Variables:
VAR_CHEMICAL_COMPOSITION('Formulation', 'Raw_material') # For each formulation I should find the optimal content of each raw_material (ranging from 0 to 1) VAR_VOLUME_FORMULATION('Formulation', 'Location') # The volume that i should use on each location for the proposed formulation
Constraints:
(1) Nutrient demand (for each location and nutrient):
PARAM_NUTRIENT_DEMAND['Location', 'Nutrient'] <= sum(
VAR_VOLUME_FORMULATION['Formulation', 'Location'] * VAR_CHEMICAL_COMPOSITION['Formulation', 'Raw Material'] * PARAM_RAW_MATERIAL_NUTRIENT_CONTENT['raw_material', 'nutrient'] for all Raw Materials)
This is the only formulation I was able to come up with, but it is obviously non-linear. Is there any way to rewrite this constraint to have a LINEAR PROBLEM?
Thank you for the help.,
p.S.: I know the problem is not complete, I´ve brought only the information necessary to show the problem

you have a blending example with docplex at https://github.com/sumeetparashar/IBM-Decision-Optimization-Introductory-notebook/blob/master/Oil-blend-student-copy.ipynb
I also recommend diet example in
CPLEX_Studio1210\python\examples\mp\modeling

Related

decision trees using R, rpart, fragile families

So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.
I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.
I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Storing a curve in a django model as manytomany?

I have a list of products (say diodoes) which have a curve associated to them.
For example,
Diode 1: curve 1: [(0,1),(1,3),(2,10), ...., (100,0.5)]
Diode 2: curve 2: [(0,2),(1,4),(2.1,19), ..., (100,0)]
So for each product there is a curve (with the same x-axis values range(1,100)) but different y-axis values.
My question is what is the best practice to store such data (using Django + PostgreSql) given that I want to calculate things with it later in the views (say the area under the curve, or that curve times another one, etc). I will also be charting it, so the view will have to pull the values.
My first attempts have had various limitations:
Naive attempt 1
# model.py
for i in range(101):
name_sects = ["x", str(i+1)]
attrs["".join(name_sects)] = models.DecimalField(_("".join([str(i+1),' A'])), max_digits=6)
attrs['intensity'] = model.DecimalField(_('Diode Intensity'))
Diode = type('Diode', (models.Model,), attrs)
Ok, that creates a field for each "x", x1, x2,... etc, and I can fill each "y" in the admin ... but it's not obvious how to manipulate it in the view or the template. (and a pain to fill in, obviously)
Naive attempt 2
#model.py
class Curve(models.Model)
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
characteristic_curve = model.ManyToManyField(Curve)
Is ManyToMany the way forward? Even if to each diode corresponds one single curve? (but many points, possibly two diodes sharing a same point).
Any advice, tips or links to tools for it are very appreciated.
If you want to improve speed (because 100 entries for each product, it's really huge and it would be slow if you have to fetch 100 products and theirs points), I would use the pickle module and store your list of tuples in a TextField (or maybe CharField if the length of the string doesn't change).
>>> a = [(1,2),(3,4),(5,6),(7,8)]
>>> pickle.dumps(a)
'(lp0\n(I1\nI2\ntp1\na(I3\nI4\ntp2\na(I5\nI6\ntp3\na(I7\nI8\ntp4\na.'
>>> b = pickle.dumps(a)
>>> pickle.loads(b)
[(1, 2), (3, 4), (5, 6), (7, 8)]
Just store b in your TextField and you can get back your list really easily.
And even better, as Robert Smith says, use http://pypi.python.org/pypi/django-picklefield
I like your second approach but just a minor suggestion.
class Plot(models.Model):
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Curve(models.Model)
plots = models.ManyToManyField(Plot)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
curve = models.ForeignKey(Curve)
Just a minor suggestion for flexibility