I have a list of products (say diodoes) which have a curve associated to them.
For example,
Diode 1: curve 1: [(0,1),(1,3),(2,10), ...., (100,0.5)]
Diode 2: curve 2: [(0,2),(1,4),(2.1,19), ..., (100,0)]
So for each product there is a curve (with the same x-axis values range(1,100)) but different y-axis values.
My question is what is the best practice to store such data (using Django + PostgreSql) given that I want to calculate things with it later in the views (say the area under the curve, or that curve times another one, etc). I will also be charting it, so the view will have to pull the values.
My first attempts have had various limitations:
Naive attempt 1
# model.py
for i in range(101):
name_sects = ["x", str(i+1)]
attrs["".join(name_sects)] = models.DecimalField(_("".join([str(i+1),' A'])), max_digits=6)
attrs['intensity'] = model.DecimalField(_('Diode Intensity'))
Diode = type('Diode', (models.Model,), attrs)
Ok, that creates a field for each "x", x1, x2,... etc, and I can fill each "y" in the admin ... but it's not obvious how to manipulate it in the view or the template. (and a pain to fill in, obviously)
Naive attempt 2
#model.py
class Curve(models.Model)
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
characteristic_curve = model.ManyToManyField(Curve)
Is ManyToMany the way forward? Even if to each diode corresponds one single curve? (but many points, possibly two diodes sharing a same point).
Any advice, tips or links to tools for it are very appreciated.
If you want to improve speed (because 100 entries for each product, it's really huge and it would be slow if you have to fetch 100 products and theirs points), I would use the pickle module and store your list of tuples in a TextField (or maybe CharField if the length of the string doesn't change).
>>> a = [(1,2),(3,4),(5,6),(7,8)]
>>> pickle.dumps(a)
'(lp0\n(I1\nI2\ntp1\na(I3\nI4\ntp2\na(I5\nI6\ntp3\na(I7\nI8\ntp4\na.'
>>> b = pickle.dumps(a)
>>> pickle.loads(b)
[(1, 2), (3, 4), (5, 6), (7, 8)]
Just store b in your TextField and you can get back your list really easily.
And even better, as Robert Smith says, use http://pypi.python.org/pypi/django-picklefield
I like your second approach but just a minor suggestion.
class Plot(models.Model):
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Curve(models.Model)
plots = models.ManyToManyField(Plot)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
curve = models.ForeignKey(Curve)
Just a minor suggestion for flexibility
Related
I have the following problem:
I must identify if a data point is an outlier or not (we don't have labels). I have different unsupervised models to identify the outlier. Then, I normalize the outlier score and I combine them via a weight average. According to the fact that I don't have information about their accuracy I use the same weight for each models.
Now, suppose that I have a small fraction of the dataset with also the label.
How can I update the weights according to the new information?
Please If you have it, give me some resources because I didn't find it.
Thank you in advance.
I tried to see some resources about the bayesian model average, but I don't know If it is the correct way. I also have implemented an idea, but I'm not sure that is correct.
import numpy as np
def bayesian_update(anomaly, weight, prob):
#posterior = prob(anomaly | model)
posterior = np.zeros(len(anomaly))
for i in range(len(anomaly)):
if anomaly[i] == 1:
posterior[i] = prob[i] * weight
else:
posterior[i] = (1-prob[i]) * weight
return posterior
np.random.seed(0)
n_observations = 100
n_models = 4
#
models_probs = np.random.rand(n_observations, n_models)
anomaly = np.where(models_probs[:, 0] > 0.5, 1, 0)
posterior_sum = np.zeros(n_models)
for i in range(n_models):
posterior_sum[i] = np.sum(bayesian_update(anomaly, 0.25, models_probs[:, i]))
new_weight = posterior_sum/np.sum(posterior_sum)
print(new_weight)
So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!
I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...
Needed correct datatype for geo points.
I will get and display it with google map API so format like
42.761819,11.104863
41.508577,-101.953125
Usecase:
user click on map
django save this point with additional data
on next visiting django display this points on map
So, no distances beetween points and etc hacks.
DB: postgres 8
Django: 1.4
Check out GeoDjango and see if it helps you. If your stack is configured to run GeoDjango, you can do this.
Your models will looks like this
from django.contrib.gis.db import models
class LocationPoint(models.Model):
point = models.PointField(srid=4326,dim=3)
accuracy = models.FloatField(default=0.0)
objects = models.GeoManager()
To save the point to the database all you will have to do is
from django.contrib.gis.geos import Point
point = Point(x=12.734534, y=77.2342, z=0, srid=4326)
location = LocationPoint()
location.point = point
location.save()
GeoDjango gives you extended abilities to do geospatial queries which you might be interested in the near future, like finding the distance between points, finding the nearest locations around a point etc.
Here's the link to GeoDjango
From django documentation about DecimalField:
DecimalField.max_digits
The maximum number of digits allowed in the
number. Note that this number must be greater than or equal to
decimal_places.
DecimalField.decimal_places
The number of decimal places to store with
the number.
which is refering to the Python Decimal
To make good choice about accurate data type and precission you should consider:
what is minimum possible value (latitude can be from 0 (down)up to (-)90 degrees) _ _.
what is maximum possible value (longitude can range from 0 (down)up to (-)180 degrees) _ _ _.
what is accuracy (decimal_places), you wish. Pleas notice that it has impact on zoom level on Google Maps.
By the way, for better understanding, it is good to know how the calculation is done (Python code):
def deg_to_dms(deg):
d = int(deg)
md = abs(deg - d) * 60
m = int(md)
sd = (md - m) * 60
return [d, m, sd]
def decimal(deg,min,sec):
if deg < 0:
dec= -1.0 * deg + 1.0 * min/60.0 + 1.0 * sec/3600.0
return -1.0 * dec
else:
dec=1.0 * deg + 1.0 * min/60.0 + 1.0 * sec/3600.0;
return dec
It looks like you're going to be storing a latitude and a longitude. I would go with a DecimalField for this, and store each number separately.
I use longitude and latitude in my django setup.
My model includes:
long_position = models.DecimalField (max_digits=8, decimal_places=3)
lat_position = models.DecimalField (max_digits=8, decimal_places=3)
For more precision you may want the decimal_places to be more.
When you want to display it in the Google Map API method you would reference your model and write a python code to output like this:
output = some_long_position + "," + some_lati_position
I'm using this in model
latitude = models.DecimalField(max_digits=11, decimal_places=7,null=True,blank=True)
longitude = models.DecimalField(max_digits=11, decimal_places=7,null=True,blank=True)
Django documents give this example of associating extra data with a M2M relationship. Although that is straight forward, now that I am trying to make use of the extra data in my views it is feeling very clumsy (which typically means "I'm doing it wrong").
For example, using the models defined in the linked document above I can do the following:
# Some people
ringo = Person.objects.create(name="Ringo Starr")
paul = Person.objects.create(name="Paul McCartney")
me = Person.objects.create(name="Me the rock Star")
# Some bands
beatles = Group.objects.create(name="The Beatles")
my_band = Group.objects.create(name="My Imaginary band")
# The Beatles form
m1 = Membership.objects.create(person=ringo, group=beatles,
date_joined=date(1962, 8, 16),
invite_reason= "Needed a new drummer.")
m2 = Membership.objects.create(person=paul, group=beatles,
date_joined=date(1960, 8, 1),
invite_reason= "Wanted to form a band.")
# My Imaginary band forms
m3 = Membership.objects.create(person=me, group=my_band,
date_joined=date(1980, 10, 5),
invite_reason= "Want to be a star.")
m4 = Membership.objects.create(person=paul, group=my_band,
date_joined=date(1980, 10, 5),
invite_reason= "Wanted to form a better band.")
Now if I want to print a simple table that for each person gives the date that they joined each band, at the moment I am doing this:
bands = Group.objects.all().order_by('name')
for person in Person.objects.all():
print person.name,
for band in bands:
print band.name,
try:
m = person.membership_set.get(group=band.pk)
print m.date_joined,
except:
print 'NA',
print ""
Which feels very ugly, especially the "m = person.membership_set.get(group=band.pk)" bit. Am I going about this whole thing wrong?
Now say I wanted to order the people by the date that they joined a particular band (say the beatles) is there any order_by clause I can put on Person.objects.all() that would let me do that?
Any advice would be greatly appreciated.
You should query the Membership model instead:
members = Membership.objects.select_related('person', 'group').all().order_by('date_joined')
for m in members:
print m.band.name, m.person.name, m.date_joined
Using select_related here we avoid the 1 + n queries problem, as it tells the ORM to do the join and selects everything in one single query.