I am given a text file that contains many lines like the following... many random information
Spiaks Restaurant|42.74|-73.70|2 Archibald St+Watervliet, NY 12189|http://www.yelp.com/biz/spiaks-restaurant-watervliet|Italian|4|5|4|3|3|4|4
For example, Spiaks Restaurant is in position 0, 42.74 is in position 1, -73.70 is in position 2.... Italian is in position 5...
4|5|4|3|3|4|4 is another list... so basically a list within a list, and the number 4 would be in position 6, 5 in position 7.. etc
I have to ask the user, and the user should reply with:
What type of restaurant would you like => Italian
What is the minimum rating => 3.6
The result should be:
Name: Spiaks Restaurant; Rating 3.86
Name: Lo Portos; Rating 4.00
Name: Verdiles Restaurant; Rating 4.00
Found 3 restaurants.
Here is my code:
rest_type = raw_input("What type of restaurant would you like => ")
min_rate = float(raw_input("What is the minimum rating => "))
def parse_line(text_file):
count = 0.0
a_strip = text_file.strip('\n')
b_split = a_strip.split('|')
for i in range(6, len(b_split)):
b_split[i] = int(b_split[i]) # Takes the current indices in the list and converts it to integers
count += b_split[i] # Add up all of the integers values to get the total ratings
avg_rate = count/len(b_split[6:len(b_split)]) # Takes the total ratings & divides it by the length
#The above basically calculates the average of the numbers like 4|5|4|3|3|4|4
if (rest_type == b_split[5] and avg_rate >= min_rate):
print b_split[0], avg_rate
The problem with the result is.. I get:
None
I know this is a very, very long question, but if someone could give me some insight, I would appreciate it!
Have you trying printing out all of the information you aggregated?
Find out where the error is occurring, be it when you try to parse for specific restaurants, or you just straight up parse nothing and end up with a blank list.
Related
I have that kind of entries :
id user number
1 Peter 1
2 Jack 3
3 Kate 2
4 Carla 3
The name of my table is User so I would like to get only the user with the highest number but in some cases I don't know this number.
I thought to do something like that :
max_users = User.objects.filter(number=3)
But the problem is in that case I suppose I know that the highest number is 3 whereas it is not always the case. Could you help me please ?
Thank you very much !
Try the following snippet:
from django.db.models import Max
max_number = User.objects.aggregate(Max('number'))['number__max'] # Returns the highest number.
max_users = User.objects.filter(number=max_number) # Filter all users by this number.
I'm trying to run sample code for transportation problem in Jupyter book, but generate error
TypeError: list indices must be integers or slices, not str. What's problem here? How to solve it? Thanks!
from pulp import *
# Creates a list of all the supply nodes
Warehouses = ["A","B"]
# Creates a dictionary for the number of units of supply for each supply node
supply = {"A": 1000,
"B": 4000}
# Creates a list of all demand nodes
Bars = ["1", "2", "3", "4", "5"]
# Creates a dictionary for the number of units of demand for each demand node
demand = {"1": 500,
"2": 900,
"3": 1800,
"4": 200,
"5": 700}
costs = [ #Bars
#1 2 3 4 5
[2,4,5,2,1],#A Warehouses
[3,1,3,2,3] #B
]
# Creates the prob variable to contain the problem data
prob = LpProblem("Beer Distribution Problem", LpMinimize)
# Creates a list of tuples containing all the possible routes for transport
Routes = [(w,b) for w in Warehouses for b in Bars]
# A dictionary called route_vars is created to contain the referenced variables (the routes)
route_vars = LpVariable.dicts("Route",(Warehouses,Bars),0,None,LpInteger)
# The objective function is added to prob first
prob += lpSum([route_vars[w][b]*costs[w][b] for (w,b) in Routes]), "Sum of Transporting Costs"
# The supply maximum constraints are added to prob for each supply node (warehouse)
for w in Warehouses:
prob += lpSum([route_vars[w][b] for b in Bars]) <= supply[w], "Sum of Products out of Warehouse %s"%w
# The demand minimum constraints are added to prob for each demand node (bar)
for b in Bars:
prob += lpSum([route_vars[w][b] for w in Warehouses]) >= demand[b], "Sum of Products into Bars %s"%b
This is more of a basic Python question than a PuLP question.
w,b are strings. So in your code, you are evaluating costs['A']['1']. If you type this in, you see the same error message. To be able to use string indices, you need to use a dict instead of a list (array).
Solution: make costs a dict
One way to do this is:
costs = {'A': {'1':2,'2':4,'3':5,'4':2,'5':1},
'B': {'1':3,'2':1,'3':3,'4':2,'5':3}}
I have problems combining multiple for loops. I will give an example with two of them, I would like to combine. If I know how to do it with two I will also be able to do it with multiple loops.
If anyone knows how to write this as lapply function that would also be nice.
require(ncdf4)
#### download files from this link to directory: (I just downloaded manually,two files are sufficient to answer the example)
#### ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/LWdown_daily_WFDEI/
setwd("C:/place_where_I_have_downloaded_my_files_from_link/")
temp = list.files(pattern="*.nc") #list imported netcdf files
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment
#### first loop - # select parameter out of netcdf files and combine into a List of 2
list_temp<-list() #create empty list before loop
for (t in temp[1:2]){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(t),"LWdown")[428,176,],xcoor=176,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
#### second loop # select parameter out of onenetcdf file per x-coordinate and combine into a List of 2
list_temp<-list() #create empty list before loop
for (x in 176:177){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(temp[1]),"LWdown")[428,x,],xcoor=x,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
How I tried to combine but didn't work:
#### combined loops
list_temp<-list()
for (t in temp[1:2]){for (x in 176:177){
#ncin<-list()
ncin<-nc_open(t)
list_temp[x][t]<-list(data.frame(LWdown=ncvar_get(ncin,"LWdown")[428,x,],x=x,y=428))
}}
LWdown_1to2<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
I already solved my problem. See below. But I am still curious how one could solve the two for loops as described above, so I will leave the question open an unanswered.
Here is my solution:
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf),"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(temp_netcdf,"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
### nc_close(temp_netcdf) #close nc file to prevent data loss and errors
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...
My question is more mathematical. there is a post in the site. User can like and dislike it. And below the post is written for example -5 dislikes and +23 likes. On the base of these values I want to make a rating with range 0-10 or (-10-0 and 0-10). How to make it correctly?
This may not answer your question as you need a rating between [-10,10] but this blog post describes the best way to give scores to items where there are positive and negative ratings (in your case, likes and dislikes).
A simple method like
(Positive ratings) - (Negative ratings), or
(Positive ratings) / (Total ratings)
will not give optimal results.
Instead he uses a method called Binomial proportion confidence interval.
The relevant part of the blog post is copied below:
CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:
(source: evanmiller.org)
(Use minus where it says plus/minus to calculate the lower bound.) Here p is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.
Here it is, implemented in Ruby, again from the blog post.
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
This is extension to Shepherd's answer.
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
It depends on number of visitors to your app. Lets say if you expect about 100 users rate your app. When a first user click dislike, we will rate it as 0 based on above approach. But this is not logically right.. since our sample is very small to make it a zero. Same with only one positive - our app gets 10 rating.
A better thing would be to add a constant value to numerator and denominator. Lets say if our app has 100 visitors, its safe to assume that until we get 10 ups/downs, we should not go to extremes(neither 0 nor 10 rating). SO just add 5 to each likes and dislikes.
num_likes = num_likes + 5;
num_dislikes = num_dislikes + 5;
total_votes = num_likes + num_dislikes;
rating = round(10*(num_likes)/(total_votes));
It sounds like what you want is basically a percentage liked/disliked. I would do 0 to 10, rather than -10 to 10, because that could be confusing. So on a 0 to 10 scale, 0 would be "all dislikes" and 10 would be "all liked"
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
And that's basically it.