Proven correct receipt module - ocaml

I'm working on a register which produces receipts when customers buy articles. As an exercise, I'm thinking about making a receipt module in Coq which cannot produce erroneous receipts. In short, the articles and the payments on the receipt should always sum to 0 (where articles have price > 0 and payments have amount < 0). Is this doable, or sensible?
To do a quick sketch, a receipt would consist of receipt items and payments, like
type receipt = {
items : item list;
payments : payment list
}
And there would be functions to add articles and payments
add_article(receipt, article, quantity)
add_payment(receipt, payment)
In real life, this procedure is of course more complicated, adding different types of discounts etc.
Certainly it's easy to add a boolean function check_receipt that confirms that the receipt is correct. And since articles and payments always is added in runtime, maybe this is the only way?
Example:
receipt = {
items = [{name = "article1"; price = "10"; quantity = "2"}];
payments = [{payment_type = Cash; amount = (-20)}];
}
Is it clear what I want?
Or maybe it's more interesting to prove correct VAT calculations. There's a couple of such properties that could be proven.

You can use Coq to prove your program has properties like that, but I don't think it applies to the particular example you have presented. Articles and payments would be added at different times, so there's no way of garanteeing the balance is 0 all the time. You could check at the end if the balance is 0, but the program already has to do that anyway. I don't think there's a way of moving that check from execution time to compile time even with a proof assistant.
I think it would make more sense to use Coq to prove that an optimized and a naive implementation of an algorithm obey the same input / output relation. If there were a way to simplify your program, maybe at the cost of performance, maybe you could compare the two versions using Coq. Then you would be confident you didn't introduce a bug during optimization.
That's all I can say without looking at any code.

Related

LP warehouse problem with a certain truck capacity

The problem that I need to solve is almost like the basic LP warehouse problem, where you have n warehouses, each one with a certain amount of a product, and m shops, each one demanding a certain amount of that product. So the goal is to minimize the amount of Km made by the trucks that have to deliver the products from the warehouses to the shops.
This was the easy part, I already identified the constraints and the objective function.
The part that I can't get my head around, is that the truck that delivers the products has a certain capacity, C. Every single truck has the same capacity. I can't tell if that piece of information is really relevant and should be included in some kind of constraint or something. I would really apreciate a hint, cause I've been stuck on this part for a while now and couldn't fine any example of this exact type of problem on the Internet
The number of trucs needed can be bounded by
numtrucs(i,j)*capacity >= shipment(i,j)
Add a term to the objective that minimizes the number of trucs.

Django filter exact m2m objects

Let's say I have a team model, and a team has members.
So
class Team(models.Model):
team_member = models.ManyToManyField('Employee')
class Employee(models.Model):
....
Lets say I have a list of employee ids like team_members = [1001, 1003, 1004] and I want to find the Team, that is made up of exactly those three members.
I don't want the team that has [1001, 1003, 1004, 1005] or the team that has [1001, 1003].
Only team [1001, 1003, 1004].
This is what I'm doing now:
teams = Team.objects.all()
for t in teams:
if set([x.id for x in t.team_member.all()]) == set(team_members):
team = t
if not team:
team = Team.objects.create()
team.team_member = team_members
But it seems a bit ham-handed. Is there a cleaner way, with fewer nested loops?
The short answer
No, I don't know of a much simpler way in terms of code appearance.
However there are some things you could do to make your code a little more graceful and potentially a lot faster. Plus it is possible to do the work in the database, albeit quite inefficiently for large team sizes.
The DB option listed below is pretty much as ham-handed as the for loop you provided, but could be more efficient depending on your data set, DB, etc.
Longer answer: ways to be less 'ham-handed'
There are a couple of places I'd clean up the style here.
Plus, in my experience with Django, loops like the one you built do tend to become pretty expensive on large data sets. If you end up loading, say, 10,000 teams into memory, having the ORM convert them to Team objects, and then iterating over them, you'll probably see some significant slowdown.
Two things to try for speed & grace:
Use Team.values_list('team_members') for your in-python filter loop, which skips the step where Django organizes all of the SQL data into Model objects. I've found this to save lots of time instantiating objects (sometimes around an order of magnitude).
straighten out your set() calls. Currently you're re-converting team_members to a set() on every iteration, plus you're turning t.team_member implicitly into TeamMember objects (as they're fetched from the DB), then into a list of ids and then into a set. For the first item, just make a team_members_set = set(team_members) up front and reuse it. For the second item, you can do set(t.team_member.values_list('id', flat=True)) which will skip the heaviest ORM step of instantiating TeamMembers (which could be as bad as O(n^2) in your example depending on the data set and Django's caching).
use Team.objects.all().iterator() to not load the Teams all into memory at once. This will help if you're running into memory issues.
But with any performance optimization, of course test your perf with real or real-ish data to be sure you're not making things worse!
Longer answer: the DB option
After trying all manner of Q() manipulation and other approaches listed in the answers here, to no avail, I found this answer by #Todor.
Basically you need to do repeated filter()s, one for each team_member. On top of that you use a Count filter to make sure that you don't end up choosing a Team with a superset of the desired members.
desired_members = [1001, 1003, 1004]
initial_queryset = Team.objects.annotate(cnt=models.Count('team_members')).filter(cnt=len(desired_members))
matching_teams = reduce( # Can of course use a for loop if you prefer that to reduce()
lambda queryset, member: queryset.filter(team_members=member),
desired_members,
initial_queryset
)
Note that the resulting query will likely have perf issues for large teams, since it will do one JOIN for every one of your desired_members. It'd be nice to avoid that but I don't know of another way to do this all in the database without changing your data structure. I'd love to learn a better way, and if you end up doing some perf testing I'd be curious to find what you learn!
Maybe you can use annotate for the count of team_member. Can you try this?
Team.objects.filter(team_member__pk__in=team_members).annotate(num_team=Count('team_member')).filter(num_team=len(team_members))
To get the Team with those exact three members you can use:
Team.objects.get(team_member__pk=team_members) # This code was untested
You could also try with a list of Employee objects:
# team_members = Employee.objects.filter(pk__in=tem_members)
team_members = [<Employee: Employee object>, <Employee: Employee object>, <Employee: Employee object>]
Team.objects.get(team_member=team_members)

Understanding "Real world modelling" program

Few days now I've got new project to do related with a "real world modelling" program.
Here's how it looks like:
A visit to a psychologist (Use queue). Experts provides psychologist's advice, some of them (n) forms therapeutic groups of k people (GrT - duration of group therapy in hours), other experts (m) takes individual patients (InT - duration of individual therapy in hours). Each newly came patient (new patient's appearance probability is p1, recurring patients comes after period of time (h)) can choose to go to a psychologist providing individual therapies, or to group therapies. If group therapy session is full, patients who are wishing to participate in group sessions must wait. Recurring patients wishing to go to group sessions can start a session with smaller group, but can't go to same session with newly came patients. It has been observed that patients who took individual therapy are recovering faster than those, who chose group sessions(they will need less sessions), but there are exceptions - due to social interaction factor, some patients (probability p2) recover h percent faster than those, who choose individual therapy. Individual session costs InC, group session GrC. You need to assess what therapeutic approach patient should choose optimizing with their resources, and how many and what specialists should hire a health care facility.
Here's my approach to this problem:
Read text file containing Names, Surnames, money willing to spend and place everything in queue structure.
Find which group is better for patient by generating random number for p2probability and using it, we'll find if patient recover faster in individual or group therapy. IMO factor sequence here: Money(looking, if patient can afford individual therapy sessions) > p2 (should patient take group sessions if it's better for him).
By looking how many patients there are in queue, we can find how many psychologists we'll need. (Are there any other factors here? What if we are short of experts?)
Problems that I can't understand: how do I implement p1 probability of new patients appearance if I write every patient into a text file and put them in a queue? How many therapy sessions does it take for patient to recover (static number?)?
Am I missing something? Basically it's open question and there could be no right answer. If anyone have any suggestions how to build this program to better one, I'd be glad to take it!
Programming language I'm using: C++
If you want to break up a task, analyse it and prepare it for coding, you could :
Firstly make a Block diagram, representing program flow control.
Followed by Pseudo code implementation.
P.S. update the question following the above and when you reach the "code stage", there, definitely, will be more help.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.

Collaborative Filtering: Ways to determine implicit scores for products for each user?

Having implemented an algorithm to recommend products with some success, I'm now looking at ways to calculate the initial input data for this algorithm.
My objective is to calculate a score for each product that a user has some sort of history with.
The data I am currently collecting:
User order history
Product pageview history for both anonymous and registered users
All of this data is timestamped.
What I'm looking for
There are a couple of things I'm looking for suggestions on, and ideally this question should be treated more for discussion rather than aiming for a single 'right' answer.
Any additional data I can collect for a user that can directly imply an interest in a product
Algorithms/equations for turning this data into scores for each product
What I'm NOT looking for
Just to avoid this question being derailed with the wrong kind of answers, here is what I'm doing once I have this data for each user:
Generating a number of user clusters (21 at the moment) using the k-means clustering algorithm, using the pearsons coefficient for the distance score
For each user (on demand) calculating their a graph of similar users by looking for their most and least similar users within their cluster, and repeating for an arbitrary depth.
Calculating a score for each product based on the preferences of other users within the user's graph
Sorting the scores to return a list of recommendations
Basically, I'm not looking for ideas on what to do once I have the input data (I may need further help with that later, but it's not the point of this question), just for ideas on how to generate this input data in the first place
Here's a haymaker of a response:
time spent looking at a product
semantic interpretation of comments left about the product
make a discussion page about a product, brand, or product category and semantically interpret the comments
if they Shared a product page (email, del.icio.us, etc.)
browser (mobile might make them spend less time on the page vis-à-vis laptop while indicating great interest) and connection speed (affects amt. of time spent on the page)
facebook profile similarity
heatmap data (e.g. à la kissmetrics)
What kind of products are you selling? That might help us answer you better. (Since this is an old question, I am addressing both #Andrew Ingram and anyone else who has the same question and found this thread through search.)
You can allow users to explicitly state their preferences, the way netflix allows users to assign stars.
You can assign a positive numeric value for all the stuff they bought, since you say you do have their purchase history. Assign zero for stuff they didn't buy
You could do some sort of weighted value for stuff they bought, adjusted for what's popular. (if nearly everybody bought a product, it doesn't tell you much about a person that they also bought it) See "term frequency–inverse document frequency"
You could also assign some lesser numeric value for items that users looked at but did not buy.