Suppose you want to implement an algorithm that works in the following way:
You read in from a file that contains values of the form:
Mark Buy 20 1 100
Bob Sell 20 2 90
Where the input takes the form:
<name><buy or sell><quantity><time><company><buy maximum or sell minimum>
What's the fastest way to match buyers and sellers (for some company, where buyers and sellers are matched only if the person with the highest buy for that company is greater than the person with the lowest sell for that company). The buy or sell that is top-most will be the one that determines what price to use.
So in the example given we'd have "Mark, at time 1, bought 20 of Google for $100 from Bob, at time 2."
How can we optimize this algorithm for speed? Would reading in the entire file first be an optimal solution?
What you need is two priority queues per commodity: one for active buy bids (prioritized on max-price), and one for active sell bids (prioritized on min-price), plus an overall queue for bid creation/expiry events (prioritized on time). (If your bids are in a batch file as described, rather than a causal/online sequence, you can just sort the creation/expiry events, but you still need the buy and sell queues)
Using priority queues is the crux; everything after that is plumbing:
foreach bid creation/expiry event, in chronological order:
if the event is an expiry:
delete the bid from the appropriate queue
else, the event is a creation:
add the bid to the appropriate queue
repeat until no further transaction can be performed:
find max-active-buy and min-active-sell bids for the given commodity
if they match:
execute (and record) the transaction
update partially fulfilled bids, and remove completely fulfilled ones
When performing as a batch operation, you could simplify things a bit by sorting out each individual commodity, and executing each one separately. However, this will not work if the markets interact in any way (such as checking for sufficient account balance).
Priority queue operations can have asymptotic performance of O(log N) in the number of items. There are many fast, practical priority queue datastructures available which achieve this asymptotic limit.
Since you are evaluating an entire file as a batch, you may want to look into priority queues with amortized performance guarantees -- but if you expect to use your code in a real-time setting, you should probably stick to priority queues with strict per-query guarantees.
Related
I wanted to how the system handles the concurrent bids. There is a possibility that two bidders might bid the same amount at the same time even going down to the milliseconds (considering a large number of users are bidding for the item). In that case, how will the system manage the bid?
For example;
A pendant is placed for a bid let's suppose. The current bid is $ 3.75. Now the next bidder must place a bid of $4 or more. Now, two bidders (multiple bidders) bid $4 at the same time, since they are seeing the current bid of $3.75. Multiple bidders bid the same amount at the same time. How will the system now handle the bid? If any one of them had placed the bid a bit earlier, then automatically the next bidder had to bid a little more than the previous bidder. But in this case, both bidders happened to bid the same amount at the same time for the same item.
Whose bid is considered to be the current bid?
I think the least thing you could probably comes with is DB Transaction (which probably going to slow down your users experience) but, I think its a worth shot to try using Redis in this case!
Since Redis is single threaded.
And I also think you could probably make use of the INCR functionality of it since its the best way IMHO.
Note:
I believe Redis also have been supporting Transaction since version 6 (CMIIW), but please be aware that Redis's Transaction doesn't support rollback.
Read more about Redis Transaction and INCR.
Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.
Few days now I've got new project to do related with a "real world modelling" program.
Here's how it looks like:
A visit to a psychologist (Use queue). Experts provides psychologist's advice, some of them (n) forms therapeutic groups of k people (GrT - duration of group therapy in hours), other experts (m) takes individual patients (InT - duration of individual therapy in hours). Each newly came patient (new patient's appearance probability is p1, recurring patients comes after period of time (h)) can choose to go to a psychologist providing individual therapies, or to group therapies. If group therapy session is full, patients who are wishing to participate in group sessions must wait. Recurring patients wishing to go to group sessions can start a session with smaller group, but can't go to same session with newly came patients. It has been observed that patients who took individual therapy are recovering faster than those, who chose group sessions(they will need less sessions), but there are exceptions - due to social interaction factor, some patients (probability p2) recover h percent faster than those, who choose individual therapy. Individual session costs InC, group session GrC. You need to assess what therapeutic approach patient should choose optimizing with their resources, and how many and what specialists should hire a health care facility.
Here's my approach to this problem:
Read text file containing Names, Surnames, money willing to spend and place everything in queue structure.
Find which group is better for patient by generating random number for p2probability and using it, we'll find if patient recover faster in individual or group therapy. IMO factor sequence here: Money(looking, if patient can afford individual therapy sessions) > p2 (should patient take group sessions if it's better for him).
By looking how many patients there are in queue, we can find how many psychologists we'll need. (Are there any other factors here? What if we are short of experts?)
Problems that I can't understand: how do I implement p1 probability of new patients appearance if I write every patient into a text file and put them in a queue? How many therapy sessions does it take for patient to recover (static number?)?
Am I missing something? Basically it's open question and there could be no right answer. If anyone have any suggestions how to build this program to better one, I'd be glad to take it!
Programming language I'm using: C++
If you want to break up a task, analyse it and prepare it for coding, you could :
Firstly make a Block diagram, representing program flow control.
Followed by Pseudo code implementation.
P.S. update the question following the above and when you reach the "code stage", there, definitely, will be more help.
I have a dataset including 1 million customers. They are splitted into some categories like electronics customers,food and Beverage customers etc. Group names present customers' profiles.
each customer has different behaviours. For instance suppose that an electronic customer buys one electronic devices at least when he goes shopping. This transaction repeats randomly or continuously. So that I present each transaction by numerical codes.
(Value of transaction, volume of trans., transaction type, etc..) = (100,200,1)
for each transaction I have this vector above.it means every customer has a different trade behaviour.
I want to find out whether each customer has a pattern? Do we have outliers?
it is a profiling problem basically.
which analysis do you recommend?
Can you be more specific? What are you trying to get out of the analysis exactly? Buying patterns, customers that are outliers, purchases that are outliers?
If you want to determine which items are bought together, group the transactions together, just listing the items purchased at the same time and do shopping basket analysis, using the apriori algorithm or similar.
If you want to find similar customers, using k nearest neighbor or k means against a vector representing a customer's buy patterns (probably just the items bought). You can do this on individual transactions also to compare transactions.
To determine outliers, you can use a density based clustering algorithm (e.g. DBSCAN) to cluster customers together that are close to one another, and look at those customers that are not in clusters to determine outliers also.
I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.
My question is
(c) A transactional dataset, T, is given below:
t1: Milk, Chicken, Beer
t2: Chicken, Cheese
t3: Cheese, Boots
t4: Cheese, Chicken, Beer,
t5: Chicken, Beer, Clothes, Cheese, Milk
t6: Clothes, Beer, Milk
t7: Beer, Milk, Clothes
Assume that minimum support is 0.5 (minsup = 0.5).
(i) Find all frequent itemsets.
Here is how I worked it out:
Item : Amount
Milk : 4
Chicken : 4
Beer : 5
Cheese : 4
Boots : 1
Clothes : 3
Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:
{items} : Amount
{Milk, Chicken} : 2
{Milk, Beer} : 4
{Milk, Cheese} : 1
{Chicken, Beer} : 3
{Chicken, Cheese} : 3
{Beer, Cheese} : 2
Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?
I agree you should go for the Apriori Algorithm.
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.
So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.
The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.
If item y has a frequency greater than X, it is said to be frequent.
In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.
Once this is calculated, the frequencies greater than the threshold are said frequent itemset.
(http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html)
There are two ways to solve the problem:
using Apriori algorithm
Using FP counting
Assuming that you are using Apriori, the answer you got is correct.
The algorithm is simple:
First you count frequent 1-item sets and exclude the item-sets below minimum support.
Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.
The algorithm can go on until no item-sets are greater than threshold.
In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.
There is a solved example of further steps on Wikipedia here.
You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.
OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.
Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule. Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.
Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).
I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.