ThreadPoolExecutor consuming from multiple PriorityBlockingQueues - concurrency

I have the task of scheduling & executing a lot of web-request, in java 8, with the following conditions:
each web-request belongs to exactly one of distinct groups
the group a web-request belongs to is an immutable and deterministic property of the request (i.e. is not a result of some (pseudo-)random logic), e.g. imagine the user on behalf the request is being made
the web-service is tracking the quota usage of web-requests it receives, for each of these distinct group
a web-request may receive an HTTP 429 (Too Many Requests) at any given moment, indicating that the quota for that group is full
when this happens, no more web-requests of the same group are allowed to be made until the stated time indicated by the Retry-After header of the response, such futile requests still count as part of the quota
web-requests of a non-throttled group can, and should be processed regardless of some other groups being throttled
some requests are more equal then others, therefore eligible requests should be processed in some priority order
the number of these distinct quota groups is in the few hundreds (for now)
at any given moment a new group may be born, e.g. a new user joins the organization
I've been collecting some ideas, none of which I am satisfied with:
The most obvious is that each group could be handled by their very own ThreadPoolExecutor consuming from a respective PriorityBlockingQueue
Simplicity has its virtues, but I kind of dislike running hundred instances of ThreadPoolExecutors (even if each and every one of them is using a single thread for execution).
I could (try to) go down the tedious and quite error-prone path of implementing my own BlockingQueue, with a PriorityQueue maintained for each group
the number of interface methods in BlockingQueue is not that many, but the designers of that concurrency library saw fit to extend Collection and Queue interfaces, and just the amount of time to implement all those methods and test them too sounds like a (dangerous) waste of time to me
I could also go and relax the goal of letting non-throttled groups&requests to progress, and just block all requests until stated time
this may not be as bad as it sounds, I will still have to check how easy it is to hit the quota limit and what the time penalty is - 5 minutes blackout ever other week sounds almost acceptable, half an hour every midnight is definitely not OK
Another idea is to have ThreadPoolExecutor with a single PriorityBlockingQueue (PBQ) with a map of throttled group -> request-lists on the side
on several occasions (on submit, on consuming from main PBQ, and even on just having received an HTTP 429 response), the group of the request would be tested for being throttled, and if it's that the case, the request would be put to that throttled group -> request list map
but normally, requests would just be consumed by the ThreadPoolExecutor
of course, whenever some throttling period indicated by Retry-After header of the HTTP 429 response has ended, the respective group would wake up and all it's requests would be re-submitted to the main PBQ
I've also been reading up on RxJava, but none of the delay, throttle or backpressure facilities are suitable - or at least I coulnd't see how.
At this point I really don't expect any pseudo-code (well, unless it's actually shorter that way), but I am most interested in better ideas or perhaps pointers to existing facilities.
(Btw, the web-service is the Microsoft Graph API, in case anyone wonders.)

Related

Taking a survey not more than once on Amazon Mechanical Turk

How can I make sure than participants who are taking the survey designed by me not allowed to take the survey more than once on Amazon Mechanical Turk?
If you create a HIT, a given worker can only take that HIT once. If you have, e.g., multiple HITs that are all the same study (either different conditions you launch simultaneously or multiple HITs that you post over time), then workers will have access to each version. Of course, someone might have multiple accounts or something (but that is rare and against Terms of Use). So, as long as you only have one HIT (with however many assignments you need - one assignment being one worker), then you will be fine.
While it's true that posting a HIT once means a Turker can only take it once, many people find that some participants malingered, satisficed, etc. and have to resubmit their HITs a second or third time. Requsters also sometimes realize they need more responses, and therefore post their HITs again. In these situations your solution is the DoesNotExist qualifier: http://mechanicalturk.typepad.com/blog/2014/07/new-qualification-comparators-add-greater-flexibility-to-qualifications-.html

how to get the 1 million-th click of a website

I often heard this question coming from different sources, but never got a good idea of the technologies to achieve this. Can anyone shed some lights? The question is: you have a website which has high volume of users access per day. Your website is deployed in a distributed manner, have multiple webservers and load balancers responding incoming requests from lots of locations. How do you get the 1000000th user access, and show him a special page saying "congrats, you are our 1000000th visitor!". Assuming you had a distributed backend.
You could do it with jQuery, for example:
$("#linkOfInterest").click(function() { //code for updating a variable/record that contains the current number of clicks });
CSS:
a#linkOfInterest {
//style goes here
}
somewhere in the html :
<a id="linkOfInterest" href="somepage.htm"></a>
You are going to have to trade off performance or accuracy. The simplest way to do this would be have a memcached instance keep track of your visitor counts, or some other datastore with an atomic increment operation. Since there is only a single source of truth, only 1 visitor will get the message. This will delay the loading of your page by the roundtrip to the store at minimum.
If you can't afford the delay, then you will have to trade off accuracy. A distributed data store will not be able to atomically increment the field any faster than a single instance. Every web server can read and write to a local node, but another node at another datacenter may also reach 1 million users counts before the transactions are reconciled. In that case 2 or more people may get the 1 millionth user message.
It is possible to do so after the fact. Eventually, the data store will reconcile the increments, and your application can decide on a strict ordering. However, if you have already decided that a single atomic request takes too long, then this logic will take place too late to render your page.

Camel + ActiveMQ: Handling Two Distinct Concurrency Constraints With Competing Consumers

Problem:
Process a backlog of messages where each message has three headers "service", "client", and "stream". I want to process the backlog of messages with maximum concurrency, but I have some requirements:
Only 10 messages with the same service can be processing at once.
Only 4 messages with the same service AND client can be processing at
once.
All messages with the same service AND client AND stream must
be kept in order.
Additional Information:
I've been playing around with "maxConcurrentConsumers" along with the "JMSXGroupID" in a ServiceMix (Camel + ActiveMQ) context, and I seem to be able to get 2 out of 3 of my requirements satisfied.
For example, if I do some content-based routing to split the backlog up into separate "service" queues (one queue for each service), then I can set the JMSXGroupID to (service + client + stream), and set maxConcurrentConsumers=10 on routes consuming from each queue. This solves the first and last requirements, but I may have too many messages for the same client processing at the same time.
Please note that if a solution requires a separate queue and route for every single combination of service+client, that would become unmanageable because there could be 10s of thousands of combinations.
Any feedback is greatly appreciated! If my question is unclear, please feel free to suggest how I can improve it.
To my knowledge, this would be very hard to achieve if you have 10k+ combos.
You can get around one queue per service/client combo by using consumers and selectors. That would, however, be almost equally hard to deal with (you simply don't create 10k+ selector consumers unharmed and without significant performance considerations), if you cannot predict in some way a limited set of service/client active at once.
Can you elaborate on the second requirement? Do you need it to make sure there are some sense of fairness among your clients? Please elaborate and I'll update if I can think of anything else.
Update:
Instead of consuming by just listening to messages, you could possibly do a browse on the queue, looping through the messages and pick one that "has free slots". You can probably figure out if the limit has been reached by some shared variable that keeps track given you run in a single instance.

Approach to large object transfers in web services

I have to implement a SOA solution with web services. I have to transfer large objects (ex: Invoices of 25~30mb of XML data) and I wonder what's the best approach...
Should I:
A. transfer parts of this objects separately (ex: header first, then items one by one, regardless of the fact that there could be 1000 of them) in several WS calls and then organize them in "server side" dealing with retries and errors.
Or ...
B. Should I transfer the entire payload in one single call and try to optimize it (and not to "burn" Http connections)?
I'm using .Net's WCF to expose services layer. I accept recommended readings and considerations.
The idea would be to maximize the load and minimize the number of calls. This isn't always simple since - in a one shot call - firewalls or the web service itself could limit the payload size and your message might not make it, or - in case of multiple calls - as you mentioned yourself, you have to deal with errors and retries (basically doing WS-ReliableMessaging).
So perhaps, instead of concentrating on the message of an usual call, you might try changing how you perform the respective call, and maybe have a look at MTOM (Message Transmission Optimization Mechanism) with WCF, or maybe use streaming.

Document Server: Handling Concurrent Saves

I'm implementing a document server. Currently, if two users open the same document, then modify it and save the changes, the document's state will be undefined (either the first user's changes are saved permanently, or the second's). This is entirely unsatisfactory. I considered two possibilities to solve this problem:
The first is to lock the document when it is opened by someone the first time, and unlock it when it is closed. But if the network connection to the server is suddenly interrupted, the document would stay in a forever-locked state. The obvious solution is to send regular pings to the server. If the server doesn't receive K pings in a row (K > 1) from a particular client, documents locked by this client are unlocked. If that client re-appears, documents are locked again, if someone hadn't already locked them. This could also help if the client application (running in web browser) is terminated unexpectedly, making it impossible to send a 'quitting, unlock my documents' signal to the server.
The second is to store multiple versions of the same document saved by different users. If changes to the document are made in rapid succession, the system would offer either to merge versions or to select a preferred version. To optimize storage space, only document diffs should be kept (just like source control software).
What method should I choose, taking into consideration that the connection to the server might sometimes be slow and unresponsive? How should the parameters (ping interval, rapid succession interval) be determined?
P.S. Unfortunately, I can't store the documents in a database.
The first option you describe is essentially a pessimistic locking model whilst the second is an optimistic model.
Which one to choose really comes down to a number of factors but essentially boils down to how the business wants to work. For example, would it unduly inconvenience the users if a document they needed to edit was locked by another user? What happens if a document is locked and someone goes on holiday with their client connected? What is the likely contention for each document - i.e. how likely is it that the same document will be modified by two users at the same time?, how localised are the modifications likely to be within a single document? (If the same section is modified regularly then performing a merge may take longer than simply making the changes again).
Assuming the contention is relatively low and/or the size of each change is fairly small then I would probably opt for an optimistic model that resolves conflicts using an automatic or manual merge. A version number or a checksum of the document's contents can be used to determine if a merge is required.
My suggestion would be something like your first one. When the first user (Bob) opens the document, he acquires a lock so that other users can only read the current document. If the user saves the document while he is using it, he keeps the lock. Only when he exits the document, it is unlocked and other people can edit it.
If the second user (Kate) opens the document while Bob has the lock on it, Kate will get a message saying the document is uneditable but she can read it until it the lock has been released.
So what happens when Bob acquires the lock, maybe saves the document once or twice but then exits the application leaving the lock hanging?
As you said yourself, requiring the client with the lock to send pings at a certain frequency is probably the best option. If you don't get a ping from the client for a set amount of time, this effectively means his client is not responding anymore. If this is a web application you can use javascript for the pings. The document that was last saved releases its lock and Kate can now acquire it.
A ping can contain the name of the document that the client has a lock on, and the server can calculate when the last ping for that document was received.
Currently documents are published by a limited group of people, each of them working on a separate subject. So, the inconvenience introduced by locks is minimized.
People mostly extend existing documents and correct mistakes in them.
Speaking about the pessimistic model, the 'left client connected for N days' scenario could be avoided by setting lock expire date to, say, one day before lock start date. Because documents edited are by no means mission critical, and are modified by multiple users quite rarely, that could be enough.
Now consider the optimistic model. How should the differences be detected, if the documents have some regular (say, hierarchical) structure? If not? What are the chances of successful automatic merge in these cases?
The situation becomes more complicated, because some of the documents (edited by the 'admins' user group) contain important configuration information (document global index, user roles, etc.). To my mind, locks are more advantageous for precisely this kind of information, because it's not changed on everyday basis. So some hybrid solution might be acceptable.
What do you think?