mechanical turk architecture for streaming an endless lists of tasks - amazon-web-services

How should we architect a solution that uses Amazon Mechanical Turk API to process a stream of tasks instead of a single batch of bulk tasks?
Here's more info:
Our app receives a stream of about 1,000 photos and videos per day. Each picture or video contains 6-8 numbers (it's the serial number of an electronic device) that need to be transcribed, along with a "certainty level" for the transcription (e.g. "Certain", "Uncertain", "Can't Read"). The transcription will take under 10 seconds per image and under 20 seconds per video and will require minimal skill or training.
Our app will get uploads of these images continuously throughout the day and we want to turn them into numbers within a few minutes. The ideal solution would be for us to upload new tasks every minute (under 20 per minute during peak periods) and download results every minute too.
Two questions:
To ensure a good balance of fast turnaround time, accuracy, and cost effectiveness, should we submit one task at a time, or is it best to batch tasks? If so, what variables should we consider when setting a batch size?
Are there libraries or hosted services that wrap the MTurk API to more easily handle use-cases like ours where HIT generation is streaming and ongoing rather than one-time?
Apologies for the newbie questions, we're new to Mechanical Turk.

Streaming tasks one at a time to Turk
You can stream tasks individually through mechanical turk's api by using the CreateHIT operation. Every time you receive an image in your app, you can call the CreateHIT operation to immediately send the task to Turk.
You can also setup notifications through the api, so you can be alerted as soon as a task is completed. Turk Notification API Docs
Batching vs Streaming
As for batching vs streaming, you're better off streaming to achieve a good balance of turnaround time and cost. Batching won't drive down costs too much and improving accuracy is largely dependent on vetting, reviewing, and tracking worker performance either manually or implementing automated processes.
Libraries and Services
Most libraries offer all of the operations available in the api, so you can just google or search Github for a library in your programming language. (We use the Ruby library rturk)
A good list of companies that offer hosted solutions can be found under the Metaplatforms section of a answer on Quora to the question: What are some crowdsourcing services similar to Amazon Mechanical Turk? (Disclaimer: my company, Houdini is one of the solutions listed there.)

Related

Google Speech streaming recognition slow response time

What is the fastest expected response time of the Google Speech API with streaming audio data? I am sending an audio stream to the API and am receiving the interim results with a 2000ms delay, of which I was hoping I could drop to below 1000ms. I have tested different sampling rates and different voice models.
I'm afraid that response time can't be measured or guaranteed because of the nature of the service. We don't know what is done under the hood, in fact there is no SLA for response time even though there is SLA for availability.
Something that can help you is working on building a good request:
Reducing 100-miliseconds frame size, for example, could ensure a good tradeoff between latency and efficiency.
Following Best Practices will help you to make a clean request so that the latency can be reduced.
You may want to check following links on specific uses cases to know how they addressed latency issues:
Realtime audio streaming to Google Speech engine
How to speed up google cloud speech
25s Latency in Google Speech to Text
If you really care about response time you'd better use Kaldi-based service on your own infrastructure. Something like https://github.com/alumae/kaldi-gstreamer-server together with https://github.com/Kaljurand/dictate.js
Google Cloud Speech itself works pretty fast, you can check how quick your microphone gets transcribed https://cloud.google.com/speech-to-text/.
You may probably experience buffering issue on your side, the tool you are using may buffer data before sending(buffer flush) to underlying device(stream).
You can find out how to decrease output buffer of that tool to lower values e.g. 2Kb, so data will reach Node app and Google service faster. Google recommends to send data that equals to 100ms buffer size.

Simple task queue using Google Cloud Platform : issue with Google PubSub

My task : I cannot speak openly about what the specifics of my task are, but here is an analogy : every two hours, I get a variable number of spoken audio files. Sometimes only 10, sometimes 800 or more. Let's say I have a costly python task to perform on these files, for example Automatic Speech Recognition. I have a Google Intance managed group that can deploy any number of VMs for executing this task.
The issue : right now, I'm using Google PubSub. Every two hours, a topic is filled with audio ids. Instances of the managed group can be deployed depending on the size of queue. The problem is, only one worker get all the messages from the PubSub subscription, while the others are not receiving any, perhaps because the queue is not that long (maximum ~1000 messages). This issue is reported in a few cases in the python Google Cloud github, and it is not clear if it is the intended purpose of PubSub, or just a bug.
How could I implement the equivalent of a simple serverless task queue in Python and Google Cloud, and can spawn instances based on a given metric, for example the size of the queue ? Is this the intended purpose of PubSub ?
Thanks in advance.
In App Engine you can create push queues and set rate/concurrency limits and Google will handle the rest for you. App Engine will scale as needed (e.g. increase Python instances).
If you're outside of App Engine (e.g. GKE), the pubsub Python client library may be pulling many messages at once. We had a hard time controlling this (for google-cloud-pubsub==0.34.0) as well so we ended up writing a small adjustment on top of google-cloud-pubsub calling SubscriberClient.pull with max_messages set). The server side pubsub API does adhere to max_messages.

Display real time data on website that scales?

I am starting a project where I want to create a website which will display LIVE flight information and status. We all have seen this at airport. An example is given here - http://www.computronics.biz/productimages/prodairport4.jpg. As you can see this information changes continuously. The website will talk to a backend api and the this backend api will talk to database. Now the important part is that the flight information in the database will be updated by the airline itself. There could be several airlines and they will update their data respectively. I have drawn a diagram and uploaded here - https://imgur.com/a/ssw1S.
Now those airlines will obviously have an interface (website talking to some backend API) through which they will update the database.
Now here is my attempt to solve it. We need to have some sort of trigger such that if any airline updates a flight detail in the database between current time - 1 hour to current + 4 hours (website will only display few hours of flights), we need to call the web api and then send the update to the website in the real time. The user must not refresh the page at all. At the same time the website needs to scale well i.e. if 1 million users are on the website, and there is an update in the database in the correct time range, all 1 million user's website should get updated within a decent amount of time.
I did some research and it looks like we need to have an event based approach. For example - we need to create a function (AWS lambda or Azure function) that should be called whenever there is an update in the database (Dynamo DB for example) within the correct time range. This function then should call an API which should then update the website through web socket technology for example.
I am not looking for any code but just some alternative suggestions on how this can be solved in a scalable way. Also how do we test scalability?
Dont use serverless functions(Lambda/Azure functions)
Although I am a huge fan of serverless functions, and currently running a full web app in Lambda, I don't think its needed for your use case and doesn't make sense economically. As you've answered in the comments, each airline will not write directly to the database, they'll push to an API, meaning you are explicitly told when flights have changed. When an airline has sent you new data you can simply propagate this to all the browser endpoints via websockets. This keeps the design very simple. There is no need to artificially create a database event that then triggers a function that will then tell you a flight has been updated. Thats like removing your doorbell and replacing it with a motion detector that triggers a doorbell :)
Cost
Money always deserves its own section. Lambda is more of an economic break through than a technological one. You have to know when its cost effective. You pay per request so if your dealing with a process that handles 10,000 operations a month, or something that only fires 1,000 times a day, than lambda is dirt cheap and practically free. You also pay for the length of time the function is executing and the memory consumed while executing. Generally, it makes sense to use lambda functions where a dedicated server would be sitting idle for most of the time. So instead of a whole EC2 instance, AWS provides you with a container on demand. There are points at which high requests rates and constantly running processes makes lambda more expensive than EC2. This article discusses how generally its cheaper to use lambda up to a point -> https://www.trek10.com/blog/lambda-cost/ The same applies to Azure functions and googles equivalent. They are all just containers offered on demand.
If you're dealing with flight information I would imagine you will have thousands of flights being updated every minute so your lambda functions will be firing constantly as if you were running an EC2 instance. You will end up paying a lot more than EC2. When you have a service that needs to stay up 24/7 and run 24/7 with high activity that is most certainly a valid use case for a dedicated server or servers.
Proposed Solution
These are the components I would use below:
Message Queue of some sort (RabbitMQ or AWS SQS with SNS perhaps)
Web Socket Backend (The choice will depend on programming language)
Airline input API (REST,GraphQL, or maybe AWS Kinesis Data Firehose)
The airlines publish their data to a back-end api. The updates are stored on a message queue and the web applicaton that actually displays the results to users, via websockets, reads from the queue.
Scalability
For scalability you can run the websocket application on multiple EC2 instances (all reading from the same queuing service) in an autoscaling group, so with extra load more instances will be created automatically hence the name "autoscaling". And those instances can sit behind an elastic load balancer. Lots of AWS documentation on how to do this and its their flagship design pattern. If you use AWS SQS you don't have to manage the scalability details yourself, aws handles that. The only real components to scale are your websocket application and the flight data input endpoint. You can run the flight api in an autoscaling group as well but AWS does offer an additional tool for high traffic data processing. I detail that below.
Testing Scalability
It would be fairly easy to have a mock airline blast your service with thousands and thousands of fake updates and on the other end you can easily run multiple threads of selenium tests simulating browser clicks and validating that the UI is still operational.
Additional tools
If it ends up being large amounts of data, rather than using a conventional REST api for your flight update service you could consider a service AWS offers specifically for dealing with large amounts of real time updates (Kinessis Data Firehose) https://aws.amazon.com/kinesis/data-firehose/ But I've never used it.
First, please don't over think this. This is a trivial problem to solve and doesn't require any special techniques, technologies or trendy patterns & frameworks.
You actually have three functional areas you can address almost separately.
Ingestion - Collection and normalization of the data from the various sources. For this, you'll need a process and transformation engine, LogicApps or such.
Your databases. You'll quickly learn that not all flights are the same ;). While it might seem so, the amount of data isn't that much. Instances of MySQL/SQL Server tuned for a particular function will work just fine. Hint, you don't need to have data for every movement ready to present all the time.
Presentation. The data API and UIs. This, really, is the easy part. I would suggest you use basic polling at first. For reasons you will never have any control over, the SLA for flight data is ~5 minutes so a real-time client notification system is time you should spend elsewhere at first.

High latency issue of online prediction

I've deployed a linear model for classification on Google Machine Learning Engine and want to predict new data using online prediction.
When I called the APIs using Google API client library, it took around 0.5s to get the response for a request with only one instance. I expected the latency should be less than 10 microseconds (because the model is quite simple) and 0.5s was way too long. I also tried to make predictions for the new data offline using the predict_proba method. It took 8.2s to score more than 100,000 instances, which is much faster than using Google ML engine. Is there a way I can reduce the latency of online prediction? The model and server which sent the request are hosted in the same region.
I want to make predictions in real-time (the response is returned immediately after the APIs gets the request). Is Google ML Engine suitable for this purpose?
Some more info would be helpful:
Can you measure the network latency from the machine you are accessing the service to gcp? Latency will be lowest if you are calling from a Compute Engine instance in the same region that you deployed the model to.
Can you post your calling code?
Is this the latency to the first request or to every request?
To answer your final question, yes, cloud ml engine is designed to support a high queries per second.

What good alternatives are there to Copperegg for monitoring EC2 instances?

I've been using Copperegg for a while now and have generally been happy with it until lately, where I have had a few issues. It's being used to monitor a number of EC2 instances that must be up 24/7.
Last week I was getting phantom alerts that servers had gone down when they hadn't, which I can cope with, but also I didn't get an alert when I should have done. One server had high CPU for over 5 mins when the alert should be triggered after 1 minute. The Copperegg support weren't not all that helpful, merely agreeing that an alert should have been triggered.
The latter of those problems is unacceptable and if it were to happen again outside of working hours then serious problems will follow.
So, I'm looking for alternative services that will do that same job. I've looked at Datadog and New Relic, but both have a significant problem in that they will only alert me of a problem 5 minutes after it's occurred, rather than the 1 minute I can get with Copperegg.
What else is out there that can do the same job and will also integrate with Pager Duty?
tl;dr : Amazon CloudWatch will do what you want and probably much much more.
I believe that Amazon actually offers a service that would accomplish your goal - CloudWatch (pricing). I'm going to take your points one by one. Note that I haven't actually used it before, but the documentation is fairly clear.
One server had high CPU for over 5 mins when the alert should be triggered after 1 minute
It looks like CloudWatch can be configured to send an alert (which I'll get to) after one minute of a condition being met:
One can actually set conditions for many other metrics as well - this is what I see on one of my instances, and I think that detailed monitoring (I use free), might have even more:
What else is out there that can do the same job and will also integrate with Pager Duty?
I'm assuming you're talking about this. It turns out the Pager Duty has a helpful guide just for integrating CloudWatch. How nice!
Pricing
Here's the pricing page, as you would probably like to parse it instead of me telling you. I'll give a brief overview, though:
You don't want basic monitoring, as it only gives you metrics once per five minutes (which you've indicated is unacceptable.) Instead, you want detailed monitoring (once every minute).
For an EC2 instance, the price for detailed monitoring is $3.50 per instance per month. Additionally, every alarm you make is $0.10 per month. This is actually very cheap if compared to CopperEgg's pricing - $70/mo versus maybe $30 per month for 9 instances and copious amounts of alarms. In reality, you'll probably be paying more like $10/mo.
Pager Duty's tutorial suggests you use SNS, which is another cost. The good thing: it's dirt cheap. $0.60 per million notifications. If you ever get above a dollar in a year for SNS, you need to perform some serious reliability improvements on your servers.
Other shiny things!
You're not just limited to Amazon's pre-packaged metrics! You can actually send custom metrics (time it took to complete a cronjob, whatever) to Cloudwatch via a PUT request. Quite handy.
Submit Custom Metrics generated by your own applications (or by AWS resources not mentioned above) and have them monitored by Amazon CloudWatch. You can submit these metrics to Amazon CloudWatch via a simple Put API request.
(from here)
Conclusion
So all in all: CloudWatch is quite cheap, can do 1-minute frequency stats, and will integrate with Pager Duty.
tl;dr: Server Density will do what you want, on top of that it has web checks and custom metrics too.
In short Server Density is a monitoring tool that will monitor all the relevant server metrics. You can take a look at this page where it’s all described.
One server had high CPU for over 5 mins when the alert should be triggered after 1 minute
Server Density’s open source agent collects and posts the data to their server every minute and you can decide yourself when that alert should be triggered. In the alert below you can see that the alert will alert 1 person after 1 minute and then repeatedly alert every 5 minutes.
There is a lot of other metrics that you can alert on too.
What else is out there that can do the same job and will also integrate with Pager Duty?
Server Density also integrates with PagerDuty. The only thing you need to do is to generate an api key at PagerDuty and then provide that in the settings.
Just provide the API key in the settings and you can then in check pagerduty as one of the alert recipients.
Pricing
You can find the pricing page here. I’ll give you a brief overview of it. The pricing starts at $10 for one server plus one web check and then get’s cheaper per server the more servers you add.
Everything will be monitored once every minute and there is no fees added for the amount of alerts added or triggered, even if that is an SMS to your phone number. The cost is slightly more expensive than the Cloudwatch example, but the support is good. If you used copperegg before they have a migration tool too.
Other shiny things!
Server Density allows you to monitor all the things! Then only thing you need to do is to send us custom metrics which you can do with a plugin written by yourself or by someone else.
I have to say that the graphs that Server Density provides is somewhat akin to eye candy too. Most other monitoring solutions I’ve seen out there have quite dull dashboards.
Conclusion
It will do the job for you. Not as cheap as CloudWatch, but doesn’t lock you in into AWS. It’ll give you 1 minute frequency metrics and integrate with pagerduty + a lot more stuff.