Taking a survey not more than once on Amazon Mechanical Turk - amazon-web-services

How can I make sure than participants who are taking the survey designed by me not allowed to take the survey more than once on Amazon Mechanical Turk?

If you create a HIT, a given worker can only take that HIT once. If you have, e.g., multiple HITs that are all the same study (either different conditions you launch simultaneously or multiple HITs that you post over time), then workers will have access to each version. Of course, someone might have multiple accounts or something (but that is rare and against Terms of Use). So, as long as you only have one HIT (with however many assignments you need - one assignment being one worker), then you will be fine.

While it's true that posting a HIT once means a Turker can only take it once, many people find that some participants malingered, satisficed, etc. and have to resubmit their HITs a second or third time. Requsters also sometimes realize they need more responses, and therefore post their HITs again. In these situations your solution is the DoesNotExist qualifier: http://mechanicalturk.typepad.com/blog/2014/07/new-qualification-comparators-add-greater-flexibility-to-qualifications-.html

Related

Since vs updated_time vs created_time

I'm writing an app to collect facebook posts matching a certain search term, and I'm trying to fetch only new or updated posts since the graph.facebook.com/search endpoint. I've concluded from debugging that this particular endpoint uses time-based pagination (since, until), so here's my process:
fetch new posts using the most recent 'since' time (default to now - 5 mins at start)
update my 'since' time to the most recent created_time or updated_time from the list of return posts
sleep X seconds, repeat
However, I can't even see my own newly created posts. I do get some results, but they seem random in terms of why they match my search and not my own. For testing purposes, I'm using a user-level access token generated using the FB developer tools, so I should definitely not have any permissions issues restricting me from seeing my own content.
What gives?
Edit: More testing reveals that I can randomly receive SOME of my own posts, but there appears to be no rhyme or reason why one post shows up and the others don't. For example, I just posted 3 posts and received the second one via my app. The first and third are nowhere to be found.
I think what you are seeing here is an artefact of the consistency model Facebook is using. You can see another example of this when you look at your feed from two different devices. If I look at my feed from my smartphone and then go and check out my feed on my PC, sometimes I see the same items and sometimes there are items I saw on one device, that I didn't see on the other. This is because Facebook uses Eventual consistency.
In simple terms this means that given enough time, all data clusters will be consistent, but this is not guaranteed at any given time point. The bad news is: there is not much you can do about this. It's just a fact-of-life when working with very large distributed systems (and Facebook is one of the largest in the world). At this scale it is just not practical, where technology is today, to keep all copies of the data completely in sync at all times. What I think you are seeing is two requests serviced by two clusters which are currently not 100% in sync.
Here is an interesting read on the subject.
And here is something from Facebook. You can skip to the Consistancy section of the page (Although, I would recommend reading the entire post. It is a very interesting overview of Facebook architecture).

how to get the 1 million-th click of a website

I often heard this question coming from different sources, but never got a good idea of the technologies to achieve this. Can anyone shed some lights? The question is: you have a website which has high volume of users access per day. Your website is deployed in a distributed manner, have multiple webservers and load balancers responding incoming requests from lots of locations. How do you get the 1000000th user access, and show him a special page saying "congrats, you are our 1000000th visitor!". Assuming you had a distributed backend.
You could do it with jQuery, for example:
$("#linkOfInterest").click(function() { //code for updating a variable/record that contains the current number of clicks });
CSS:
a#linkOfInterest {
//style goes here
}
somewhere in the html :
<a id="linkOfInterest" href="somepage.htm"></a>
You are going to have to trade off performance or accuracy. The simplest way to do this would be have a memcached instance keep track of your visitor counts, or some other datastore with an atomic increment operation. Since there is only a single source of truth, only 1 visitor will get the message. This will delay the loading of your page by the roundtrip to the store at minimum.
If you can't afford the delay, then you will have to trade off accuracy. A distributed data store will not be able to atomically increment the field any faster than a single instance. Every web server can read and write to a local node, but another node at another datacenter may also reach 1 million users counts before the transactions are reconciled. In that case 2 or more people may get the 1 millionth user message.
It is possible to do so after the fact. Eventually, the data store will reconcile the increments, and your application can decide on a strict ordering. However, if you have already decided that a single atomic request takes too long, then this logic will take place too late to render your page.

Multiple HITs or ExternalQuestion in Mechanical Turk?

First of all I must say I am totally new to MT so forgive me if I am thinking in a totally wrong way.
I have to create a task for workers where they have to classify a sentence if it is spam or if it falls into a certain category. I will have about 2500 sentences to classify a day.
What is the best way to use the API to do this. I understand how to create a HIT using the API, but it is my understanding that I can't create a recurrent HIT that changes itself once each of the sentence is classified. Do I need to create 2500 HITs?
I researched and found out about the External Question which I can setup in my server and make it change with each form submit.
In that case will it be just 1 HIT? is that the correct way to do this?
I am confused in the dynamic part of MT.
Any tip, documentation (updated) or suggestion will be appreciated.
Thanks!
You likely want to create separate HITs.
If you create an single External HIT (hosted on your server), a
MTurk Worker who takes your HIT will not be eligible to take another
task (e.g. a classification task) since Workers are not allowed to
take a single HIT more than once. However, if you create separate
HITs, a Worker can take as many of them as they wish, which is
probably what you want.
You are correct that you cannot automatically change a HIT
dynamically unless it is run on your own server.

Determine unique visitors to site

I'm creating a django website with Apache2 as the server. I need a way to determine the number of unique visitors to my website (specifically to every page in particular) in a full proof way. Unfortunately users will have high incentives to try to "game" the tracking systems so I'm trying to make it full proof.
Is there any way of doing this?
Currently I'm trying to use IP & Cookies to determine unique visitors, but this system can be easily fooled with a headless browser.
Unless it's necessary that the data be integrated into your Django database, I'd strongly recommend "outsourcing" your traffic to another provider. I'm very happy with Google Analytics.
Failing that, there's really little you can do to keep someone from gaming the system. You could limit based on IP address but then of course you run into the problem that often many unique visitors share IPs (say, via a university, organization, or work site). Cookies are very easy to clear out, so if you go that route then it's very easy to game.
One thing that's harder to get rid of is files stored in the appcache, so one possible solution that would work on modern browsers is to store a file in the appcache. You'd count the first time it was loaded in as the unique visit, and after that since it's cached they don't get counted again.
Of course, since you presumably need this to be backwards compatible then of course it leaves it open to exactly the sorts of tools which are most likely to be used for gaming the system, such as curl.
You can certainly block non-browserlike user agents, which makes it slightly more difficult if some gamers don't know about spoofing browser agent strings (which most will quickly learn).
Really, the best solution might be -- what is the outcome from a visit to a page? If it is, for example, selling a product, then don't award people who have the most page views; award the people whose hits generate the most sales. Or whatever time-consuming action someone might take at the page.
Possible solution:
If you're willing to ignore people with JavaScript disabled, you could choose to count only people who access the page and then stay on that page for a given window of time (say, 1 minute). After a given period of time, do an Ajax request back to the server. So if they tried to game by changing their cookie and loading multiple tabs at once, it wouldn't work because they'd need to have the same cookie in order to register that they'd been on that page long enough. I actually think this might work; I can't honestly see a way to game that. Basically on the server side you store a dictionary called stay_until in request.session with keys for each unique page and after 1 minute or so you run an Ajax call back to the server. If the value for stay_until[page_id] is less than or equal to the current time, then they're an active user, otherwise they're not. This means that it will take someone at least 20 minutes to generate 20 unique visitors, and so long as you make the payoff worth less than the time consumed that will be a strong disincentive.
I'd even make it more explicit: on the bottom of the page in a noscript tag, put "Your access was not counted. Turn on JavaScript to be counted" with a page that lays out the tracking process.
As HTML Requests are stateless and you have no control over the users behavior on his clientside, there is no bulletproof way.
The only way you're going to be able to track "unique" visitors in a fool-proof way is to make it contingent on some controlled factor such as a login. Anything else can and will fail to be completely accurate.

SQL Query minimizing/caching in a C++ application

I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL (http://doc.qt.nokia.com/latest/qtsql.html). This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
Worries
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
}
At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.