I'm writing an app to collect facebook posts matching a certain search term, and I'm trying to fetch only new or updated posts since the graph.facebook.com/search endpoint. I've concluded from debugging that this particular endpoint uses time-based pagination (since, until), so here's my process:
fetch new posts using the most recent 'since' time (default to now - 5 mins at start)
update my 'since' time to the most recent created_time or updated_time from the list of return posts
sleep X seconds, repeat
However, I can't even see my own newly created posts. I do get some results, but they seem random in terms of why they match my search and not my own. For testing purposes, I'm using a user-level access token generated using the FB developer tools, so I should definitely not have any permissions issues restricting me from seeing my own content.
What gives?
Edit: More testing reveals that I can randomly receive SOME of my own posts, but there appears to be no rhyme or reason why one post shows up and the others don't. For example, I just posted 3 posts and received the second one via my app. The first and third are nowhere to be found.
I think what you are seeing here is an artefact of the consistency model Facebook is using. You can see another example of this when you look at your feed from two different devices. If I look at my feed from my smartphone and then go and check out my feed on my PC, sometimes I see the same items and sometimes there are items I saw on one device, that I didn't see on the other. This is because Facebook uses Eventual consistency.
In simple terms this means that given enough time, all data clusters will be consistent, but this is not guaranteed at any given time point. The bad news is: there is not much you can do about this. It's just a fact-of-life when working with very large distributed systems (and Facebook is one of the largest in the world). At this scale it is just not practical, where technology is today, to keep all copies of the data completely in sync at all times. What I think you are seeing is two requests serviced by two clusters which are currently not 100% in sync.
Here is an interesting read on the subject.
And here is something from Facebook. You can skip to the Consistancy section of the page (Although, I would recommend reading the entire post. It is a very interesting overview of Facebook architecture).
Related
I'm having a problem with Advanced segments in Google Mobile App Analytics.
A condition has been setup to include all screens that match regex "/01-12-2013/" - but it's also showing me screens which does not contain this string. For example I'm getting a screen name containing "/11-11-2013/" which I would have expected to be filtered out.
The segment seems to return different results based on which tab I'm in in Google Mobile App Analytics. If that helps at all.
In "Audience Overview" I's returning 48.02% of all Screen Views. In "Behavior Overview" it's returning 71.51% of all screen views.
Here are some screenshots to illustrate the problem.
This is going to sound a bit ridiculous but, when creating advanced segments, after you've created them, I'd give them an hour or two before relying on the data they provide. I still have yet to find a solid answer as to why this is, but across a wide range of sites over the past year or two I've found similar issues. I've noticed that when I create an advanced segment to filter specific pages, invariably the initial results still show irrelevant pages I specifically filtered out. The only thing I've been able to attribute this to is some sort of "lag", on Google's side, in updating the Analytics data/property/view/segment. In almost every case, I've simply waited an hour or two after creating an advanced segment, and the data that was filtered usually displays correctly by then.
An interesting thing I've also seen was that a third party reporting platform I use, that has a Google Analytics integration, actually displayed the correct advanced segmented data BEFORE it showed up properly in Analytics. Strange.
The autocomplete API allows us to retrieve lists of all countries, regions, and locales by leaving out the query string and setting the result limit to a large number, but this feature isn't available at the city level.
Is there a way that we can retrieve a full list of all targetable cities and their IDs? If not, can we cache the autocomplete data for cities to build up such a list?
That functionality is probably not supported because of the massive amount of return data that would result in fetching all the cities in the world, even with paging. Although limiting the response data by country (by using country_list=["ca"]) and then fetching all cities doesn't sound too far-fetched, however, it is not implemented either.
To me, it sounds like you have two options.
Create a bug report using our bug tool to request a wishlist feature (doesn't guarantee anything, but at least we can track it if we choose to implement it and can serve as a way to gauge interest in the feature)
IANAL, but according to the FB Platform Policies part 2 of section 2 states
You may cache data you receive through use of the Facebook API in order to improve your application’s user experience, but you should try to keep the data up to date. This permission does not give you any rights to such data.
Which sounds like you can cache the autocomplete data since it will better improve the UX of your app, however, just remember that you do not have the rights to the data. I would be cautious about this as it would really suck if you worked really hard to get all the caching functionality built in only to have FB say that it's not allowed. I would advise with some experts some more before pursuing this path.
I'm creating a django website with Apache2 as the server. I need a way to determine the number of unique visitors to my website (specifically to every page in particular) in a full proof way. Unfortunately users will have high incentives to try to "game" the tracking systems so I'm trying to make it full proof.
Is there any way of doing this?
Currently I'm trying to use IP & Cookies to determine unique visitors, but this system can be easily fooled with a headless browser.
Unless it's necessary that the data be integrated into your Django database, I'd strongly recommend "outsourcing" your traffic to another provider. I'm very happy with Google Analytics.
Failing that, there's really little you can do to keep someone from gaming the system. You could limit based on IP address but then of course you run into the problem that often many unique visitors share IPs (say, via a university, organization, or work site). Cookies are very easy to clear out, so if you go that route then it's very easy to game.
One thing that's harder to get rid of is files stored in the appcache, so one possible solution that would work on modern browsers is to store a file in the appcache. You'd count the first time it was loaded in as the unique visit, and after that since it's cached they don't get counted again.
Of course, since you presumably need this to be backwards compatible then of course it leaves it open to exactly the sorts of tools which are most likely to be used for gaming the system, such as curl.
You can certainly block non-browserlike user agents, which makes it slightly more difficult if some gamers don't know about spoofing browser agent strings (which most will quickly learn).
Really, the best solution might be -- what is the outcome from a visit to a page? If it is, for example, selling a product, then don't award people who have the most page views; award the people whose hits generate the most sales. Or whatever time-consuming action someone might take at the page.
Possible solution:
If you're willing to ignore people with JavaScript disabled, you could choose to count only people who access the page and then stay on that page for a given window of time (say, 1 minute). After a given period of time, do an Ajax request back to the server. So if they tried to game by changing their cookie and loading multiple tabs at once, it wouldn't work because they'd need to have the same cookie in order to register that they'd been on that page long enough. I actually think this might work; I can't honestly see a way to game that. Basically on the server side you store a dictionary called stay_until in request.session with keys for each unique page and after 1 minute or so you run an Ajax call back to the server. If the value for stay_until[page_id] is less than or equal to the current time, then they're an active user, otherwise they're not. This means that it will take someone at least 20 minutes to generate 20 unique visitors, and so long as you make the payoff worth less than the time consumed that will be a strong disincentive.
I'd even make it more explicit: on the bottom of the page in a noscript tag, put "Your access was not counted. Turn on JavaScript to be counted" with a page that lays out the tracking process.
As HTML Requests are stateless and you have no control over the users behavior on his clientside, there is no bulletproof way.
The only way you're going to be able to track "unique" visitors in a fool-proof way is to make it contingent on some controlled factor such as a login. Anything else can and will fail to be completely accurate.
I have a Windows Phone 7 app that (currently) calls an OData service to get data, and throws the data into a listbox. It is horribly slow right now. The first thing I can think of is because OData returns way more data than I actually need.
What are some suggestions/best practices for speeding up the fetching of data in a Windows Phone 7 app? Anything I could be doing in the app to speed up the retrieval of data and putting into in front of the user faster?
Sounds like you've already got some clues about what to chase.
Some basic things I'd try are:
Make your HTTP requests as small as possible - if possible, only fetch the entities and fields you absolutely need.
Consider using multiple HTTP requests to fetch the data incrementally instead of fetching everything in one go (this can, of course, actually make the app slower, but generally makes the app feel faster)
For large text transfers, make sure that the content is being zipped for transfer (this should happen at the HTTP level)
Be careful that the XAML rendering the data isn't too bloated - large XAML structure repeated in a list can cause slowness.
When optimising, never assume you know where the speed problem is - always measure first!
Be careful when inserting images into a list - the MS MarketPlace app often seems to stutter on my phone - and I think this is caused by the image fetch and render process.
In addition to Stuart's great list, also consider the format of the data that's sent.
Check out this blog post by Rob Tiffany. It discusses performance based on data formats. It was written specifically with WCF in mind but the points still apply.
As an extension to the Stuart's list:
In fact there are 3 areas - communication, parsing, UI. Measure them separately:
Do just the communication with the processing switched off.
Measure parsing of fixed ODATA-formatted string.
Whether you believe or not it can be also the UI.
For example a bad usage of ProgressBar can result in dramatical decrease of the processing speed. (In general you should not use any UI animations as explained here.)
Also, make sure that the UI processing does not block the data communication.
I'm very curious to know how this process works. These sites (http://www.sharkscope.com and http://www.pokertableratings.com) data mine thousands of hands per day from secure poker networks, such as PokerStars and Full Tilt.
Do they have a farm of servers running applications that open hundreds of tables (windows) and then somehow spider/datamine the hands that are being played?
How does this work, programming wise?
There are a few options. I've been researching it since I wanted to implement some of this functionality in a web app I'm working on. I'll use PokerStars for example, since they have, by far, the best security of any online poker site.
First, realize that there is no way for a developer to rip real time information from the PokerStars application itself. You can't access the API. You can, though, do the following:
Screen Scraping/OCR
PokerStars does its best to sabotage screen/text scraping of their application (by doing simple things like pixel level color fluctuations) but with enough motivation you can easily get around this. Google AutoHotkey combined with ImageSearch.
API Access and XML Feeds
PokerStars doesn't offer public access to its API. But it does offer an XML feed to developers who are pre-approved. This XML feed offers:
PokerStars Site Summary - shows player, table, and tournament counts
PokerStars Current Tournament data - files with information about upcoming and active tournaments. The data is provided in two files:
PokerStars Static Tournament Data - provides tournament information that does not change frequently, and
PokerStars Dynamic Tournament Data - provides frequently changing tournament information
PokerStars Tournament Results - provides information about completed tournaments. The data is provided in two files:
PokerStars Tournament Results – provides basic information about completed tournaments, and
PokerStars Tournament Expanded Results – provides expanded information about completed tournaments.
PokerStars Tournament Leaders Board - provides information about top PokerStars players ranked using PokerStars Tournament Ranking System
PokerStars Tournament Leaders Board BOP - provides information about top PokerStars players ranked using PokerStars Battle Of Planets Ranking System
Team PokerStars – provides information about Team PokerStars players and their online activity
It's highly unlikely that these sites have access to the XML feed (or an improved one which would provide all the functionality they need) since PokerStars isn't exactly on good terms with most of these sites.
This leaves two options. Scraping the network connection for said data, which I think is borderline impossible (I don't have experience with this so I'm not sure; I've heard it's highly encrypted and not easy to tinker with, but I'm not sure) and, mentioned above, screen scraping/OCR.
Option #2 is easy enough to implement and, with some work, can avoid detection. From what I've been able to gather, this is the only way they could be doing such massive data mining of PokerStars (I haven't looked into other sites but I've heard security on anything besides PokerStars/Full Tilt is quite horrendous).
[edit]
Reread your question and realized I didn't unambiguously answer it.
Yes, they likely have a massive amount of servers running watching all currently running tables, tournaments, etc. Realize that there is a decent amount of money in what they're doing.
This, for instance, could be how they do it (speculation):
Said bot applications watch the tables and data mine all information that gets "posted" to the chat log. They do this by already having a table of images that correspond to, for example, all letters of the alphabet (since PokerStars doesn't post their text as... text. All text in their software is actually an image). So, the bot then rips an image of the chat log, matches it against the store, converts the data to a format they can work with, and throws it in a database. Done.
[edit]
No, the data isn't sold to them by the poker sites themselves. This would be a PR nightmare if it ever got out, which it would. And it wouldn't account for the functionality of these sites, which appears to be instantaneous. OPR, Sharkscope, etc. There are, without a doubt, applications running that are ripping the data real time from the poker software, likely using the methods I listed.
maybe I can help.
I play poker, run a HUD, look at the stats and am a software developer.
I've seen a few posts on this suggesting it's done by OCR software grabbing the screen. Well, that's really difficult and processor hungry, so a programmer wouldn't choose to do that unless there were no other options.
Also, because you can open multiple windows, the poker window can be hidden or partially obscured by other things on the screen, so you couldn't guarantee to be able to capture the screen.
In short, they read the log files that are output by the poker software.
When you install your HUD like Sharkscope or Jivaro etc, than they run client software on your PC. It reads the log files and updates its own servers with every hand you play.
Most poker software is similar, but lets start with Pokerstars, as thats where I play. The Poker software outputs to local log files for every action you/it makes. It shows your cards, any opponents cards that you see plus what you do. eg. which button you have pressed, how much you/they bet etc. It posts these updates in near real time and timestamps the log file.
You can look at your own files to see this in action.
On a PC do this (not sure what you do on a Mac, but will be similar)
1. Load File Explorer
2. Select VIEW from the menu
3. Select HIDDEN ITEMS so that you can see the hidden data files
4. Goto C:\Users\Dave\AppData\Local\PokerStars.UK (you may not be called DAVE...)
5. Open the PokerStars.log.0 file in NOTEPAD
6. In Notepad, SEARCH for updateMyCard
7. It will show your card numerically
3c for 3 of Clubs
14d for Ace of Diamonds
You can see your opponents cards only where you saw them at the table.
Here is a few example lines from the log file.
OnTableData() round -2
:::TableViewImpl::updateMyCard() 8s (0) [2A0498]
:::TableViewImpl::updateMyCard() 13h (1) [2A0498]
:::TableViewImpl::updatePlayerCard() 7s (0) [2A0498]
:::TableViewImpl::updatePlayerCard() 14s (1) [2A0498]
[2015/12/13 12:19:34]
cheers, hope this helps
Dave
I've thought about this, and have two theories:
The "sniffer" sites have every table open, AND:
Are able to pull the hand data from the network stream. (or:)
Are obtaining the hand data from the GUI (screen scraping, pulling stuff out via the GUI API).
Alternately, they may have developed/modified clients to log everything for them, but I think one of the above solutions is likely simpler.
Well, they have two choices:
they spider/grab the data without consent. Then they risk being shut down anytime. The poker site can easily detect such monitoring at this scale and block it. And even risk a lawsuit for breach of the terms of service, which probably disallow the use of robots.
they pay for getting the data directly. This saves a lot of bandwidth (e.g. not having to load the full pages, extraction, updates with html changes etc.) and makes their business much less risky (legally and technically).
Guess which one they more likely chose; at least if the site has been around for some time without being shut down every now and then.
I'm not sure how it works but I have an application id and a key- which you get as a gold or silver subscriber- sign up for a month and send them an email and you will get access and the API documentation.