How exactly does sharkscope or PTR data mine all those hands? - data-mining

I'm very curious to know how this process works. These sites (http://www.sharkscope.com and http://www.pokertableratings.com) data mine thousands of hands per day from secure poker networks, such as PokerStars and Full Tilt.
Do they have a farm of servers running applications that open hundreds of tables (windows) and then somehow spider/datamine the hands that are being played?
How does this work, programming wise?

There are a few options. I've been researching it since I wanted to implement some of this functionality in a web app I'm working on. I'll use PokerStars for example, since they have, by far, the best security of any online poker site.
First, realize that there is no way for a developer to rip real time information from the PokerStars application itself. You can't access the API. You can, though, do the following:
Screen Scraping/OCR
PokerStars does its best to sabotage screen/text scraping of their application (by doing simple things like pixel level color fluctuations) but with enough motivation you can easily get around this. Google AutoHotkey combined with ImageSearch.
API Access and XML Feeds
PokerStars doesn't offer public access to its API. But it does offer an XML feed to developers who are pre-approved. This XML feed offers:
PokerStars Site Summary - shows player, table, and tournament counts
PokerStars Current Tournament data - files with information about upcoming and active tournaments. The data is provided in two files:
PokerStars Static Tournament Data - provides tournament information that does not change frequently, and
PokerStars Dynamic Tournament Data - provides frequently changing tournament information
PokerStars Tournament Results - provides information about completed tournaments. The data is provided in two files:
PokerStars Tournament Results – provides basic information about completed tournaments, and
PokerStars Tournament Expanded Results – provides expanded information about completed tournaments.
PokerStars Tournament Leaders Board - provides information about top PokerStars players ranked using PokerStars Tournament Ranking System
PokerStars Tournament Leaders Board BOP - provides information about top PokerStars players ranked using PokerStars Battle Of Planets Ranking System
Team PokerStars – provides information about Team PokerStars players and their online activity
It's highly unlikely that these sites have access to the XML feed (or an improved one which would provide all the functionality they need) since PokerStars isn't exactly on good terms with most of these sites.
This leaves two options. Scraping the network connection for said data, which I think is borderline impossible (I don't have experience with this so I'm not sure; I've heard it's highly encrypted and not easy to tinker with, but I'm not sure) and, mentioned above, screen scraping/OCR.
Option #2 is easy enough to implement and, with some work, can avoid detection. From what I've been able to gather, this is the only way they could be doing such massive data mining of PokerStars (I haven't looked into other sites but I've heard security on anything besides PokerStars/Full Tilt is quite horrendous).
[edit]
Reread your question and realized I didn't unambiguously answer it.
Yes, they likely have a massive amount of servers running watching all currently running tables, tournaments, etc. Realize that there is a decent amount of money in what they're doing.
This, for instance, could be how they do it (speculation):
Said bot applications watch the tables and data mine all information that gets "posted" to the chat log. They do this by already having a table of images that correspond to, for example, all letters of the alphabet (since PokerStars doesn't post their text as... text. All text in their software is actually an image). So, the bot then rips an image of the chat log, matches it against the store, converts the data to a format they can work with, and throws it in a database. Done.
[edit]
No, the data isn't sold to them by the poker sites themselves. This would be a PR nightmare if it ever got out, which it would. And it wouldn't account for the functionality of these sites, which appears to be instantaneous. OPR, Sharkscope, etc. There are, without a doubt, applications running that are ripping the data real time from the poker software, likely using the methods I listed.

maybe I can help.
I play poker, run a HUD, look at the stats and am a software developer.
I've seen a few posts on this suggesting it's done by OCR software grabbing the screen. Well, that's really difficult and processor hungry, so a programmer wouldn't choose to do that unless there were no other options.
Also, because you can open multiple windows, the poker window can be hidden or partially obscured by other things on the screen, so you couldn't guarantee to be able to capture the screen.
In short, they read the log files that are output by the poker software.
When you install your HUD like Sharkscope or Jivaro etc, than they run client software on your PC. It reads the log files and updates its own servers with every hand you play.
Most poker software is similar, but lets start with Pokerstars, as thats where I play. The Poker software outputs to local log files for every action you/it makes. It shows your cards, any opponents cards that you see plus what you do. eg. which button you have pressed, how much you/they bet etc. It posts these updates in near real time and timestamps the log file.
You can look at your own files to see this in action.
On a PC do this (not sure what you do on a Mac, but will be similar)
1. Load File Explorer
2. Select VIEW from the menu
3. Select HIDDEN ITEMS so that you can see the hidden data files
4. Goto C:\Users\Dave\AppData\Local\PokerStars.UK (you may not be called DAVE...)
5. Open the PokerStars.log.0 file in NOTEPAD
6. In Notepad, SEARCH for updateMyCard
7. It will show your card numerically
3c for 3 of Clubs
14d for Ace of Diamonds
You can see your opponents cards only where you saw them at the table.
Here is a few example lines from the log file.
OnTableData() round -2
:::TableViewImpl::updateMyCard() 8s (0) [2A0498]
:::TableViewImpl::updateMyCard() 13h (1) [2A0498]
:::TableViewImpl::updatePlayerCard() 7s (0) [2A0498]
:::TableViewImpl::updatePlayerCard() 14s (1) [2A0498]
[2015/12/13 12:19:34]
cheers, hope this helps
Dave

I've thought about this, and have two theories:
The "sniffer" sites have every table open, AND:
Are able to pull the hand data from the network stream. (or:)
Are obtaining the hand data from the GUI (screen scraping, pulling stuff out via the GUI API).
Alternately, they may have developed/modified clients to log everything for them, but I think one of the above solutions is likely simpler.

Well, they have two choices:
they spider/grab the data without consent. Then they risk being shut down anytime. The poker site can easily detect such monitoring at this scale and block it. And even risk a lawsuit for breach of the terms of service, which probably disallow the use of robots.
they pay for getting the data directly. This saves a lot of bandwidth (e.g. not having to load the full pages, extraction, updates with html changes etc.) and makes their business much less risky (legally and technically).
Guess which one they more likely chose; at least if the site has been around for some time without being shut down every now and then.

I'm not sure how it works but I have an application id and a key- which you get as a gold or silver subscriber- sign up for a month and send them an email and you will get access and the API documentation.

Related

How did these big companies start from these three main points?

So this is something that Iv'e been thinking about lately, and it basically is : How did big music web apps or websites like Spotify, Youtube, or Anghami(if you know that one) start? I was actually thinking about 3 things, the first : How did they get these huge music libraries? the second : Did each of those big companies need to buy a special server to hold the website data and music Library? and if yes, how much does a special server cost in this case? and the third question is : How did they solve the copyrights with all of these creators or authors or publishers or whatever they're called, the copyrights owners in this case...?
1. They are uploaded by the artists/creators. I'd imagine pre-release Spotify would have had a library already put together by working with the artists.
2. Yes. They cost a lot. There are hundreds of millions of users and terabytes upon terabytes of data, spread around the world. Server costs will be in the millions. Starting out the upfront cost to set up infrastructure would be very high too.
3. This is definitely not the place to ask this kind of question. I would Google information on how copyrights with artists usually work

Can I create a somewhat complex Mechanical Turk HIT without much programming experience?

I have a task that seems well-suited to Mturk. I've never before used the service, however, and despite reading through some of the documentation I'm having a difficult time judging how hard it would be to set up a task. I'm a strong beginner or weak intermediate in R. I've messed around with a project that involved a little understanding of XML. Otherwise, I have no programming or web development skills (I'm a statistician/epidemiologist). I'm hoping someone can give me an idea of what would be involved in creating my task so I can decide of it is worth the effort to learn how to create a HIT.
Essentially, I have recurring projects that require many graphs to be digitized (i.e. go from images to x,y coordinates). The automatic digitization software that I've tried isn't great for this task because some of the graphs are from old journal articles and they have gray-scale lines that cross each other multiple times. Figuring out which line is which requires a little human judgement. Workflow for the HIT would be to have each Mturker:
Download a properly named empty Excel workbook.
Download a JPEG of the graphs.
Download a free plot digitization program.
Open the graph in the plot digitization software, calibrate the axes, trace the outline of each curve, paste the coordinates into the corresponding Excel workbook that I have given them, extract some numbers off the graph into a second sheet of the same workbook.
Send me the Excel files.
I'd have these done in duplicate to make sure that there is acceptable agreement between the two Mturkers who did each graph.
Is this a reasonable task to accomplish via Mechanical Turk? If so, can a somewhat intelligent person who isn't a programmer/web developer pull it off? I've poked around the internet a bit but I still can't tell if I just haven't found the right resource to teach me how to do this or if I'd need 5 years of experience as a web developer to pull it off. Thanks.
No this really isn't a task for Mechanical Turk at all. Not only because you are requiring them to download a bunch of stuff which they won't do, but it's way too complex for them to have confidence they are doing it right and will get paid. Pay is binary so could go through all that for nothing.
You are also probably violating terms of service if they have to divulge personal info for the programs.
If you have a continuous need for this then MAYBE you can prequalify people by creating qualification on the service and then using just those workers.

C++ Usage Statistics

I have made an application in C++ and would like to know how to go about implementing a usage statistics system so that I may gather some data regarding how users use the program.
Eg. IP Address, Number of hours spent in application, and OS used.
In theory I know I can code this myself if I must, but I was wondering if there is a framework available to make this easier to do. Unfortunately I was unable to find anything on google.
Though there is no any kind of such framework, you could reduce the work you have to do (in order to retrieve all these information) by using some approaches and techniques, which I tried describe below. Please, anybody feel free to correct me.
Let's summarise, what groups of information do we need to complete the task:
User Environment Information. I suggest you to look at Microsoft's WMI infrastructure, in particular to WMI classes: Desktop, File System, Networking, etc. Using this classes in your application can help you retrieve almost all kind of system information. But if you don't satisfy with this, see #2.
Application and System Performance. Under these terms I mean overall system performance, processor's count, processes running in OS, etc. To retrieve these data you can use the NtQuerySystemInformation function. With its help, you will get an access to detailed SystemProcessInformation, SystemProcessorPerformanceInformation (retrieves info about each processor) information, and much more.
User Related Information. It's hard to find a framework to do such things, so I suggest you simply start writing code, having in mind your requirements:
counting how many times each button was pressed, each text field was changes, etc.
measuring delay time between consecutive actions in some kind of predefined sequences (for example, if you have a settings gui form and you expect from the user to fill very fast all required text fields, so using a time delay measuments can give you an information if the user acted as we expected from him or delayed after TextBox2 for a 5 minutes).
anything that could be interested to you.
So, how you could implement the last item (User Related Information) requirements? As for me, I'd do something like folowing (some may seem very hard to implement or too pointless):
- creating a kind of base Counter class and derive from it some controls (buttons, edits, etc).
- using a windows hooks for mouse or keybord while getting a child handle (to recognize a control, for example).
- using Callback class, which can do all "dirty" work (counting, measuring, performing additional actions).
You could store all this information either in a textfile or an SQLite database or there wherever you prefer.
I would recommend taking a look at DeskMetrics. This StackOverflow post summarizes the issue.
Building your own framework could take you months of development (apart from maintenance). With something like Trackerbird Software Analytics you can integrate a DLL with your app and start tracking in 30 minutes and you get all the cool real-time visualizations.
Disclaimer: I am affiliated with company.

Audio Subtitle Transcription - C++

I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.

Windows Phone 7 - Best Practices for Speeding up Data Fetch

I have a Windows Phone 7 app that (currently) calls an OData service to get data, and throws the data into a listbox. It is horribly slow right now. The first thing I can think of is because OData returns way more data than I actually need.
What are some suggestions/best practices for speeding up the fetching of data in a Windows Phone 7 app? Anything I could be doing in the app to speed up the retrieval of data and putting into in front of the user faster?
Sounds like you've already got some clues about what to chase.
Some basic things I'd try are:
Make your HTTP requests as small as possible - if possible, only fetch the entities and fields you absolutely need.
Consider using multiple HTTP requests to fetch the data incrementally instead of fetching everything in one go (this can, of course, actually make the app slower, but generally makes the app feel faster)
For large text transfers, make sure that the content is being zipped for transfer (this should happen at the HTTP level)
Be careful that the XAML rendering the data isn't too bloated - large XAML structure repeated in a list can cause slowness.
When optimising, never assume you know where the speed problem is - always measure first!
Be careful when inserting images into a list - the MS MarketPlace app often seems to stutter on my phone - and I think this is caused by the image fetch and render process.
In addition to Stuart's great list, also consider the format of the data that's sent.
Check out this blog post by Rob Tiffany. It discusses performance based on data formats. It was written specifically with WCF in mind but the points still apply.
As an extension to the Stuart's list:
In fact there are 3 areas - communication, parsing, UI. Measure them separately:
Do just the communication with the processing switched off.
Measure parsing of fixed ODATA-formatted string.
Whether you believe or not it can be also the UI.
For example a bad usage of ProgressBar can result in dramatical decrease of the processing speed. (In general you should not use any UI animations as explained here.)
Also, make sure that the UI processing does not block the data communication.