Not getting expected results from complex view - mapreduce

This is a somewhat involved question as the data I am working with is a little large.
I have the following document structure: https://gist.github.com/gaigepr/5b28a7c67ced0cd71e4e
and the following map function: https://gist.github.com/gaigepr/a721bcc8ef6f681f3807
A little description, this function goes through the example document to collect a list of all combinations of characters from 1 to 5 and supplies them with a 1 or 0 to indicate a win or a loss for that particular combo of characters. This is accomplished by getting the powerset of the team and ignoring the empty set when emitting the array key and integer to indicate a win or loss.
The problem I am having is with reducing the data. My goal is to get the win rate of a particular group of characters in the game this data is from. the view takes a key formatted as such: [1] and should output the win rate and games played by that pair of characters.
so my reduce function should be something like this:
However when I do this, I do not actually get all the games played by that pair in the reduction. in my test database, I have 96 games played by the above pair [1, 18] but when I run map and reduce on the with that key, I get that there were only 2 games played and null for the win rate.
A note: This seems to only happen inconsistently. With my data, when I query with the key [1, 18] I get accurate results.
I am a little bit at a loss for what to do to debug this and would appreciate some help. I am happy to add more details, gists, even pictures of the futon output if that would be helpful.
I do not have a lot of reason for this yet, or confirmation, but it seems that the data passed to the reduce function is not formatted how I expect it to be but I am not sure why that is.

Related

Artificial Neural Network with large inputs & outputs

I've been following Dave Miller's ANN C++ Tutorial, and I've been having some problems getting it to function as expected.
You can view the code I'm working with here. It's an XCode project, but includes the main.cpp and data set file.
Previously, this program would only gives outputs between -1 and 1, I'm presuming due to the use of the tanh function. I've manipulated the data inputs so I can input my data that is much larger and have valid outputs. I've simply done this by multiplying the input values by 0.0001, and multiplying the output values by 10000.
The training data I'm using is the included CSV file. The last column is the expected output, the rest are inputs. Am I using the wrong mathematical function for these data?
Would you say that this is actually learning? This whole thing has stressed me out so much, I understand the theory behind ANN's but just can't implement from scratch for myself.
The net recent average error definitely gets smaller and smaller, which to me would say it is learning.
I'm sorry if I haven't explained myself very well, I'm very new to ANN's and this whole thing is very confusing to me. My university lecturers are useless when it comes to the practical side, they only teach us the theory of it.
I've been playing around with the eta and alpha values, along with the number of hidden layers.
You explained yourself quite well, if the net recent average is getting lower and lower it probably means that the network is actually learning, but here is my suggestion about how to be completely sure.
Take you CSV file and split it into 2 files one should be about 10% of the all data and the other all the remaining.
You start with an untrained network and you run your 10% file trough the net and for each line you save the difference between actual output and expected result.
Then you train the network only with the 90% of the CSV file you have and finally you re run trough the NET the first 10% file again and you compare the differences you had on the first run with the the latest ones.
You should find out that the new results are much closer to the expected values than the first time, and this would be the final proof that your network is learning.
Does this make any sense ? if not please send share some code or send me a link to the exercise you are running and I will try to explain it in code.

Neural Network seems to work fine until used for processing data (all of the results are practically the same)

I have recently implemented a typical 3 layer neural network (input -> hidden -> output) and I'm using the sigmoid function for activation. So far, the host program has 3 modes:
Creation, which seems to work fine. It creates a network with a specified number of input, hidden and output neurons, initializes the weights to either random values or zero.
Training, which loads a dataset, computes the output of the network then backpropagates the error and updates the weights. As far as I can tell, this works ok. The weights change, but not extremely, after training on the dataset.
Processing, which seems to work ok. However, the data output for the dataset which was used for training, or any other dataset for that matter is very bad. It's usually either just a continuuous stream of 1's, with an occasional 0.999999 or every output value for every input is 0.9999 with the last digits being different between inputs. As far as I could tell there was no correlation between those last 2 digits and what was supposed to be outputed.
How should I go about figuring out what's not working right?
You need to find a set of parameters (number of neurons, learning rate, number of iterations for training) that works well for classifying previously unseen data. People often achieve this by separating their data into three groups: training, validation and testing.
Whatever you decide to do, just remember that it really doesn't make sense to be testing on the same data with which you trained, because any classifcation method close to reasonable should be getting everything 100% right under such a setup.

Poor mans Huffman Compression

Im trying to better understand how the huffman decoder works. Ive got a code table but im having a hard time trying to understand how the decoder would work because of ambiguity in the binary string.
(im learning this in prep for my final year at uni)
my table:
Data Hcode
0, 0
1, 1
2, 10
3, 11
17, 100
18, 101
19, 110
29, 111
If i have a huffman code string like 010011 i can return many different combinations of data so how can i discriminate?
i understand the huffman logic in BST representation and you follow a path to a given leaf which the path resembles the code for that given value between (0-255(ascii)) but i still dont know how you can discriminate between returning data: 0,1,0 or data: 0,17
do i really have to enforce 2 bit codes on data 0 and 1? (00 and 01)
i hope ive explained the best i can XD
If your wondering how I generated the table - your gonna kill me because i didnt use tree logic to generate it. Althought i sorted the data (random bytes) on frequency - i generated the Hcodes by converting the element position number into binary (hency why i called this post Poor Mans Huffman).
Many Thanks for any advice.
The code table is wrong. Huffman odes are supposed to be prefix free. This is neccessary in order to decode them afterwards without ambiguities.
If you would use a binary tree for creating the codes, this would automatically ensure the "prefix freeness". See: http://en.wikipedia.org/wiki/Huffman_coding
And now, I am going to kill you ... ;)
Not only is the code table wrong, the lengths of the codes are also wrong. If you have two one-bit codes, you have already used up all of the code space, and can have no other codes. What you have shown is not only not a Huffman code and not a prefix code -- it is in fact not a code at all.

Logical Programming Problem

I've been trying to solve this problem for quite sometime but I am having trouble with it.
Let's say on a trigger, you receive values.
First trigger: You get 1
Second trigger: You get 1, 2
Third trigger: You get 1, 2, 3
So, I store 1.
For 2nd trigger, I store 2 since 1 already exist.
For 3rd trigger, I store 3 since 1,2 already exist
so in total I have stored 1,2,3
As you can see, we can easily check for new values, if old != new.
Here's come the problem:
Fourth trigger: You get 1, 2, 4
For 4th trigger, I store 1, 2 because it exists
but how do I check against 3 and remove 3 from store and check if 4 is new?
If you are having problems understanding this, feel free to clarify. Thanks!
Use a std::set<int> container. When a trigger arrives, clear it an insert all the values from trigger. This should be ok if you work with just a few numbers (about ten or so). With more, a little bit more sophisticated approach might be required.
Hard to tell what you're asking exactly, but see std::set data structure if your main problem is trying to maintain a set of unique numbers and efficiently check for existence in the set.
Your logic changed between 1,2,3 and 1,2,4
(only stored 3 on former, but stored 1,2,4 on latter)
In that case, ignore data recv'd that already exists, only storing new data, unless some old data was not sent in which case you'll create a new set of data to store.
But, I'm guessing that's not what you had in mind at all :)
edit
I see it's been edited now, so my answer is invalid
edit-2
the fastest way is to drop all stored data on each iteration as comparisons will take as long (if not longer) than a complete save of sent data.
Your approach sounds like it is better served by using some basic set theory. A couple of answers already point you to STL sets for that matter. With that, you'll need to iterate through the reported values to test for membership in the set.
However, is there an opportunity to affect what is reported with each "trigger"? For example, if this is something like a select poll, you could just put whatever it is that you're polling into a different state so that it is not reported as ready in subsequent triggers.

Autocorrelation returns random results with mic input (using a high pass filter)

Sorry to ask a similar question to the one i asked before (FFT Problem (Returns random results)), but i've looked up pitch detection and autocorrelation and have found some code for pitch detection using autocorrelation.
Im trying to do pitch detection of a users singing. Problem is, it keeps returning random results. I've got some code from http://code.google.com/p/yaalp/ which i've converted to C++ and modified (below). My sample rate is 2048, and data size is 1024. I'm detecting pitch of both a sine wave and mic input. The frequency of the sine wave is 726.0, and its detecting it to be 722.950820 (which im ok with), but its detecting the pitch of the mic as a random number from around 100 to around 1050.
I'm now using a High pass filter to remove the DC offset, but it's not working. Am i doing it right, and if so, what else can i do to fix it? Any help would be greatly appreciated!
(Fixed)
Thanks,
Niall.
Edit: Changed the code to implement a high pass filter with a cutoff of 30hz (from What Are High-Pass and Low-Pass Filters?, can anyone tell me how to convert the low-pass filter using convolution to a high-pass one?) but it's still returning random results. Plugging it into a VST host and using VST plugins to compare spectrums isn't an option to me unfortunately.
Edit: Fixed, thanks for everyones help, but I never got it to work, now using new code.
I am no sound expert, but if you are sampling with 44100 (I guess samples per second) and use 1024 datapoints. You are working with about 1/40th of a second worth of data. I doesn't surprise me that the current pitch varies a lot, depending on which piece you pick. If you want to find the average or main pitch of a voice, I'd expect to need about 1second worth of data.
At 44.1 kHz sampling frequency, 1024 samples is only a little bit over 23 ms worth of data. Isn't it possible that this is simply insufficient data in order to compute the pitch of a human singer?
I mean, the sound I can make that lasts for 23 ms is probably not something I have a lot of pitch-control over; I would expect this kind of measurement to be done over slighly longer periods of time.
The problem is in your findBestCandidates() function:
Inside this function you access the 'inputs' array from 0 up to 'length - 1'.
When you call this function inside detectPitchCalculation() function 'inputs' is 'results' and 'length' is 'nHiPeriodInSamples'.
But 'results' is only allocated and filled up to 'nHiPeriodInSamples - nLowPeriodInSamples - 1'.
So if 'nLowPeriodInSamples' is greater 0 you access unallocated and random memory inside the findBestCandidates() function!
EDIT:
Another bug is that you fill each 'nResolution' entry of the 'results' array in detectPitchCalculation() function but access each entry in the findBestCandidates() function (via the 'inputs' argument). But since you call detectPitchCalculation() with a 'nResolution=1' this does not explain your specific problem...so I will look a little bit more. But it would definitely a problem if you call it with higher resolutions.
I don't see the problem in you code, but I'm no good in C. But I'd try the following to find the problem:
run with data where the result in known, e.g. with sin(x) as input
run it with small data size (e.g. 2)
Compare the results with known correct ones. You should be able to find those on the internet, or do them by hand.
If random means: same input, different output, you most probably have some bug in the initialisation of variables. Use a debugger and known input to check, that all variables, especially all elements of arrays are properly initialized.