Scientific explanation why word2vec models perform poorly on small data - word2vec

I would like to know if there is any scientific explanation why word2vec models like CBOW perform poorly on small data. Here's what I tested;
data=[[context1], [context2], [context3]......[contextn]]
model=trained word2vec model
model.most_similar('word')
output=[word not in even in top-10]
I retrained the model with 10 times the dataset.
model.most_similar(word)
output=[word in the 10 most similar words]
Is there any scientific reason for the improvement in performance as the data size increased other than the increase in the word count with increase in data?

Words start word2vec-training in random places, and Word2Vec can only "deduce" that a pair of words are alike if it sees many instructive examples that, gradually over all training, nudge those words into similar positions. It can only arrange many, many words so that all those pairwise similarities are simultaneously upheld if it gets many, many, many varied examples and thus many, many, many chances to incrementally nudge them all into better places.
If you have a tiny dataset:
(1) There might be no, or very few, examples where the desired-to-be-alike words are in similar nearby-word contexts. With no examples where there are shared nearby-words, there's little basis for nudging the pair to the same place – the internal Word2Vec task, of predicting nearby words, doesn't need them to be near each other, if they're each predicting completely different words. But even with a few examples, there are problems, like...
(2) Words need to move a lot from their original random positions, to eventually reach the useful 'constellation' you get form a successful word2vec session. If a word only appears in your dataset 10 times, and you train over the dataset 10 iterations, that word gets just 100 CBOW nudges. If instead the word happens 1000 times, then 10 training iterations give it 10,000 CBOW updates – a much better chance to move from an initially-arbitrary location to a useful location. As a result, you can sometimes squeeze slightly-better results from smaller datasets by increasing your training-iterations. That somewhat simulates a larger corpus. But there's still not real variety in those repeated contexts, so you still have the problem of...
(3) Smaller datasets, with largish-models (lots of 'free parameters', like the independent coordinates of every context-word and predicted target-word), give rise to 'overfitting'. That is, they can become really good at the training task, essentially by 'memorizing' the idiosyncracies of the data, and not achieve the sort of compression/generalization that's usually wanted to be able to apply the results to novel varied data. For example, within a tiny corpus, with lots of internal parameters, maybe 'happy' and 'ecstatic' don't need to be near each other - their contexts only overlap a little, and there's lots of 'space' inside the model to put them in totally different places but still do well predicting their neighboring words. You can sometimes squeeze whatever generalization is possible by shrinking the model, for example making the vector size parameter much smaller, when the dataset is tiny. Then it's still forced to make use of whatever (few) context-similarities that exist.
But the essence of Word2Vec is using bulk data, and bulk training, to force the discovery of generalizable patterns – so tricks to stretch small-corpuses into something that just-barely-works aren't taking advantage of it, in the way it is strongest.

Related

How to organize data (writing your own profiler)

I was thinking about using reflection to generate a profiler. Lets say I am generating code without a problem; how do I properly measure or organize the results? I'm mostly concerned about CPU time but memory suggestions are welcome
There are lots of bad ways to write profilers.
I wrote what I thought was a pretty good one over 20 years ago.
That is, it made a decent demo, but when it came down to serious performance tuning I concluded there really is nothing that works better, and gives better results, than the dumb old manual method, and here's why.
Anyway, if you're writing a profiler, here's what I think it should do:
It should sample the stack at unpredictable times, and each stack sample should contain line number information, not just functions, in the code being tuned. It's not so important to have that in system functions you can't edit.
It should be able to sample during blocked time like I/O, sleeps, and locking, because those are just as likely to result in slowness as CPU operations.
It should have a hot-key that the user can use, to enable the sampling during the times they actually care about (like not when waiting for the user to do something).
Do not assume it is necessary to get measurement precision, necessitating a large number of frequent samples. This is incredibly basic, and it is a major reversal of common wisdom. The reason is simple - it doesn't do any good to measure problems if the price you pay is failure to find them.
That's what happens with profilers - speedups hide from them, so the user is content with finding maybe one or two small speedups while giant ones get away.
Giant speedups are the ones that take a large percentage of time, and the number of stack samples it takes to find them is inversely proportional to the time they take. If the program spends 30% of its time doing something avoidable, it takes (on average) 2/0.3 = 6.67 samples before it is seen twice, and that's enough to pinpoint it.
To answer your question, if the number of samples is small, it really doesn't matter how you store them. Print them to a file if you like - whatever.
It doesn't have to be fast, because you don't sample while you're saving a sample.
What does allow those speedups to be found is when the user actually looks at and understands individual samples. Profilers have all kinds of UI - hot spots, call counts, hot paths, call graphs, call trees, flame graphs, phony 3-digit "statistics", blah, blah.
Even if it's well done, that's only timing information.
It doesn't tell you why the time is spent, and that's what you need to know.
Make eye candy if you want, but let the user see the actual samples.
... and good luck.
ADDED: A sample looks something like this:
main:27, myFunc:16, otherFunc:9, ..., someFunc;132
That means main is at line 27, calling myFunc. myFunc is at line 16, calling otherFunc, and so on. At the end, it's in someFunc at line 132, not calling anything (or calling something you can't identify).
No need for line ranges.
(If you're tempted to worry about recursion - don't. If the same function shows up more than once in a sample, that's recursion. It doesn't affect anything.)
You don't need a lot of samples.
When I did it, sampling was not automatic at all.
I would just have the user press both shift keys simultaneously, and that would trigger a sample.
So the user would grab like 10 or 20 samples, but it is crucial that the user take the samples during the phase of the program's execution that annoys the user with its slowness,
like between the time some button is clicked and the time the UI responds.
Another way is to have a hot-key that runs sampling on a timer while it is pressed.
If the program is just a command-line app with no user input, it can just sample all the time while it executes.
The frequency of sampling does not have to be fast.
The goal is to get a moderate number of samples during the program phase that is subjectively slow.
If you take too many samples to look at, then when you look at them you need to select some at random.
The thing to do when examining a sample is to look at each line of code in the sample so you can fully understand why the program was spending that instant of time.
If it is doing something that might be avoided,
and if you see a similar thing on another sample, you've found a speedup.
How much of a speedup? This much (the math is here):
For example, if you look at three samples, and on two of them you see avoidable code, fixing it will give you a speedup - maybe less, maybe more, but on average 4x.
(That's what I mean by giant speedup. The way you get it is by studying individual samples, not by measuring anything.)
There's a video here.

Recommended samples for performance benchmarks?

I'm writing performance benchmarks for some of my code. This is both to compare my own implementations as I develop/experiment, and to compare against "competing" implementations. I have no problem writing these, and getting usable results.
It's very well established that more samples are a good thing, as it reduces the impact of erroneous data and gives a more true result.
So, if I'm profiling a given function/procedure/whatever, how many samples does it seem reasonable to get?
I'm currently doing about 1 million samples for each test. These are individual operations, the results rarely take longer than 10s per item, even on an old laptop. Most are under a hundredth of a second.
Actually, it is not well established that more samples are a good thing.
It is nothing more than common wisdom.
I think you are sharing in a general confusion about the reason for profiling, whether the purpose is to measure performance or to find speedups.
For measuring performance, you don't need samples at all.
what you need is a stopwatch, whether in software or not.
If your process runs too quickly for the resolution of the stopwatch, just run your process 10^3 or 10^6 times, measure it, and divide by that number.
For finding speedups, sampling the call stack is very effective, provided the samples contain line-level or instruction-level call site information.
How many samples do you need?
Well, if you see it doing something that could be removed on one sample, that probably doesn't mean much.
But if you see it on two samples, that estimates it's costing time fraction F of about 2/N where N is the number of samples.
Example: if you see it twice in 10 samples, that means it costs roughly 20% of time.
In general, if the speedup is going to save you fraction F of time, it takes on average 2/F samples to see it twice.
Example: if it is going to save 30% of time (F = 0.3) you need on average 2/0.3 = 6.67 samples to see it twice.
Of course, if you see it more than twice, all the better.
Bottom line, for finding speedups, you don't need a lot of samples.
What you do need is to examine each one for activity that could be removed.
What you don't need is to mush them together into "statistics" (like most profilers do).
Many people understand this.
If you want a bit more rigorous explanation, look here.

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.

how to improve multi-dimentional bit array comparison performance in c or c++

I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.
Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.