I have several enormous spreadsheets of numbers. I want to search for combinations of cells which yield a certain value. Using a brute-force approach to try all possible combinations I am searching over hundreds of millions of combinations and the process is taking too long for my needs. This is searching over many, many combinations which aren't even close to reasonable (e.g. is 1+2+3 == 4,000 ?). Is there some way to limit the space of what combinations are searched over which is more computationally efficient? I can't just throw out "1" because 1 + 3,999 does indeed == 4,000
Related
I have read about many random number generators and all problems that most have (repeatable, non-uniform distribution, floating-point precision, modulus w/e).
I'm a game developer and I'm thinking why not generate 'random' numbers from time ? I know they won't be 'random', but at least they can't be predicted and I'm only happy for them to just feel random for the players.
For example let's say, at every frame we can take 5 digits out of the current time and use them to generate random numbers.
Let's say if we have the time as a float ss.mmmuuunnn where ss = seconds, mmm = miliseconds, uuu = microseconds and nnn = nanoseconds, we can take only the part muuun and use this to generate our very own random numbers. I have investigated them a bit, and they seem and feel pretty random. I can come up with so many formulas to play around with those 5 digits and get new numbers.
Anyone here seeing anything wrong or that can perform miserably ?
Reminder, I'm just looking for an easy way to generate numbers that 'FEEL' randomly distributed and are unpredictable.
Is this an easy and decent way to give players the sense of randomness ?
Let us assume for the sake of the argument that you are making, on average, one call to your random function every 0.1 milliseconds (since you need it to be fast, you are calling it often, right?) and that it is equally probable to fall anywhere into that time range. In other words, the uun part is assumed to be completely random, but everything higher only changes slowly from call to call and is thus mostly predictable.
That is 1000 possible outcomes or ~10 bits of randomness. There are 1,056,964,608 normal floats bewteen 0 and 1 - not equally distributed of course. That's three orders of magnitude more, which sounds like "poor randomness" to me. Similarly, spreading your 10 bits to the 32 bits of an int (no matter how fancy your function) won't actually improve the randomness.
Also note that none of this deals with the possibility (and very likely scenario) that your calls will probably be extremely periodic and/or in short sequences, as well as the fact that your system time function might not have high enough resolution (or significantly increase power consumption of the system). Both further reducing the randomness of the obtained time, and the side effect of the latter can be very undesirable.
Reminder, I'm just looking for an easy way to generate numbers that 'FEEL' randomly distributed and are unpredictable.
That is extremely unspecific. Humans are terrible at judging randomness and will likely "feel" a close-to-uniform distribution to be more random than a true, fully random one - especially when it comes to streaks.
Unpredictability also exists on way too many levels, from "the player can't manually predict what the enemy will do" to "cryptographically secure until the end of time". For example, given the above assumptions, it might be possible to predict the result of the second of two random calls that happen in quick succession with a success rate of anywhere from 0.1% to 100%. How this pattern emerges in a game is hard to tell, but humans are exceedingly good at spotting patterns.
I would like to know if there is any scientific explanation why word2vec models like CBOW perform poorly on small data. Here's what I tested;
data=[[context1], [context2], [context3]......[contextn]]
model=trained word2vec model
model.most_similar('word')
output=[word not in even in top-10]
I retrained the model with 10 times the dataset.
model.most_similar(word)
output=[word in the 10 most similar words]
Is there any scientific reason for the improvement in performance as the data size increased other than the increase in the word count with increase in data?
Words start word2vec-training in random places, and Word2Vec can only "deduce" that a pair of words are alike if it sees many instructive examples that, gradually over all training, nudge those words into similar positions. It can only arrange many, many words so that all those pairwise similarities are simultaneously upheld if it gets many, many, many varied examples and thus many, many, many chances to incrementally nudge them all into better places.
If you have a tiny dataset:
(1) There might be no, or very few, examples where the desired-to-be-alike words are in similar nearby-word contexts. With no examples where there are shared nearby-words, there's little basis for nudging the pair to the same place – the internal Word2Vec task, of predicting nearby words, doesn't need them to be near each other, if they're each predicting completely different words. But even with a few examples, there are problems, like...
(2) Words need to move a lot from their original random positions, to eventually reach the useful 'constellation' you get form a successful word2vec session. If a word only appears in your dataset 10 times, and you train over the dataset 10 iterations, that word gets just 100 CBOW nudges. If instead the word happens 1000 times, then 10 training iterations give it 10,000 CBOW updates – a much better chance to move from an initially-arbitrary location to a useful location. As a result, you can sometimes squeeze slightly-better results from smaller datasets by increasing your training-iterations. That somewhat simulates a larger corpus. But there's still not real variety in those repeated contexts, so you still have the problem of...
(3) Smaller datasets, with largish-models (lots of 'free parameters', like the independent coordinates of every context-word and predicted target-word), give rise to 'overfitting'. That is, they can become really good at the training task, essentially by 'memorizing' the idiosyncracies of the data, and not achieve the sort of compression/generalization that's usually wanted to be able to apply the results to novel varied data. For example, within a tiny corpus, with lots of internal parameters, maybe 'happy' and 'ecstatic' don't need to be near each other - their contexts only overlap a little, and there's lots of 'space' inside the model to put them in totally different places but still do well predicting their neighboring words. You can sometimes squeeze whatever generalization is possible by shrinking the model, for example making the vector size parameter much smaller, when the dataset is tiny. Then it's still forced to make use of whatever (few) context-similarities that exist.
But the essence of Word2Vec is using bulk data, and bulk training, to force the discovery of generalizable patterns – so tricks to stretch small-corpuses into something that just-barely-works aren't taking advantage of it, in the way it is strongest.
Question: Which data structure is more efficient when calculating n most frequent words in a text file. Hash tables or Priority Queues?
I've previously asked a question related to this subject however after the creative responses I got confused and I've decided on two data types that I actually implement easily; Hash table vs Priority Queues
Priority Queue Confusion: To be honest, I've listened to a lecture from youtube related to priority queues, understood it's every component, however when it comes to its applicability, I get confused. Using a binary heap I can easily implement the priority queue however my challenge is the match its components usage to frequency problem.
My Hash table Idea: Since in here deciding the on hash table's size was a bit uncertain I've decided to go with what makes more sense to me: 26. Due to the number of letters in alphabet. In addition, with a good hash function it would be efficient. However reaching out and out again for linked lists (using separate chaining for collusion) and incrementing its integer value by 1 ,in my opinion, wouldn't be efficient.
Sorry for the long post, but as fellow programmers which one would you recommend. If priority queue can you simply give me ideas to relate it to my question, if hash table could anything be done to make it even more efficient ?
A hash table would be the faster of the two choices offered, besides making more sense. Rather than choosing the size 26, if you have an estimate of the total number of unique words (and most people's vocabularies outside of technical specialized terms is not a lot bigger than 10,000 - 20,000 is really big, and 30,000 is for people who make a hobby of collecting words), make the size big enough that you don't expect to ever fill it so the probability of a collision is low - not more than 25%. If you want to be more conservative, implement a function to rehash the contents of the table into a table of twice the original size (and make the size a prime, so only approximately twice the original size).
Now since this is tagged C++, you might ask yourself why you aren't just using a multiset straight out of the standard template library. It will keep a count of how many of each word you enter into it.
In either case you'll need to make a separate pass to find which of the words are the n most frequent, as you only have the frequencies, not the rank order of the frequencies.
Why don't you use a generic/universal string hashing function? After all you don't want to count the first letter, you want to count over all possible words. I'd keep the bucket count dynamic. If not you will need to do insane amounts of linked-list traversals.
I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.
We have an application which is completely written in C. For table access inside the code like fetching some values from a table we use Pro*C. And to increase the performance of the application we also preload some tables for fetching the data. We take some input fields and fetch the output fields from the table in general.
We usually have around 30000 entries in the table and max it reaches 0.1 million some times.
But if the table entries increase to around 10 million entries, I think it dangerously affects the performance of the application.
Am I wrong somewhere? If it really affects the performance, is there any way to keep the performance of the application stable?
What is the possible workaround if the number of rows in the table increases to 10 million considering the way the application works with tables?
If you are not sorting the table you'll get a proportional increase of search time... if you don't code anything wrong, in your example (30K vs 1M) you'll get 33X greater search times. I'm assumning you're incrementally iterating (i++ style) the table.
However, if it's somehow possible to sort the table, then you can greatly reduce search times. That is possible because an indexer algorithm that searchs sorted information will not parse every element till it gets to the sought one: it uses auxiliary tables (trees, hashes, etc), usually much faster to search, and then it pinpoints the correct sought element, or at least gets a much closer estimate of where it is in the master table.
Of course, that will come at the expense of having to sort the table, either when you insert or remove elements from it, or when you perform a search.
maybe you can go to 'google hash' and take a look at their implementation? although it is in C++
It might be that you have too many cache misses once you increase over 1MB or whatever your cache size is.
If you iterate table multiple times or you access elements randomly you can also hit lot of cache misses.
http://en.wikipedia.org/wiki/CPU_cache#Cache_Misses
Well, it really depends on what you are doing with the data. If you have to load the whole kit-and-kabootle into memory, then a reasonable approach would be to use a large bulk size, so that the number of oracle round trips that need to occur is small.
If you don't really have the memory resources to allow the whole result set to be loaded into memory, then a large bulk size will still help with the Oracle overhead. Get a reasonable size chunk of records into memory, process them, then get the next chunk.
Without more information about your actual run time environment, and business goals, that is about as specific as anyone can get.
Can you tell us more about the issue?