Preprocessing in Recommender systems with apriori/fpgrowth algorithms - data-mining

I am trying to implement the apriori and fpgrowth algorithm to some characterisation data that I have. The data I have are already binarised and it is composed of 1's (passes), 0's (fails) and Null values.
I want to clarify with my preprocessing pipeline if it would be good enough in practise. I have already removed rows/columns from the dataset that have the ENTIRE row/column with Null values and now I am still left with some Null values.
I was thinking of applying categorical PCA to decrease the size of the dataset even more, but I believe that wouldn't good enough practise as it requires to impute and fill the missing values with something else, and I don't need that as it will affect final results.
So, what I am actually doing to address the issue of the Null values, is to fill them up with a 0. I do this, because the algorithms above try to measure the frequency of items that exist in a database. And I guess, the 1's are the datapoints that are keeping count of that frequency. Hence, the rest should be 0.
But, I am still not sure if it's good enough practise because it looks like I am filling up the Null values with a 0 (failure) as if it has been measured.
Any help on this, if I am tackling my problem correctly or if I should try something else would be very much appreciated. :)

Related

Compressing an array of two values

What is the cheapest way to compress an array, to a byte[] array, given only two possible values in it?
Array length has no limit.
The best idea I have seen so far is to insert the repeated number of times each value appears, so for example the array "11111001" is compressed to "521".
I wonder if there is a better way.
Thanks.
First off, convert to a bit array. Right off the bat it will take 1/8th the space.
Then you'd need to examine your data to see if there is any apparent redundancy. If not, you're done. If there is redundancy, you'd need to figure out a way to model it, and then compress it. Run-length coding as you propose is useful only if long runs of zeros and/or ones are common in the data.

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

"Tricks" to filling a "rolling" time-series array absent brute-force pushback of all values each iteration

My applications is financial, in C++, Visual Studio 2003.
I'm trying to maintain an array of the last (x) values for an observation, and as each new value arrives I have a loop first push all of the other values back then add the new value in the front.
It's computationally intensive, and I've been trying to be clever and come up with some way around this. I've probably either stated an oxymoronic problem or reached the limit of my intellect, or both.
Here's an idea I have:
Suppose it's 60 seconds of data, and new data arrive each second. Suppose we have an integer between 0 and 59, that will serve to index an element of the array. Suppose each second, when the data arrives, we first iterate the integer then overwrite the element of the array at that index with the new data. Then, suppose in our calculations, we refer to the same integer as the base, work backwards to zero, then to 59 then back down again. The formulas in the math would be a bit more tedious to write. But my application does a lot of these pushback/fills of arrays, each second for several data points, and each array having 3600 elements per data series (one hour of seconds).
Does the board think this is a good idea? Or am I being silly here?
What you're describing is nothing more than a circular buffer.
There's an implementation in Boost, and probably in other
libraries as well, and a good algorithm description on the
Wikipedia (http://en.wikipedia.org/wiki/Circular_buffer).
And yes, it's a good solution for the problem you describe.
You could use modulo as you suggested (hint: x % y is the syntax for "x" modulo "y"), or you could maintain two buffers where you essentially swap which one is the current data to be read and which buffer is the stale data that is to be overwritten. For copying large quantities of plain-old-data (POD) data, you should take a look at memcpy. Of course, in quantitative trading, people will do all sorts of things to get a speed edge, including custom hardware that allows one to bypass several layers of copying.
Are you sure you are talking about arrays? They don't have a "push" operation - it sounds more like an std::vector.
Anyway, here is the solution for what I think that you want:
If I understood it right, you want a collection of 3600 elements and each second the last element drops off and a new element is added.
So you should use a linked list queue for that task. Operations are performed in O(1).

How to handle very large matrices (e.g. 1000000 by 1000000)

My question is very general..and its not duplicate too..
when we declare something like this int mat[1000000][1000000];
it is sure it will give an error saying matrix size too large.
i have seen many problems on many competitive programming websites where we need to declare a 2d matrix with 10^6 rows, columns ,I know there is always some trick associated with it to reduce the matrix size.
so i just want to ask what are the possible ways or tricks we can use in such cases to minimize the size ..i mean which types of algorithms are generally required to solve it like DP or anyone else??
In DP, if current row is dependent only on previous row, you can use
int mat[2][1000000];. After calculating current row, you can immediately discard previous row and switch current and previous.
Sometimes, it is possible to use std::map instead of 2D array.
I have encountered many question in programming contests and the
solutions defers from case to case basis, so if you mention a
specific case, I can possibly give you a better targeted solution.
That depends very much on the specific task. There is no universal "trick" that will always work. You'll have to look for something in the particular problem that allows you to solve it in a different way.
That said, if I could really see no other way, I'd start thinking about how many elements of that matrix will really be non-zero (perhaps I can use a sparse array or a map (dictionary) instead). Or maybe I don't need to store all the elements it memory, but can instead re-calculate them every time I need them.
At any rate, a matrix that large (or any kind of fake representation of it) will NOT be useful. Not just because you don't have enough memory, but also because filling up such an array with data will take anywhere from several hours to many months. That should be your first concern - figuring out how to solve the task with less data and computations. When you figure out that, you'll also see what data structure is appropriate.

C++ Complicated look-up table

I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!
Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...
To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.