Generating multiple hash values for same key using murmerHash

Generating multiple hash values for same key using murmerHash - c++

I generating a bunch of hash values for the same key using MurmerHash like the following. It outputs 50 different hash values for 50 different seeds.
for(int seed=0; seed<50; seed++)
MurmurHash3_x86_32(key, strlen(key), seed, &hash);
But it is not so time efficient when I have a large number of keys i.e. 10 million keys. Is there any other way to make it faster?

It's been a while, but saw this and figured I would contribute an answer for anyone else who needs it.
NOTE -- I can't remember where I read this, or who to attribute the formula to. If someone knows, please let me know so I can properly attribute this, and so that others can verify the correctness of this solution.
You can generate n number of independent hashes by combining 2 independent hashes.
for (i = 0; i < n; i++) {
(hash1 + i * hash2) % max_hash_size;
}
Addition, multiplication, and modulation should be faster than generating an entirely new hash.
I use MurmurHash with two different seeds, and then combine them to generate the extra hash values that I need.
I wish I could remember where I read this....

Related

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?

best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)

The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Sitecore 7 ContentSearch - Sort by random

Using the ContentSearch Linq API in Sitecore 7, how might I go about efficiently taking a random selection of, say, 3 search results from around 1500 potential results?
So far I'm considering using the API to return an entire list of IDs (seeing as 1500 results isn't that large), and then doing the rest in code.
Can somebody point me in the right direction of what I'd need to do to be able to achieve this directly from Lucene?

If you're dealing with a smaller subset of items, it might be easiest for you to randomly shuffle the resultset of SkinnyItems using Fisher-Yates or any other shuffling algorithm.
To shuffle an array a of n elements (indices 0..n-1):
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
Source
I'm not too familiar with Sitecore 7 yet, so if there's an easier way to do it I hope someone can provide it.

You can try the custom sort option as described here: Lucene 2.9.2: How to show results in random order?
This, however did not perform any better than randomizing all the results, in our experience...
For that there are several options: Linq to Entities, random order.

Stevie, have a read of this question and answer which might give you some inspiration as to how to go about it.
Also recommend reading this article on Sitecore Community as suggested by Stephen Pope

How to check if a constraint already exists in CPLEX C++?

I have some constraints of the form Sizes[i1] + Sizes[i2] + Sizes[i3]<=1, which I add by
model.add(Sizes[i1] + Sizes[i2] + Sizes[i3]<=1)
for some specific indices i1,i2,i3. Later I want to add for all other index combinations the constraints
model.add(Sizes[k1] + Sizes[k2] + Sizes[k3]>1)
Is there some nice way to do this, e.g. to check if the constraint already exists in the model?
Maybe I can store the handle which is returned by the IloModel::add function (e.g. as an ILOExtracableArray or even IloConstraintArray?) but even then I con't know how to check if the contraint already exists.
Thank you

I don't think that there is really an easy way to get this back from a cplex model. I have previously had to do something similar in several projects, so I give my two suggestions below.
(1) If you know you always have the same number of things in each constraint, then you can create a structure to hold that information, like:
class tuple{
public int index1;
public int index2;
public int index3;
}
and then you can just create one for each constraint you add, and keep them in a list or array or similar.
(2) If you know about the possible values of the indices, then maybe you can create a hash-code or similar from the indices. If done right, this can also solve the issue of symmetry due to permuting the indices - (Sizes[a] + Sizes[b] + Sizes[c]) is the same as (Sizes[b] + Sizes[a] + Sizes[c]).
Then as above, you can keep the hash codes in a list or array for the constraints that you added.

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4
Corresponding results:
A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}

If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}

From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.

I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.

It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.

I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js