mapreduce - how to anonymize a column's values

mapreduce - how to anonymize a column's values - mapreduce

input
1 - - GET hm_brdr.gif
2 - - GET s102382.gif ( "1", {"- - GET hm_brdr.gif"})
3 - - GET bg_stars.gif map-reduce-> ( "2", {"- - GET s102382.gif"})
3 - - GET phrase.gif ( "3", {"- - GET bg_stars.gif,"- - GET phrase.gif"})
I want to make the first column values 1,2,3.. anonymous using random integers. But it shouldn't change it like 1->x in one line and 1->t in another line. so my solution is to replace the "keys" with random integers (rand(1)=x, rand(2)=y ..) in the reduce step and ungroup the values with their new keys and write to files again as shown below.
output file
x - - GET hm_brdr.gif
y - - GET s102382.gif
z - - GET bg_stars.gif
z - - GET phrase.gif
my question is, is there a better way of doing this in the means of running time ?

If you want to assign a random integer to a key value then you'll have to do that in a reducer where all key/value pairs for that key are gathered in one place. As #jason pointed out, you don't want to assign a random number since there's no guarantee that a particular random number won't be chosen for two different keys. What you can do is just increment a counter held as an instance variable on the reducer to get the next available number to associate with a key. If you have a small amount of data then a single reducer can be used and the numbers will be unique. If you're forced to use multiple reducers then you'll need a slightly more complicated technique. Use
Context.getTaskAttemptID().getTaskID().getId()
to get a unique reducer number with which to calculate an overall unique number for each key.

There is no way this is a bottleneck to your MapReduce job. More precisely, the runtime of your job is dominated by other concerns (network and disk I/O, etc.). A quick little key function? Meh.
But that's not even the biggest issue with your proposal. The biggest issue with your proposal is that it's doomed to fail. What is a key fact about keys? They serve as unique identifiers for records. Do random number generators guarantee uniqueness? No.
In fact, pretend for just a minute that your random key space has 365 possible values. It turns out that if you generate a mere 23 random keys, you are more likely than not to have a key collision; welcome to the birthday paradox. And all of a sudden, you've lost the whole point to the keys in the first place as you've started smashing together records by giving two that shouldn't have the same key the same key!
And you might be thinking, well, my key space isn't as small as 365 possible keys, it's more like 2^32 possible keys, so I'm, like, totally in the clear. No. After approximately 77,000 keys you're more likely than not to have a collision.
Your idea is just completely untenable because it's the wrong tool for the job. You need unique identifiers. Random doesn't guarantee uniqueness. Get a different tool.
In your case, you need a function that is injective on your input key space (that is, it guarantees that f(x) != f(y) if x != y). You haven't given me enough details to propose anything concrete, but that's what you're looking for.
And seriously, there is no way that performance of this function will be an issue. Your job's runtime really will be completely dominated by other concerns.
Edit:
To respond to your comment:
here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know.
First off, we have a serious XY problem here. You should have asked searched for answers to that question. Anonymizing IP addresses, or anything for that matter, is hard. You haven't even told us the criteria for a "solution" (e.g., who are the attackers?). I recommend taking a look at this answer on the IT Security Stack Exchange site.

Related

Easiest primary key for main table?

My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?

You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.

C++ Complicated look-up table

I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!

Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...

To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.

C++ Inventory & Item Crafting System - turning "stuff" into hash_tag "garblygook" and reversing the hash_tags to get real "stuff"

I'm creating a item crafting system for a game and need to be able to take any random selected items that a player could select and transform whatever items selected into a hash_tag which can then be compared to all the hash_tags from all item-mixes possible, searching for a correct match. This should be the simplest and fastest means to get the result I'm looking for, but of all other ways of doing this sort of thing (I've experience with just about all of them), hash tags are the one thing I've never even slightly touched. I've no idea where to even begin, and could use a lot of help with this.
Basically what it needs to do is allow the player to select anything he or she has, combine the selected things into a hash_tag and check the hash tag board for that number. Whether or not that number results in a "valid combination" or a "this is not a valid combination" doesn't matter, so long as all possible mixes are available on the hash tag board.
On the side there'll obviously be some code for picking things and removing them if there's a valid match and adding in the new item instead, but that's not what I need help with.
(Although anyone with suggestions on this I'll be glad to hear them!)

From what I have gathered so far you have an ordered list of inputs
(the items being crafted) and are looking for a function that returns
a hash (probably for easy comparisons and storage) and also has the
property of being reversible.
Such a thing cannot exist for the general case as long as your hash
has less bits than your input data hashing will produce collisions and
with those collisions the backward transformation will be impossible.
A good start would be just to choose unique identifiers for each item
and use a list of those identifiers(order them by size if order is
irrelevant to the crafting) as the hash. Comparison will still be
reasonable fast.

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?

Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.

Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.

A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.

It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')

You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.

A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

A good repartition algorithm

I am implementing a memcached client library. I want it to support several servers and so I wish to add some load-balancing system.
Basically, you can do two operations on a server:
Store a value given its key.
Get a value given its key.
Let us say I have N servers (from 0 to N - 1), I'd like to have a repartition function which, from a given key and server number N, would give me an index in the [0, N[ range.
unsigned int getServerIndex(const std::string& key, unsigned int serverCount);
The function should be as fast and simple as possible and must respect the following constraint:
getServerIndex(key, N) == getServerIndex(key, N); //aka. No random return.
I wish I could do this without using an external library (like OpenSSL and its hashing functions). What are my options here?
Side notes:
Obviously, the basic implementation:
unsigned int getServerIndex(const std::string& key, unsigned int serverCount)
{
return 0;
}
Is not a valid answer as this is not exactly a good repartition function :D
Additional information:
Keys will usually be any possible string, within the ANSI charset (mostly [a-zA-Z0-9_-]). The size may be anything from a one-char-key to whatever-size-you-want.
A good repartition algorithm is an algorithm for which the probability of returning a is equal (or not too far) from the probability of returning b, for two different keys. The number of servers might change (rarely though) and if it does, it is acceptable that the returned index for a given key changes as well.

You're probably looking for something that implements consistent hashing. The easiest way to do this is to assign a random ID to each memcache server, and allocate each item to the memcache server which has the closest ID to the item's hash, by some metric.
A common choice for this - and the one taken by distributed systems such as Kademlia - would be to use the SHA1 hash function (though the hash is not important), and compare distances by XORing the hash of the item with the hash of the server and interpreting the result as a magnitude. All you need, then, is a way of making each client aware of the list of memcache servers and their IDs.
When a memcache server joins or leaves, it need only generate its own random ID, then ask its new neighbours to send it any items that are closer to its hash than to their own.

I think the hashing approach is the right idea. There are many simplistic hashing algorithms out there.
With the upcoming C++0x and the newly standard unordered_map, the hash of strings is becoming a standard operation. Many compilers are already delivered with a version of the STL which features a hash_map and thus already have a pre-implemented hash function.
I would start with those... but it would be better if we had more information on your strings: are they somehow constrained to a limited charset, or is it likely that they will be many similar strings ?
The problem is that a "standard" hash might not produce a uniform distribution if the input is not uniformly distributed to begin with...
EDIT:
Given the information, I think the hash function already shipped with most STL should work, since you do not seem to have a highly concentrated area. However I am by now way expert in probabilities, so take it with a grain of salt (and experiment).

What about something very simple like
hash(key) % serverCount

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js