is any way mapper use share data? - mapreduce

I want my mapper use hash map for checking something. The hash map values is same for all mapper and each mapper load that individually and all memory is consumed by that. I want the hash map load one and all mapper use that.You do think any way for that? I know each mapper use difference JVM machine.
Thanks all

There is something called distributed cache. Even using distributed cache you will not be able to share memory(your hashmap) between two mapper jvm processes.
But this distributed cache serves the purpose of distributing your small files like lookup files over all the nodes in the cluster. And again you have to construct your hashmap seperately for each of your mapper processes in a node.

#vahid Map/Reduce is explicitly not set up that way. What happens if one of the mappers fails and needs to be restarted? Maybe you should look into something like MPI. Small amounts of signaling info can be shared via counters - which are transmitted along with each heartbeat.

define the hash map as static member variable in **Map Class
define static boolean init_once = false
overide Setup function, initilize hash map in setup (setup is called once in each task, but all maptask share can share the hash map )
define boolean init_once = false, in setup function
if(!init_once)
{
init_once = true;
//* you init code here
}
notes: it is not thread safe, if hadoop maptask is run in multi-thread mode, mutex should be used to make sure the operation is itom

Related

In Flink is it possible to use state with a non keyed stream?

Lets assume that I have an input DataStream and want to implement some functionality that requires "memory" so I need ProcessFunction that gives me access to state. Is it possible to do it straight to the DataStream or the only way is to keyBy the initial stream and work in keyed-context?
I'm thinking that one solution would be to keyBy the stream with a hardcoded unique key so the whole input stream ends up in the same group. Then technically I have a KeyedStream and I can normally use keyed state, like I'm showing below with keyBy(x->1). But is this a good solution?
DataStream<Integer> inputStream = env.fromSource(...)
DataStream<Integer> outputStream = inputStream
.keyBy(x -> 1)
.process(...) //I've got acess to state heree
As I understand that's not a common usecase because the main purpose of flink is to partition the stream, process them seperately and then merge the results. In my scenario thats exactly what I'm doing, but the problem is that the merge step requires state to produce the final "global" result. What I actually want to do is something like this:
DataStream<Integer> inputStream = env.fromElements(1,2,3,4,5,6,7,8,9)
//two groups: group1=[1,2,3,4] & group2=[5,6,7,8,9]
DataStream<Integer> partialResult = inputStream
.keyBy(val -> val/5)
.process(<..stateful processing..>)
//Can't do statefull processing here because partialResult is not a KeyedStream
DataStream<Integer> outputStream = partialResult
.process(<..statefull processing..>)
outputStream.print();
But Flink doesnt seem to allow me do the final "merge partial results operation" because I can't get access to state in process function as partialResult is not a KeyedStream.
I'm beginner to flink so I hope what I'm writing makes sense.
In general I can say that I haven't found a good way to do the "merging" step, especially when it comes to complex logic.
Hope someone can give me some info, tips or correct me if I'm missing something
Thank you for your time
Is "keyBy the stream with a hardcoded unique key" a good idea? Well, normally no, since it forces all data to flow through a single sub-task, so you get no benefit from the full parallelism in your Flink cluster.
If you want to get a global result (e.g. the "best" 3 results, from any results generated in the preceding step) then yes, you'll have to run all records through a single sub-task. So you could have a fixed key value, and use a global window. But note (as the docs state) you need to come up with some kind of "trigger condition", otherwise with a streaming workflow you never know when you really have the best N results, and thus you'd never emit any final result.

Allocating datastore id using PRNG

Google Cloud Datastore documents that if an entity id needs to be pre-allocated, then one should use the allocateIds method:
https://cloud.google.com/datastore/docs/best-practices#keys
That method seems to make a REST or RPC call which has latency. I'd like to avoid that latency by using a PRNG in my Kubernetes Engine application. Here's the scala code:
import java.security.SecureRandom
class RandomFactory {
protected val r = new SecureRandom
def randomLong: Long = r.nextLong
def randomLong(min: Long, max: Long): Long =
// Unfortunately, Java didn't make Random.internalNextLong public,
// so we have to get to it in an indirect way.
r.longs(1, min, max).toArray.head
// id may be any value in the range (1, MAX_SAFE_INTEGER),
// so that it can be represented in Javascript.
// TODO: randomId is used in production, and might be susceptible to
// TODO: blocking if /dev/random does not contain entropy.
// TODO: Keep an eye on this concern.
def randomId: Long =
randomLong(1, RandomFactory.MAX_SAFE_INTEGER)
}
object RandomFactory extends RandomFactory {
// MAX_SAFE_INTEGER is es6 Number.MAX_SAFE_INTEGER
val MAX_SAFE_INTEGER = 9007199254740991L
}
I also plan to install haveged in the pod to help with entropy.
I understand allocateIds ensures that an ID is not already in use. But in my particular use case, there are two mitigating factors to overlooking that concern:
Based on entity count, the chance of a conflict is 1 in 100 million.
This particular entity type is non-essential, and can afford a "once in a blue moon" conflict.
I am more concerned about even distribution in keyspace, because that is normal use case concern.
Will this approach work, particularly with even distribution in keyspace? Is the allocatedIds method essential, or does it just help developers avoid simple mistakes?
To get rid of collisions use more bits -- for all practical purposes 128 [See statistics behind UUID V4] will never generate a collision.
Another technique is to insert new entities with a shorter random number and handle the error Cloud Datastore returns if they already exist by trying again with a new ID (until you happen upon one that isn't currently in use).
As far as the key distribution goes: the keys will be randomly distributed within the key space will keep Cloud Datastore happy.
Given that you don't want the entity identifier to be based on an external value, you should allow Cloud Datastore to allocate IDs for you. This way you won't have any conflicts. The IDs allocated by Cloud Datastore will be appropriately scattered through the key space.

Persist an entity without attaching it to the EntityManager

I want to bulk-import Doctrine entities from an XML file.
The XML file can be very large (up to 1 million entities), so I can't persist all my entities the traditional way:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
}
$em->flush();
$em->commit();
I would soon exceed my memory limit, and Doctrine is not really designed to handle that many managed entities.
I don't need to track changes to the persisted entities, just to persist them; therefore I don't want them to be managed by the EntityManager.
Is it possible to persist entities without getting them managed by the EntityManager?
The first option that comes to my mind is to detach it immediately after persisting it:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
$em->flush($entity);
$em->detach($entity);
}
$em->commit();
But this is quite expensive in Doctrine, and would slow down the import.
The other option would be to directly insert the data into the database using the Connection object and a prepared statement, but I like the abstraction of the entity and would ideally like to store the object directly.
Instead of using detach and flush after each insert, you can call clear (which detaches all entities from the manager) and flush in batches, which should be significantly faster:
Bulk inserts in Doctrine are best performed in batches, taking
advantage of the transactional write-behind behavior of an
EntityManager. The following code shows an example for inserting 10000
objects with a batch size of 20. You may need to experiment with the
batch size to find the size that works best for you. Larger batch
sizes mean more prepared statement reuse internally but also mean more
work during flush.
https://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
If possible, I recommend avoiding transactions for bulk operations as they tend to slow things down:
//$em->beginTransaction();
$i = 0;
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
if(++$i % 20 == 0) {
$em->flush();
$em->clear(); // detaches all entities
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();
//$em->commit();

How should I implement simple caches with concurrency on Redis?

Background
I have a 2-tier web service - just my app server and an RDBMS. I want to move to a pool of identical app servers behind a load balancer. I currently cache a bunch of objects in-process. I hope to move them to a shared Redis.
I have a dozen or so caches of simple, small-sized business objects. For example, I have a set of Foos. Each Foo has a unique FooId and an OwnerId.
One "owner" may own multiple Foos.
In a traditional RDBMS this is just a table with an index on the PK FooId and one on OwnerId. I'm caching this in one process simply:
Dictionary<int,Foo> _cacheFooById;
Dictionary<int,HashSet<int>> _indexFooIdsByOwnerId;
Reads come straight from here, and writes go here and to the RDBMS.
I usually have this invariant:
"For a given group [say by OwnerId], the whole group is in cache or none of it is."
So when I cache miss on a Foo, I pull that Foo and all the owner's other Foos from the RDBMS. Updates make sure to keep the index up to date and respect the invariant. When an owner calls GetMyFoos I never have to worry that some are cached and some aren't.
What I did already
The first/simplest answer seems to be to use plain ol' SET and GET with a composite key and json value:
SET( "ServiceCache:Foo:" + theFoo.Id, JsonSerialize(theFoo));
I later decided I liked:
HSET( "ServiceCache:Foo", theFoo.FooId, JsonSerialize(theFoo));
That lets me get all the values in one cache as HVALS. It also felt right - I'm literally moving hashtables to Redis, so perhaps my top-level items should be hashes.
This works to first order. If my high-level code is like:
UpdateCache(myFoo);
AddToIndex(myFoo);
That translates into:
HSET ("ServiceCache:Foo", theFoo.FooId, JsonSerialize(theFoo));
var myFoos = JsonDeserialize( HGET ("ServiceCache:FooIndex", theFoo.OwnerId) );
myFoos.Add(theFoo.OwnerId);
HSET ("ServiceCache:FooIndex", theFoo.OwnerId, JsonSerialize(myFoos));
However, this is broken in two ways.
Two concurrent operations can read/modify/write at the same time. The latter "wins" the final HSET and the former's index update is lost.
Another operation could read the index in between the first and second lines. It would miss a Foo that it should find.
So how do I index properly?
I think I could use a Redis set instead of a json-encoded value for the index.
That would solve part of the problem since the "add-to-index-if-not-already-present" would be atomic.
I also read about using MULTI as a "transaction" but it doesn't seem like it does what I want. Am I right that I can't really MULTI; HGET; {update}; HSET; EXEC since it doesn't even do the HGET before I issue the EXEC?
I also read about using WATCH and MULTI for optimistic concurrency, then retrying on failure. But WATCH only works on top-level keys. So it's back to SET/GET instead of HSET/HGET. And now I need a new index-like-thing to support getting all the values in a given cache.
If I understand it right, I can combine all these things to do the job. Something like:
while(!succeeded)
{
WATCH( "ServiceCache:Foo:" + theFoo.FooId );
WATCH( "ServiceCache:FooIndexByOwner:" + theFoo.OwnerId );
WATCH( "ServiceCache:FooIndexAll" );
MULTI();
SET ("ServiceCache:Foo:" + theFoo.FooId, JsonSerialize(theFoo));
SADD ("ServiceCache:FooIndexByOwner:" + theFoo.OwnerId, theFoo.FooId);
SADD ("ServiceCache:FooIndexAll", theFoo.FooId);
EXEC();
//TODO somehow set succeeded properly
}
Finally I'd have to translate this pseudocode into real code depending how my client library uses WATCH/MULTI/EXEC; it looks like they need some sort of context to hook them together.
All in all this seems like a lot of complexity for what has to be a very common case;
I can't help but think there's a better, smarter, Redis-ish way to do things that I'm just not seeing.
How do I lock properly?
Even if I had no indexes, there's still a (probably rare) race condition.
A: HGET - cache miss
B: HGET - cache miss
A: SELECT
B: SELECT
A: HSET
C: HGET - cache hit
C: UPDATE
C: HSET
B: HSET ** this is stale data that's clobbering C's update.
Note that C could just be a really-fast A.
Again I think WATCH, MULTI, retry would work, but... ick.
I know in some places people use special Redis keys as locks for other objects. Is that a reasonable approach here?
Should those be top-level keys like ServiceCache:FooLocks:{Id} or ServiceCache:Locks:Foo:{Id}?
Or make a separate hash for them - ServiceCache:Locks with subkeys Foo:{Id}, or ServiceCache:Locks:Foo with subkeys {Id} ?
How would I work around abandoned locks, say if a transaction (or a whole server) crashes while "holding" the lock?
For your use case, you don't need to use watch. You simply use a multi + exec block and you'd have eliminated the race condition.
In pseudo code -
MULTI();
SET ("ServiceCache:Foo:" + theFoo.FooId, JsonSerialize(theFoo));
SADD ("ServiceCache:FooIndexByOwner:" + theFoo.OwnerId, theFoo.FooId);
SADD ("ServiceCache:FooIndexAll", theFoo.FooId);
EXEC();
This is sufficient because multi makes the following promise :
"It can never happen that a request issued by another client is served in the middle of the execution of a Redis transaction"
You don't need the watch and retry mechanism because you are not reading and writing in the same transaction.

Optimisation on lookup question

The requirement is that a number of 'clients' select a range of resources which they wish to control and listen to events on. Typically there would be 10 or so clients and 100 or so resources. It is possible that the number of clients and resources could be 1000 plus however.
This is currently implemented as a map indexed by clientid with the value as the client object. The client object contains a list of resources selected The problem is that if there is an event for a resource, say resource A then the code has to cycle through each client and then through each list within a client. I am concerned about performance.
Is there a more efficient algorithm to handle this possible bottleneck?
Angus
your structure looks like {client:[resource]} but for efficient event delivery you need {resource:[client]}
It seems you want to do a reverse lookup, so take a look at the boost.bimap which supports this.
Insert standard disclaimer: premature optimisation is bad, don't complicate things before you know you have a performance problem
How about just having a second, inverse data structure, a hashMap of resources, containing a list of interested clients? Bit more work as clints and resources change, but probably worth it.
Did you run a profiler? Did the profiler register that as the real bottleneck?
10 client and 100 resources is nothing for a modern PC. Simple std::map could get that lookup very fast.
Something like this :
struct Resource
{
// data 2
};
struct Client
{
// data 2
};
std::map< Client, std::vector< Resource > > mappingClientToResources;
This is just an idea, and is missing some things to have it working (like for example sorting criteria for Clients)
Or you can have your resource list also as a map , with resource as key and boolean value as value.
Something like
{ client1 : { resource1 : true, resource2: true, resource3:true },... }
instead of your current
{ client1 : [resource1,resource2,resource3],....
The lookup becomes faster.