I have many map fields defined in my protocol buffer messages. The messages are populated in C++ and received in a different C++ component, which reads their content using the Descriptor and Reflection APIs.
Given a map field, say:
map <int32, int32> my_map = 1;
This is transported in the same way as something like this:
message my_map_entry {
int32 key = 1;
int32 value = 2;
}
repeated my_map_entry my_map = 1;
As I have understood the current limitation of Descriptor and Reflection APIs, here I have to perform look ups by iterating over the received data. Of course I could put all the data in some more suitable data structure, such as a std::unordered_map if I wanted to do many look ups in the received map field, but I generally only do one look up per received map field.
Can I assume something about the order in which the data is received? Is the repeated my_map_entry messages perhaps ordered, because of the underlaying data structure used in the protocol buffer implementation? If so, a look up for an integer key in a map can stop when a larger key is found. That could give me a potential optimization when it comes to processing the received map fields in my application.
You can not assume the order of the map is similar after serialization.
The following quote is taken from the protobuf website:
Wire format ordering and map iteration ordering of map values is
undefined, so you cannot rely on your map items being in a particular
order
In general protobuf may serialize fields in a random order.
Related
Suppose I have a proto structure that looks like the following:
message TMessage {
optional TDictionary dictionary = 1;
optional int specificField1 = 2;
optional TOtherMessage specificField2 = 3;
...
}
Suppose I am using C++. This is the message stub that is used in the master process to send information to the bunch of the nodes using the network. In particular, the dictionary field is 1) pretty heavy 2) common for all the serialized messages, and all the following specific fields are filled with the relatively small information specific to the destination node.
Of course, dictionary is built only once, but it comes out that the major part of running time is spent while serializing the common dictionary part again and again for each new node.
Obvious optimization would be to pre-serialize dictionary into the byte string and put it into the TMessage as a bytes field, but this looks a bit nasty to me.
Am I right that there is no built-in way to pre-serialize a message field without ruining the message structure? It sounds like an idea for a good plugin for proto compiler.
Protobuf is designed such that concatenation === composition, at least for the root message. That means that you can serialize an object with just the dictionary, and snapshot the bytes somewhere. Now for each of the real messages you can paste down that snapshot, and then serialize an object with just the other fields - just whack it straight after: no additional syntax is required. This is semantically identical to serializing them all at the same time. In fact, since it will retain the field order, it should actually be identical bytes too.
It helps that you used "optional" throughout :)
Marc's answer is perfect for your use case. Here is just another option:
The field must be a submessage, like your TDictionary is.
Have another variant of the outer message, with bytes in place of the submessage you want to preserialize:
message TMessage_preserialized {
optional bytes dictionary = 1;
...
}
Now you can serialize the TDictionary separately and put the resulting data in the bytes field. In protobuf format, submessages and bytes field are written out the same way. This means you can serialize as TMessage_preserialized and still deserialize as normal TMessage.
I am performing some particle simulations in C++ and I need to keep a list of contacts info between particles. A contact is actually a data struct containing some data related to the contact. Each particle is identified with a unique ID. Once a contact is lost, it is deleted from the list. The bottleneck of the simulation is computing the force (a routine inside the contacts), and I have found an important impact on the overall performance according to the actual way the contact list is organised.
Currently, I am using a c++ unordered_map (hash map), whose key is a single integer obtained from a pair function applied over the two unique IDS of the particles, and the value is the contact itself.
I would like to know if there is a better approach to this problem (organising efficiently the list of contacts while keeping the info of the particles they are related with) since my approach is done just because I read and found than a hash map is fast for both insertion and deletion.
Thanks in advance.
I want to write a map-side join and want to include a reducer code as well. I have a smaller data set which I will send as distributed cache.
Can I write the map-side join with reducer code?
Yes!! Why not. Look, reducer is meant for aggregation of the key values emitted from the map. So you can always have a reducer in your code whenever you want to aggregate your result (say you want to count or find average or any numerical summarization) based on certain criteria that you've set in your code or in accordance with the problem statement. Map is just for filtering the data and emitting some useful key value pairs out of a LOT of data. Map side join is just needed when one of the dataset is small enough to fit the memory of the commodity machine. By the way reduce-side join serves your purpose too!!
I'm trying to add an index to an attribute inside of a map object in DynamoDB and can't seem to find a way to do so. Is this something that is supported or are indexes really only allowed on scalar values? The documentation around this seems to be quite sparse. I'm hoping that the indexing functionality is similar to MongoDB but so far the approaches I've taken of referencing the attribute to index using dot syntax has not been successful. Any help or additional info that can be provided is appreciated.
Indexes can be built only on top-level JSON attributes. In addition, range keys must be scalar values in DynamoDB (one of String, Number, Binary, or Boolean).
From http://aws.amazon.com/dynamodb/faqs/:
Q: Is querying JSON data in DynamoDB any different?
No. You can create a Global Secondary Index or Local Secondary Index
on any top-level JSON element. For example, suppose you stored a JSON
document that contained the following information about a person:
First Name, Last Name, Zip Code, and a list of all of their friends.
First Name, Last Name and Zip code would be top-level JSON elements.
You could create an index to let you query based on First Name, Last
Name, or Zip Code. The list of friends is not a top-level element,
therefore you cannot index the list of friends. For more information
on Global Secondary Indexing and its query capabilities, see the
Secondary Indexes section in this FAQ.
Q: What data types can be indexed?
All scalar data types (Number, String, Binary, and Boolean) can be
used for the range key element of the local secondary index key. Set,
list, and map types cannot be indexed.
I have tried doing hash(str(object)) while I store the object separately. This hash gives me an integer(Number) and I am able to use a secondary index on it. Below is a sample in python, it is important to use a hash function which generates the same hash key every time for the value. So I am using sha1.
# Generate a small integer hash:
import hashlib
def hash_8_digits(source):
return int(hashlib.sha1(source.encode()).hexdigest(), 16) % (10 ** 8)
The idea is to keep the entire object small while still the entity intact. i.e. rather than serializing and storing the object as string and changing whole way the object is used I am storing a smaller hash value along with the actual list or map.
I need to know efficient mechanism used for data structure in the socket programming. Lets consider an example of car manufacturing on assembly line.
Initially Conveyer is empty then i start adding different parts dynamically. How can i transmit my data to the server using the TCP/UDP. What can i do so that my server can recognize, if i add some new part dynamically ? and after calculating server return data to client in same structure, so that client can put calculated data on the exact position of component.
Is it possible to arrange this data using some B Tree or B+ Tree structures ? is it possible to reconstruct the same tree on the server side ? what could be other possible alternatives approaches to do this ?
You need to serialize your data, whatever you need to send to server, to some text or binary blob. Yeah, it's possible to serialize interrelated data structure, e.g. by assigning some ID to items and then referencing them by that ID. For C++ serialization I would recommend to have a look at Boost.Serialization.
The simplest ID is memory address on serializer (sender) side - kind of unique identifier ready to use. Of course on deserializer side it must be considered as a just ID and not a memory address.