How do I run a minimal XLA C++ computation?

How do I run a minimal XLA C++ computation? - c++

I'm using the XLA C++ API, and I've managed to run a simple addition, but I've no idea if I'm doing it right. There seem to be an awful lot of classes that I've not used. Here's my example
auto builder = new XlaBuilder("XlaBuilder");
auto one = ConstantR0(builder, 1);
auto two = ConstantR0(builder, 2);
auto res = one + two;
ValueInferenceMode value_inf_mode;
auto value_inf = new ValueInference(builder_);
auto lit = value_inf
->AnalyzeConstant(res, value_inf_mode)
->GetValue()
->Clone();
// I'm using `untyped_data` because I can't express arbitrary array types.
// I guess I could use `data<int32>` in this simple case
auto data = lit.untyped_data();
std::cout << ((int32*) data)[0] << std::endl; // prints 3

I suspect I didn't actually run that computation through XLA. Here's a different approach based on a sample harness in the XLA source code
XlaComputation computation = res.builder()->Build().ConsumeValueOrDie();
ExecutionProfile profile;
Literal lit = ClientLibrary::LocalClientOrDie()
->ExecuteAndTransfer(computation, {}, nullptr, &profile)
.ConsumeValueOrDie();
data = lit.untyped_data()

Related

What techniques does cheat engine use for scanning memory?

I've been hacking an online game client, which usually comsumes 1GB+ memory in runtime.
For example I want to find a specific string in the client's memory, using both cheat engine and the native api ReadProcessMemory().
When using cheat engine, it takes less than one second to find candidate addresses of the string;
However, when using ReadProcessMemory(), it would take more than 60 seconds to traverse all memory regions in the client memory. Even when the code is injected into target process, it takes up to 10 to 20 seconds.
Question is: why can cheat engine scan memory so fast? According to the memory usage of cheat engine it does not read whole one memory region at one time(which often reduces calls to ReadProcessMemory()).
Below is my actual code, basically its purpose is to traverse through the memory and find the python object with type "UIRoot". mrg means memory region (std::pair<uint64_t base,uint64_t size>); The executable is built with -O2 option. It works but runs slowly.
#pragma omp parallel for
for (int i = 0; i < mrgs.size(); i++) {
auto& mrg = mrgs[i];
for (auto o = 8; o < mrg.second; o += 8) {
auto toab = memory_reader_->ReadBytes(mrg.first + o, 8);
if (toab) {
auto toa = Convert::ToInt64(toab->Raw(), 0);
auto tonab = memory_reader_->ReadBytes(toa + 24, 8);
if (tonab) {
auto tona = Convert::ToInt64(tonab->Raw(), 0);
auto ton = ReadNullTerminatedAsciiStringFromAddressUpTo255(tona, 7);
if (ton == "UIRoot") {
//do something
}
}
}
}
}

Armadillo concurrent chained operations hang

We have an urban legend regarding chained Armadillo operations having "issues". Here's a comment from a recent code change:
// calculate coefficients
// Note: yes, we could write this more efficiently in one line of code.
// e.g., a = (s.t() * s).i() * s.t() * p
// However we have had exception issues with armadillo that seem to have been solved
// by un-chaining blocks of code, which we will do here:
The code, as implemented was functionally equivalent to the chaining, due to the use of auto:
auto st = s.t();
auto sts = st * s;
auto stsi = sts.i();
auto stsist = stsi * st;
arma::vec a = stsist * p;
This works fine, in a single thread instance. However, when running multiple threads (each thread operating on its own instances, so there should be no concurrency), the final statement hangs.
The fix is to explicitly assign the intermediate steps to arma::mat:
arma::mat st = s.t();
arma::mat sts = st * s;
arma::mat stsi = sts.i();
arma::mat stsist = stsi * st;
arma::vec a = stsist * p;
Now, all threads run just fine.
What is going on that Armadillo can't do chained operations on different objects concurrently?

Operator= slowing down simulation

I am running a Monte Carlo simulation of a polymer. The entire configuration of the current state of the system is given by the object called Grid. This is my definition of Grid:
class Grid{
public:
std::vector <Polymer> PolymersInGrid; // all the polymers in the grid
int x; // length of x-edge of grid
int y; // length of y-edge of grid
int z; // length of z-edge of grid
double kT; // energy factor
double Emm_n ; // monomer-solvent when Not aligned
double Emm_a ; // monomer-solvent when Aligned
double Ems; // monomer-solvent interaction
double Energy; // energy of grid
std::map <std::vector <int>, Particle> OccupancyMap; // a map that gives the particle given the location
Grid(int xlen, int ylen, int zlen, double kT_, double Emm_a_, double Emm_n_, double Ems_): x (xlen), y (ylen), z (zlen), kT (kT_), Emm_n(Emm_n_), Emm_a (Emm_a_), Ems (Ems_) { // Constructor of class
// this->instantiateOccupancyMap();
};
// Destructor of class
~Grid(){
};
// assignment operator that allows for a correct transfer of properties. Important to functioning of program.
Grid& operator=(Grid other){
std::swap(PolymersInGrid, other.PolymersInGrid);
std::swap(Energy, other.Energy);
std::swap(OccupancyMap, other.OccupancyMap);
return *this;
}
.
.
.
}
I can go into the details of the object Polymer and Particle, if required.
In my driver code, this is what I am going:
Define maximum number of iterations.
Defining a complete Grid G.
Creating a copy of G called G_.
I am perturbing the configuration of G_.
If the perturbance on G_ is accepted per the Metropolis criterion, I assign G_ to G (G=G_).
Repeat steps 1-4 until maximum number of iterations is achieved.
This is my driver code:
auto start = std::chrono::high_resolution_clock::now();
Grid G_ (G);
int acceptance_count = 0;
for (int i{1}; i< (Nmov+1); i++){
// choose a move
G_ = MoveChooser(G, v);
if ( MetropolisAcceptance (G.Energy, G_.Energy, G.kT) ) {
// accepted
// replace old config with new config
acceptance_count++;
std::cout << "Number of acceptances is " << acceptance_count << std::endl;
G = G_;
}
else {
// continue;
}
if (i % dfreq == 0){
G.dumpPositionsOfPolymers (i, dfile) ;
G.dumpEnergyOfGrid(i, efile, call) ;
}
// G.PolymersInGrid.at(0).printChainCoords();
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds> (stop-start);
std::cout << "\n\nTime taken for simulation: " << duration.count() << " milliseconds" << std::endl;
This is the interesting part: if I run the simulation using condition that do not have lots of "acceptances" (low temperatures, bad solvent), the simulation runs pretty fast. However, if there are a large number of acceptances, the simulation gets incredibly slow. My hypothesis is that my assignment operator = is slowing down my simulation.
I ran some tests:
number of acceptances = 25365, wall-clock time = 717770 milliseconds (!)
number of acceptances = 2165, wall-clock time = 64412 milliseconds
number of acceptances = 3000, wall-clock time = 75550 milliseconds
And the trend continues.
Could anyone advise me on how to possibly make this more efficient? Is there a way to bypass the slowdown I am experiencing, I think, due to the = operator?
I would really appreciate any advice you have for me!

One thing that you can certainly do to improve performance is to force moving _G rather than coping it to G:
G = std::move(G_);
After all, at this stage you don't need G_ any more.
Side remark. The fact that you don't need to copy all member data in operator= indicates that your design of Grid is far from perfect, but, well, keep it if the program is small and you're sure you control everything. Anyway, rather than using operator=, you should define and use a member function with a meaningful name, like "fast_and_dirty_swap" etc :-) Then you can define operator= the way suggested by #Jarod42, that is, using = default.
An alternative approach that I used before C++11 is to operate on pointers. In this scenario one would have two Grids, one "real" and one treated as a buffer, or sandbox, and on acceptance on would simply swap the pointers, so that the "buffer" filled with MoveChooser would become the real, current Grid.
A pseudocode:
Create two buffers, previous and current, each capable of storing a simulation state
Initialize current
Create two pointers, p_prev = &previous, p_curr = &currenrt
For as many steps as you wish
compute the next state from *p_curr and store it in *p_prev (e.g. monte_carlo_step(p_curr, p_prev)
swap the pointers: now the current system state is at p_curr and the previous at p_prev.
analyze the results stored at *p_curr

Performance of C++ program is 2x slower with the use of TBB

I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?

If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.

Protobuf vs Flatbuffers vs Cap'n proto which is faster?

I decided to figure out which of Protobuf, Flatbuffers and Cap'n proto would be the best/fastest serialization for my application. In my case sending some kind of byte/char array over a network (the reason I serialized to that format). So I made simple implementations for all three where i seialize and dezerialize a string, a float and an int. This gave unexpected resutls: Protobuf being the fastest. I would call them unexpected since both cap'n proto and flatbuffes "claims" to be faster options. Before I accept this I would like to see if I unitentionally cheated in my code somehow. If i did not cheat I would like to know why protobuf is faster (exactly why is probably impossible). Could the messages be to simeple for cap'n proto and faltbuffers to really make them shine?
My timings:
Time taken flatbuffers: 14162 microseconds
Time taken capnp: 60259 microseconds
Time taken protobuf: 12131 microseconds
(time from one machine. Relative comparison might be relevant.)
UPDATE: The above numbers are not representative of CORRECT usage, at least not for capnp -- see answers & comments.
flatbuffer code:
int main (int argc, char *argv[]){
std::string s = "string";
float f = 3.14;
int i = 1337;
std::string s_r;
float f_r;
int i_r;
flatbuffers::FlatBufferBuilder message_sender;
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
auto autostring = message_sender.CreateString(s);
auto encoded_message = CreateTestmessage(message_sender, autostring, f, i);
message_sender.Finish(encoded_message);
uint8_t *buf = message_sender.GetBufferPointer();
int size = message_sender.GetSize();
message_sender.Clear();
//Send stuffs
//Receive stuffs
auto recieved_message = GetTestmessage(buf);
s_r = recieved_message->string_()->str();
f_r = recieved_message->float_();
i_r = recieved_message->int_();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken flatbuffer: " << duration.count() << " microseconds" << endl;
return 0;
}
cap'n proto code:
int main (int argc, char *argv[]){
char s[] = "string";
float f = 3.14;
int i = 1337;
const char * s_r;
float f_r;
int i_r;
::capnp::MallocMessageBuilder message_builder;
Testmessage::Builder message = message_builder.initRoot<Testmessage>();
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
//Encodeing
message.setString(s);
message.setFloat(f);
message.setInt(i);
kj::Array<capnp::word> encoded_array = capnp::messageToFlatArray(message_builder);
kj::ArrayPtr<char> encoded_array_ptr = encoded_array.asChars();
char * encoded_char_array = encoded_array_ptr.begin();
size_t size = encoded_array_ptr.size();
//Send stuffs
//Receive stuffs
//Decodeing
kj::ArrayPtr<capnp::word> received_array = kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(encoded_char_array), size/sizeof(capnp::word));
::capnp::FlatArrayMessageReader message_receiver_builder(received_array);
Testmessage::Reader message_receiver = message_receiver_builder.getRoot<Testmessage>();
s_r = message_receiver.getString().cStr();
f_r = message_receiver.getFloat();
i_r = message_receiver.getInt();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken capnp: " << duration.count() << " microseconds" << endl;
return 0;
}
protobuf code:
int main (int argc, char *argv[]){
std::string s = "string";
float f = 3.14;
int i = 1337;
std::string s_r;
float f_r;
int i_r;
Testmessage message_sender;
Testmessage message_receiver;
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
message_sender.set_string(s);
message_sender.set_float_m(f);
message_sender.set_int_m(i);
int len = message_sender.ByteSize();
char encoded_message[len];
message_sender.SerializeToArray(encoded_message, len);
message_sender.Clear();
//Send stuffs
//Receive stuffs
message_receiver.ParseFromArray(encoded_message, len);
s_r = message_receiver.string();
f_r = message_receiver.float_m();
i_r = message_receiver.int_m();
message_receiver.Clear();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken protobuf: " << duration.count() << " microseconds" << endl;
return 0;
}
not including the message definition files scince they are simple and most likely has nothing to do with it.

In Cap'n Proto, you should not reuse a MessageBuilder for multiple messages. The way you've written your code, every iteration of your loop will make the message bigger, because you're actually adding on to the existing message rather than starting a new one. To avoid memory allocation with each iteration, you should pass a scratch buffer to MallocMessageBuilder's constructor. The scratch buffer can be allocated once outside the loop, but you need to create a new MallocMessageBuilder each time around the loop. (Of course, most people don't bother with scratch buffers and just let MallocMessageBuilder do its own allocation, but if you choose that path in this benchmark, then you should also change the Protobuf benchmark to create a new message object for every iteration rather than reusing a single object.)
Additionally, your Cap'n Proto code is using capnp::messageToFlatArray(), which allocates a whole new buffer to put the message into and copies the entire message over. This is not the most efficient way to use Cap'n Proto. Normally, if you were writing the message to a file or socket, you would write directly from the message's original backing buffer(s) without making this copy. Try doing this instead:
kj::ArrayPtr<const kj::ArrayPtr<const capnp::word>> segments =
message_builder.getSegmentsForOutput();
// Send segments
// Receive segments
capnp::SegmentArrayMessageReader message_receiver_builder(segments);
Or, to make things more realistic, you could write the message out to a pipe and read it back in, using capnp::writeMessageToFd() and capnp::StreamFdMessageReader. (To be fair, you would need to make the protobuf benchmark write to / read from a pipe as well.)
(I'm the author of Cap'n Proto and Protobuf v2. I'm not familiar with FlatBuffers so I can't comment on whether that code has any similar issues...)
On benchmarks
I've spent a lot of time benchmarking Protobuf and Cap'n Proto. One thing I've learned in the process is that most simple benchmarks you can create will not give you realistic results.
First, any serialization format (even JSON) can "win" given the right benchmark case. Different formats will perform very, very differently depending on the content. Is it string-heavy, number-heavy, or object heavy (i.e. with deep message trees)? Different formats have different strengths here (Cap'n Proto is incredibly good at numbers, for example, because it doesn't transform them at all; JSON is incredibly bad at them). Is your message size incredibly short, medium-length, or very large? Short messages will mostly exercise the setup/teardown code rather than body processing (but setup/teardown is important -- sometimes real-world use cases involve lots of small messages!). Very large messages will bust the L1/L2/L3 cache and tell you more about memory bandwidth than parsing complexity (but again, this is important -- some implementations are more cache-friendly than others).
Even after considering all that, you have another problem: Running code in a loop doesn't actually tell you how it performs in the real world. When run in a tight loop, the instruction cache stays hot and all the branches become highly predictable. So a branch-heavy serialization (like protobuf) will have its branching cost swept under the rug, and a code-footprint-heavy serialization (again... like protobuf) will also get an advantage. This is why micro-benchmarks are only really useful to compare code against other versions of itself (e.g. to test minor optimizations), NOT to compare completely different codebases against each other. To find out how any of this performs in the real world, you need to measure a real-world use case end-to-end. But... to be honest, that's pretty hard. Few people have the time to build two versions of their whole app, based on two different serializations, to see which one wins...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I run a minimal XLA C++ computation? - c++

Related

What techniques does cheat engine use for scanning memory?

Armadillo concurrent chained operations hang

Operator= slowing down simulation

Performance of C++ program is 2x slower with the use of TBB

Protobuf vs Flatbuffers vs Cap'n proto which is faster?

Categories

Resources