I've been hacking an online game client, which usually comsumes 1GB+ memory in runtime.
For example I want to find a specific string in the client's memory, using both cheat engine and the native api ReadProcessMemory().
When using cheat engine, it takes less than one second to find candidate addresses of the string;
However, when using ReadProcessMemory(), it would take more than 60 seconds to traverse all memory regions in the client memory. Even when the code is injected into target process, it takes up to 10 to 20 seconds.
Question is: why can cheat engine scan memory so fast? According to the memory usage of cheat engine it does not read whole one memory region at one time(which often reduces calls to ReadProcessMemory()).
Below is my actual code, basically its purpose is to traverse through the memory and find the python object with type "UIRoot". mrg means memory region (std::pair<uint64_t base,uint64_t size>); The executable is built with -O2 option. It works but runs slowly.
#pragma omp parallel for
for (int i = 0; i < mrgs.size(); i++) {
auto& mrg = mrgs[i];
for (auto o = 8; o < mrg.second; o += 8) {
auto toab = memory_reader_->ReadBytes(mrg.first + o, 8);
if (toab) {
auto toa = Convert::ToInt64(toab->Raw(), 0);
auto tonab = memory_reader_->ReadBytes(toa + 24, 8);
if (tonab) {
auto tona = Convert::ToInt64(tonab->Raw(), 0);
auto ton = ReadNullTerminatedAsciiStringFromAddressUpTo255(tona, 7);
if (ton == "UIRoot") {
//do something
}
}
}
}
}
Related
I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?
If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.
I'm trying to allocate and use 100 MB of data (size_t s = 100 * 1024 * 1024) and measured different ways in C++:
Raw allocation without init:
// < 0.01 ms
auto p = new char[s];
C++ zero-init:
// 20 ms
auto p = new char[s]();
Manual zero-init:
// 20 ms
auto p = new char[s];
for (auto i = 0; i < s; ++i)
p[i] = 0;
This is not a limitation of my memory as demonstrated by writing the memory again:
// 3 ms
std::memset(p, 0xFF, s);
I also tried std::malloc and std::calloc but they show the same behavior. calloc returns memory that is zero-initialised but if I do a memset afterwards that still takes 20 ms.
As far I understand it uninitialized memory is fast to allocate because it doesn't actually touch the memory. Only when I access it are the pages allocated to my program. The 3 ms for setting 100 MB correspond to ~35GB/s which is OK-ish. The 20 ms seem to be overhead when triggering page faults.
The funny thing is that this seems to be compute overhead. If I initialize it with multiple threads it gets faster:
// 6-10 ms
auto p = new char[s];
#pragma omp parallel for
for (auto i = 0; i < s; ++i)
p[i] = 0;
My question: is there a way to not only allocate memory but also immediately allocate all pages so that no further page faults arise when accessing it?
I would like to avoid using huge pages if possible.
(Measurement with std::chrono::high_resolution_clock)
This was done on my desktop system (5Ghz i9, 3600 MHz DDR4, Linux Mint 19, 4.15.0-45-generic kernel) with Clang 7 (-O2 -march=native) though, looking at the assembly, the compiler is not the problem.
EDIT: This is a simplified example, in my actual application I need to initialize it with a different value than 0 but that doesn't change the timing at all.
I'm trying to write a program whose job it is to go into shared memory, retrieve a piece of information (a struct 56 bytes in size), then parse that struct lightly and write it to a database.
The catch is that it needs to do this several dozens of thousands of times per second. I'm running this on a dedicated Ubuntu 14.04 server with dual Xeon X5677's and 32GB RAM. Also, Mongo is running PerconaFT as its storage engine. I am making an uneducated guess here, but say worst case load scenario would be 100,000 writes per second.
Shared memory is populated by another program who's reading information from a real time data stream, so I can't necessarily reproduce scenarios.
First... is Mongo the right choice for this task?
Next, this is the code that I've got right now. It starts with creating a list of collections (the list of items I want to record data points on is fixed) and then retrieving data from shared memory until it catches a signal.
int main()
{
//these deal with navigating shared memory
uint latestNotice=0, latestTurn=0, latestPQ=0, latestPQturn=0;
symbol_data *notice = nullptr;
bool done = false;
//this is our 56 byte struct
pq item;
uint64_t today_at_midnight; //since epoch, in milliseconds
{
time_t seconds = time(NULL);
today_at_midnight = seconds/(60*60*24);
today_at_midnight *= (60*60*24*1000);
}
//connect to shared memory
infob=info_block_init();
uint32_t used_symbols = infob->used_symbols;
getPosition(latestNotice, latestTurn);
//fire up mongo
mongoc_client_t *client = nullptr;
mongoc_collection_t *collections[used_symbols];
mongoc_collection_t *collection = nullptr;
bson_error_t error;
bson_t *doc = nullptr;
mongoc_init();
client = mongoc_client_new("mongodb://localhost:27017/");
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collections[symbol] = mongoc_client_get_collection(client, "scribe",
(infob->sd+symbol)->getSymbol());
}
//this will be used later to sleep one millisecond
struct timespec ts;
ts.tv_sec=0;
ts.tv_nsec=1000000;
while(continue_running) //becomes false if a signal is caught
{
//check that new info is available in shared memory
//sleep 1ms if it isn't
while(!getNextNotice(¬ice,latestNotice,latestTurn)) nanosleep(&ts, NULL);
//get the new info
done=notice->getNextItem(item, latestPQ, latestPQturn);
if(done) continue;
//just some simple array math to make sure we're on the right collection
collection = collections[notice - infob->sd];
//switch on the item type and parse it accordingly
switch(item.tp)
{
case pq::pq_event:
doc = BCON_NEW(
//decided to use this instead of std::chrono
"ts", BCON_DATE_TIME(today_at_midnight + item.ts),
//item.pr is a uint64_t, and the guidance I've read on mongo
//advises using strings for those values
"pr", BCON_UTF8(std::to_string(item.pr).c_str()),
"sz", BCON_INT32(item.sz),
"vn", BCON_UTF8(venue_labels[item.vn]),
"tp", BCON_UTF8("e")
);
if(!mongoc_collection_insert(collection, MONGOC_INSERT_NONE, doc, NULL, &error))
{
LOG(1,"Mongo Error: "<<error.message<<endl);
}
break;
//obviously, several other cases go here, but they all look the
//same, using BCON macros for their data.
default:
LOG(1,"got unknown type = "<<item.tp<<endl);
break;
}
}
//clean up once we break from the while()
if(doc != nullptr) bson_destroy(doc);
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collection = collections[symbol];
mongoc_collection_destroy(collection);
}
if(client != nullptr) mongoc_client_destroy(client);
mongoc_cleanup();
return 0;
}
My second question is: is this the fastest way to do this? The retrieval from shared memory isn't perfect, but this program is getting way behind its supply of data, far moreso than I need it to be. So I'm looking for obvious mistakes with regards to efficiency or technique when speed is the goal.
Thanks in advance. =)
I'm in the process of adding multithreading to several CPU-intensive processes on a list of long-lived object pointers. Roughly 60 million of these objects were created and added to a primary list on the main processing thread.
All of the work occurs in two lambda functors, one to process the data (myMap) and one to collect the results (myReduce). The main list gets divided into four sub-lists of roughly 15 million each and sent to QtConcurrent::mappedReduced to do work. Here's some example code:
//main thread
const int count = 60000000;
QList<MyObject*> list;
for(int i = 0; i < count; ++i) {
MyObject* obj = new MyObject;
obj.readFromFile(path);
list << obj;
}
QList<QList<MyObject*> > sublists;
for(int i = 0; i < count; i += count/4) {
sublists << list.mid(i, count/4);
}
QThreadPool::globalInstance()->setMaxThreadCount(1); //slowdown when set to 4??
Result results_total;
std::function<Result (const QList<MyObject*>&)>
myMap = [](const QList<MyObject*>& m) -> Result {
//do lots of work on individual MyObjects, querying and modifying them
};
auto myReduce = [&results_total](bool& /*noreturn*/, const Result& result) {
results_total.count += result.count;
results_total.othernumber += result.othernumber;
};
QFutureWatcher<void> fw;
fw.setFuture(QtConcurrent::mappedReduced<bool>(
sublists, myMap, myReduce,
QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce));
fw.waitForFinished();
Here's the kicker: When I setMaxThreadCount to 4 instead of 1, the procedure slows down by 10% instead of speeding up 200-400%. I used the exact same methodology (split a list into fourths and run it through QtConcurrent) on another procedure and ran it on the exact same dataset for a roughly 4x speed boost as expected by using 4 threads instead of 1.
Googling around suggests that there must be a shared resource in the myRun functor somewhere, but I can't find anything at all that's shared between the processing threads other than the original list of MyObjects that exist on the main thread.
So here's the question: Does the fact that MyObject was created in a different thread than the processing thread matter if I can guarantee that there are no synchronization issues? This link suggests it doesn't matter, but that heap memory block seems to be the only thing both threads share.
I'm running Qt 4.8.6 on Windows 7 Pro x64 with an i7 processor.
I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:
void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
//Initialize
auto OutMapLen = this->_OutMapLen;
auto KernelBatchLen = this->_KernelBatchLen;
auto OutMapHeig = this->_OutMapHeig;
auto OutMapWid = this->_OutMapWid;
auto InpMapWid = this->_InpMapWid;
auto NumInputMaps = this->_NumInputMaps;
auto InpMapLen = this->_InpMapLen;
auto KernelLen = this->_KernelLen;
auto KernelHeig = this->_KernelHeig;
auto KernelWid = this->_KernelWid;
auto input_local = this->input;
auto output_local = this->output;
auto weights_local = this->weights;
auto biases_local = this->biases;
auto klim = this->_klim;
#pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
for(auto i=0; i<_NumOutMaps; ++i)
{
auto output_map = output_local + i*OutMapLen;
auto kernel_batch = weights_local + i*KernelBatchLen;
auto bias = biases_local + i;
for(auto j=0; j<OutMapHeig; ++j)
{
auto output_map_row = output_map + j*OutMapWid;
auto inp_row_idx = j*InpMapWid;
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
*output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
auto kernel = kernel_batch + _i*KernelLen;
for(int _j=0; _j<KernelHeig; ++_j)
{
auto kernel_row_idx = _j*KernelWid;
auto inp_row_cur_idx = _j*InpMapWid;
int _k=0;
for(; _k<klim; _k+=4)//unroll and vectorize
{
float buf;
__m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
__m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
__m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
_mm_store_ss(&buf, prd);
*output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
*output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
}
}
}
}
Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?
Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.
Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.
There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.
The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.
Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
auto _output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
...
for(int _j=0; _j<KernelHeig; ++_j)
{
...
for(; _k<klim; _k+=4)//unroll and vectorize
{
...
_output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
_output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
*output_nn = _output_nn;
}
But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.