C++ and pqxx accumulating a transaction - c++

TL;DR - I am accumulating a transaction in postgresql (pqxx) and suspect I'm doing something overly tortuous to get where I'm going.
In the first part of the question I've shown the naive way one might do a series of inserts and explain that it is (first) slow and (second) can lead to an error. Then I show where reading docs has lead me: I'm pretty sure it's an abusive way to interact with libpqxx, but I'm not sure what the intended pattern is.
Intro: the naive approach to doing thousands of inserts
I have some code that wants to write a bunch of stuff to the database, typically INSERT ... ON CONFLICT UPDATE. Roughly, this code is looping over some container and generating the needed SQL to insert each container object that needs inserting.
The naive way to do this (skipping error/exception handling for the moment) is thus:
pqxx::work txn(conn);
ostringstream sql_stmt;
Container rows;
for (const auto& row : rows) {
if (row.IsDirty()) {
RowToSQL(sql_stmt, row, txn);
txn.exec(sql_stmt.str());
// Clear sql_stmt here.
}
}
txn.commit();
The function RowToSQL() takes a transaction object so that it can quote strings appropriately using txn.quote().
This is inefficient, however: calling exec() over and over turns out to be quite slow.
So instead I build up a bunch of statements in the ostringstream, thus:
pqxx::work txn(conn);
ostringstream sql_stmt;
Container rows;
for (const auto& row : rows) {
if (row.IsDirty()) {
RowToSQL(sql_stmt, row, txn);
if (++counter > kExecuteThreshold || sql_stmt.IsTooBig()) {
txn.exec(sql_stmt.str());
// Clear sql_stmt, reset counter here.
}
if (count > kCommitThreshold) {
txn.commit();
// Reset statement counter here.
}
}
}
// Final commit here.
I chose the two thresholds based on performance testing in our environment, think 100 and 10,000 for order of magnitude.
This worked until it didn't, because re-using the transaction this way leads to conflicts.
Attempt to activate transaction<READ COMMITTED> which is already closed.
This SO question addresses this mostly.
This leads me to write something that has weird code smell, and I so suspect I've misunderstood how postgresql / libpqxx intend to be used.
Adding back error/exception handling:
Less naive but weird code smell
pqxx::nontransaction non_txn;
ostringstream sql_stmt;
Container rows;
Vector<string> transactions;
for (const auto& row : rows) {
if (row.IsDirty()) {
RowToSQL(sql_stmt, row, non_txn);
if (++counter > kExecuteThreshold || sql_stmt.IsTooBig()) {
transactions.push_back(sql_stmt.str());
// Clear sql_stmt, reset counter here.
}
if (count > kCommitThreshold) {
try {
pqxx::work txn(conn);
for (const string& trans : transactions) {
txn.exec(trans);
}
txn.commit();
transactions.clear();
// Reset statement counter here.
} catch (const exception& e) {
txn.abort(); // Pedantic, happens on destruction.
YellowAlert(); // Something appropriate.
}
}
}
}
// Final commit here.
It seems quite wrong to me that I should build up this vector of things to execute and maintain these custom-tuned execution/transaction thresholds rather than using some facility of libpqxx. That is, this pattern seems common enough to me that the complexity I'm starting to see strikes me as my own misunderstanding.
Any pointers much appreciated.

Related

Most efficient paradigm for checking if a key exists in a c++ std::unordered_map?

I am relatively new to modern c++ and working with a foreign code base. There is a function that takes a std::unordered_map and checks to see if a key is present in the map. The code is roughly as follows
uint32_t getId(std::unordered_map<uint32_t, uint32_t> &myMap, uint32_t id)
{
if(myMap.contains(id))
{
return myMap.at(id);
}
else
{
std::cerr << "\n\n\nOut of Range error for map: "<< id << "\t not found" << std::flush;
exit(74);
}
}
It seems like calling contains() followed by at() is inefficient since it requires a double lookup. So, my question is, what is the most efficient way to accomplish this? I also have a followup question: assuming the map is fairly large (~60k elements) and this method gets called frequently how problematic is the above approach?
After some searching, it seems like the following paradigms are more efficient than the above, but I am not sure which would be best.
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Use find
Pros: One call, no try-catch overhead
Cons: Involves using an iterator; more overhead than just returning the value
auto findit = myMap.find(id);
if(findit == myMap.end())
{
//error message;
exit(74);
}
else
{
return findit->first;
}
You can do
// stuff before
{
auto findit = myMap.find(id);
if ( findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
}
// stuff after
or with the new C++17 init statement syntax
// stuff before
if ( auto findit = myMap.find(id); findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
// stuff after
Both define the iterator reference only in local scope. As the interator use is most definitively optimized away, I would go with it. Doing a second hash calculation will be slower almost for sure.
Also note that findit->first returns the key not the value. I was not sure what you expect the code to do, but one of the code snippets in the question returns the value, while the other one returns the key
In case you don't get enough speedup within only removing the extra lookup operation and if there are millions of calls to getId in a multi-threaded program, then you can use an N-way map to be able to parallelize the id-checks:
template<int N>
class NwayMap
{
public:
NwayMap(uint32_t hintMaxSize = 60000)
{
// hint about max size to optimize initial allocations
for(int i=0;i<N;i++)
shard[i].reserve(hintMaxSize/N);
}
void addIdValuePairThreadSafe(const uint32_t id, const uint32_t val)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
if(it==shard[selected].end())
{
shard[selected].emplace(id,val);
}
else
{
// already added, update?
}
}
uint32_t getIdMultiThreadSafe(const uint32_t id)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
// lock only the selected shard, others can work in parallel
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
// we expect it to be found, so get it quicker
// without going "else"
if(it!=shard[selected].end())
{
return it->second;
}
else
{
exit(74);
}
}
private:
std::unordered_map<uint32_t, uint32_t> shard[N];
std::mutex mut[N];
};
Pros:
if you serve each shard's getId from their own CPU threads, then you benefit from N*L1 cache size.
even within single thread use case, you can still interleave multiple id-check operations and benefit from instruction-level-parallelism because checking id 0 would have different independent code path than checking id 1 and CPU could do out-of-order execution on them (if pipeline is long enough)
Cons:
if a lot of checks from different threads collide, their operations are serialized and the locking mechanism causes extra latency
when id values are mostly strided, the parallelization is not efficient due to unbalanced emplacement
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Your implementation of getId terminates application, so who cares about exception overheads?
Please note that most compilers (AFAIK all) implement C++ exceptions to have zero cost when exception is not thrown. Problem appears when stack is unwinded when exception is thrown and matching exception handler. I read somewhere that penalty when exception is thrown is x40 comparing to case when stack is unwinded by simple returns (with possible error codes).
Since you want to just terminate application then this overhead is negligible.

How to optimize heavy map insertion in C++ regarding CPU and memory

I am iterating a map where I need to add elements on that map depending on a condition that an element is not found (it could be any other condition).
My main problem is that with a big scale of updates to be added, the application takes the whole CPU and all the memory.
State Class:
class State {
int id;
int timeStamp;
int state;
}
Method in State:
void State::updateStateIfTimeStampIsHigher(const State& state) {
if (this->id == state.getId() && state.getTimeStamp() > this->getTimeStamp()) {
this->timeStamp = state.getTimeStamp();
this->state = state.getState();
}
}
Loop Code:
std::map<int, State> data;
const std::map<int, State>& update;
for (auto const& updatePos : update) {
if (updatePos.first != this->toNodeId) {
std::map<int, State>::iterator message = data.find(updatePos.first);
if (message != data.end() && message->first) {
message->second.updateStateIfTimeStampIsHigher(updatePos.second);
} else {
data.insert(std::make_pair(updatePos.first, updatePos.second));
}
}
}
Watching KCacheGrind data it looks like the data.insert() line takes most time / memory. I am new to KCacheGrind, but this line seemed to be around 72% of the cost.
Do you have any suggestions on how to improve this?
Your question is quite general, but I see tho things to make it run faster:
Use hinted insertion / emplacement. When you add new element its iterator is returned. Assuming that both maps are ordered in same fashion you can tell where was the last one inserted so lookup should be faster (could use some benchmarking here).
Use emplace_hint for faster insertion
Sample code here:
std::map<int, long> data;
const std::map<int, long> update;
auto recent = data.begin();
for (auto const& updatePos : update) {
if (updateElemNotFound) {
recent = data.emplace_hint(recent, updatePos);
}
}
Also, if you want to trade CPU over memory you could use unordered_map (Is there any advantage of using map over unordered_map in case of trivial keys?), but first dot would not matter anymore.
I could find a satisfying answer thanks to researching the comments to the question. It did help a little bit to change from map to unordered_map but I still got unsatisfying results.
I ended up using Google's sparsehash that provides a better resource usage despite some drawbacks from erasing entries (which I do).
The code solution is as follows. First I include the required library:
#include <sparsehash/sparse_hash_map>
Then, my new data definition looks like:
struct eqint {
bool operator()(int i1, int i2) const {
return i1 == i2;
}
};
google::sparse_hash_map<int, State, std::tr1::hash<int>, eqint> data;
Since I have to use "erase" I have to do this after the sparsemap construction:
data.clear_deleted_key();
data.set_deleted_key(-1);
Finally my loop code changes very little:
for (auto const& updatePos : update) {
if (updatePos.first != this->toNodeId) {
google::sparse_hash_map<int, State, std::tr1::hash<int>, eqint>::iterator msgIt = data.find(updatePos.first);
if (msgIt != data.end() && msgIt->first) {
msgIt->second.updateStateIfTimeStampIsHigher(updatePos.second);
} else {
data[updatePos.first] = updatePos.second;
}
}
}
The time before making the changes for a whole application run under specific parameters was:
real 0m28,592s
user 0m27,912s
sys 0m0,676s
And the time after making the changes for the whole application run under the same specific parameters is:
real 0m37,464s
user 0m37,032s
sys 0m0,428s
I run it with other cases and the results where similar (from a qualitative point of view). The system time and resourse usage (CPU and memory) decreases and the user time increases.
Overall I am satisfied with the tradeoff since I was more concerned about resource usage than execution time (the application is a simulator and it was not able to finish and get results under really heavy load and now it does).

How to make a string into a reference?

I have looked into this, but it's not what I wanted: Convert string to variable name or variable type
I have code that reads an ini file, stores data in a QHash table, and checks the values of the hash key, (see below) if a value is "1" it's added to World.
Code Examples:
World theWorld;
AgentMove AgentMovement(&theWorld);
if(rules.value("AgentMovement") == "1")
theWorld.addRule(&AgentMovement);
INI file:
AgentMovement=1
What I want to do is, dynamically read from the INI file and set a reference to a hard coded variable.
for(int j = 0; j < ck.size(); j++)
if(rules.value(ck[j]) == "1")
theWorld.addRule("&" + ck[j]);
^
= &AgentMovement
How would you make a string into a reference as noted above?
This is a common theme in programming: A value which can only be one of a set (could be an enum, one of a finite set of ints, or a set of possible string values, or even a number of buttons in a GUI) is used as a criteria to perform some kind of action. The simplistic approach is to use a switch (for atomic types) or an if/else chain for complex types. That is what you are currently doing, and there is nothing wrong with it as such:
if(rules.value(ck[j]) == "1") theWorld.addRule(&AgentMovement);
else if(rules.value(ck[j]) == "2") theWorld.addRule(&AgentEat);
else if(rules.value(ck[j]) == "3") theWorld.addRule(&AgentSleep);
// etc.
else error("internal error: weird rules value %s\n", rules.value(ck[j]));
The main advantages of this pattern are in my experience that it is crystal clear: anybody, including you in a year, understands immediately what's going on and can see immediately which criteria leads to which action. It is also trivial to debug which can be a surprising advantage: You can break at a specific action, and only at that action.
The main disadvantage is maintainability. If the same criteria (enum or whatever) is used to switch between different things in various places, all these places have to be maintained, for example when a new enum value is added. An action may come with a sound, an icon, a state change, a log message, and so on. If these do not happen at the same time (in the same switch), you'll end up switching multiple times over the action enum (or if/then/else over the string values). In that case it's better to bundle all information connected to an action in a data structure and put the structures in a map/hash table with the actions as keys. All the switches collapse to single calls. The compile-time initialization of such a map could look like this:
struct ActionDataT { Rule rule; Icon icon; Sound sound; };
map<string, AcionDataT> actionMap
= {
{"1", {AgentMovement, moveIcon, moveSound} }
{"2", {AgentEat, eatIcon, eatSound } } ,
//
};
The usage would be like
for(int j = 0; j < ck.size(); j++)
theWorld.addRule(actionMap[rules.value(ck[j])].rule);
And elsewhere, for example:
if(actionFinished(action)) removeIcon(actionMap[action].icon);
This is fairly elegant. It demonstrates two principles of software design: 1. "All problems in computer science can be solved by another level of indirection" (David Wheeler), and 2. There is often a choice between more data or more code. The simplistic approach is code-oriented, the map approach is data oriented.
The data-centrist approach is indispensable if switches occur in more than one situation, because coding them out each time would be a maintenance nightmare.
Note that with the data-centrist approach none of the places where an action is used has to be touched when a new action is added. This is essential. The mechanism resembles (in principle and implementation, actually) the call of a virtual member function. The calling code doesn't know and isn't really interested in what is actually done. Responsibility is transferred to the object. The calling code may perform actions later in the life cycle of a program which didn't exist when it was written. By contrast, compare it to a program with many explicit switches where every single use must be examined when an action is added.
The indirection involved in the data-centrist approach is its disadvantage though, and the only problem which cannot be solved by another level of indirection, as Wheeler remarked. The code becomes more abstract and hence less obvious and harder to debug.
You have to provide the mapping from the names to the object by yourself. I would wrap it into a class, something like this:
template <typename T>
struct ObjectMap {
void addObject(std::string name,T* obj){
m[name] = obj;
}
T& getRef(std::string name) const {
auto x = m.find(name);
if (x != m.end() ) { return *(x->second);}
else { return dummy; }
}
private:
std::map<std::string,T*> m;
T dummy;
}
The problem with this approach is that you have to decide what to do if an object is requested that is actually not in the map. A reference always has to reference something (in contrast to a pointer that can be 0). I decided to return the reference to a dummy object. However, you might want to consider to use pointers instead of references. Another option might be to throw an error in case the object is not in the map.

C++ SDL Breaking out of while loop

I've been messing around with C++ SDL for a few days now and I've come across an interesting problem.
SDL_Event event1;
while(SDL_WaitEvent(&event1))
{
for(size_t i = 0; i < MainMenuOptions.size();i++)
{
if(event1.button.x > MainMenuOptions.at(i).GetX() && event1.button.x < (MainMenuOptions.at(i).GetX() + MainMenuOptions.at(i).GetWidth())
&& event1.button.y > MainMenuOptions.at(i).GetY() && event1.button.y < (MainMenuOptions.at(i).GetY() + MainMenuOptions.at(i).GetHeight()))
{
break;
}
}
}
When I use break in the for loop its going to break out of the for loop instead of the while loop. How would I break out the while loop instead without using the goto statement? (the goto statement is bad programming, I heard)
The common solution is to put this stuff into its own function and return from that:
inline SDL_Event do_it()
{
SDL_Event event;
while(SDL_WaitEvent(&event))
for(std::size_t i = 0; i < MainMenuOptions.size(); ++i)
if(/*...*/)
return event;
return event; // or whatever else suits, I know too little about your code
}
There's another answer to that, and I think I should say it before everyone will downvote me.
Using a variable is certainly a "good" way to do it. However, the creating additional variable just to jump out of the loop seems a bit of overkill, right?
So yes, this time goto is perfect solution. It's perfectly clear what you are doing with it, you are not using another variable and the code remains short, maintainable and readable.
The statement goto is bad practice is mostly a remnant of the BASIC times, when it was quite the only way of changing code flow. However, now we "know better", and saying that the goto or any other construction is bad, just doesn't cut it. It can be bad for one particular problem you are trying to solve with it (and it's the case with most of the problems that people try to solve with goto). However, given the right circumstances (like here) it's OK. I don't want to start a debate here, of course. Goto is like a very powerful tool (sledgehammer, for example). It has its uses and you can't say a tool is totally bad; it's the user using it in the wrong way.
Use a variable to indicate the need to exit:
bool exit_program = false;
while( !exit_program && SDL_WaitEvent(&event1) )
{
for( /* ... */ )
{
exit_program = true;
}
}
First point: IMO, you're trying to wrap too much up into a single place, and ending up with something that's fairly difficult to understand -- somebody has to read through that entire long set of comparisons before they can understand any of what this is supposed to accomplish at all.
Second point: using an explicit loop to iterate over a standard collection is usually a mistake -- and this is no exception. The standard library already has an algorithm to accomplish the same basic thing as your loop. It's better to use that than write it again yourself.
template <class T>
bool in_range(T a, T b, T c) {
return (a > b) && (a < b+c);
}
class in_rect {
point p;
public:
in_rect(point const &p) : p(p) {}
// Not sure of the type of objects in MainMenuOptions, so just T for now.
//
bool operator()(T const &m) {
return in_range(p.x, m.GetX(), m.GetWidth())
&& in_range(p.y, m.GetY(), m.GetHeight());
}
};
SDL_Event event1;
while (SDL_WaitEvent(&event1))
if (std::any_of(MainMenuOptions.begin(), MainMenuOptions.end(),
in_rect(event1.button))
break;
Once we fix the other problems, there's simply no longer any need (or even use) for the goto. We haven't taken any steps explicitly intended to remove it, but when the other problems have been fixed (especially, replacing the loop with an appropriate algorithm), the use for it has disappeared.
I suppose I should preemptively comment on the increase in the total number of lines of code: yes, there are more lines of code. What of it? If we really wanted to, we could use the same basic approach, but instead of defining in_rect and in_range, we'd basically just take the condition from the original if statement and stuff it into a lambda. While I'm very happy that lambdas have been added to C++, in this case I'm not excited about using it. It would get rid of the goto, but in general the code would be almost as unreadable as it started out.
Simply put, the number of lines isn't a good way to measure much of anything.
A solution without additional variable and goto:
while(SDL_WaitEvent(&event1))
{
size_t i;
for(i = 0; i < MainMenuOptions.size();i++)
{
if(/* ... */)
{
break;
}
}
if (i < MainMenuOptions.size())
break;
}

Is throwing an exception a healthy way to exit?

I have a setup that looks like this.
class Checker
{ // member data
Results m_results; // see below
public:
bool Check();
private:
bool Check1();
bool Check2();
// .. so on
};
Checker is a class that performs lengthy check computations for engineering analysis. Each type of check has a resultant double that the checker stores. (see below)
bool Checker::Check()
{ // initilisations etc.
Check1();
Check2();
// ... so on
}
A typical Check function would look like this:
bool Checker::Check1()
{ double result;
// lots of code
m_results.SetCheck1Result(result);
}
And the results class looks something like this:
class Results
{ double m_check1Result;
double m_check2Result;
// ...
public:
void SetCheck1Result(double d);
double GetOverallResult()
{ return max(m_check1Result, m_check2Result, ...); }
};
Note: all code is oversimplified.
The Checker and Result classes were initially written to perform all checks and return an overall double result. There is now a new requirement where I only need to know if any of the results exceeds 1. If it does, subsequent checks need not be carried out(it's an optimisation). To achieve this, I could either:
Modify every CheckN function to keep check for result and return. The parent Check function would keep checking m_results. OR
In the Results::SetCheckNResults(), throw an exception if the value exceeds 1 and catch it at the end of Checker::Check().
The first is tedious, error prone and sub-optimal because every CheckN function further branches out into sub-checks etc.
The second is non-intrusive and quick. One disadvantage is I can think of is that the Checker code may not necessarily be exception-safe(although there is no other exception being thrown anywhere else). Is there anything else that's obvious that I'm overlooking? What about the cost of throwing exceptions and stack unwinding?
Is there a better 3rd option?
I don't think this is a good idea. Exceptions should be limited to, well, exceptional situations. Yours is a question of normal control flow.
It seems you could very well move all the redundant code dealing with the result out of the checks and into the calling function. The resulting code would be cleaner and probably much easier to understand than non-exceptional exceptions.
Change your CheckX() functions to return the double they produce and leave dealing with the result to the caller. The caller can more easily do this in a way that doesn't involve redundancy.
If you want to be really fancy, put those functions into an array of function pointers and iterate over that. Then the code for dealing with the results would all be in a loop. Something like:
bool Checker::Check()
{
for( std::size_t id=0; idx<sizeof(check_tbl)/sizeof(check_tbl[0]); ++idx ) {
double result = check_tbl[idx]();
if( result > 1 )
return false; // or whichever way your logic is (an enum might be better)
}
return true;
}
Edit: I had overlooked that you need to call any of N SetCheckResultX() functions, too, which would be impossible to incorporate into my sample code. So either you can shoehorn this into an array, too, (change them to SetCheckResult(std::size_t idx, double result)) or you would have to have two function pointers in each table entry:
struct check_tbl_entry {
check_fnc_t checker;
set_result_fnc_t setter;
};
check_tbl_entry check_tbl[] = { { &Checker::Check1, &Checker::SetCheck1Result }
, { &Checker::Check2, &Checker::SetCheck2Result }
// ...
};
bool Checker::Check()
{
for( std::size_t id=0; idx<sizeof(check_tbl)/sizeof(check_tbl[0]); ++idx ) {
double result = check_tbl[idx].checker();
check_tbl[idx].setter(result);
if( result > 1 )
return false; // or whichever way your logic is (an enum might be better)
}
return true;
}
(And, no, I'm not going to attempt to write down the correct syntax for a member function pointer's type. I've always had to look this up and still never ot this right the first time... But I know it's doable.)
Exceptions are meant for cases that shouldn't happen during normal operation. They're hardly non-intrusive; their very nature involves unwinding the call stack, calling destructors all over the place, yanking the control to a whole other section of code, etc. That stuff can be expensive, depending on how much of it you end up doing.
Even if it were free, though, using exceptions as a normal flow control mechanism is a bad idea for one other, very big reason: exceptions aren't meant to be used that way, so people don't use them that way, so they'll be looking at your code and scratching their heads trying to figure out why you're throwing what looks to them like an error. Head-scratching usually means you're doing something more "clever" than you should be.