I'm trying to write a program whose job it is to go into shared memory, retrieve a piece of information (a struct 56 bytes in size), then parse that struct lightly and write it to a database.
The catch is that it needs to do this several dozens of thousands of times per second. I'm running this on a dedicated Ubuntu 14.04 server with dual Xeon X5677's and 32GB RAM. Also, Mongo is running PerconaFT as its storage engine. I am making an uneducated guess here, but say worst case load scenario would be 100,000 writes per second.
Shared memory is populated by another program who's reading information from a real time data stream, so I can't necessarily reproduce scenarios.
First... is Mongo the right choice for this task?
Next, this is the code that I've got right now. It starts with creating a list of collections (the list of items I want to record data points on is fixed) and then retrieving data from shared memory until it catches a signal.
int main()
{
//these deal with navigating shared memory
uint latestNotice=0, latestTurn=0, latestPQ=0, latestPQturn=0;
symbol_data *notice = nullptr;
bool done = false;
//this is our 56 byte struct
pq item;
uint64_t today_at_midnight; //since epoch, in milliseconds
{
time_t seconds = time(NULL);
today_at_midnight = seconds/(60*60*24);
today_at_midnight *= (60*60*24*1000);
}
//connect to shared memory
infob=info_block_init();
uint32_t used_symbols = infob->used_symbols;
getPosition(latestNotice, latestTurn);
//fire up mongo
mongoc_client_t *client = nullptr;
mongoc_collection_t *collections[used_symbols];
mongoc_collection_t *collection = nullptr;
bson_error_t error;
bson_t *doc = nullptr;
mongoc_init();
client = mongoc_client_new("mongodb://localhost:27017/");
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collections[symbol] = mongoc_client_get_collection(client, "scribe",
(infob->sd+symbol)->getSymbol());
}
//this will be used later to sleep one millisecond
struct timespec ts;
ts.tv_sec=0;
ts.tv_nsec=1000000;
while(continue_running) //becomes false if a signal is caught
{
//check that new info is available in shared memory
//sleep 1ms if it isn't
while(!getNextNotice(¬ice,latestNotice,latestTurn)) nanosleep(&ts, NULL);
//get the new info
done=notice->getNextItem(item, latestPQ, latestPQturn);
if(done) continue;
//just some simple array math to make sure we're on the right collection
collection = collections[notice - infob->sd];
//switch on the item type and parse it accordingly
switch(item.tp)
{
case pq::pq_event:
doc = BCON_NEW(
//decided to use this instead of std::chrono
"ts", BCON_DATE_TIME(today_at_midnight + item.ts),
//item.pr is a uint64_t, and the guidance I've read on mongo
//advises using strings for those values
"pr", BCON_UTF8(std::to_string(item.pr).c_str()),
"sz", BCON_INT32(item.sz),
"vn", BCON_UTF8(venue_labels[item.vn]),
"tp", BCON_UTF8("e")
);
if(!mongoc_collection_insert(collection, MONGOC_INSERT_NONE, doc, NULL, &error))
{
LOG(1,"Mongo Error: "<<error.message<<endl);
}
break;
//obviously, several other cases go here, but they all look the
//same, using BCON macros for their data.
default:
LOG(1,"got unknown type = "<<item.tp<<endl);
break;
}
}
//clean up once we break from the while()
if(doc != nullptr) bson_destroy(doc);
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collection = collections[symbol];
mongoc_collection_destroy(collection);
}
if(client != nullptr) mongoc_client_destroy(client);
mongoc_cleanup();
return 0;
}
My second question is: is this the fastest way to do this? The retrieval from shared memory isn't perfect, but this program is getting way behind its supply of data, far moreso than I need it to be. So I'm looking for obvious mistakes with regards to efficiency or technique when speed is the goal.
Thanks in advance. =)
Related
I've coded some relatively simple communication protocol using shared memory, and shared mutexes. But then I wanted to expand support to communicate between two .dll's having different run-time in use. It's quite obvious that if you have some std::vector<__int64> and two dll's - one vs2010, one vs2015 - they won't work politely with each other. Then I've thought - why I cannot serialize in ipc manner structure on one side and de-serialize it on another - then vs run-times will work smoothly with each other.
Long story short - I've created separate interface for sending next chunk of data and for requesting next chunk of data. Both are working while decoding happens - meaning if you have vector with 10 entries, each string 1 Mb, and shared memory is 10 Kb - then it would require 1*10*1024/10 times to transfer whole data. Each next request is followed by multiple outstanding function calls - either by SendChunk or GetNextChunk depending on transfer direction.
Now - I wanted to encode and decode happen simultaneously but without any threading - then I've came up with solution of using setjmp and longjmp. I'm attaching part of code below, just for you to get some understanding of what is happening in whole machinery.
#include "..."
#include <setjmp.h> //setjmp
class Jumper: public IMessageSerializer
{
public:
char lbuf[ sizeof(IpcCommand) + 10 ];
jmp_buf jbuf1;
jmp_buf jbuf2;
bool bChunkSupplied;
Jumper() :
bChunkSupplied(false)
{
memset( lbuf, 0 , sizeof(lbuf) );
}
virtual bool GetNextChunk( bool bSend, int offset )
{
if( !bChunkSupplied )
{
bChunkSupplied = true;
return true;
}
int r = setjmp(jbuf1);
((_JUMP_BUFFER *)&jbuf1)->Frame = 0;
if( r == 0 )
longjmp(jbuf2, 1);
bChunkSupplied = true;
return true;
}
virtual bool SendChunk( bool bLast )
{
bChunkSupplied = false;
int r = setjmp(jbuf2);
((_JUMP_BUFFER *)&jbuf2)->Frame = 0;
if( r == 0 )
longjmp(jbuf1, 1);
return true;
}
bool FlushReply( bool bLast )
{
return true;
}
IpcCommand* getCmd( void )
{
return (IpcCommand*) lbuf;
}
int bufSize( void )
{
return 10;
}
}; //class Jumper
Jumper jumper;
void main(void)
{
EncDecCtx enc(&jumper, true, true);
EncDecCtx dec(&jumper, false, false);
CString s;
if( setjmp(jumper.jbuf1) == 0 )
{
alloca(16*1024);
enc.encodeString(L"My testing my very very long string.");
enc.FlushBuffer(true);
} else {
dec.decodeString(s);
}
wprintf(L"%s\r\n", s.GetBuffer() );
}
There are couple of issues here. After first call to setjmp I'm using alloca() - which allocates memory from stack, it will be autofreed on return. alloca can happen only during first jump, because any function call always uses callstack (to save return address) and it can corrupt second "thread" call stack.
There are multiple articles discussing about how dangerous setjmp and longjmp are, but this is now somehow working solution. The stack size (16 Kb) is reservation for next function calls to come - decodeString and so on - it can be adjusted to bigger if not enough.
After trying out this code I've noticed that x86 code was working fine, but 64-but did not work - I've got similar problem to what is described here:
An invalid or unaligned stack was encountered during an unwind operation
Like article suggested I've added ((_JUMP_BUFFER *)&jbuf1)->Frame = 0; kind of resetting - and after that 64-bit code started to work. Currently library is not using any exception mechanism and I'm not planning to use any (will try-catch everything if needed in encode* decode* function calls.
So questions:
Is it acceptable solution to disable unwinding in code ? (((_JUMP_BUFFER *)&jbuf1)->Frame = 0;) What unwinding really means in context setjmp / longjmp ?
Do you see any potential problem with given code snipet?
Currently, I am working on real time interface with Visual Studio C++.
I faced problem is, when buffer is running for data store, that time .exe is not responding at the point data store in buffer. I collect data as 130Hz from motion sensor. I have tried to increase virtual memory of computer, but problem was not solved.
Code Structure:
int main(){
int no_data = 0;
float x_abs;
float y_abs;
int sensorID = 0;
while (1){
// Define Buffer
char before_trial_output_data[][8 * 4][128] = { { { 0, }, }, };
// Collect Real Time Data
x_abs = abs(inchtocm * record[sensorID].y);
y_abs = abs(inchtocm * record[sensorID].x);
//Save in buffer
sprintf(before_trial_output_data[no_data][sensorID], "%d %8.3f %8.3f\n",no_data,x_abs,y_abs);
//Increment point
no_data++;
// Break While loop, Press ESc key
if (GetAsyncKeyState(VK_ESCAPE)){
break;
}
}
//Data Save in File
printf("\nSaving results to 'RecordData.txt'..\n");
FILE *fp3 = fopen("RecordData.dat", "w");
for (i = 0; i<no_data-1; i++)
fprintf(fp3, output_data[i][sensorID]);
fclose(fp3);
printf("Complete...\n");
}
The code you posted doesn't show how you allocate more memory for your before_trial_output_data buffer when needed. Do you want me to guess? I guess you are using some flavor of realloc(), which needs to allocate ever-increasing amount of memory, fragmenting your heap terribly.
However, in order for you to save that data to a file later on, it doesn't need to be in continuous memory, so some kind of list will work way better than an array.
Also, there is no provision in your "pseudo" code for a 130Hz reading; it processes records as fast as possible, and my guess is - much faster.
Is your prinf() call also a "pseudo code"? Otherwise you are looking for trouble by having mismatch of the % format specifications and number and type of parameters passed in.
I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.
can I get a C++ code to read windows perfmon counter (category, counter name and instance name)?
It's very easy in c# but I needed c++ code.
Thanks
As Doug T. pointed out earlier, I posted a helper class awhile ago to query the performance counter value. The usage of the class is pretty simple, all you have to do is to provide the string for the performance counter.
http://askldjd.wordpress.com/2011/01/05/a-pdh-helper-class-cpdhquery/
However, the code I posted on my blog has been modified in practice. From your comment, it seems like you are interested in querying just a single field.
In this case, try adding the following function to my CPdhQuery class.
double CPdhQuery::CollectSingleData()
{
double data = 0;
while(true)
{
status = PdhCollectQueryData(hQuery);
if (ERROR_SUCCESS != status)
{
throw CException(GetErrorString(status));
}
PDH_FMT_COUNTERVALUE cv;
// Format the performance data record.
status = PdhGetFormattedCounterValue(hCounter,
PDH_FMT_DOUBLE,
(LPDWORD)NULL,
&cv);
if (ERROR_SUCCESS != status)
{
continue;
}
data = cv.doubleValue;
break;
}
return data;
}
For e.g.
To get processor time
counter = boost::make_shared<CPdhQuery>(std::tstring(_T("\\Processor Information(_Total)\% Processor Time")));
To get file read bytes / sec:
counter = boost::make_shared<CPdhQuery>(std::tstring(_T("\\System\\File Read Bytes/sec")));
To get % Committed Bytes:
counter = boost::make_shared<CPdhQuery>(std::tstring(_T("\\Memory\\% Committed Bytes In Use")));
To get the data, do this.
double data = counter->CollectSingleData();
I hope this helps.
... Alan
Some of the commonly used performance values have API calls to get them directly. For example, the total processor time can be obtained from GetSystemTimes, and you can calculate the percentage yourself.
If this isn't an option then the Performance Data Helper library provides a moderately simple interface to performance data.
I would like to create web-service with some problems and judge system. The user should write solution in C++ and then upload the source code. Then code should be compiled and test to determinate if the solution is right(like TopCoder does when you submit your code). There should be time and memory limits for solutions to run. Is there some ready-to-use judge systems to create such a service on Linux?
Do you mean online judge system like acm.timus.ru? Then you could try open source project on code.google.com and I guess that they use it here. I studied before this code a little bit: basically it prevents some system calls and measures spent time and memory. In detail, have a look at here.
I used DomJudge system which is ready for online competitions. http://domjudge.sourceforge.net/
For just checking memory and time caps, just add the code into their project automatically. Also, see Limit physical memory per process for memory caps as well.
For example, to check memory, add:
(EDITED FOR LINUX EXAMPLE)
#include "sys/types.h"
#include "sys/sysinfo.h"
#inlcude <stdio.h>
#include <pthread.h>
public long long getMemoryUsage()
{
sysinfo (&memInfo);
long long totalVirtualMem = memInfo.totalram; //Total physical memory
}
And add a new Thread to their Main() that calls a function as follow:
private const long long MAXMEMORY = 1024; // Memory cap
private const int LIMITAPPLICATION = 100; // 10 seconds
private void TimeLimitToApplication()
{
//LIMITAPPLICATION represents how many 100ms's must
//pass until time limit is reached.
while(COUNTER < LIMITAPPLICATION)
{
sleep(100);
/* Include if including code block below
if(CheckProcessThreadsIsComplete()) break;
*/
//Check memory usage every 100ms
if(getMemoryUsage() > MAXMEMORY)
break;
COUNTER++;
}
pgid = getpgid();
kill(pgid, 15);
}
You might also want to add another check to see if all your threads are completed:
(I only have a C#.NET sample of this, but I'm sure you could find something similar.)
public static bool CheckProcessThreadsIsComplete()
{
Process[] allProcs = Process.GetProcesses();
foreach(Process proc in allProcs)
{
ProcessThreadCollection myThreads = proc.Threads;
foreach(ProcessThread pt in myThreads)
{
//Ignore thread that you started to do time and memory checks
if(pt.Id == TIMELIMIT_AND_MEMORYCHECK_THREAD_ID) continue;
if(pt.ThreadState != (ThreadState.Unstarted | ThreadState.Stopped | ThreadState.WaitSleepJoin | ThreadState.Aborted)
return true; // all threads are not completed
}
}
return false; // all threads are completed
}
And add that to your checks in TimeLimitToApplication();