processing large data that sequential with tbb

processing large data that sequential with tbb - c++

I'm working on c++ app to process large amounts of quote data eg. (MSFT, AMZN, etc) with tbb. And was wondering how I would structure it. I'm been looking at parallel_for and pipeline and concurrent_queue.
The process would basically parse the data, process it and output to file. Parsing and processing can be done in parallel, but output should be in order for each symbol.
Eg. Input:
- Msg #1 - AMZN #1
- Msg #2 - AMZN #2
- Msg #3 - IBM #1
- Msg #4 - AMZN #3
- Msg #5 - CSCO #1
- Msg $6 - IBM #2
I would like to use lock-free solution or minimum locking, but it seems like I have keep in concurrent_queue to keep the order.
Any ideas would be helpful
Thanks,
David

If you use the pipeline pattern (tbb::pipeline class or tbb::parallel_pipeline() function), you can use ordered filters to ensure the output will appear in exactly the same order as the input was received. And you will not need any locks in your code for ordering.

Does your quote data either have a timestamp or a sequence number
Otherwise add a sequence number from the producer thread and sort the data based on squence number after parsing it - the resorting can be done then either in a batch or just before the writing of the files.

You can create an output structure (hash or list) where a key is a position of the displayed element (1st, 2nd, ...) and the value is the data to be displayed. Then when all the elements are ready, you can output the structure in the desired order.
This way you don't care about which thread finishes first.

Related

Query on boost interprocess::file_lock on NFS

We have an application which user can run to generate some data at user specified path. This unique output data is generated with respect to one unique input data-set - this input data is provided by the user.
When we initially developed the application, we never anticipated that number of unique input data-set will be large (due to nature of application). Our expectation was number of unique input data-set could be of order of 10 where as one user has this as 1000. So, that particular user started 1000 jobs of our application on grid and all writing data to same path. Note - these 1000 jobs are not fired from our application and rather he spawned 1000 processes of our application on different machines.
Now this lead to some collision and data loss.
To guard against it, I am planning to synchronization using boost::interprocess. This is what I am planning:
// usual processing of input data ...
boost::filesystem::path reportLockFilePath(boost::filesystem::system_complete(userDir));
rerportLockFilePath.append("report.lock");
// if lock file does not exist, create one
if (!boost::filesystem::exists(reportLockFilePath) {
boost::interprocess::named_mutex reportLockMutex(boost::interprocess::open_or_create, "report_mutex");
boost::interprocess::scoped_lock< boost::interprocess::named_mutex > lock(reportLockMutex);
std::ofstream lockStrm(reportLockFilePath.string().c_str());
lockStrm << "## report lock file ##" << std::endl;
lockStrm.flush();
}
boost::interprocess::file_lock reportFileLock(reportLockFilePath.string().c_str());
boost::interprocess::scoped_lock< boost::interprocess::file_lock > lock(reportFileLock);
// usual reporting code that we already have ...
Now, questions are -
If this is correct synchronization for the problem at hand
If this synchronization scheme will work, when jobs are on different machines and path is on NFS
If on NFS etc., this is not going to work, what are the C++ alternatives? I prefer to avoid lower level C functions to avoid race condition due to lock being held when one instance of execution crashes etc.

I just removed the named mutex part (as that was causing problem on few machines due to permission issue - probably related to umask issue discussed in this context in some other post) and replaced with
std::ofstream lockStrm(reportLockFilePath.string().c_str(), std::ios_base::app);
And it worked at least in our internal testing.

TopologyTestDriver with streaming groupByKey.windowedBy.reduce not working like kafka server [duplicate]

I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?

If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers

Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.

When trying to execute an Siddhi APP using an event-stream generated by JMeter the RAM usage gets out of control

When trying to simulate an event-stream with JMeter and use it as source on siddhi it works for a little time but ends with RAM being overused and the execution of the program stops.
I tried executing the code with a database,without a database, using a partition to get the events one by one.
This is the stream code:
#Source(type = 'http',
receiver.url='http://172.23.3.22:8007/insertSweetProduction',
basic.auth.enabled='false',
#map(type='json', #attributes( tipoDato='$.tipoDato', fecha='$.fecha', valor='$.valor', servicio='$.servicio')))
define stream insertSweetProduction (tipoDato string, fecha string, valor double, servicio string);
This is the sink stream:
#Sink(type='file',
#map(type='json'),
#attributes( tipoDato='$.tipoDato', fecha='$.fecha', valor='$.valor', servicio='$.servicio'),
file.uri='/dev/null')
define stream fileSweetProduction (tipoDato string, fecha string, valor double, servicio string);
And this is the query executed to copy from one stream to another:
#info(name='query2')
from insertSweetProduction
select tipoDato,fecha,valor,servicio
insert into fileSweetProduction;
The expected results are that the wso2worker would show that all the events were processed and inserted on the sink stream.
On JMeter I am simulating 1 user introducing 6000 events during 1 hour and it looks like the memory ends up overused and the simulation stops.
Tried with partition and the memory usage improved a lot but still ended up in failure.
All I can think is that is a coding problem but cant seem to find anything that could cause this.
//Sorry for the poor english, not my first language//

The recommended heap size is 2 Gb[1]. How much have you allocated to the worker profile?
[1] https://docs.wso2.com/display/SP430/Installation+Prerequisites

rdkafka consumer query about partition size

Let's say I don't have access to the set of producers that commits on the partition of interest, but just have control over a bunch of C++ consumers.
Since I'm running benchmarks over a complex program, I'd like to know the spread between the offset my consumers are fetching and the total offset stored in the partition.
e.g., >> reading message #1234 of 5678 total in partition 0 of topic foo
I misunderstood the purpose of RdKafka::Consumer->outq_len() and RdKafka::Topic->OFFSET_END, because they seem always equal to 0 and -1, respectively.
How can I acquire the 5678 value of my example?

You need to subscribe to librdkafka's statistics to get an updated view of your consumer's lag.
Register an Event callback class and regularily call poll() on your handle, check for EVENT_STATS and then parse the corresponding JSON message and look for lo_offset, hi_offset and consumer_lag.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?

I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js