Boost.Log -managing repeated consecutive log messages and prevent printing duplicates - c++

I am new to Boost.Log and Using it to develop a logger library as a wrapper on top of Boost.Log. My Problem is finding a convenient way to set a counter for number of consecutive repeated log messages instead of printing the same log multiple times.
for example: (ADD_LOG() method is in my library and do BOOST_LOG_SEV(...))
ADD_LOG("Hello");
ADD_LOG("Hello");
ADD_LOG("Hello");
ADD_LOG("Some Different Hello");
I want to be the output log file like this: (sample_0.log)
........................................................
[TimeStamp] [Filter] Hello
[TimeStamp] [Filter] Skipped 2 duplicate messages!
[TimeStamp] [Filter] Some Different Hello
.......................................................
I am using this example with Text File Backend and TimeStamp, Filter are Ok. My problem is about skipping duplicates. maybe with setting filters or anything else.
I think syslog in linux has this feature by some configurations.

Boost.Log does not implement such log record accumulation, you will have to implement it yourself. You can do this by implementing a sink backend that would buffer the last log record message and compare it with the next one. Note that you should not buffer the whole record or formatted string because it will likely differ because of timestamps, record counters and other frequently changing attribute values that you might use.

Related

What will happen if power get shutdown , while we are inserting into database?

I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.

Parsing the date from the header stream of an .msg file

I'm trying to obtain the send date of an .msg email message file. After endless searching, I've concluded that the send date is not kept in its own stream within the file (but please correct me if I'm wrong). Instead, it appears that the date must be obtained from the stream containing the standard email headers (a stream named __substg1.0_007D001F).
So I've managed to obtain the email header stream and store it in a buffer. At this point, I need to find and parse the Date field from the headers. I'm finding this difficult, because I don't believe I can use a standard email-parsing C++ library. After all, I only have a header stream--not an entire, standard email file.
I'm currently trying a regex, perhaps something like this:
std::wregex regexDate(L"^Date:(.*)\r\n");
std::wsmatch match;
if (std::regex_search(strHeader, match, regexDate)) {
//...
}
But I'm reluctant to use regex (I'm concerned that it'll be error-prone), and I'm wondering if there's a more robust, accepted approach to parsing headers. Perhaps splitting the header string on new lines and finding the one that begins with Date:? Any guidance would be greatly appreciated.
One other consideration: I'm not sure it's possible to read in the header stream line by line, because IStream doesn't have a get line method.
(Side note: I've also tried obtaining message data using C++ Outlook automation, but that seems to involve some security and compatibility issues, so it won't work out.)
The Send Date is stored in an msg file, but as you note, it is not in its own stream. As a short, fixed-width value, it can be found in the __properties_version1.0 stream object under the root entry (or under an attachment object for embedded messages), with the property ID 0x00390040, the PidTagClientSubmitTime Property, which is described in the MS-OXOMSG documentation as
Contains the current time, in UTC, when the email message is submitted.
MS-OXCMAIL Section 2.2.3.2.2: Sent time elaborates on this:
To set the value of the PidTagClientSubmitTime property ([MS-OXOMSG] section 2.2.3.11), clients MUST set the Date header value, as specified in [RFC2822].
This has the property type 0x0040, pTypTime, which, per the list of Property Data Types:
8 bytes; a 64-bit integer representing the number of 100-nanosecond intervals since January 1, 1601

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Chronicle Queue - reader modifying msg

We are preparing to use Chronicle Queue (SingleChronicleQueue) to record our messages. The prototype is working now. However we have some problems.
Can readers modify the messages? We use a chronicle map to record indices read to remove duplicate messages after a restart. In case this doesn't work, we want to tag messages read on the reader side. Actually we already do that. The problem is now, sometimes, we get error messages like "15c77d8be (62) was 8000003f is now 3f", and we suspect that this is because writes across cache line boundaries are no longer atomic now. What is the recommended way to solve it? Currently we add a one-byte tag before the message, will adding a 3-byte padding solve the problem?
Can we use our own roll policy? We'd like to use an hourly policy. But the hourly policy mandates a file containing less than 256 million entries. Can we use our custom roll cycle? Are there any caveats?
One common approach is to record your consumers' read indices in another output queue. On restart, simply read backwards from the end of the output queue to determine what each consumer's read sequence should be.
Without seeing your code it is a little difficult to determine what the problem might be with trying to modify existing records. Note that records inserted into a queue are supposed to be immutable; modifying them from a reader thread is not supported.
With regards to your RollCycle requirements, the LARGE_HOURLY cycle was recently added, allowing ~2 billion entries per cycle:
https://github.com/OpenHFT/Chronicle-Queue/blob/master/src/main/java/net/openhft/chronicle/queue/RollCycles.java#L27

What is the efficient method to compare files list in Client and remote Server

I have the below situation which need to be addressed efficiently,
I'm doing file sync from client devices to server. Sometimes what happen is file from one device doesn't get fetched to another device from the server due to some issues with server. I need to make sure that all the files in the server are synced to all the client devices using a separate thread. I am using C++ for the development and libcurl for client to server communication.
Here in the client device, we have an entry for downloaded files in the SQLite Database. Likewise in the server, we have similar updates in the server databases (MySQL) too. I need to list all the available files from the client device and send it to server and have to compare it with the list taken from the server database to find out the missed files.
I did a rough estimation that for 1 million files list (File Name with Full Path), it is about 85 MB in size. Upon compression it goes upto 10 MB in size. So transferring this entire file list (even after compression) from client to server is not a good idea. I planned to implement Bloom Filters for this as below,
Fetch files list from client side database and convert those to Bloom Filter Data Structure.
Just transferring the bloom data structure alone from client to the server.
Fetch files list from server side database and compare it with Bloom data structure received from the client and find out the missing files.
Please note that the above process initiated from client should be handled in thread at regular interval say for every 1 hour or so.
The problem with Bloom filters is false positive rates even if it very low. I don't want to miss out even a single file. Is there any other better way of doing this ?.
As you've noticed, this isn't a problem for which Bloom Filters are appropriate. With a Bloom Filter, when you get a hit you must then check the authoritative source to differentiate between a false positive and a true positive - they're useful in situations where most queries against the filter will be expected to give a negative result, which is the opposite to your case.
What you could do is have each side build a partial Prefix Tree in memory of all the filenames known to that side. It wouldn't be a full prefix tree - once you number of filenames below a node drops below a certain level you'd just include the full list of those filenames in that node. You then synchronise those prefix trees using a recursive algorithm starting at the root of the trees:
Each side creates a hash of all the sorted, concatenated filenames below the current node.
If the hashes are equal then this node and all descendents are synchronised - return.
If there are no child nodes, send the (short) list of filenames at this terminal node from one side to the other to synchronise and return.
Otherwise, recursively synchronise the child nodes and return.
The hash should be at least 128 bits, and make sure that when you concatenate the filenames for the hash you do so in a reversible manner (ie. seperate them with a character that can't appear in filenames like \0, or prefix each one with its length).
In file/pathname compression I've found a prefix-suffix compression to work better even alone than a generic (bz2) compression. When combined, the filename list could be reduced even more.
The trick is in using escape codes (e.g. <32) to indicate the number of common characters to the previous row, then use regular characters for the unique part and finally (optionally) encode the number of common characters at the end of the string.