Parsing the date from the header stream of an .msg file - c++

I'm trying to obtain the send date of an .msg email message file. After endless searching, I've concluded that the send date is not kept in its own stream within the file (but please correct me if I'm wrong). Instead, it appears that the date must be obtained from the stream containing the standard email headers (a stream named __substg1.0_007D001F).
So I've managed to obtain the email header stream and store it in a buffer. At this point, I need to find and parse the Date field from the headers. I'm finding this difficult, because I don't believe I can use a standard email-parsing C++ library. After all, I only have a header stream--not an entire, standard email file.
I'm currently trying a regex, perhaps something like this:
std::wregex regexDate(L"^Date:(.*)\r\n");
std::wsmatch match;
if (std::regex_search(strHeader, match, regexDate)) {
//...
}
But I'm reluctant to use regex (I'm concerned that it'll be error-prone), and I'm wondering if there's a more robust, accepted approach to parsing headers. Perhaps splitting the header string on new lines and finding the one that begins with Date:? Any guidance would be greatly appreciated.
One other consideration: I'm not sure it's possible to read in the header stream line by line, because IStream doesn't have a get line method.
(Side note: I've also tried obtaining message data using C++ Outlook automation, but that seems to involve some security and compatibility issues, so it won't work out.)

The Send Date is stored in an msg file, but as you note, it is not in its own stream. As a short, fixed-width value, it can be found in the __properties_version1.0 stream object under the root entry (or under an attachment object for embedded messages), with the property ID 0x00390040, the PidTagClientSubmitTime Property, which is described in the MS-OXOMSG documentation as
Contains the current time, in UTC, when the email message is submitted.
MS-OXCMAIL Section 2.2.3.2.2: Sent time elaborates on this:
To set the value of the PidTagClientSubmitTime property ([MS-OXOMSG] section 2.2.3.11), clients MUST set the Date header value, as specified in [RFC2822].
This has the property type 0x0040, pTypTime, which, per the list of Property Data Types:
8 bytes; a 64-bit integer representing the number of 100-nanosecond intervals since January 1, 1601

Related

How to extract data of an unknown length from a string with random data

This question has been stumping me for quite a while now. I have a binary file, within which are the contents of an SNMP trap sent by a server, i am able to read the file and output its data as a UTF-8 encoded string '0‚Œcommunity¤‚} +…"Õ#À¨V ÁC8o[0‚W0+…"ÕCPU00020A+…"Õ/Message Detailing The Event Triggering The SNMP Trap (BIST).0+…"Õ0+…"Õ7K882L30+…"Õ server-name0+…"Õ0+…"ÕN/A0+…"Õ"1"0+…"Õ 7K882L30%+…"Õ Main System Chassis0&+…"ÕiDRAC-server.example.com' The issue is, im trying to extract certain elements of this data such as community and server-name and store them into variables, however, the program needs to work with lots of different SNMP traps with values that may differ in length and content, and as I don't know where the data will be located within the string I cant come up with code to reliably sort the data into the specific variables. The only thing I have to go off is that I know the sequential order of the data.
I'm not sure where to begin with approaching this problem, many thanks for any help.
To clarify, the data is currently stored in a variable char buffer[1025] = DATA;

Server side Validation for DateTime Stamp

In my application, server side date validations were done through IsDate which is very inconsistent in behavior. I used isValid("USdate",DateVar), that works fine with incoming dates, but when DateVar is a date time stamp it fails. Values coming in DateVar could be anything, a date, a time, a date & time or even some invalid data. If I use Date mask with isValid, that behaves like isDate and of no use. How I can accomplish this.
All "dates" that will be arriving via a request - be they via a URL parameter, a form submission, a cookie, etc - will be strings, not dates.
What you need to do is to work out what string formats you will allow, and validate them accordingly.
EG: you might decide that yyyy-mm-dd is OK, but you won't accept m/d/yy. You might pass them as three separate components for y, m and d. But you really oughtn't try to accept any old format, as you will need to have a validator for each format, and there's a law of diminishing returns there: people won't expect to use any format they like; they'll be expecting you to guide them. You also need to be mindful that if you ask me to type in today's date, I'd give you 4/5/2015. But to you that might represent April 5.
Given various month-length and leap-year rules, really the easiest and most reliable way to see if and input string represents a date in an acceptable format do this:
Validate the format mask, eg: if you're accepting yyyy-mm-dd, then the input needs to be \d{4}-\d{2}-\d{2}. Then at least you know the string has been formed properly.
Then extract the components from the string, and attempt to create a date object with them. If it doesn't error: it's OK.
Last: check any date boundaries within / outwith which the date needs to fall.

Excel international date formatting

I am having problems formatting Excel datetimes, so that it works internationally. Our program is written in C++ and uses COM to export data from our database to Excel, and this includes datetime fields.
If we don't supply a formatting mask, some installations of Excel displays these dates as Serial numbers (days since 1900.01.01 followed by time as a 24-hour fraction). This is unreadable to a human, so we ha found out that we MUST supply a date formatting mask to be sure that it displays readable.
The problem - as I see it - is that Excel uses international formatting masks. For example; the UK datetime format mask might be "YYYY-MM-DD HH:MM".
But if the format mask is sent to an Excel that is installed in Sweden, it fails since the Swedish version of the Excel uses "ÅÅÅÅ-MM-DD tt:mm".
It's highly impractical to have 150 different national datetime formatting masks in our application to support different countries.
Is there a way to write formatting masks so that they include locale, such that we would be allowed to use ONE single mask?
Unless you are using the date functionality in Excel, the easiest way to handle this is to decide on a format and then create a string yourself in that format and set the cell accordingly.
This comic: http://xkcd.com/1179/ might help you choose a standard to go with. Otherwise, clients that open your file in different countries will have differently formatted data. Just pick a standard and force your data to that standard.
Edited to add: There are libraries that can make this really easy for your as well... http://www.libxl.com/read-write-excel-date-time.html
Edited to add part2: Basically what I'm trying to get at is to avoid asking for the asmk and just format the data yourself (if that makes sense).
I recommend doing the following: Create an excel with date formatting on a specific cell and save this for your program to use.
Now when the program runs it will open this use this excel file to retrieve the local date formatting from the excel and the specified cell.
When you have multiple formats to save just use different cells for them.
It is not a nice way but will work afaik.
Alteratively you could consider creating an xla(m) file that will use vba and a command to feed back the local formatting characters through a function like:
Public Function localChar(charIn As Range) As String
localChar = charIn.NumberFormatLocal
End Function
Also not a very clean method, but it might do the trick for you.

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael
How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.
Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

XML or CSV for "Tabular Data"

I have "Tabular Data" to be sent from server to client --- I am analyzing should I be going for CSV kind of formate or XML.
The data which I send can be in MB's, server will be streaming it and client will read it line by line to start paring the output as it gets (client can't wait for all data to come).
As per my present thought CSV would be good --- it will reduce the data size and can be parsed faster.
XML is a standard -- I am concerned with parsing data as it comes to system(live parsing) and data size.
What would be the best solution?
thanks for all valuable suggestions.
If it is "Tabular data" and the table is relatively fixed and regular, I would go for a CSV-format. Especially if it is one server and one client.
XML has some advantage if you have multiple clients and want to validate the file format before using the data. On the other hand, XML has cornered the market for "code bloat", so the amount transfered will be much larger.
I would use CSV, with a header which indicate the id of each field.
id, surname, givenname, phone-number
0, Doe, John, 555-937-911
1, Doe, Jane, 555-937-911
As long as you do not forget the header, you should be fine if the data format ever changes. Of course the client need be updated before the server starts sending new streams.
If not all clients can be updated easily, then you need a more lenient messaging system.
Google Protocol Buffer has been designed for this kind of backward/forward compatibility issues, and combines this with excellent (fast & compact) binary encoding abilities to reduce the message sizes.
If you go with this, then the idea is simple: each message represents a line. If you want to stream them, you need a simple "message size | message blob" structure.
Personally, I have always considered XML bloated by design. If you ever go with Human Readable formats, then at least select JSON, you'll cut down the tag overhead by half.
I would suggest you go for XML.
There are plenty of libraries available for parsing.
Moreover, if later the data format changes, the parsing logic in case of XML won't change only business logic may need change.
But in case of CSV parsing logic might need a change
CSV format will be smaller since you only have to delare the headers on the first row then rows of data below with only commas in between to add any extra characters to the stream size.