Storing PCRE compiled regexes in C/C++

Storing PCRE compiled regexes in C/C++ - c++

Is there an efficient way to store the compiled regexes (compiled via regcomp(), PCRE) in a binary file, so that later I can just read from the file and call regexec()?
Or is it just a matter of dumping the compiled regex_t structs to the file and reading them back when needed?

Unless you have a super-complex regex, I hardly see an advantage of serializing the compiled regex, the compilation time shouldn't be that big. Unless you are on a super-tight embed system?
In any case, indeed dumping the structure might be a solution, at least you can try...
[EDIT] I just looked at the source I have (6.7) and as I feared, it is not so simple, the structure starts with a void *... You can't serialize pointers, they have to be recomputed.

Related

Is it a good idea to save/load an array of struct?

I was wondering if it was a good idea to load/save an array of a certain type of structure using fstream. Note, I am talking about loading/saving to a binary file. Should I be loading/saving independent variables such as int, float, boolean rather then a struct? The reason I ask that is because I've heard that a structure might have some type of padding which might offset the save/load.

A structure may contain padding, which will be written to the file. This is no big deal if the file is going to be read back on the same platform, using code emitted by the same compiler that did the write. However, this is difficult to guarantee, and if you cannot guarantee it, you should normally write the data in some textual format, such as XML, json or whatever.

Without serialization, your binary data will not be portable across different platform (and compilers). So if you need portability, then you need to serialize the data before storing it in file as binary.
Have a look at these:
Boost Serialization Tutorial
Boost Serializable Concept

It's not deprecated (it's not part of any formal spec, where should it be deprecated?), but it's extremely not portable and probably the worst way to go about serialising stuff. Use Boost.Serialization, or a similar library.

As you pointed out in your answer this will happen with writing structs this way. If you want your files to be portable across platforms, e.g. file being written on Linux i686 to opened by Solaris on Sparc then even writing individual float's won't work.
Try writing your data to something like text or XML and then zip/tar the files to make one document of them.

As Neil said, prefer textual representation of data. The XML format may be overkill. Simpler versions are Comma Separated Value (CSV) and one value per text line.

How to save/serialize compiled regular expression (std::regex) to a file?

I'm using <regex> from Visal Studio 2010.
I understand that when I create regex object then it's compiled. There is no compile method like in other languages and libraries but I thinks that's how it work, am I right?
I need to store large amount of this compiled regexes in a file so I would just get chunk of memory block and get my compiled regex.
I can't figure how to do this. I found that in PCRE it is possible but it's Linux library. There is a Windows [version2 but it's 3 years old and I would like to use more high-level approach (there isn't c++ wrapper in windows version).
So is it possible to use save std:regex or boost::regex (it's the same right?) as a chunk of memory and then simply reuse it later?
Or is there other simple library for Windows that allows to do this?
EDIT:
Thanks for great answers. I'll simply check if it would be sufficient to simply store a regex as a string and then if it would still be slow I'll test and compare it with this old PCRE library.

You can use the regex strings themselves as the 'serialized' regex - just save those to a file, then when you want to reconstitute the regex objects, just pass the saved strings to the regex constructor.
The only drawbacks I can think of:
it might take some more time to 'reconstitute' the regex database, but I really don't know how much (I suspect that the time would be dominated by I/O anyway, so I'm not sure if the difference would be significant - I really don't know how much overhead there is in regex compilation by the boost library's implementation)
if you want the stored regexes obfuscated, you'll have to do that yourself instead of relying on the compiled-binary state to be unreadable
The advantages to this are:
it's 100% supported, so it's not fragile/brittle
it's portable across compiler versions and platforms (ie., not fragile/brittle)
Is the time to compile the regex database (excluding I/O) really significant enough to warrant trying to save the compiled state?

I don't think it can be done without modifying the boost library to support it.
I don't know specifically how the boost regex library is implemented, but most regex libraries compile things to a binary blob that's then interpreted later as a series of instructions for a sort of limited virtual machine.
If boost's regex library is implemented in this way, serializing it would be relatively easy. Just get at the binary blob somehow and dump it to disk. The existence of the POSIX regex API for the boost library tells me that this is probably how it's implemented.
OTOH, another way to implement it (and a not so common way) is by generating something like an abstract syntax tree for the regex. This means that the individual pieces of the regex would be represented by their own objects and those objects would be linked together into some larger structure that represented the whole regex.
If boost does it this way then serialization will be very complex.
This is not possible with C++, but what I really wish happened is that boost could compile constant string regular expressions at compile time with template meta-programming. The reason this is not possible is that it isn't possible to iterate over the contents of a string (even a constant string) with a template.

I'm not sure, but did you take a look at boost::serialization, which can serialize a C++ object?

How to extract ALL typedefs and structs and unions from c++ source

I have inherited a Visual Studio project that contains hundreds of files.
I would like to extract all the typedefs, structs and unions from each .h/.cpp file and put the results in a file).
Each typdef/struct/union should be on one line in the results file. This would make sorting much easier.
typdef int myType;
struct myFirstStruct { char a; int b;...};
union Part_Number_Serial_Number_Part_2_Response_Message_Type {struct{Message_Response_Head_Type Head; Part_Num_Serial_Num_Part_2_Report_Array Part_2_Report; Message_Tail_Type Tail;} Data; BYTE byData[140];}myUnion;
struct { bool c; int d;...}mySecondStruct;
My problem is, I do not know what to look for (grammar of typedef/structs/unions) using a regular expression.
I cannot believe that nobody has done this before (I googled and have not found anything on this).
Does anyone know the regular expressions for these? (Note some are commented out using // others /* */)
Or a tool to accomplish this.
Edit:
I am toying with the idea of autogenerating source code and/or dialogs for modifying messages that use the underlying typedef/struct/union. I was going to use the output to generate an XML file that could be used for this reason.
The source for these are in C/C++ and used in almost all my projects. These projects are usually NOT in C/C++. By using the XML version I would only need to update/add the typedef/struct/union only in one place and all the projects would be able to autogen the source and/or dialogs.

I can't imagine a purpose for this except for some sort of documentation effort. If that is what you're looking for I would suggest doxygen.
To answer your question, I seriously doubt any amount of regular expressions will be sufficient. What you need to do is actually parse the code. I have heard of a library out there for building compilers and C++ tools that would provide the parsing aspect but I'm sorry to say I have forgotten the name. I know it's out there though so I'd start searching for that.

You will NOT be able to accomplish this with a regular expression. The only way to actually do this will be to get hold of a lexer and parser for the C++ grammar and write the code yourself to dump the interesting bits to a file or database upon encountering one of the structures you're interested in. And unfortunately, C++ parsing is rather hard.

parsing c++ is ... difficult. Instead of killing yourself trying to parse it there are a few options.
gcc-xml
cscope
ctags
global
Each of these will parse c++ code and grab the info your after. If you want to dump it to a file in the format you requested you'd be a lot better off parsing their data files than parsing raw c++.
I recommend you skip all of this and just use doxygen. It won't be in your preferred format but you'll be better off getting used to doxygen's layout.

How to decompress a file in fortran77?

I have a compressed file.
Let's ignore the tar command because I'm not sure it is compressed with that.
All I know is that it is compressed in fortran77 and that is what I should use to decompress it.
How can I do it?
Is decompression a one way road or do I need a certain header file that will lead (direct) the decompression?
It's not a .Z file. It ends at something else.
What do I need to decompress it? I know the format of the final decompressed archive.
Is it possible that the file is compressed thru a simple way but it appears with a different extension?

First, let's get the "fortran" part out of the equation. There is no standard (and by that, I mean the fortran standard) way to either compress or decompress files, since fortran doesn't have a compression utility as part of the language. Maybe someone written some of their own, but that's entirely up to him.
So, you're stuck with publicly available compression utilities, and such. On systems which have those available, and on compilers which support it (it varies), you can use the SYSTEM function, which executes the system command by passing a command string to the operating system's command interpreter (I know it exists in cvf, probably ivf ... you should probably look it up in help of your compiler).
Since you asked a similar question already I assume you're still having problem with this. You mentioned that "it was compressed with fortran77". What do you mean by that ? That someone builded a compression utility in f77 and used it ? So that would make it a custom solution ?
If it's some kind of a custom solution, then it can practically be anything, since a lot of algorithms can serve as "compression algorithms" (writing file as binary compared to plain text will save a few bytes; voila, "compression")
Or have I misunderstood something ? Please, elaborate this a little.

My guess is that you have a binary file, which is output by a Fortran program. These can look like compressed files because they are not readable in a text editor.
Fortran allows you to write the in-memory data out to a file without formatting it, so that you can reload it later without having to parse it. The problem, however, is that you need that original source code in order to see what types of variables are written in the file.
If you have no access to the fortran source code, but a lot of time to spare, you could write some simple fortran program and guess what types of variables are being used. I wouldn't advise it, though, as Fortran is not very forgiving.
If you want some simple source code to try, look at this page which details binary read and write in Fortran, and includes a code sample. Just start by replacing reclength=reclength*4 with reclength=reclength*2 for a double precision real.

There is no standard decompression method, there are tons. You will need to know the method used to compress it in order to decompress it.

You said that the file extension was not .Z, but something else. What was that something else?
If it's .gz (which is very common on Unix systems), "gunzip" is the proper command. If it's .tgz, you can gunzip and untar it. (Or you can read the man page for tar(1), since it probably has the ability to gunzip and extract together.)
If it's on Windows, see if Windows can read it directly, as the file system itself appears to support the ZIP format.
If something else, please just list the file name (or, if there are security implications, the file name beginning with the first period), and we might be able to figure it out.

You can check to see if it's a known compressed file type with the file command. Assuming file returns something like "binary file" then you're almost certainly looking at plain binary data.

create and stream large XML document in C++

I have some code that creates a fairly large xml DOM and writes it off to a file (up to 50-100MB) . It basically creates the DOM and then calls a toString on it and writes it out with ofstream. Is there a way to get streaming output of the generated dom so that it doesn't create the whole structure in memory all at once and then copy it, etc? I will not modify any node after i create it so it can write it out and free up the memory right away. I could write my own xml class that does the xml construction but ... i don't think that's a good idea since i'll probably miss something when it comes down to escaping etc.

Ok, turns out libxml2 has a streaming API:
http://xmlsoft.org/examples/testWriter.c
It's a little old style (very C-ish) but you can write your wrapper around it.

I would recommend GenX as a streaming XML writer, I use this in Programmer's Notepad and it works a treat, you can see examples of use in the source code. Extremely fast, and it produces good UTF-8 XML. Memory usage while you use it should remain roughly constant.

Look under keyword "C++ XML writer;" XML writers generate XML to file without building the entire DOM in memory so they don't need to use very much memory at all. You didn't mention platform, but Microsoft XmlLite has IXmlWriter.

There really isn't much to generating valid XML; the escaping that you worry about is trivial.
There's a library for streaming writing of XML here: https://code.google.com/p/xml-writer-cpp/ — if nothing else, it's useful for education purposes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Storing PCRE compiled regexes in C/C++ - c++

Is there an efficient way to store the compiled regexes (compiled via regcomp(), PCRE) in a binary file, so that later I can just read from the file and call regexec()? Or is it just a matter of dumping the compiled regex_t structs to the file and reading them back when needed?

Related

Is it a good idea to save/load an array of struct?

How to save/serialize compiled regular expression (std::regex) to a file?

How to extract ALL typedefs and structs and unions from c++ source

How to decompress a file in fortran77?

create and stream large XML document in C++

Categories

Resources