I wanted to do some regular expressions in C++ so I looked on the interwebz (yes, I am an beginner/intermediate with C++) and found this SO answer.
I really don't know what to choose between boost::regex and boost::xpressive. What are the pros/cons?
I also read that boost::xpressive opposed to boost::regex is a header-only library. Is it hard to statically compile boost::regex on Linux and Windows (I almost always write cross-platform applications)?
I'm also interested in comparisons of compile time. I have a current implementation using boost::xpressive and I'm not too content with the compile times (but I have no comparisons to boost::regex).
Of course I'm open for other suggestions for regex implementations too. The requirements are free (as in beer) and compatible with http://nclabs.org/license.php.
One fairly important difference is that Boost Regex can support linking to ICU for Unicode support (character classes, etc) Boost Regex ICU Support.
As far as I can tell, Boost Xpressive doesn't have this kind of support built-in.
Well if you need to create a regular expression at runtime (i.e. Letting the user type in a regular expression to search for) you can't use xpressive as it is compile time only.
On the other hand, since it is a compile-time construct, it should benefit more from your optimizer than regex does.
I do enough stuff with Boost.MPL, StateChart, and Spirit that 220KB of compiler warning and errors don't really bother me much. If that sounds like hell to you, stick with Boost.Regex.
If you do use xpressive, I highly recommend turning on -Wfatal-errors as this will stop compilation (and further errors) after the first 'error:' line.
For compilation time, it's no contest. Boost.Regex will be faster*. The fact that xpressive uses MPL will cause compile times to be dramatically increased.
*This assumes you only build the dll/so once
When using the Boost libraries I tend to lean toward the use of header only libraries, due to cross platform compatability issues. The down side of that is that when your compiler reports an error related to your use of the the library, the header only output tends toward the arcane.
Assuming you're using a reasonably recent compiler, there's a pretty decent chance that it includes a regex package already. Try just doing #include <regex> and see if the compiler finds it.
The only trick to things is that it could be in either (or both) of two different namespaces. Regexes were included in TR1 of the C++ standard, and are also in (the final drafts of) C++11. The TR1 version is in a namespace named tr1, where the standard version is in std, just like the rest of the library.
FWIW, this is essentially the same as Boost regex, not Boost Xpressive.
I would try to supplement other people answers by get deeper into topic of compile-time regular expressions(CTR) vs run-time(dynamic) regular expressions(RTR) in a more theoretical way(this topic is implied by OP question indirectly IMHO). Run-time regex are more known and popular(most language core-libraries implementations), i suppose due to historical reasons. They are OK when regular expression is determined at run-time, unlike CTR. Both work on finite state machine basis.
RTR are "compiled" and interpreted by some kind of universal finite state machine(universal means its kind of interpreter which scheme is given at run-time, "compiled" in some internal data structure - when you pass regex string, then interpreted at run-time).
But CTR is "compiled" at compile-time and are specific for particular regex, so you can't use them, when regex is given at run-time(applications like text editors, file/internet search engines).
But they are a priori more efficient(theoretically however) as customized in compile-time finite state machine will be efficient, than interpreter with table-preset scheme of this machine(some similar cases are reflection field access vs compile-time access, or specialized function optimized for some fixed parameter as pointed out there). Another advantage is compile-time syntax checking. CTR can be implemented through meta-programming and/or code generation.
As for specific implementations - there are many RTR, but not so numerous CTR. For C++ they are above mentioned Boost and STL C++0x11 implementations. You may need them for optimizing regex perfomance/size of generated code/memory usage, mostly relevant for embedded systems or high perfomance specific applications.
SO question about CTR
Finding CTR-implementations is harder, one example if found is Re2C Code generator project, Java CTR implementation and C# implementation featuring run-time compilation(into IL code, not internal data structure) of Regex [there is SO question about it]
P.S. Sorry, couldn't post some relevant links due to reputation
Related
I am looking at various STL headers provided with compilers and I cant imagine the developers actually writing all this code by hand.
All the macros and the weird names of varaibles and classes - they would have to remember all of them! Seems error prone to me.
Are parts of the headers result of some text preprocessing or generation?
I've maintained Visual Studio's implementation of the C++ Standard Library for 7 years (VC's STL was written by and licensed from P.J. Plauger of Dinkumware back in the mid-90s, and I work with PJP to pick up new features and maintenance bugfixes), and I can tell you that I do all of my editing "by hand" in a plain text editor. None of the STL's headers or sources are automatically generated (although Dinkumware's master sources, which I have never seen, go through automated filtering in order to produce customized drops for Microsoft), and the stuff that's checked into source control is shipped directly to users without any further modification (now, that is; previously we ran them through a filtering step that caused lots of headaches). I am notorious for not using IDEs/autocomplete, although I do use Source Insight to browse the codebase (especially the underlying CRT whose guts I am less familiar with), and I extensively rely on grep. (And of course I use diff tools; my favorite is an internal tool named "odd".) I do engage in very very careful cut-and-paste editing, but for the opposite reason as novices; I do this when I understand the structure of code completely, and I wish to exactly replicate parts of it without accidentally leaving things out. (For example, different containers need very similar machinery to deal with allocators; it should probably be centralized, but in the meantime when I need to fix basic_string I'll verify that vector is correct and then copy its machinery.) I've generated code perhaps twice - once when stamping out the C++14 transparent operator functors that I designed (plus<>, multiplies<>, greater<>, etc. are highly repetitive), and again when implementing/proposing variable templates for type traits (recently voted into the Library Fundamentals Technical Specification, probably destined for C++17). IIRC, I wrote an actual program for the operator functors, while I used sed for the variable templates. The plain text editor that I use (Metapad) has search-and-replace capabilities that are quite useful although weaker than outright regexes; I need stronger tools if I want to replicate chunks of text (e.g. is_same_v = is_same< T >::value).
How do STL maintainers remember all this stuff? It's a full time job. And of course, we're constantly consulting the Standard/Working Paper for the required interfaces and behavior of code. (I recently discovered that I can, with great difficulty, enumerate all 50 US states from memory, but I would surely be unable to enumerate all STL algorithms from memory. However, I have memorized the longest name, as a useless bit of trivia. :->)
The looks of it are designed to be weird in some sense. The standard library and the code in there needs to avoid conflicts with names used in user programs, including macros and there are almost no restrictions as to what can be in a user program.
They are most probably hand written, and as others have mentioned, if you spend some time looking at them you will figure out what the coding conventions are, how variables are named and so on. One of the few restrictions include that user code cannot use identifiers starting with _ followed by a capital letter or __ (two consecutive underscores), so you will find many names in the standard headers that look like _M_xxx or __yyy and it might surprise at first, but after some time you just ignore the prefix...
Is it theoretically up to the task?
Can it be done practically and would the resulting parser be used with sufficient performance and output (say, LLVM IR or GCC's gimple) to be integrated in a competing compiler?
I'm sorry. I talked to its author, and he said he won't make it parse C++ fully, but admits that he accepts it to parse certain constructs as ambiguous.
So this is not an answer anymore!!
I recommend you to have a look at scalpel. From its homepage
Scalpel stands for source code analysis, libre and portable library. This is a C++ library which aims to perform full syntax and semantic analysis of any given C++ program.
And
What makes me think Scalpel could be accepted into Boost
Scalpel uses itself several Boost libraries: Spirit, Wave, shared_ptr (now in C++0x's STL), Optional, Test, etc.. Actually, it exclusively uses Boost libraries and the C++ standard library, which is required by Boost.
Besides, Boost already provides a Spirit-based C++ source code preprocessing library: Wave. Including a C++ source code analysis library seems to be a natural evolution.
No. C++ is too hard to parse for most automatic tools, and in practice usually is parsed by hand written parsers.
[Edit 1-Mar-2015: Added 'most' and 'usually'.]
Among the hard problems are:
A * B; which could be either the definition of a variable B with type A* or just the multiplication of two variables A and B.
A < B > C > D Where does the template A<> end? The usual 'max-munch' rules for parsing expressions will not work here.
vector<shared_ptr<int>> where the >> ends two templates, which is hard to do with only one token (and a space in between is allowed). But in 1>>15 no space is allowed.
And I bet that this list is far from complete.
Addition: The grammar is available, but is ambiguous and thus not valid as input to tools like Spirit.
Update 1-Mar-2015: As Ira Baxter, a well known expert in this field, points out in the comments, there are some parser generators that can generate a parser that will generate the full parser forest. As far as I know, selecting the right parse still requires a semantic phase. I'm not aware of any non-commercial parser generators that can do so for C++'s grammar. For more information, see this answer.
For "any other language", I once tried creating a shell script parser with Spirit. It turned out to be theoretically possible (I believe it would work), but it was not compilable on a machine with 1 GB memory, so eventually I gave up.
I'm using <regex> from Visal Studio 2010.
I understand that when I create regex object then it's compiled. There is no compile method like in other languages and libraries but I thinks that's how it work, am I right?
I need to store large amount of this compiled regexes in a file so I would just get chunk of memory block and get my compiled regex.
I can't figure how to do this. I found that in PCRE it is possible but it's Linux library. There is a Windows [version2 but it's 3 years old and I would like to use more high-level approach (there isn't c++ wrapper in windows version).
So is it possible to use save std:regex or boost::regex (it's the same right?) as a chunk of memory and then simply reuse it later?
Or is there other simple library for Windows that allows to do this?
EDIT:
Thanks for great answers. I'll simply check if it would be sufficient to simply store a regex as a string and then if it would still be slow I'll test and compare it with this old PCRE library.
You can use the regex strings themselves as the 'serialized' regex - just save those to a file, then when you want to reconstitute the regex objects, just pass the saved strings to the regex constructor.
The only drawbacks I can think of:
it might take some more time to 'reconstitute' the regex database, but I really don't know how much (I suspect that the time would be dominated by I/O anyway, so I'm not sure if the difference would be significant - I really don't know how much overhead there is in regex compilation by the boost library's implementation)
if you want the stored regexes obfuscated, you'll have to do that yourself instead of relying on the compiled-binary state to be unreadable
The advantages to this are:
it's 100% supported, so it's not fragile/brittle
it's portable across compiler versions and platforms (ie., not fragile/brittle)
Is the time to compile the regex database (excluding I/O) really significant enough to warrant trying to save the compiled state?
I don't think it can be done without modifying the boost library to support it.
I don't know specifically how the boost regex library is implemented, but most regex libraries compile things to a binary blob that's then interpreted later as a series of instructions for a sort of limited virtual machine.
If boost's regex library is implemented in this way, serializing it would be relatively easy. Just get at the binary blob somehow and dump it to disk. The existence of the POSIX regex API for the boost library tells me that this is probably how it's implemented.
OTOH, another way to implement it (and a not so common way) is by generating something like an abstract syntax tree for the regex. This means that the individual pieces of the regex would be represented by their own objects and those objects would be linked together into some larger structure that represented the whole regex.
If boost does it this way then serialization will be very complex.
This is not possible with C++, but what I really wish happened is that boost could compile constant string regular expressions at compile time with template meta-programming. The reason this is not possible is that it isn't possible to iterate over the contents of a string (even a constant string) with a template.
I'm not sure, but did you take a look at boost::serialization, which can serialize a C++ object?
A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.
You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.
ctags' output is quite easy to read/parse
if you want to simply generate a binding, try swig
What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.
One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.
Doxygen can help with this. It's an advanced topic, and somewhat documented.
In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.
I recently came upon the need to have compile-time assertions in C++ to check that the sizes of two types were equal.
I found the following macro on the web (stated to have come from the Linux kernel):
#define X_ASSERT(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
which I used like so:
X_ASSERT(sizeof(Botan::byte) != sizeof(char));
This gets me curious - although this works, is there a cleaner way to do so? (obviously there's more than one way, as it is) Are there advantages or disadvantages to certain methods?
In C++0x, there is a new language feature, static_assert, which provides a standard way to generate compile-time assertions. For example,
static_assert(sizeof(Botan::byte) != 1, "byte type has wrong size");
Visual C++ 2010 supports static_assert, as do g++ 4.3 (and greater) and Intel C++ 11.0.
You might want to take a look at Boost StaticAssert. The internals aren't exactly clean (or weren't the last time I looked) but at least it's much more recognizable, so most people know what it means. It also goes to some pains to produce more meaningful error messages if memory serves.
Some other interesting options are here: http://www.jaggersoft.com/pubs/CVu11_3.html
Neat reading as the author walks the C (not C++) spec looking for syntax that can be leveraged as compile-time assertions.
To do it right you need a C++0x friendly compiler, see James McNellis' and Jerry Coffins answers.
You can't do much with the 1998 or 2003 C++ standards. Take a look at these links for examples:
http://en.wikipedia.org/wiki/Assertion_(computing)#Static_assertions
http://ksvanhorn.com/Articles/ctassert.html
There's an excellent #error preprocessor directive (see here for a good essay about it), but I believe it needs to be within a #if as opposed to being used in a "free-standing" as in your example use.