How do I use pcre_study with pcrecpp?

How do I use pcre_study with pcrecpp? - c++

I'm using Google's C++ interface to PCRE to match a single regex multiple times (possibly thousands of times). From reading the PCRE manual, it seems like a good idea to let PCRE 'study' (spend time optimizing) the regex, however, I can't seem to find a way to do that with the C++ wrapper. The pcrecpp.h doesn't mention studying at all.
Is using pcre_study() worthwile, and if so, how can it be combined with pcrecpp and its RE class?

From a quick scan of the PCRE++ source code, it appears that "studying" is impossible with this API because the compiled RE (pcre*) member of the RE wrapper object is private and there's no way to get it out or reset it.
If you want to know whether the studying optimization is worthwhile with your REs, the easiest option that I see is to copy pcrecpp.{cc,h} into your project and hack it in; the C++ API is just some thin wrapper code. You might even want to submit a patch upstream if, like me, you like to litter open source projects with your name and copyright ;)

For the benefit of people hitting on this question from a web search, I will point out that the ability to "study" an RE has been removed from PCRE2:
Explicit "studying" of compiled patterns has been abolished - it now always
happens automatically. JIT compiling is done by calling a new function,
pcre2_jit_compile() after a successful return from pcre2_compile().
Reference: https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html

Related

What's the easiest way to parse C++ for code generation?

I would like to generate some wrapper code based on C++ types. I basically would like to parse some C++ headers, get the types, classes and their fields defined in the headers, and generate some code based on them.
What would be the easiest way to parse C++ and get type information? I thought about using the Clang C++ parser, but I couldn't make a working hello world in a couple of hours, so I gave up for the time being.
Could you advise any other way to parse C++, or if Clang is the easiest solution, could you point me to a simple getting started guide to be able to parse C++ types with it?
(basically any technology would be ok, C++, Java, C#, etc., this would be part of a command line tool)

Clang is definitely the easiest option. Consider using cindex python bindings, it's pretty straightforward. Alternatively, you could get an older version of clang which still features an xml backend.
EDIT: the link above seems to be down, so here is a link to the google cache of it.
Another link suggested in the comments: http://www.altdevblogaday.com/2014/03/05/implementing-a-code-generator-with-libclang/

Unless your object is to verify correctness, or the code involves advanced template stuff, consider using the XML output of DOxygen or GCC_XML. Alternatively, consider clang, even if that's what you found too complex. Note that for clang it might be best to work in *nix-land.

If your generation tool is in Java, consider using the parser from the Eclipse CDT.
my set of dependencies are:
com.ibm.icu_4.4.2.v20110823.jar
org.eclipse.cdt.core_5.3.2.201202111925.jar
org.eclipse.equinox.common_3.6.0.v20110523.jar
(these are from an old Eclipse version, because I have a dependency on old java class versions), but taking from the latest CDT wil do.
parsing involves:
FileContent reader;
reader = FileContent.createForExternalFileLocation(fullPath);
IScannerInfo info = new ScannerInfo(definedSymbols, includePaths);
return GPPLanguage.getDefault().getASTTranslationUnit(reader, info, FilesProvider.getInstance(), null, 0,log);
This returns an IASTTranslationUnit that can be accessed through a Visitor pattern (ASTVisitor).
I cannot comment on the accuracy of the parsing in corner scenarios, because so far I've been generating code based on simple C++ structure definitions.

How to save/serialize compiled regular expression (std::regex) to a file?

I'm using <regex> from Visal Studio 2010.
I understand that when I create regex object then it's compiled. There is no compile method like in other languages and libraries but I thinks that's how it work, am I right?
I need to store large amount of this compiled regexes in a file so I would just get chunk of memory block and get my compiled regex.
I can't figure how to do this. I found that in PCRE it is possible but it's Linux library. There is a Windows [version2 but it's 3 years old and I would like to use more high-level approach (there isn't c++ wrapper in windows version).
So is it possible to use save std:regex or boost::regex (it's the same right?) as a chunk of memory and then simply reuse it later?
Or is there other simple library for Windows that allows to do this?
EDIT:
Thanks for great answers. I'll simply check if it would be sufficient to simply store a regex as a string and then if it would still be slow I'll test and compare it with this old PCRE library.

You can use the regex strings themselves as the 'serialized' regex - just save those to a file, then when you want to reconstitute the regex objects, just pass the saved strings to the regex constructor.
The only drawbacks I can think of:
it might take some more time to 'reconstitute' the regex database, but I really don't know how much (I suspect that the time would be dominated by I/O anyway, so I'm not sure if the difference would be significant - I really don't know how much overhead there is in regex compilation by the boost library's implementation)
if you want the stored regexes obfuscated, you'll have to do that yourself instead of relying on the compiled-binary state to be unreadable
The advantages to this are:
it's 100% supported, so it's not fragile/brittle
it's portable across compiler versions and platforms (ie., not fragile/brittle)
Is the time to compile the regex database (excluding I/O) really significant enough to warrant trying to save the compiled state?

I don't think it can be done without modifying the boost library to support it.
I don't know specifically how the boost regex library is implemented, but most regex libraries compile things to a binary blob that's then interpreted later as a series of instructions for a sort of limited virtual machine.
If boost's regex library is implemented in this way, serializing it would be relatively easy. Just get at the binary blob somehow and dump it to disk. The existence of the POSIX regex API for the boost library tells me that this is probably how it's implemented.
OTOH, another way to implement it (and a not so common way) is by generating something like an abstract syntax tree for the regex. This means that the individual pieces of the regex would be represented by their own objects and those objects would be linked together into some larger structure that represented the whole regex.
If boost does it this way then serialization will be very complex.
This is not possible with C++, but what I really wish happened is that boost could compile constant string regular expressions at compile time with template meta-programming. The reason this is not possible is that it isn't possible to iterate over the contents of a string (even a constant string) with a template.

I'm not sure, but did you take a look at boost::serialization, which can serialize a C++ object?

Python code to parse and inspect c++

Is there a library for Python that will allow me to parse c++ code?
For example, let's say I want to parse some c++ code and find the names of all classes and their member functions/variables.
I can think of a few ways to hack it together using regular expressions, but if there is an existing library it would be more helpful.

In the past I've used for such purposes gccxml (a C++ parser that emits easily-parseable XML) -- I hacked up my own Python interfaces to it, but now there's a pygccxml which should package that up nicely for you.

Parsing C++ accurately is light-years from something you can do with a regular expression.
You need a full C++ parser, and they're pretty hard to build. I've been involved in building one over several years, and track who is doing it; I don't know of any being attempted in Python.
The one I work on is DMS C++ Front End.
It provides not only parsing, but full name and type resolution. After parsing, you can basically extract detailed information about the code at whatever level of detail you like, including arbittary details about function content.
You might consider using GCCXML, which does contain a parser, and will produce, I believe, the names of all classes, functions, and top-level variables. GCCXML won't give you any information about what's inside a function.

This is a little outside your question's scope perhaps... but depending on what you're trying to achieve, perhaps Exuberant Ctags is worth looking at.

Have not tried, but using the Python bindings from LLVM's Clang parser may work; see here.

How about pyparsing?

Reading/Understanding third-party code

When you get a third-party library (c, c++), open-source (LGPL say), that does not have good documentation, what is the best way to go about understanding it to be able to integrate into your application?
The library usually has some example programs and I end up walking through the code using gdb. Any other suggestions/best-practicies?
For an example, I just picked one from sourceforge.net, but it's just a broad engineering/programming question:
http://sourceforge.net/projects/aftp/

I frequently use a couple of tools to help me with this:
GNU Global. It generates cross-referencing databases and can produce hyperlinked HTML from source code. Clicking function calls will take you to their definitions, and you can see lists of all references to a function. Only works for C and perhaps C++.
Doxygen. It generates documentation from Javadoc-style comments. If you tell it to generate documentation for undocumented methods, it will give you nice summaries. It can also produce hyperlinked source code listings (and can link into the listings provided by htags).
These two tools, along with just reading code in Emacs and doing some searches with recursive grep, are how I do most of my source reverse-engineering.

One of the better ways to understand it is to attempt to document it yourself. By going and trying to document it yourself, it forces you to really dive in and test and test and test and make sure you know what each statement is doing at what times. Then you can really start to understand what the previous developer may have been thinking (or not thinking for that matter).

Great question. I think that this should be addressed thoroughly, so I'm going to try to make my answer as thorough as possible.
One thing that I do when approaching large projects that I've either inherited or contributing to is automatically generate their sources, UML diagrams, and anything that can ease the various amounts of A.D.D. encountered when learning a new project:)
I believe someone here already mentioned Doxygen, that's a great tool! You should look into it and write a small bash script that will automatically generate sources for the application you're developing in some tree structure you've setup.
One thing that I've haven't seen people mention is BOUML! It's fantastic and free! It automatically generates reverse UML diagrams from existing sources and it supports a variety of languages. I use this as a way to really capture the big picture of what's going on in terms of architecture and design before I start reading code.
If you've got the money to spare, look into Understand for %language-here%. It's absolutely great and has helped me in many ways when inheriting legacy code.
EDIT:
Try out ack (betterthangrep.com), it is a pretty convenient script for searching source trees:)

Familiarize yourself with the information available in the headers. The functions you call will be declared there. Then try to identify the valid arguments and pre-/post-conditions of the functions, as those are your primary guidance (even if they are not documented!). The example programs are your next bet.

If you have code completion/intellisense I like opening up the library and going '.' or 'namespace::' and seeing what comes up. I always find it helpful, you can navigate through the objects/namespaces and see what functionality they have. This is of course assuming its an OOP library with relatively good naming of functions/objects.

There really isn't a silver bullet other than just rolling up your sleeves and digging into the code.
This is where we earn our money.

Three things;
(1) try to run the test or example apps available, set low debug levels, and walk through logs.
(2) use source navigator tool / cscope ( available both on windows and linux) and browse the code to understand the flow.
(3) also in parallel use gdb to walk into code while running test/example apps.

Autocompletion in Vim

In a nutshell, I'm searching for a working autocompletion feature for the Vim editor. I've argued before that Vim completely replaces an IDE under Linux and while that's certainly true, it lacks one important feature: autocompletion.
I know about Ctrl+N, Exuberant Ctags integration, Taglist, cppcomplete and OmniCppComplete. Alas, none of these fits my description of “working autocompletion:”
Ctrl+N works nicely (only) if you've forgotton how to spell class, or while. Oh well.
Ctags gives you the rudiments but has a lot of drawbacks.
Taglist is just a Ctags wrapper and as such, inherits most of its drawbacks (although it works well for listing declarations).
cppcomplete simply doesn't work as promised, and I can't figure out what I did wrong, or if it's “working” correctly and the limitations are by design.
OmniCppComplete seems to have the same problems as cppcomplete, i.e. auto-completion doesn't work properly. Additionally, the tags file once again needs to be updated manually.
I'm aware of the fact that not even modern, full-blown IDEs offer good C++ code completion. That's why I've accepted Vim's lack in this area until now. But I think a fundamental level of code completion isn't too much to ask, and is in fact required for productive usage. So I'm searching for something that can accomplish at least the following things.
Syntax awareness. cppcomplete promises (but doesn't deliver for me), correct, scope-aware auto-completion of the following:
variableName.abc
variableName->abc
typeName::abc
And really, anything else is completely useless.
Configurability. I need to specify (easily) where the source files are, and hence where the script gets its auto-completion information from. In fact, I've got a Makefile in my directory which specifies the required include paths. Eclipse can interpret the information found therein, why not a Vim script as well?
Up-to-dateness. As soon as I change something in my file, I want the auto-completion to reflect this. I do not want to manually trigger ctags (or something comparable). Also, changes should be incremental, i.e. when I've changed just one file it's completely unacceptable for ctags to re-parse the whole directory tree (which may be huge).
Did I forget anything? Feel free to update.
I'm comfortable with quite a lot of configuration and/or tinkering but I don't want to program a solution from scratch, and I'm not good at debugging Vim scripts.
A final note, I'd really like something similar for Java and C# but I guess that's too much to hope for: ctags only parses code files and both Java and C# have huge, precompiled frameworks that would need to be indexed. Unfortunately, developing .NET without an IDE is even more of a PITA than C++.

Try YouCompleteMe. It uses Clang through the libclang interface, offering semantic C/C++/Objective-C completion. It's much like clang_complete, but substantially faster and with fuzzy-matching.
In addition to the above, YCM also provides semantic completion for C#, Python, Go, TypeScript etc. It also provides non-semantic, identifier-based completion for languages for which it doesn't have semantic support.

There’s also clang_complete which uses the clang compiler to provide code completion for C++ projects. There’s another question with troubleshooting hints for this plugin.
The plugin seems to work fairly well as long as the project compiles, but is prohibitively slow for large projects (since it attempts a full compilation to generate the tags list).

as per requested, here is the comment I gave earlier:
have a look at this:
Vim integration to MonoDevelop
for .net stuff at least..
OmniCompletion
this link should help you if you want to use monodevelop on a MacOSX
Good luck and happy coding.

I've just found the project Eclim linked in another question. This looks quite promising, at least for Java integration.

I'm a bit late to the party but autocomplpop might be helpful.

is what you are looking for something like intellisense?
insevim seems to address the issue.
link to screenshots here

Did someone mention code_complete?
http://www.vim.org/scripts/script.php?script_id=1764
But you did not like ctags, so this is probably not what you are looking for...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js