Indexing string literals for c++ project - c++

I have a huge c++ project and I find myself rgrep-ing for patterns that I know are in string literals. Is there a way to get clang or xtags or cscope or whatever to build a file with a mapping of each string literal in the project to the file and line where it was found?

I don't know of a way to make cscope or friends to do this. You could almost certainly write a custom Starscope extractor that would do this, if you don't mind writing a dozen or so lines of Ruby (starscope: https://github.com/eapache/starscope, adding an extractor: https://github.com/eapache/starscope/blob/master/doc/LANGUAGE_SUPPORT.md#how-to-add-another-language)
Alternatively it may just be enough to use something like ag instead, which is grep-like but generally a lot faster: https://github.com/ggreer/the_silver_searcher

Related

Regular Expression for whole world

First of all, I use C# 4.0 to parse the code of a VB6 application.
I have some old VB6 code and about 500+ copies of it. And I use a regular expression to grab all kinds of global variables from the code. The code is described as "Yuck" and some poor victim still has to support this. So I'm hoping to help this poor sucker a bit by generating overviews of specific constants. (And yes, it should be rewritten but it ain't broke, so...)
This is a sample of a code line I need to match, in this case all boolean constants:
Public Const gDemo = False 'Is this a demo version
And this is the regular expression I use at this moment:
Public\s+Const\s+g(?'Name'[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?'Value'[0-9]*)
And I think it too is yuckie, since the * at the end of the boolean group. But if I don't use it, it will only return 'T' or 'F'. I want the whole word.
Is this the proper RegEx to use as solution or is there an even nicer-looking option?
FYI, I use similar regexs to find all string constants and all numeric constants. Those work just fine. And basically the same .BAS file is used for all 50 copies but with different values for all these variables. By parsing all files, we have a good overview of how every version is configured.
And again, yes, we need to rebuild the whole project from scratch since it becomes harder to maintain these days. But it works and we need the manpower for other tasks. It just needs the occasional tweaks...
You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True)
demo

Parsing char array in C++

I'm coding a httpserver and I'm stuck at parsing GET requests.
How would I parse a char buffer with something like
GET /images/logo.png HTTP/1.1
So that I only get the path and file extension but ignore the other parts?
You don't say specifically say what sort of storage this string is in - simple char* or some string class.
So, in general, you could either do it the rather simple and dirty way, by splitting the string on the space character, and taking the second or middle section. Or, a better approach would be to get familiar with Regular Expressions. C++ has several regex libraries - Boost is well regarded.

C++ - Splitting Filename and File Extension

Ok, first of all I don't want to use Boost, or any external libraries. I just want to use the C++ Standard Library. I can easily split strings with a given delimiter with my split() function:
void split(std::string &string, std::vector<std::string> &tokens, const char &delim) {
std::string ea;
std::stringstream stream(string);
while(getline(stream, ea, delim))
tokens.push_back(ea);
}
I do this on filenames. But there's a problem. There are files that have extensions like: tar.gz, tar.bz2, etc. Also there are some filenames that have extra dots. Some.file.name.tar.gz. I wish to separate Some.file.name and tar.gz Note: The number of dots in a filename isn't constant.
I also tried PathFindExtension but no luck. Is this possible? If so, please enlighten me. Thank you.
Edit: I'm very sorry about not specifying the OS. It's Windows.
I think you could use std::string find_last_of to get the index of the last ., and substr to cut the string (although the "complex extensions" involving multiple dots will require additional work).
There is no way of doing what you want that does not involve a database of extensions for your purpose. There's nothing magical about extensions, they are just part of a filename (if you gunzip foo.tar.gz you'll likely get a foo.tar, so for this application .gz actually is "the extension"). So, in order to do what you want, build a database of extensions that you want to look for and fall back on "last dot" if you don't find one.
There's nothing in the C++ standard library -- that is, it's not in the Standard --, but every operating system I know of provides this functionality in a variety of ways.
In Windows you can use _splitpath(), and in Linux you can use dirname() & basename()
The problem is indeed filenames like *.tar.gz, which can not be split consistently, due to the fact that (at least in Windows) the .tar part isn't part of the extension. You'll either have to keep a list for these special cases and use a one-dot string::rfind for the rest or find some pre-implemented way. Note that the .tar.* extensions aren't infinite, and very much standardized (there's about ten of them I think).
You could create a look-up table of file extensions that you think you might encounter. And also add a command line option to add a new one to the look-up table if you encounter anything new. Then parse through the file name to see if it any entry in the look-up table is a sub-string in the file name.
EDIT: You can also refer to this question: C++/STL string: How to mimic regex like function with wildcards?

C++/STL string: How to mimic regex like function with wildcards?

I would like to compare 4 character string using wildcards.
For example:
std::string wildcards[]=
{"H? ", "RH? ", "H[0-5] "};
/*in the last one I need to check if string is "H0 ",..., and "H5 " */
Is it possible to manage to realize only by STL?
Thanks,
Arman.
EDIT:
Can we do it without boost.regex?
Or should I add yet another library dependences to my project?:)
Use Boost.Regex
No - you need boost::regex
Regular expressions were made for this sort of thing. I can understand your reluctance to avoid a dependency, but in this case it's probably justified.
You might check your C++ compiler to see if it includes any built-in regular expression library. For example, Microsoft includes CAtlRegExp.
Barring that, your problem doesn't look too difficult to write custom code for.
You can do it without introducing a new library dependency, but to do so you'd end up writing a regular expression engine yourself (or at least a subset of one).
Is there some reason you don't want to use a library for this?

Are there particular cases where native text manipulation is more desirable than regex?

Are there particular cases where native text manipulation is more desirable than regex?
In particular .net?
Note:
Regex appears to be a highly emotive subject, so I am wary of asking such a question. This question is not inviting personal/profession opinions on regex, only specific situations where a solution including its use is not as good as language native commands (including those which have underlying code using regex) and why.
Also, note that Desirable can mean performance, can mean code-readability; it does not mean panacea, as each solution for a problem has its benefits and limitations.
Apologies if this is a duplicate, I have searched SO for a similar question.
I prefer text manipulation over regular expressions to parse delimited string input. It's far simpler (for me at least) to issue a string split than to manage a regular expression.
Given some text:
value1, value2, value3
You can parse the line easily:
var values = myString.Split(',');
I'm sure there's a better way but with regular expressions you'd have to do something like:
var match = Regex.Match(myString, "^([^,]*),([^,]*),([^,]*)$");
var value1 = match.Group[1];
...
When you can do it simply with native text manipulation, it is usually preferable (simpler to read & better performance) not to use regex.
Personal rule of thumb: if it's tricky or relatively longer to do it "manually" and that performance gain is negligible, don't. Else do.
Don't examples:
split
simple find & replace
long text
loop
existing native functions (like, in PHP, strrchr, ucwords...)
Using a regex basically means embedding a tiny program, written in a different programming language, in the middle of your program. I'll ignore the inefficiency of using a regex over native string manipulation, because it probably isn't relevant in most cases.
I prefer native text manipulation over regex any time native text manipulation will be easier to follow for other people. Which is true quite frequently, since plenty of the people around me are not strongly familiar with regex. Unless working with something that is very much about parsing (via regex) they should not need to be!
Regular expressions are usually slower, less readable, and harder to debug than native string manipulation.
The main case where I'll prefer regex over string manipulation is when I want to be able to have different ways to parse strings dependning on the source, and the types of sources will increase over time. Native string manipulation is not really practical in this case. I've had cases where I've stuck a regex column in a database...
RegEx's are very flexible and powerful, because they are in many ways similar to an eval() statement. That being said, depending on the implementation, they can be a bit slow. Normally, this is not an issue, however, if they can be avoided in a particularly costly loop, that can boost performance.
That being said, I tend to use them, and only worry about performance when the app is "done" and I have real benchmarks to prove I need to tweak performance. i.e, avoid premature optimization.
Whenever the same result can be achieved with a reasonable amount of code.
Regular expressions are very powerful, but they tend to get hard to read. If you can do the same with simple string operations that usually means that the code gets easier to manage and maintain.
There is some overhead in setting up the object and parsing the expression. For simpler string manipulation you can get better performance with simple string methods.
Example:
Getting the file name from a file path (yes, I know that the Path class should be used for that, it's just an example...)
string name = Regex.Match(path, #"([^\\]+)$").Groups[0].Value;
vs.
string name = path.Substring(path.LastIndexOf('\\') + 1);
The second solution is straight forward and does the minimal work needed to get the result. The regular expression solution produces the same result, but it does more work to parse the string, and it produces a bunch of objects that is not needed for the result.
Regex parsing and execution refers the host language to defer processing to its regex "engine". This adds overhead, so for any instance where native string manipulation could be used it is preferable for speed (and readability!).
I'll usually just use text manipulation for simple string replacements (e.g. replacing tokens in a template with actual values). You could certainly do this with Regex, but replacements are much easier.
Yes. Example:
char* basename (const char* path)
{
char* p = strrchr(path, '/');
return (p != NULL) ? (p+1) : path;
}