C++, Boost regex, replace value function of matched value? - c++

Specifically, I have an array of strings called val, and want to replace all instances of "%{n}%" in the input with val[n]. More generally, I want the replace value to be a function of the match value. This is in C++, so I went with Boost, but if another common regex library matches my needs better let me know.
I found some .NET (C#, VB.NET) solutions, but I don't know if I can use the same approach here (or, if I can, how to do so).
I know there is this ugly solution: have an expression of the form "(%{0}%)|(%{1}%)..." and then have a replace pattern like "(1?" + val[0] + ")(2?" + val[1] ... + ")".
But I'd like to know if what I'm trying to do can be done more elegantly.
Thanks!

I don't beleive boost::regex has an easy way to do this. The most straightfoward way that I can think of would be to do a regex_search using the "(%{[0-9]+}%)" pattern and then iterate over the sub-matches in the returned match_results object. You'll need to build a new string by concatenating the text from between each match (the match_results::position method will be your friend here) with the result of converting sub-matches to the values from your val array.

Related

Regex - match all strings with specific exceptions

I have a lot of strings which have similar values.
I need to write a regex that will keep all values except those that start with a specific substring, anyone know how I can do this.
For example, assume my string values are :
foo_bar
foo_baz
foo_bar_baz
foo_baz_bar
bar_baz
bar_foo
I can write a regex that will capture all of the above strings easily :
(foo_.*|bar_.*)
But supposing I have reasons for dropping anything that contains "foo_baz" and keep all the others.
i.e. my results would be :
foo_bar
foo_bar_baz
bar_baz
bar_foo
Is there any easy way to achieve this without explicitly listing each of the strings I want to keep?
Thanks.
You can use a negative lookahead:
^(?!foo_baz).*$
See https://regex101.com/r/jBCSjR/1
Or, depending on your programming language, it could be easier to filter out values using startsWith() or any equivalent.

Is there an efficient way to find a string fulfilling given regex?

Let's say I've got such regex (python notation) r'^namespace/(\w+)/([0-9]+)/', is there a way to reverse this regex and find a string fulfilling it?
By reversing I don't mean manual constructing 'namespace/' + 'a_1' + '/' + '1', but systematic way to reverse any regular expression consisting of some special characters. So that for every regex I can generate (any) string fulfilling it.
The only thing that comes to my mind is to parse the given regex with some other regexs, but it does not seem acceptable solution. Although I expect the whole operation to have huge complexity, I still look for at least a bit more sophisticated way to do it.
The only thing that comes to my mind is to parse the given regex with some other regexs, but it does not seem acceptable solution
You don't need to parse the regex with regexes, but yes you will need to parse it. When you have an AST of the regular expression, you can easily traverse that and build a possible match in linear time (for plain regular expression, nothing too fancy like lookaround).
Check Enumerating Regular Languages for an example code and continuative links.

How to concisely regex match any portion of a unique string?

Context:
Say I have a set of strings that are all distinct, though they may share starting sequences, i.e. apple, banana, bpple, canana, applf.
How best would I use a regex to match on a string that can contain any left-starting subset of one of those strings? For example apple and banana would obviously match. So would banan, ba, bp and c. b and appl would be ambiguous (and therefore should not match).
Using generated character classes in dynamically-built regexes (slow and ugly), I can make a match engine for this. However, it's complicated to the point that when I try, I end up doing most of the matching logic in Python/pick-your-language and ditching regex altogether. Is there some succinct way to make this work with regular expressions?
The simplest way to do this might be to break out each possible string (apple, banana etc) into a list and match against each one in sequence, but curiosity and stubbornness make me wonder if there isn't some way to do it with regex alone/primarily.
TL;DR:
Is there a way, using regex, to match: if and only if the string supplied is a unique and left-starting substring of only one of a given set of strings?
Don't use regular expressions. You are asking for the leaves in a trie.
If you absolutely have to use regular expressions, then they could be built like this:
(a(p(p(le?)?)?)?|b(a(n(a(na?)?)?)?)? ...)
It is easy to write some code that constructs this, but you won't be able to find out what actually matched (e. g. the user enters 'app' - you probably want to know that this matches 'apple'). Also, modifying this to ensure that there is no more than one match is really cumbersome. The code that constructs the regex will be much more complicated than just creating a trie (in fact, you probably have to create something equivalent to a trie in order to create the regex, you are asking for).

How to create regular expression to get all functions from code

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!
Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.
Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps
Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).
It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.
(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)

Do regex implementations actually need a split() function?

Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)
The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};