QRegExp in C++ to capture part of string - c++

I am attempting to use Qt to execute a regex in my C++ application.
I have done similar regular expressions with Qt in C++ before, but this one is proving difficult.
Given a string with optional _# at the end of the string, I want to extract the part of the string before that.
Examples:
"blue_dog" should result "blue_dog"
"blue_dog_1" should result "blue_dog"
"blue_dog_23" should result "blue_dog"
This is the code I have so far, but it does not work yet:
QString name = "blue_dog_23";
QRegExp rx("(.*?)(_\\d+)?");
rx.indexIn(name);
QString result = rx.cap(1);
I have even tried the following additional options in many variations without luck. My code above always results with "":
rx.setMinimal(TRUE);
rx.setPatternSyntax(QRegExp::RegExp2);

Sometimes it's easier not to pack everything in a single regexp. In your case, you can restrict manipulation to the case of an existing _# suffix. Otherwise the result is name:
QString name = "blue_dog_23";
QRegExp rx("^(.*)(_\\d+)$");
QString result = name;
if (rx.indexIn(name) == 0)
result = rx.cap(1);
Alternatively, you can split the last bit and check if it is a number. A compact (but maybe not the most readable) solution:
QString name = "blue_dog_23";
int i = name.lastIndexOf('_');
bool isInt = false;
QString result = (i >= 0 && (name.mid(i+1).toInt(&isInt) || isInt)) ? name.left(i) : name;

The following solution should work as you want it to!
^[^\s](?:(?!_\d*\n).)*/gm
Basically, that is saying match everything up to, but not including, _\d*\n. Here, _\d*\n means match the _ char, then match any number of digits \d* until a new line marker, \n is reached. ?! is a negative lookahead, and ?: is a non-capturing group. Basically, the combination means that the sequence after the ?: is the group representing the non-inclusive end point of the what should be captured.
The ^[^\s] tells the expression to match starting at the start of a line, as long as the first character isn't a white space.
The /gm sets the global flag (allowing more than one match to be returned) and the mutli-line flag (which allows sequences to match past a single line.

Related

RegEx for matching a string after a string up to a comma

Here is a sample string.
"BLAH, blah, going to the store &^5, light Version 12.7(2)L6, anyway
plus other stuff Version 3.3.4.6. Then goes on an on for several lines..."
I want to capture only the first version number without including the word version if possible but not include the periods and parenthesis. The result would stop when it encounters a comma. The result would be:
"1272L6"
I don't want it to include other instances of version in the text. Can this be done?
I've tried (?<=version)[^,]* I know it does not address removing the periods and parens and does not address the subsequent versions.
This exact RegEx, maybe not the best solution, but it might help you to get 1272L6:
([0-9]{2})\.([0-9]{1})\(([0-9]{1})\)([A-Z]{1}[0-9]{1})
It creates four groups (where $1$2$3$4 is your target 1272L6) and passes ., ) and (.
You might change {1} to other numbers of repetitions, such as {1,2}.
Assuming the version number is fixed on format but not on the specific digits or letters, you could do this.
String s = "this is a test 12.7(2)L6, 13.7(2)L6, 14.7(2)L6";
String reg = "(\\d\\d\\.\\d\\(\\d\\)[A-Z]\\d),";
Matcher m = Pattern.compile(reg).matcher(s);
if (m.find()) { // should only find first one
System.out.println(m.group(1).replaceAll("[.()]", ""));
}

Combining look aheads not matching

I have the following test string I'm working with:
__level__:,Undergraduate,;__subject__:,Maths,Art,;
This is actually a stringified object of { level: ["Undergraduate"], subject: ["Maths", "Art"] } that I figured converting to a string and using a regular expression might be quicker than looping through each level|subject and each value within those properties.
I can match a single value within a list of a property (e.g. level) like so:
(?=(__subject__:[^;]*(,Maths,).*?;))
And I can match two like so:
(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
However, I can't guarantee the order that level and subject lists will be. Below is also possible:
__subject__:,Maths,Art,;__level__:,Undergraduate,;
Notice I've put subject before level now. Now the regular expression doesn't match. I'm pretty new to look aheads so I can't figure out what I've done wrong. Would appreciate any help on the matter.
I also want to combine the properties being matched, so something like:
(?=(__level__:[^;]*(,Undergraduate,).*?;))(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
..doesn't work for me either but I'm trying to match two values from the subject property and a value from the level property. Again, I can't guarantee the order of properties (e.g. level, subject) and/or values (e.g. Maths, Art OR Art, Maths)
Class \[A-Z\] & Positive Lookahead (?=)
The targets are letters [A-Z]+? and to exclude the words surrounded by underscore use the positive lookahead to ensure the target is followed by a comma (?=,)
/([A-Z]+?)(?=,)/gi;
Demo
let str = `__level__:,Undergraduate,;__subject__:,Maths,Art,;`;
let rgx = /([A-Z]+?)(?=,)/gi;
let mch = rgx.exec(str);
let res = [];
while (mch !== null) {
res.push(mch[0]);
mch = rgx.exec(str);
}
console.log(res.join(', '));

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed