Combining look aheads not matching - regex

I have the following test string I'm working with:
__level__:,Undergraduate,;__subject__:,Maths,Art,;
This is actually a stringified object of { level: ["Undergraduate"], subject: ["Maths", "Art"] } that I figured converting to a string and using a regular expression might be quicker than looping through each level|subject and each value within those properties.
I can match a single value within a list of a property (e.g. level) like so:
(?=(__subject__:[^;]*(,Maths,).*?;))
And I can match two like so:
(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
However, I can't guarantee the order that level and subject lists will be. Below is also possible:
__subject__:,Maths,Art,;__level__:,Undergraduate,;
Notice I've put subject before level now. Now the regular expression doesn't match. I'm pretty new to look aheads so I can't figure out what I've done wrong. Would appreciate any help on the matter.
I also want to combine the properties being matched, so something like:
(?=(__level__:[^;]*(,Undergraduate,).*?;))(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
..doesn't work for me either but I'm trying to match two values from the subject property and a value from the level property. Again, I can't guarantee the order of properties (e.g. level, subject) and/or values (e.g. Maths, Art OR Art, Maths)

Class \[A-Z\] & Positive Lookahead (?=)
The targets are letters [A-Z]+? and to exclude the words surrounded by underscore use the positive lookahead to ensure the target is followed by a comma (?=,)
/([A-Z]+?)(?=,)/gi;
Demo
let str = `__level__:,Undergraduate,;__subject__:,Maths,Art,;`;
let rgx = /([A-Z]+?)(?=,)/gi;
let mch = rgx.exec(str);
let res = [];
while (mch !== null) {
res.push(mch[0]);
mch = rgx.exec(str);
}
console.log(res.join(', '));

Related

Regex pattern for number or three letters

I need to sort a table field with different kind of values:
number from 0 to 999+
group of three letters like AAA, AAB, AAC, AAD, etc.
StupidTable.js enables me to add a custom alphanumeric data type, but i'm not able to define the regex pattern.
I tried this code:
$("table").stupidtable({
"alphanum":function(a,b){
console.log(a,b)
var pattern = "^[a-zA-Z0-9_.-]*$";
var re = new RegExp(pattern);
var aNum = re.exec(a).slice(1);
var bNum = re.exec(b).slice(1);
return parseInt(aNum,10) - parseInt(bNum,10);
}
})
but it doesnt work. You can check the issue on this page clicking on "nr" tab: Test
Try something like this:
const regexPattern = /^[\d\w]{3}/gm;
This pattern allows you to capture a string if it contains only a 3 digit number or a 3 letter code. If you want to capture 0 and not 000, you will need to change {3} with {1,3}, but this will also capture A instead of AAA.
You might also consider normalizing your data in some ways, e.g. converting A to AAA and 0 to 000. This could be helpful for a number of reasons assuming your variable type is a string and not actually a number type. Does that make sense?
You can see how I've created this pattern at the link below, and try some tweaks to make it work well for you. I use this tool a lot and it will also generate some code for you in different languages. Good luck with your project, let me know how it goes.
Regex101.com

QRegExp in C++ to capture part of string

I am attempting to use Qt to execute a regex in my C++ application.
I have done similar regular expressions with Qt in C++ before, but this one is proving difficult.
Given a string with optional _# at the end of the string, I want to extract the part of the string before that.
Examples:
"blue_dog" should result "blue_dog"
"blue_dog_1" should result "blue_dog"
"blue_dog_23" should result "blue_dog"
This is the code I have so far, but it does not work yet:
QString name = "blue_dog_23";
QRegExp rx("(.*?)(_\\d+)?");
rx.indexIn(name);
QString result = rx.cap(1);
I have even tried the following additional options in many variations without luck. My code above always results with "":
rx.setMinimal(TRUE);
rx.setPatternSyntax(QRegExp::RegExp2);
Sometimes it's easier not to pack everything in a single regexp. In your case, you can restrict manipulation to the case of an existing _# suffix. Otherwise the result is name:
QString name = "blue_dog_23";
QRegExp rx("^(.*)(_\\d+)$");
QString result = name;
if (rx.indexIn(name) == 0)
result = rx.cap(1);
Alternatively, you can split the last bit and check if it is a number. A compact (but maybe not the most readable) solution:
QString name = "blue_dog_23";
int i = name.lastIndexOf('_');
bool isInt = false;
QString result = (i >= 0 && (name.mid(i+1).toInt(&isInt) || isInt)) ? name.left(i) : name;
The following solution should work as you want it to!
^[^\s](?:(?!_\d*\n).)*/gm
Basically, that is saying match everything up to, but not including, _\d*\n. Here, _\d*\n means match the _ char, then match any number of digits \d* until a new line marker, \n is reached. ?! is a negative lookahead, and ?: is a non-capturing group. Basically, the combination means that the sequence after the ?: is the group representing the non-inclusive end point of the what should be captured.
The ^[^\s] tells the expression to match starting at the start of a line, as long as the first character isn't a white space.
The /gm sets the global flag (allowing more than one match to be returned) and the mutli-line flag (which allows sequences to match past a single line.

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.

RegEx : Replace parts of dynamic strings

I have a string
IsNull(VSK1_DVal.RuntimeSUM,0),
I need to remove IsNull part, so the result would be
VSK1_DVal.RuntimeSUM,
I'm absolute new to RegEx, but it wouldn't be a problem, if not one thing :
VSK1 is dynamic part, can be any combination of A-Z,0-9 and any length. How to replace strings with RegEx? I use MSSQL 2k5, i think it uses general set of RegEx rules.
EDIT : I forgot to say, that I'm doing replacement in SSMS Query window's Replace Box (^H) - not building RegEx query
br
marius
here's a regex that should work:
[^(]+\(([^,]+),[^)]\)
Then use $1 capture group to extract the part that you need.
I did a sanity check in ruby:
orig = "IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /[^(]*\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => VSK1_DVal.RuntimeSUM,
It gets trickier if you have a prefix that you want to retain. Like if you have this:
"somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
In this case, you need someway to identify the start of the pattern. Maybe you can use '=' to identify the start of the pattern? If so, this should work:
orig = "somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /=\s*\w+\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
But then the case where you don't have an equals sign will fail. Maybe you can use 'IsNull' to identify the start of the pattern? If so, try this (note the '/i' representing case insensitive matching):
orig = "somestuff = isnull(VSK1_DVal.RuntimeSUM,0),"
regex = /IsNull\(([^,]+),[^)]\)/i
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
/IsNULL\((A-Z0-9+),0\)/
Then pick group match number 1.
Here's a very useful site: http://www.regexlib.com/RETester.aspx
They have a tester and a cheatsheet that are very useful for quick testing of this sort.
I tested the solution by Dave and it works fine except it also removes the trailing comma you wanted retained. Minor thing to fix.
Try this:
IsNULL\((.*,)0\)
You say in your question
I use MSSQL 2k5, i think it uses
general set of RegEx rules.
This is not true unless you enable CLR and compile and install an assembly. You can use its native pattern matching syntax and LIKE for this as below.
WITH T(C) AS
(
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,0),' UNION ALL
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,123465),' UNION ALL
SELECT 'No Match'
)
SELECT SUBSTRING(C,8,1+LEN(C)-8-CHARINDEX(',',REVERSE(C),2))
FROM T
WHERE C LIKE 'IsNull(%,_%),'

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.