RegEx to find words with characters - regex

I've found answers to many of my questions here but this time I'm stuck. I've looked at 100's of questions but haven't found an answer that solves my problem so I'm hoping for your help :D
Considering the following list of words:
iris
iridium
initialization
How can I use regex to find words in this list when I am looking using exactly the characters u, i, i? I'm expecting the regex to find "iridium" only because it is the only word in the list that has two i's and one u.
What I've tried
I've been searching both here and elsewhere but haven't come across any that helps me.
[i].*[i].*[u]
matches iridium, as expected, and not iris nor initialization. However, the characters i, i, u must be in that sequence in the word, which may or may not be the case. So trying with a different sequence
[u].*[i].*[i]
This does not match iridium (but I want it to, iridium contains u, i, i) and I'm stuck for what to do to make it match. Any ideas?
I know I could try all sequences (in the example above it would be iiu; iui; uii) but that gets messy when I'm looking for more characters (say 6, tnztii which would match initialization).
[t].*[n].*[z].*[t].*[i].*[i]
[t].*[z].*[n].*[t].*[i].*[i]
[t].*[z].*[n].*[i].*[t].*[i]
..... (long list until)
[i].*[n].*[i].*[t].*[z].*[t] (the first matching sequence)
Is there a way to use regex to find the word, irrespective of the sequence of the characters?

I don't think there's a way to solve this with RegularExpressions which does not end in a horribly convoluted expression - might be possible with LookForward and LookBehind expressions, but I think it's probably faster and less messy if you simply solve this programmatically.
Chop the string up by its whitespaces and then iterate over all the words and count the instances your characters appear inside this word. To speed things up, discard all words with a length less than your character number requirement.

Is this an academic exercise, or can you use more than a single regular expression? Is there a language wrapped around this? The simplest way to do what you want is to have a regexp that matches just i or u, and examine (count) the matches. Using python, it could be a one-liner. What are you using?
The part you haven't gotten around to yet is that there might be additional i's or u's in the word. So instead of matching on .*, match on [^iu].

Here's what I would do:
Array.prototype.findItemsByChars = function(charGroup) {
console.log('charGroup:',charGroup);
charGroup = charGroup.toLowerCase().split('').sort().join('');
charGroup = charGroup.match(/(.)\1*/g);
for (var i = 0; i < charGroup.length; i++) {
charGroup[i] = {char:charGroup[i].substr(0,1),count:charGroup[i].length};
console.log('{char:'+charGroup[i].char+' ,count:'+charGroup[i].count+'}');
}
var matches = [];
for (var i = 0; i < this.length; i++) {
var charMatch = 0;
//console.log('word:',this[i]);
for (var j = 0; j < charGroup.length; j++) {
try {
var count = this[i].match(new RegExp(charGroup[j].char,'g')).length;
//console.log('\tchar:',charGroup[j].char,'count:',count);
if (count >= charGroup[j].count) {
if (++charMatch == charGroup.length) matches.push(this[i]);
}
} catch(e) { break };
}
}
return matches.length ? matches : false;
};
var words = ['iris','iridium','initialization','ulisi'];
var matches = words.findItemsByChars('iui');
console.log('matches:',matches);
EDIT: Let me know if you need any explanation.

I know this is a really old post, but I found this topic really interesting and thought people might look for a similar answer some day.
So the goal is to match all words with a specific set of characters in any order. There is a simple way to do this using lookaheads :
\b(?=(?:[^i\W]*i){2})(?=[^u\W]*u)\w+\b
Here is how it works :
We use one lookahead (?=...) for each letter to be matched
In this, we put [^x\W]*x where x is the the letter that must be present.
We then make this pattern occur n times, where n is the number of times that x must appear in th word using (?:...){n}
The resulting regex for a letter x having to appear n times in the word is then (?=(?:[^x\W]*x){n})
All you have to do then is to add this pattern for each letter and add \w+ at the end to match the word !

Related

How to Get substring from given QString in Qt

I have a QString like this:
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication"
What I need to do is to create substrings as follow:
SoftwareName = MY_DISPLAY_OS //text after ':'
Version = 10.25.10086-1
Release = 2022-3
I tried using QString QString::sliced(qsizetype pos, qsizetype n) const but didn't worked as I'm using 5.9 and this is supported on 6.0.
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication";
QString SoftwareName = fileData.sliced(fileData.lastIndexOf(':'), fileData.indexOf('.'));
Please help me to code this in Qt.
Use QString::split 3 times:
Split by QLatin1Char('=') to two parts:
SOFT_PACKAGES.ABC
MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication
Next, split 2nd part by QLatin1Char(':'), probably again to just 2 parts if there can never be more than 2 parts, so the 2nd part can contain colons:
MY_DISPLAY_OS
MY-Display-OS.2022-3.10.25.10086-1.myApplication
Finally, split 2nd part of previous step by QLatin1Char('.'):
MY-Display-OS
2022-3
10
25
10086-1
myApplication
Now just assemble your required output strings from these parts. If exact number of parts is unknown, you can get Version = 10.25.10086-1 by removing two first elements and last element from the final list above, and then joining the rest by QLatin1Char('.'). If indexes are known and fixed, you can just use QStringLiteral("%1.%2.%3").arg(....
One way is using
QString::mid(int startIndex, int howManyChar);
so you probably want something like this:
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication";
QString SoftwareName = fileData.mid(fileData.indexOf('.')+1, (fileData.lastIndexOf(':') - fileData.indexOf('.')-1));
To extract the other part you requested and if the number of '.' characters remains constant along all strings you want to check you can use the second argument IndexOf to find shift the starting location to skip known many occurences of '.', so for example
int StartIndex = 0;
int firstIndex = fileData.indexOf('.');
for (int i=0; i<=6; i++) {
StartIndex += fileData.indexOf('.', firstIndex+StartIndex);
}
int EndIndex = fileData.indexOf('.', StartIndex+8);
should give the right indices to be cut out with
QString SoftwareVersion = fileData.mid(StartIndex, EndIndex - StartIndex);
If the strings to be parsed stay less consistent in this way, try switching to regular expressions, they are the more flexible approach.
In my experience, using regular expressions for these types of tasks is generally simpler and more robust. You can do this with a regular expressions with the following:
// Create the regular expression.
// Using C++ raw string literal to reduce use of escape characters.
QRegularExpression re(R"(.+=([\w_]+):[\w-]+\.(\d+-\d+)\.(\d+\.\d+\.\d+-?\d+))");
// Match against your string
auto match = re.match("SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication");
// Now extract the portions you are interested in
// match.captured(0) is always the full string that matched the entire expression
const auto softwareName = match.captured(1);
const auto version = match.captured(3);
const auto release = match.captured(2);
Of course for this to make sense, you have to understand regex, so here is my explanation of the regex used here:
.+=([\w_]+):[\w-]+\.(\d+-\d+)\.(\d+\.\d+\.\d+-?\d+)
.+=
get all characters up to and including the first equals sign
([\w_]+)
capture one or more word characters (alphanumeric characters) or underscores
:
a colon
[\w-]+\.
one or more alphanumeric or dash characters followed by a single period
(\d+-\d+)
capture one or more of digits followed by a dash followed by one or more digits
\.
a single period
(\d+\.\d+\.\d+-?\d*)
capture three sets of digits with periods in between, then an optional dash, and any number of digits (could be zero digits)
I think it is generally easier to make a regex that handles changes to the input - lets say version becomes 10.25.10087 - more easily than manually parsing things by index.
Regex is a powerful tool once you get used to it, but it can certainly seem daunting at first.
Example of this regex on regex101.com: https://regex101.com/r/dj3Z4U/1

C++: Regex pattern

I got a regex pattern: (~[A-Z]){10,30} (Thanks to KekuSemau). And I need to edit it, so it will skip 1 letter. So it will be like down below.
Input: CABBYCRDCEBFYGGHQIPJOK
Output: A B C D E F G H I J K
Just match two letters each iteration but only capture the second part.
(?:~[A-Z](~[A-Z])){5,15}
live: https://regex101.com/r/pIAxH8/1
I cut the repetition count (the bit inside the {}'s) by half since the new regex is matching two at a time.
The ?: in (?:...) bit disables capturing of the group.
In regex only, there is no way you can achieve this directly.
But you can do this in code:
Use following regex:
(.(?<pick>[A-Z]))+
and in code make a loop on "captures" of desired group, like in c#:
string value = "";
for (int i = 0; i < match.Groups["pick"].Captures.Count; i++)
{
value = match.Groups["pick"].Captures[0].Value;
}

Regex to match all occurrences that begin with n characters in sequence

I'm not sure if it's even possible for a regular expression to do this. Let's say I have a list of the following strings:
ATJFH
ABHCNEK
BKDFJEE
NCK
ABH
ABHCNE
KDJEWRT
ABHCN
EGTI
And I want to match all strings that begin with any number of characters for the following string: ABHCNEK
The matches would be:
ABH
ABHCN
ABHCNE
ABHCNEK
I tried things like ^[A][B][H][C][N][E][K] and ^A[B[H[C[N[E[K]]]]]], but I can't seem to get it to work...
Can this be done in regex? If so, what would it be?
The simplest can be
^(?:ABHCNEK|ABHCNE|ABHCN|ABHC|ABH|AB|A)$
See demo.
https://regex101.com/r/eB8xU8/6
Use this regular expression:
^[ABHCNEK]+$
You haven't said how you want to use it, but one option doesn't require regex. Loop through the various strings and check for a match within your test string:
var strings = ['ATJFH', 'ABHCNEK', 'BKDFJEE', 'NCK', 'ABH', 'ABHCNE', 'KDJEWRT', 'ABHCN', 'EGTI'];
var test = 'ABHCNEK';
for (var i = 0; i < strings.length; i++) {
if (test.match(strings[i])) {
console.log(strings[i]);
}
}
This returns:
ABHCNEK
ABH
ABHCNE
ABHCN

RegExp JS regarding sequential patttern matching

P.S: --> I know there is an easy solution to my needs, and I can do it that way but, -- I am looking for a "diff" solution for learning sake & challenge sake. So, this is just to solve an algorithm in a lesser traditional way.
I am working on solving an algorithm, and thought I had everything working well but one use case is failing. That is because I am building a regexp dynamically - now, my issue is this.
I need to match letters sequentially up until one doesn't match, then I just "match" what did match sequentially.
so... lets say I was matching this:
"zaazizz"
with this: /\bz[a]?[z]?/
"zizzi".match(/\bz[z]?[i]?/)
currently, that is matching with a : [zi], but the match should only be [z]
zzi only matches "z" from the front of "zizzi", in that order zzi - I now I am using [z]? etc... so it is optional.. but what I really need is match sequentially.. I'd only get "zi" IF from the front, it matched: zzi per my regex.... so, some sort of lookahead or ?. I tried ?= and != no luck.
I still think a non-regex-approach is best here. Have a look at the following JS-Code:
var match = "abcdef";
var input = "abcxdef";
var mArray = match.split("");
var inArray = input.split("");
var max = Math.min(mArray.length, inArray.length) - 1;
for (var i = 0; i < max; i++) {
if (mArray[i] != inArray[i]) { break; }
}
input.substring(0, i);
Where match is the string to be partially matched, input is the input and input.substring(0, i) is the result of the matching part. And you can change match as often as you like.

Regex permutations without repetition [duplicate]

This question already has answers here:
How to find all permutations of a given word in a given text?
(6 answers)
Closed 7 years ago.
I need a RegEx to check if I can find a expression in a string.
For the string "abc" I would like to match the first appearance of any of the permutations without repetition, in this case 6: abc, acb, bac, bca, cab, cba.
For example, in this string "adesfecabefgswaswabdcbaes" it'd find a coincidence in the position 7.
Also I'd need the same for permutations without repetition like this "abbc". The cases for this are 12: acbb, abcb, abbc, cabb, cbab, cbba, bacb, babc, bcab, bcba, bbac, bbca
For example, in this string "adbbcacssesfecabefgswaswabdcbaes" it'd find a coincidence in the position 3.
Also, I would like to know how would that be for similar cases.
EDIT
I'm not looking for the combinations of the permutations, no. I already have those. WHat I'm looking for is a way to check if any of those permutations is in a given string.
EDIT 2
This regex I think covers my first question
([abc])(?!\1)([abc])(?!\2|\1)[abc]
Can find all permutations(6) of "abc" in any secuence of characters.
Now I need to do the same when I have a repeated character like abbc (12 combinations).
([abc])(?!\1)([abc])(?!\2|\1)[abc]
You can use this without g flag to get the position.See demo.The position of first group is what you want.
https://regex101.com/r/nS2lT4/41
https://regex101.com/r/nS2lT4/42
The only reason you might "need a regex" is if you are working with a library or tool which only permits specifying certain kinds of rules with a regex. For instance, some editors can be customized to color certain syntactic constructs in a particular way, and they only allow those constructs to be specified as regular expressions.
Otherwise, you don't "need a regex", you "need a program". Here's one:
// are two arrays equal?
function array_equal(a1, a2) {
return a1.every(function(chr, i) { return chr === a2[i]; });
}
// are two strings permutations of each other?
function is_permutation(s1, s2) {
return array_equal(s1.split('').sort(), s2.split('').sort());
}
// make a function which finds permutations in a string
function make_permutation_finder(chars) {
var len = chars.length;
return function(str) {
for (i = 0; i < str.length - len; i++) {
if (is_permutation(chars, str.slice(i, i+len))) return i;
}
return -1;
};
}
> finder = make_permutation_finder("abc");
> console.log(finder("adesfecabefgswaswabdcbaes"));
< 6
Regexps are far from being powerful enough to do this kind of thing.
However, there is an alternative, which is precompute the permutations and build a dynamic regexp to find them. You did not provide a language tag, but here's an example in JS. Assuming you have the permutations and don't have to worry about escaping special regexp characters, that's just
regexp = new RegExp(permuations.join('|'));