RegExp JS regarding sequential patttern matching - regex

P.S: --> I know there is an easy solution to my needs, and I can do it that way but, -- I am looking for a "diff" solution for learning sake & challenge sake. So, this is just to solve an algorithm in a lesser traditional way.
I am working on solving an algorithm, and thought I had everything working well but one use case is failing. That is because I am building a regexp dynamically - now, my issue is this.
I need to match letters sequentially up until one doesn't match, then I just "match" what did match sequentially.
so... lets say I was matching this:
"zaazizz"
with this: /\bz[a]?[z]?/
"zizzi".match(/\bz[z]?[i]?/)
currently, that is matching with a : [zi], but the match should only be [z]
zzi only matches "z" from the front of "zizzi", in that order zzi - I now I am using [z]? etc... so it is optional.. but what I really need is match sequentially.. I'd only get "zi" IF from the front, it matched: zzi per my regex.... so, some sort of lookahead or ?. I tried ?= and != no luck.

I still think a non-regex-approach is best here. Have a look at the following JS-Code:
var match = "abcdef";
var input = "abcxdef";
var mArray = match.split("");
var inArray = input.split("");
var max = Math.min(mArray.length, inArray.length) - 1;
for (var i = 0; i < max; i++) {
if (mArray[i] != inArray[i]) { break; }
}
input.substring(0, i);
Where match is the string to be partially matched, input is the input and input.substring(0, i) is the result of the matching part. And you can change match as often as you like.

Related

Regex to match all occurrences that begin with n characters in sequence

I'm not sure if it's even possible for a regular expression to do this. Let's say I have a list of the following strings:
ATJFH
ABHCNEK
BKDFJEE
NCK
ABH
ABHCNE
KDJEWRT
ABHCN
EGTI
And I want to match all strings that begin with any number of characters for the following string: ABHCNEK
The matches would be:
ABH
ABHCN
ABHCNE
ABHCNEK
I tried things like ^[A][B][H][C][N][E][K] and ^A[B[H[C[N[E[K]]]]]], but I can't seem to get it to work...
Can this be done in regex? If so, what would it be?
The simplest can be
^(?:ABHCNEK|ABHCNE|ABHCN|ABHC|ABH|AB|A)$
See demo.
https://regex101.com/r/eB8xU8/6
Use this regular expression:
^[ABHCNEK]+$
You haven't said how you want to use it, but one option doesn't require regex. Loop through the various strings and check for a match within your test string:
var strings = ['ATJFH', 'ABHCNEK', 'BKDFJEE', 'NCK', 'ABH', 'ABHCNE', 'KDJEWRT', 'ABHCN', 'EGTI'];
var test = 'ABHCNEK';
for (var i = 0; i < strings.length; i++) {
if (test.match(strings[i])) {
console.log(strings[i]);
}
}
This returns:
ABHCNEK
ABH
ABHCNE
ABHCN

Google sheet : REGEXREPLACE match everything except a particular pattern

I would try to replace everything inside this string :
[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs
1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567,
1410315029
Except the following numbers :
1303222074
1403281851
1307239335
1403277567
1410315029
I have built a REGEX to match them :
1[0-9]{9}
But I have not figured it out to do the opposite that is everything except all matches ...
google spreadsheet use the Re2 regex engine and doesn't support many usefull features that can help you to do that. So a basic workaround can help you:
match what you want to preserve first and capture it:
pattern: [0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)
replacement: $1 (with a space after)
demo
So probably something like this:
=TRIM(REGEXREPLACE("[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs 1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567, 1410315029"; "[0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)"; "$1 "))
You can also do this with dynamic native functions:
=REGEXEXTRACT(A1,rept("(\d{10}).*",counta(split(regexreplace(A1,"\d{10}","#"),"#"))-1))
basically it is first split by the desired string, to figure out how many occurrences there are of it, then repeats the regex to dynamically create that number of capture groups, thus leaving you in the end with only those values.
First of all thank you Casimir for your help. It gave me an idea that will not be possible with a built-in functions and strong regex lol.
I found out that I can make a homemade function for my own purposes (yes I'm not very "up to date").
It's not very well coded and it returns doublons. But rather than fixing it properly, I use the built in UNIQUE() function on top of if to get rid of them; it's ugly and I'm lazy but it does the job, that is, a list of all matches of on specific regex (which is: 1[0-9]{9}). Here it is:
function ti_extract(input) {
var tab_tis = new Array();
var tab_strings = new Array();
tab_tis.push(input.match(/1[0-9]{9}/)); // get the TI and insert in tab_tis
var string_modif = input.replace(tab_tis[0], " "); // modify source string (remove everything except the TI)
tab_strings.push(string_modif); // insert this new string in the table
var v = 0;
var patt = new RegExp(/1[0-9]{9}/);
var fin = patt.test(tab_strings[v]);
var first_string = tab_strings[v];
do {
first_string = tab_strings[v]; // string 0, or the string with the first removed TI
tab_tis.push(first_string.match(/1[0-9]{9}/)); // analyze the string and get the new TI to put it in the table
var string_modif2 = first_string.replace(tab_tis[v], " "); // modify the string again to remove the new TI from the old string
tab_strings.push(string_modif2);
v += 1;
}
while(v < 15)
return tab_tis;
}

RegEx to find words with characters

I've found answers to many of my questions here but this time I'm stuck. I've looked at 100's of questions but haven't found an answer that solves my problem so I'm hoping for your help :D
Considering the following list of words:
iris
iridium
initialization
How can I use regex to find words in this list when I am looking using exactly the characters u, i, i? I'm expecting the regex to find "iridium" only because it is the only word in the list that has two i's and one u.
What I've tried
I've been searching both here and elsewhere but haven't come across any that helps me.
[i].*[i].*[u]
matches iridium, as expected, and not iris nor initialization. However, the characters i, i, u must be in that sequence in the word, which may or may not be the case. So trying with a different sequence
[u].*[i].*[i]
This does not match iridium (but I want it to, iridium contains u, i, i) and I'm stuck for what to do to make it match. Any ideas?
I know I could try all sequences (in the example above it would be iiu; iui; uii) but that gets messy when I'm looking for more characters (say 6, tnztii which would match initialization).
[t].*[n].*[z].*[t].*[i].*[i]
[t].*[z].*[n].*[t].*[i].*[i]
[t].*[z].*[n].*[i].*[t].*[i]
..... (long list until)
[i].*[n].*[i].*[t].*[z].*[t] (the first matching sequence)
Is there a way to use regex to find the word, irrespective of the sequence of the characters?
I don't think there's a way to solve this with RegularExpressions which does not end in a horribly convoluted expression - might be possible with LookForward and LookBehind expressions, but I think it's probably faster and less messy if you simply solve this programmatically.
Chop the string up by its whitespaces and then iterate over all the words and count the instances your characters appear inside this word. To speed things up, discard all words with a length less than your character number requirement.
Is this an academic exercise, or can you use more than a single regular expression? Is there a language wrapped around this? The simplest way to do what you want is to have a regexp that matches just i or u, and examine (count) the matches. Using python, it could be a one-liner. What are you using?
The part you haven't gotten around to yet is that there might be additional i's or u's in the word. So instead of matching on .*, match on [^iu].
Here's what I would do:
Array.prototype.findItemsByChars = function(charGroup) {
console.log('charGroup:',charGroup);
charGroup = charGroup.toLowerCase().split('').sort().join('');
charGroup = charGroup.match(/(.)\1*/g);
for (var i = 0; i < charGroup.length; i++) {
charGroup[i] = {char:charGroup[i].substr(0,1),count:charGroup[i].length};
console.log('{char:'+charGroup[i].char+' ,count:'+charGroup[i].count+'}');
}
var matches = [];
for (var i = 0; i < this.length; i++) {
var charMatch = 0;
//console.log('word:',this[i]);
for (var j = 0; j < charGroup.length; j++) {
try {
var count = this[i].match(new RegExp(charGroup[j].char,'g')).length;
//console.log('\tchar:',charGroup[j].char,'count:',count);
if (count >= charGroup[j].count) {
if (++charMatch == charGroup.length) matches.push(this[i]);
}
} catch(e) { break };
}
}
return matches.length ? matches : false;
};
var words = ['iris','iridium','initialization','ulisi'];
var matches = words.findItemsByChars('iui');
console.log('matches:',matches);
EDIT: Let me know if you need any explanation.
I know this is a really old post, but I found this topic really interesting and thought people might look for a similar answer some day.
So the goal is to match all words with a specific set of characters in any order. There is a simple way to do this using lookaheads :
\b(?=(?:[^i\W]*i){2})(?=[^u\W]*u)\w+\b
Here is how it works :
We use one lookahead (?=...) for each letter to be matched
In this, we put [^x\W]*x where x is the the letter that must be present.
We then make this pattern occur n times, where n is the number of times that x must appear in th word using (?:...){n}
The resulting regex for a letter x having to appear n times in the word is then (?=(?:[^x\W]*x){n})
All you have to do then is to add this pattern for each letter and add \w+ at the end to match the word !

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function