A little misunderstanding about this regex pattern - regex

Let H be column 1, E be column 2, L column 3, P 4
I understand where the H comes from.
I also see how the L works.
But I am a bit confused on E and P.
If we look horizontally, the regex HE|LL|0+ only matches {HE, LL, 0 (1 or more times)}
The regex EP|IP|EF matches {EP, IP, EF}
How is it that the string E matches both of these conditions?
Similarly with [PLEASE], which matches {P, L, E, A, S, E} (any combination of these letters), only matches with EP from the vertical regex, then why is there just a P?
Am I reading this incorrectly? This was taken from regexcrossword

I think you misunderstand the nature of the crossword.
The string HE matches HE|LL|O+
The string LP matches [PLEASE]+
The string HL matches [^SPEAK]+
The string EP matches EP|IF|EF
Each row and column matches its regex, so the solution is valid.
Like, the following statement doesn't make sense...
How is it that the string E matches both of these conditions?
There is no string E. There are two strings, HE and EP.

Related

Regex capture required and optional characters in any position only

I would like to match against a word only a set of characters in any order but one of those letters is required.
Example:
Optional letters: yujkfec
Required letter: d
Matches: duck dey feed yudekk dude jude dedededy jejeyyyjd
No matches (do not contain required): yuck feck
No matches (contain letters outside of set): sucked shock blah food bard
I've tried ^[d]+[yujkfec]*$ but this only matches when the required letter is in the front. I've tried positive lookaheads but this didn't do much.
You can use
\b[yujkfec]*d[dyujkfec]*\b
See the regex demo. Note that the d is included into the second character class.
Details:
\b - word boundary
[yujkfec]* - zero or more occurrences of y, u, j, k, f, e or c
d - a d char
[dyujkfec]* - zero or more occurrences of y, u, j, k, f, e, c or d.
\b - a word boundary.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

Regular Expression: search multiple string with linefeed delimited by ";"

I have a string such this that described a structured data source:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
Every field...
... is starting with "FieldName"
... is ending with ";"
... may contain linefeed
My need is to find with regular expression the values of SampleTestPlan that's repeated twice. So...
1st value is:
2
a b
c d
2nd value is
3
e f
g h
i l
I've performed several attempts with such search string:
/SampleTestPlan(.\s)/gm
/SampleTestPlan(.\s);/gm
/SampleTestPlan(.*);/gm
but I need to understand much better how Regular Expression work as I'm definitively a newbie on them and I need to learn a lot.
Thanks in advance to anyone that may help me!
Stefano, Milan, ITALY
You could use the following regex:
(?<=\w\b)[^;]+(?=;)
See it working live here on regex101!
How it works:
It matches everything that is:
preceded by a sequence of characters: \w+
followed by a ;
contains anything (at least one character) except a ; (including newlines).
For example, for that input:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
It matches 5 times:
whocares
then:
2
a b
c d
then:
abc
then:
3
e f
g h
i l
then:
01
Assuming your input will be always in this well formatted like the sample, try this:
/SampleTestPlan(\s+\d+.*?);/sg
Here, /s modifier means Dot matches newline characters
You can try this at online.
That would be /SameTestPlan([^;]+)/g. [^abc] means any character which is not a, b or c.

Regular Expression - All words that begin and end in different letters

I'm having trouble with this regular expression:
Construct a regular expression defining the following language over alphabet
Σ = { a,b }
L6 = {All words that begin and end in different letters}
Here are some examples of regular expressions I was able to solve:
1. L1 = {all words of even length ending in ab}
(aa + ab + ba + bb)*(ab)
2. L2 = {all words that DO NOT have the substring ab}
b*a*
Would this work:
(a.*b)|(b.*a)
Or said in Kleene way:
a(a+b)*b+b(a+b)*a
This should do it:
"^((a.*b)|(b.*a))$"
1- Write a Regular expression for each of the following languages: (a)language of all those strings which end with substrings 'ab' and have odd length. (b)language of all those strings which do not contain the substring 'abb'.
2- Construct a deterministic FSA for each of the following languages: (a)languages of all those strings in which second last symbol is 'b'. (b)language of all those strings whose length is odd,but contain even number if b's.
(aa+ab+ba+bb)∗(a+b)ab
It can choose any number of even length and have any character from a and b, and then end at string ab.

Regular expression circular matching

Using regular expressions (in any language) is there a way to match a pattern that wraps around from the end to the beginning of a string? For example, if i want the match the pattern:
"street"
against the string:
m = "et stre"
it would match m[3:] + m[:2]
You can't do that directly in the regexp. What you can do is some arithmetic. Append the string to itself:
m = "et stre"
n = m + m //n = "et street stre"
If there is an odd number of matches in n (in this case, 1), the match was 'circular'. If not, there were no circular matches, and the number of matches in n is the double of the number of matches in m.