Regexp to search names of books (Delphi 7 & DiRegEx 8.8.1) - regex

I am using Delphi 7 and this is first time I am using the library DiRegEx.
What I need to do is collect names of the books which are in a list. The list is long but to have an idea it looks like this:
2 Tesalonickým 3:14
2 Tesalonickým 3:15
2 Tesalonickým 3:16
2 Tesalonickým 3:17
2 Tesalonickým 3:18
1 Timoteovi 1:1
1 Timoteovi 1:2
1 Timoteovi 1:3
1 Timoteovi 1:4
So what I want to find by RegEx.Match is the '2 Tesalonickým' and '1 Timoteovi' strings. So I want to search for ^some string\d\d?\d?:\d\d?\d? ...
My code is:
var
contents : TStringList;
RegEx: TDIRegEx;
WordCount: Integer;
s:string;
begin
Contents := TStringList.Create;
RegEx := TDIPerlRegEx.Create{$IFNDEF DI_No_RegEx_Component}(nil){$ENDIF};
Contents.LoadFromFile('..\reference dlouhé CS.txt');
for i:=0 to Contents.count-1 do
begin
Contents[i];
try
RegEx.SetSubjectStr(Contents[i]);
RegEx.MatchPattern := '\w+';
WordCount := 0;
if RegEx.Match(0) >= 0 then
begin
repeat
Inc(WordCount);
s := RegEx.MatchedStr;
WriteLn(WordCount, ' - ', s);
until RegEx.MatchNext < 0;
end;
finally
RegEx.Free;
end; // end try
end; // end for
end;
And I need to modify the regex so the \d\d?\d?:\d\d?\d? won't be in the result, but should be an "anchor" or a "needle". How to make the regexp?
Result:
This is a complete list of 66 books of bible in UTF-8. There were some problems with the \w pattern because this dos not include characters like Ž or š.
Genesis;Exodus;Leviticus;Numeri;Deuteronomium;Jozue;Soudců;Rút;1 Samuelova;2 Samuelova;1 Královská;2 Královská;1 Paralipomenon;2 Paralipomenon;Ezdráš;Nehemjáš;Ester;Jób;Žalmy;Přísloví;Kazatel;Píseň písní;Izajáš;Jeremjáš;Pláč;Ezechiel;Daniel;Ozeáš;Jóel;Ámos;Abdijáš;Jonáš;Micheáš;Nahum;Abakuk;Sofonjáš;Ageus;Zacharjáš;Malachiáš;Matouš;Marek;Lukáš;Jan;Skutky apoštolské;Římanům;1 Korintským;2 Korintským;Galatským;Efezským;Filipským;Koloským;1 Tesalonickým;2 Tesalonickým;1 Timoteovi;2 Timoteovi;Titovi;Filemonovi;Židům;Jakubův;1 Petrův;2 Petrův;1 Janův;2 Janův;3 Janův;Judův;Zjevení Janovo;

You may use
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d\d?\d?:\d)
Or
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d{1,3}:\d)
A (*UCP) at the pattern start (PCRE verb) to make all shorthands Unicode-aware.
The patterns match
^ - start of the string
(?: - start of a non-capturing group
\d+ - 1+ digits,
\s+ - 1+ whitespaces and
)? - end of non-capturing group, 1 or 0 occurrences (? makes it optional)
\w+ - 1+ word chars...
(?=\s+\d{1,3}:\d) - followed with 1+ whitespaces, 1 to 3 digits, : and a digit.
See the regex demo.
The \w might need replacing with \p{L} if you only need to match letters.

Related

How to match in a single/common Regex Group matching or based on a condition

I would like to extract two different test strings /i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
and
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8
with a single RegEx and in Group-1.
By using this RegEx ^.[i,na,fm,d]+\/(.+([,\/])?(\/|.+=.+,\/).+\/[,](live.([^,]).).+_)?.+(640).*$ I can get the second string to match the desired result int/2021/11/25/,live_20211125_215206_
but the first string does not match in Group-1 and the missing expected test string 1 extraction is int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45
Any pointers on this is appreciated.
Thanks!
If you want both values in group 1, you can use:
^/(?:[id]|na|fm)/([^/\s]*/\d{4}/\d{2}/\d{2}/\S*?)(?:/,|[^_]+_)640(?:\D|$)
The pattern matches:
^ Start of string
/ Match literally
(?:[id]|na|fm) Match one of i d na fm
/ Match literally
( Capture group 1
[^/\s]*/ Match any char except a / or a whitespace char, then match /
\d{4}/\d{2}/\d{2}/ Match a date like pattern
\S*? Match optional non whitespace chars, as few as possible
) Close group 1
(?:/,|[^_]+_) Match either /, or 1+ chars other than _ and then match _
640 Match literally
(?:\D|$) Match either a non digits or assert end of string
See a regex demo and a go demo.
We can't know all the rules of how the strings your are matching are constructed, but for just these two example strings provided:
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(\/i/int/\d{4}/\d{2}/\d{2}/.*)(?:\/,|_[\w_]+)640`)
var str = `
/i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8`
match := re.FindAllStringSubmatch(str, -1)
for _, val := range match {
fmt.Println(val[1])
}
}

Regex that does not accept sub strings of more than two 'b'

I need a regex that accepts all the strings consisting only of characters a and b, except those with more than two 'b' in a row.
For example, these should not match:
abb
ababbb
bba
bbbaa
bbb
bb
I came up with this, but it's not working
[a-b]+b{2,}[a-b]*
Here is my code:
int main() {
string input;
regex validator_regex("\b(?:b(?:a+b?)*|(?:a+b?)+)\b");
cout << "Hello, "<<endl;
while(regex_match(input,validator_regex)==false){
cout << "please enter your choice of regEx :"<<endl;
cin>>input;
if(regex_match(input,validator_regex)==false)
cout<<input+" is not a valid input"<<endl;
else
cout<<input+" is valid "<<endl;
}
}
Your pattern [a-b]+b{2,}[a-b]* matches 1 or more a or b chars until you match bb which is what you don't want. Also note that the string should be at least 3 characters long due to this part [a-b]+b{2,}
To not match 2 b chars in a row you can exclude those matches using a negative lookahead by matching optional chars a or b until you encounter bb
Note that [a-b] is the same as [ab]
\b(?![ab]*?bb)[ab]+\b
\b A word boundary
(?![ab]*?bb) Negative lookahead, assert not 0+ times a or b followed by bb to the right
[ab]+ Match 1+ occurrences of a or b
\b A word boundary
Regex demo
Without using lookarounds, you can match the strings that you don't want by matching a string that contains bb, and capture in group 1 the strings that you want to keep:
\b[ab]*bb[ab]*\b|\b([ab]+)\b
Regex demo
Or use an alternation matching either starting with b and optional repetitions of 1+ a chars followed by an optional b, or match 1+ repetitions of starting with a followed by an optional b
\b(?:b(?:a+b?)*|(?:a+b?)+)\b
Regex demo
The simplest regex is:
^(?!.*bb)[ab]+$
See live demo.
This regex works by adding a negative look ahead (anchored to start) for bb appearing anywhere within input consisting of a or b.
If zero length input should match, change [ab]+ to [ab]*.

Find if either followed by non number or end of file

I want to match the string b5 with optional $ in front of the b or tha 5 :
=b5
b$5
= $b$5
($b5)
But the 5 can't be followed by any number . And the b can't be preceded by any alphabet. So this should return false :
b55
ab5
I tried this :
\W\$*b\$*5\W
it works fine. i will match X=($b$5) but the problem is : it won't match anymore if the '5' is the last character in the line.
because 5 is last character
You can use
(?:\W|^)\$*b\$*5(?:\W|$)
(?:\W|^)\$*b\$*5\b
See the RE2 regex demo.
Details
(?:\W|^) - a non-capturing group matching either a non-word char or start of string
\$* - zero or more $ chars
b - a b char
\$* - zero or more $ chars
5 - a 5 char
(?:\W|$) - a non-capturing group matching either a non-word char or end of string or
\b - a word boundary.

Using regexp to find a reoccurring pattern in MATLAB

input = ' 12Z taj 20501 jfdjda OCNL jtjajd ptpa 23Z jfdakdkf tjajdfk OCNL fdkadja 02Z fdjafsdk fkdsafk OCNL fdkafk dksakj = '
using regexp
regexp(input,'\s\d{2,4}Z\s.*(OCNL)','match')
I'm trying to get the output
[1,1] = 12Z taj 20501 jfdjda OCNL jtjajd ptpa
[1,2] = 23Z jfdakdkf tjajdfk OCNL fdkadja
[1,3] = 02Z fdjafsdk fkdsafk OCNL fdkafk dksakj
You may use
(?<!\S)\d{2,4}Z\s+.*?\S(?=\s\d{2,4}Z\s|\s*=\s*$)
See the regex demo.
Details
(?<!\S) - there must be a whitespace or start of string immediately to the left of the current location
\d{2,4} - 2, 3 or 4 digits
Z - a Z letter
\s+ - 1+ whitespaces
.*?\S - any zero or more chars as few as possible and then a non-whitespace
(?=\s\d{2,4}Z\s|\s*=\s*$) - there must be either of the two patterns immediately to the right of the current location:
\s\d{2,4}Z\s - a whitespace, 2, 3 or 4 digits, Z and a whitespace
| - or
\s*=\s*$ - a = enclosed with 0+ whitespace chars at the end of the string.

Replacement of strings within 2 strings in regex

I have a string:
dkj a * & &*(&(*(
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and I need to have 2 groups such as each group must start with //HELLO then there should be a next line and any type of text can follow (.*) but it will end with a //BYE preceded by a line:
1)
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
2)
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and replaces the original string to this: (basically adding // to each line of each group)
dkj a * & &*(&(*(
////#HELLO
//^%#&UJNWDUK()C*(v 8*J DK*9
////#HE#$^&&(akls#$98akdjl ak##sjdkja
////
//%^&*(//#HELLO//#BYE<><>
////#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
////#HELLO
//&*^J$XUK 8j8 j jk kk8(&*(
////#BYE
Here is my current progress:
I have
\/\/#HELLO\n.*?\/\/#BYE[\n$]
However im not sure how to go about the replacement, I'm thinking separating each line per group using \G after the //#HELLO and ending with //#BYE
It's a bit complex, but this will do it:
Search: (?m)(//#HELLO[\r\n]+|\G(?://#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))
Replace: //$1
In Groovy:
String resultString = subjectString.replaceAll(/(?m)(\/\/#HELLO[\r\n]+|\G(?:\/\/#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))/, '//$1');
For grouping into separate lines use the following regex:
//#HELLO\r(.*[\n\r]+)*//#BYE\r?
\r - Newline character
[\n\r] - Enter characters
*? - Non-greedy match
?- Match 1 or 0 times
You can take out the ? at the end if it always ends with a newline.
You can then use the group (The value inside the brackets) to search and replace.