How to match repeat word separated by delimeter - regex

I want to create regexp which will accept these values:
number:number [P] or [K] or both or nothing and now it can repeat it again separated by delimiter [ + ] so for example valid values are:
15:15
1:0
1:2 K
1:3 P
1:4 P K
3:4 + 3:2
34:14 P K + 3:1 P
What I created is this:
([0-9]+:[0-9]+( [K])?( [P])?( [+] )?)+
This example has just one mistake. It accepts the value:
15:15 K P +
which shouldn't be allowed.
How should I change it?
UPDATE:
I forgot to mention it can be K P or P K. Or values are valid
1:4 K P

Try this regex:
^([0-9]+:[0-9]+(?: P)?(?: K)?(?: \+ [0-9]+:[0-9]+(?: P)?(?: K)?)*)$
Online tryout
UPDATE:: Based on your comment, you can use this one for vice-versa, but it will also match P P or K K
^([0-9]+:[0-9]+(?: [KP]){0,2}(?: \+ [0-9]+:[0-9]+(?: [KP]){0,2})*)$

This regex supports any order for K and P:
^[0-9]+:[0-9]+( P| K| K P| P K)?( \+ [0-9]+:[0-9]+( P| K| K P| P K)?)*$

How about:
^(\d+:\d+(?:(?: P)?(?: K)?|(?: P)?(?: K)?)?)(?:\s\+\s(?1))?$
Explanation:
^ : start of string
( : start capture group 1
\d+:\d+ : digits followed by colon followed by digits
(?: : non capture group
(?: P)? : P in a non capture group optional
(?: K)? : K in a non capture group optional
| : OR
(?: K)? : K in a non capture group optional
(?: P)? : P in a non capture group optional
)? : optional
) : end of group 1
(?: : non capture group
\s\+\s : space plus space
(?1) : same regex than group 1
)? : end of non capture group optional
$ : end of string

You can use this pattern:
^(?:[0-9]+:[0-9]+(?:( [KP])(?!\1)){0,2}(?: \+ |$))+$
pattern details:
^
(?: # this group describes one item with the optional +
[0-9]+:[0-9]+
(?: # describes the KP part
( [KP])(?!\1) # capture current KP and checks it not followed by itself
){0,2} # repeat zero, one or two times
(?: \+ |$) # the item ends with + or the end of the string
)+$ # repeat the item group
in Java style:
^(?:[0-9]+:[0-9]+(?:( [KP])(?!\\1)){0,2}(?: \\+ |$))+$

Related

How to get lines until an empty newline

I want to get a bloc of lines which contains < or > operator until an empty newline
i try with this regex .*[<>][^,\r\n]+?\(.*\S.*,.*\S.*\).*(?:(\n).*)
You find here my example : https://regex101.com/r/UQYLB5/1/
Expected Result :
MATCH 1 :
BAR18>17M(3,5.2)V
MATCH 2 :
BAR19>1.243037M(3,5.2)V
INFORMATION PROCESS
TAKE B/F: 19V[1]
LIGHT PC CARD:
MATCH 3 :
TEFAL17>1.262259M(4.5,5.5)V
SISS17 : 1789-ID
LIGHT 19/17
MAPPING NICE :
MATCH 4 :
MASCARPONE19>493.818969M(3,5.2)V
BATA17 : CDER78945 -- 1875
LEFT ERREUR - CAME BACK
MATCH 5 :
REPAR_178>748.515487M(4.5,5.5)V
CHAN1 / STEREO MIX
If you don't want to match lines which could consist of spaces only, you could use match either < or > and match at least a non whitespace char \S in the following lines:
^[^<>\r\n]*[<>].*(?:\r?\n[^\r\n\S]*\S.*)*
The pattern will match:
^ Start of string
[^<>\r\n]* Match any char except < `
[<>].* Match either < or > and the rest of the line
(?: Non capture group
\r?\n Match a newline
[^\r\n\S]* Match any char except a newline
\S.* Match a non whitespace char and the rest of the line
)* Close the group and repeat 0+ times
Regex demo
If the first line should also contain a , after matching < or >:
^[^<>\r\n]*[<>][^\r\n,]*,.*(?:\r?\n[^\r\n\S]*\S.*)*
Regex demo

Regexp to search names of books (Delphi 7 & DiRegEx 8.8.1)

I am using Delphi 7 and this is first time I am using the library DiRegEx.
What I need to do is collect names of the books which are in a list. The list is long but to have an idea it looks like this:
2 Tesalonickým 3:14
2 Tesalonickým 3:15
2 Tesalonickým 3:16
2 Tesalonickým 3:17
2 Tesalonickým 3:18
1 Timoteovi 1:1
1 Timoteovi 1:2
1 Timoteovi 1:3
1 Timoteovi 1:4
So what I want to find by RegEx.Match is the '2 Tesalonickým' and '1 Timoteovi' strings. So I want to search for ^some string\d\d?\d?:\d\d?\d? ...
My code is:
var
contents : TStringList;
RegEx: TDIRegEx;
WordCount: Integer;
s:string;
begin
Contents := TStringList.Create;
RegEx := TDIPerlRegEx.Create{$IFNDEF DI_No_RegEx_Component}(nil){$ENDIF};
Contents.LoadFromFile('..\reference dlouhé CS.txt');
for i:=0 to Contents.count-1 do
begin
Contents[i];
try
RegEx.SetSubjectStr(Contents[i]);
RegEx.MatchPattern := '\w+';
WordCount := 0;
if RegEx.Match(0) >= 0 then
begin
repeat
Inc(WordCount);
s := RegEx.MatchedStr;
WriteLn(WordCount, ' - ', s);
until RegEx.MatchNext < 0;
end;
finally
RegEx.Free;
end; // end try
end; // end for
end;
And I need to modify the regex so the \d\d?\d?:\d\d?\d? won't be in the result, but should be an "anchor" or a "needle". How to make the regexp?
Result:
This is a complete list of 66 books of bible in UTF-8. There were some problems with the \w pattern because this dos not include characters like Ž or š.
Genesis;Exodus;Leviticus;Numeri;Deuteronomium;Jozue;Soudců;Rút;1 Samuelova;2 Samuelova;1 Královská;2 Královská;1 Paralipomenon;2 Paralipomenon;Ezdráš;Nehemjáš;Ester;Jób;Žalmy;Přísloví;Kazatel;Píseň písní;Izajáš;Jeremjáš;Pláč;Ezechiel;Daniel;Ozeáš;Jóel;Ámos;Abdijáš;Jonáš;Micheáš;Nahum;Abakuk;Sofonjáš;Ageus;Zacharjáš;Malachiáš;Matouš;Marek;Lukáš;Jan;Skutky apoštolské;Římanům;1 Korintským;2 Korintským;Galatským;Efezským;Filipským;Koloským;1 Tesalonickým;2 Tesalonickým;1 Timoteovi;2 Timoteovi;Titovi;Filemonovi;Židům;Jakubův;1 Petrův;2 Petrův;1 Janův;2 Janův;3 Janův;Judův;Zjevení Janovo;
You may use
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d\d?\d?:\d)
Or
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d{1,3}:\d)
A (*UCP) at the pattern start (PCRE verb) to make all shorthands Unicode-aware.
The patterns match
^ - start of the string
(?: - start of a non-capturing group
\d+ - 1+ digits,
\s+ - 1+ whitespaces and
)? - end of non-capturing group, 1 or 0 occurrences (? makes it optional)
\w+ - 1+ word chars...
(?=\s+\d{1,3}:\d) - followed with 1+ whitespaces, 1 to 3 digits, : and a digit.
See the regex demo.
The \w might need replacing with \p{L} if you only need to match letters.

R regex: how to remove "*" only in between a group of variables

I have a group of variable var:
> var
[1] "a1" "a2" "a3" "a4"
here is what I want to achieve: using regex and change strings such as this:
3*a1 + a1*a2 + 4*a3*a4 + a1*a3
to
3a1 + a1*a2 + 4a3*a4 + a1*a3
Basically, I want to trim "*" that is not in between any values in var. Thank you in advance
Can do find (?<![\da-z])(\d+)\* replace $1
(?<! [\da-z] )
( \d+ ) # (1)
\*
Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines
( # (1 start)
(?: [^\da-z] | ^ )
\d+
) # (1 end)
\*
Leading assertions are bad anyways.
Benchmark
Regex1: (?<![\da-z])(\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 1.09 s, 1087.84 ms, 1087844 µs
Regex2: ((?:[^\da-z]|^)\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 0.77 s, 767.04 ms, 767042 µs
You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:
var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"
See the IDEONE demo
Here is the regex:
\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*
Details:
\b - leading word boundary
((?:a1|a2|a3|a4)\*) - Group 1 matching
(?:a1|a2|a3|a4) - either one of your variables
\* - asterisk
(?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
| - or
\* - a "wild" literal asterisk to be removed.
Taking the equation as a string, one option is
gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"
which looks for
a captured group of characters, ( ... )
containing a non-capturing group, (?: ... )
containing the beginning of the line ^
or, |
a space (or \\s)
followed by a digit 0-9, \\d.
The capturing group is followed by an asterisk, \\*,
followed by another capturing group ( ... )
containing an alphanumeric character \\w.
It replaces the above with
the first captured group, \\1,
followed by the second captured group, \\2.
Adjust as necessary.
Thank #alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:
> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"
# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Replacement of strings within 2 strings in regex

I have a string:
dkj a * & &*(&(*(
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and I need to have 2 groups such as each group must start with //HELLO then there should be a next line and any type of text can follow (.*) but it will end with a //BYE preceded by a line:
1)
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
2)
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and replaces the original string to this: (basically adding // to each line of each group)
dkj a * & &*(&(*(
////#HELLO
//^%#&UJNWDUK()C*(v 8*J DK*9
////#HE#$^&&(akls#$98akdjl ak##sjdkja
////
//%^&*(//#HELLO//#BYE<><>
////#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
////#HELLO
//&*^J$XUK 8j8 j jk kk8(&*(
////#BYE
Here is my current progress:
I have
\/\/#HELLO\n.*?\/\/#BYE[\n$]
However im not sure how to go about the replacement, I'm thinking separating each line per group using \G after the //#HELLO and ending with //#BYE
It's a bit complex, but this will do it:
Search: (?m)(//#HELLO[\r\n]+|\G(?://#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))
Replace: //$1
In Groovy:
String resultString = subjectString.replaceAll(/(?m)(\/\/#HELLO[\r\n]+|\G(?:\/\/#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))/, '//$1');
For grouping into separate lines use the following regex:
//#HELLO\r(.*[\n\r]+)*//#BYE\r?
\r - Newline character
[\n\r] - Enter characters
*? - Non-greedy match
?- Match 1 or 0 times
You can take out the ? at the end if it always ends with a newline.
You can then use the group (The value inside the brackets) to search and replace.