Regex named capture groups in Delphi XE - regex

I have built a match pattern in RegexBuddy which behaves exactly as I expect. But I cannot transfer this to Delphi XE, at least when using the latest built in TRegEx or TPerlRegEx.
My real world code have 6 capture group but I can illustrate the problem in an easier example. This code gives "3" in first dialog and then raises an exception (-7 index out of bounds) when executing the second dialog.
var
Regex: TRegEx;
M: TMatch;
begin
Regex := TRegEx.Create('(?P<time>\d{1,2}:\d{1,2})(?P<judge>.{1,3})');
M := Regex.Match('00:00 X1 90 55KENNY BENNY');
ShowMessage(IntToStr(M.Groups.Count));
ShowMessage(M.Groups['time'].Value);
end;
But if I use only one capture group
Regex := TRegEx.Create('(?P<time>\d{1,2}:\d{1,2})');
The first dialog shows "2" and the second dialog will show the time "00:00" as expected.
However this would be a bit limiting if only one named capture group was allowed, but thats not the case... If I change the capture group name to for example "atime".
var
Regex: TRegEx;
M: TMatch;
begin
Regex := TRegEx.Create('(?P<atime>\d{1,2}:\d{1,2})(?P<judge>.{1,3})');
M := Regex.Match('00:00 X1 90 55KENNY BENNY');
ShowMessage(IntToStr(M.Groups.Count));
ShowMessage(M.Groups['atime'].Value);
end;
I'll get "3" and "00:00", just as expected. Is there reserved words I cannot use? I don't think so because in my real example I've tried completely random names. I just cannot figure out what causes this behaviour.

When pcre_get_stringnumber does not find the name, PCRE_ERROR_NOSUBSTRING is returned.
PCRE_ERROR_NOSUBSTRING is defined in RegularExpressionsAPI as PCRE_ERROR_NOSUBSTRING = -7.
Some testing shows that pcre_get_stringnumber returns PCRE_ERROR_NOSUBSTRING for every name that has the first letter in the range of k to z and that range is dependent of the first letter in judge. Changing judge to something else changes the range.
As i see it there is at lest two bugs involved here. One in pcre_get_stringnumber and one in TGroupCollection.GetItem that needs to raise a proper exception instead of SRegExIndexOutOfBounds

The bug seems to be in the RegularExpressionsAPI unit that wraps the PCRE library, or in the PCRE OBJ files that it links. If I run this code:
program Project1;
{$APPTYPE CONSOLE}
uses
SysUtils, RegularExpressionsAPI;
var
myregexp: Pointer;
Error: PAnsiChar;
ErrorOffset: Integer;
Offsets: array[0..300] of Integer;
OffsetCount, Group: Integer;
begin
try
myregexp := pcre_compile('(?P<time>\d{1,2}:\d{1,2})(?P<judge>.{1,3})', 0, #error, #erroroffset, nil);
if (myregexp <> nil) then begin
offsetcount := pcre_exec(myregexp, nil, '00:00 X1 90 55KENNY BENNY', Length('00:00 X1 90 55KENNY BENNY'), 0, 0, #offsets[0], High(Offsets));
if (offsetcount > 0) then begin
Group := pcre_get_stringnumber(myregexp, 'time');
WriteLn(Group);
Group := pcre_get_stringnumber(myregexp, 'judge');
WriteLn(Group);
end;
end;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
ReadLn;
end.
It prints -7 and 2 instead of 1 and 2.
If I remove RegularExpressionsAPI from the uses clause and add the pcre unit from my TPerlRegEx component, then it does correctly print 1 and 2.
The RegularExpressionsAPI in Delphi XE is based on my pcre unit, and the RegularExpressionsCore unit is based on my PerlRegEx unit. Embarcadero did make some changes to both units. They also compiled their own OBJ files from the PCRE library that are linked by RegularExpressionsAPI.
I have reported this bug as QC 92497
I have also created a separate report QC 92498 to request that TGroupCollection.GetItem raise a more sensible exception when requesting a named group that does not exist. (This code is in the RegularExpressions unit which is based on code written by Vincent Parrett, not myself.)

Related

Date format substitution in PL/SQL. Example: from 5y 6m 20d to 050620

I am writing a query where I need to perform a date format transformation to meet the specified requirements.
In the database which I have to search, the date format looks like the one in the example: 5y 6m 10d with spaces in between and with optional digits (10y 30d; 1m 23d; 6m are also valid) and they are always ordered (first years, then month and then days).
The format transformation should be the following:
10y 6m 10d => 100610
1y 10m 1d => 011001
6m 2d => 000602
So that the output is always a 6-digit number.
I tried writing regular expressions within REGEX_SUBSTR to isolate the tokens and then concatenate them together in the type of SELECT REGEXP_SUBSTR(text_source, '(\d+)*y') FROM database and I also tried using the REGEX_REPLACE function. Nevertheless, I am not able to perform the transformation to two digits per token without spaces, nor replace one pattern by another, I can only replace the pattern by another string.
Although I am able to output the token separation without spaces by writing the function above. I am not able to get the whole transformation. Is there any possibility of writing a RegEx and combining it with any of the PL/SQL functions in order to transform the dates stated on the list above ? I am also open to hear any other solutions not involving RegEx, I just thought it was sensible to make a proper use of them here.
Here is a simple solution in SQL.
you get the values for year, month and day e.g. with regexp_substr.
with nvl you set the value to 0 if there it is null.
lpad it with 0
with tab as(
select '10y 6m 10d' as str from dual union all
select '1y 10m 1d ' as str from dual union all
select '6m 2d ' as str from dual
)
select lpad(nvl(y,0), 2,'0') ||lpad(nvl(m,0), 2,'0')|| lpad(nvl(d,0), 2,'0')
from (
select rtrim(regexp_substr(str, '[0-9]{1,2}y', 1),'y') as y
,rtrim(regexp_substr(str, '[0-9]{1,2}m', 1),'m') as m
,rtrim(regexp_substr(str, '[0-9]{1,2}d', 1),'d') as d
from tab
)
;
LPAD(N
------
100610
011001
000602
I hope it works
declare
myDate_ varchar2(50) := REPLACE('1y 10m 81d',' ','');
year_ varchar2(50);
month_ varchar2(50);
day_ varchar2(50);
begin
if instrb(myDate_,'y',1,1)>0 then
year_ := lpad(regexp_substr(substr(myDate_,0,instrb(myDate_,'y',1,1)), '[^y]+',1 , 1),2,0);
end if;
if instrb(myDate_,'m',1,1)>0 then
month_ := lpad(regexp_substr(substr(myDate_,instrb(myDate_,'y',1,1)+1,instrb(myDate_,'m',1,1)), '[^m]+',1 , 1),2,0);
end if;
if instrb(myDate_,'d',1,1)>0 then
day_ := lpad(regexp_substr(substr(myDate_,instrb(myDate_,'m',1,1)+1,instrb(myDate_,'d',1,1)), '[^d]+',1 , 1),2,0);
end if;
dbms_output.put_line(year_||month_||day_);
end;

Using Regex to parse ASCII protocol

I'm working on a simple application that interacts with a device via an Telnet session with a ASCII based protocol.
There will be a lot of interaction with the device so i'm looking for a fast way to parse the incoming string. Now the manufacturer was so kind to release there Regex scheme. But since Regex is very new to me i don't understand how to retrieve the value. I know how to match but when i match i want to get the value from it.
Regex scheme
NameAndValue := [A-Z_]+:("(\\.|[^"\\])*"|(\\.|[^\s"\\])*)
Value := ("(\\.|[^"\\])*"|(\\.|[^\s"\\])*)
ValueUnquoted := (\\.|[^\s"\\])*
ValueQuoted := "(\\.|[^"\\])*"
CharQuoted := (\\.|[^"\\])
CharUnquoted := (\\.|[^\s"\\])
EscapedChar := \\.
CharCommon := [^\s"\\]
CharEscape := \\
CharQuote := "
CharSpace := \s
Example of a response
CMD1:"string value" CMD2:1 CMD3:"string value again" <LF> or <CR>+<LF>
I've read a lot of documentation and tried lot's of approaches, however someone could point me out in the right direct.
I did however wrote a simple parser that finds the index positions of commands and there values and then uses a substring to retrieve only the value. It works, but i prefer an "nicer" way with the power of Regex.
--------- EDIT 18-10-2017 ---------
Request of #VBobCat to provide a more detailed "parsing" requirement.
So let's say i have a object with the properties Foo and Bar and we have a second object with the properties cat and dog
Now when i receive the string via telnet i have to parse it to one of those objects. Lucky the string always begins with what it holds. So lets say x for object with Foo and Bar and animal for object with cat and dog.
Now with the provided Regex i want to parse the values in the string to the properties of the object. Something like:
X CMD1_Foo:1 CMD2_Bar:"string value" <LF> or <CR>+<LF>
Object X.Foo = CMD1_Foo.value
Object X.Bar = CMD2_Bar.value
OR
Animal CMD1_Cat:"Miauw" CMD2_Dog:"woef" <LF> or <CR>+<LF>
Object X.Cat = CMD1_Cat.value
Object X.Dog = CMD2_Dog.value
If all your samples are consistent with your example, this could work:
Function ParseTelnet(input As String) As DataTable
Dim retTable As New DataTable
retTable.Columns.Add("command", GetType(String))
retTable.Columns.Add("value", GetType(String))
Dim entries = System.Text.RegularExpressions.Regex.Split(input, "\s+(?=\w+:)")
Dim pairs = entries.Select(
Function(entry) If(entry, "").Trim(Chr(9), Chr(10), Chr(13), Chr(32)).Split({":"c}, 2)).Where(
Function(pair) pair.Count = 2)
For Each pair In pairs
If pair(1).StartsWith("""") AndAlso pair(1).EndsWith("""") Then
retTable.Rows.Add(pair(0), pair(1).Substring(1, pair(1).Length - 2))
Else
retTable.Rows.Add(pair(0), pair(1))
End If
Next
Return retTable
End Function

gst regular expression mismatch of group generates exception

I have a simple example in GNU Smalltalk 3.2.5 of attempting to group match on a key value setting:
st> m := 'a=b' =~ '(.*?)=(.*)'
MatchingRegexResults:'a=b'('a','b')
The above example works just as expected. However, if there is no match to the second group (.*), an exception is generated:
st> m := 'a=' =~ '(.*?)=(.*)'
Object: Interval new "<-0x4ce2bdf0>" error: Invalid index 1: index out of range
SystemExceptions.IndexOutOfRange(Exception)>>signal (ExcHandling.st:254)
SystemExceptions.IndexOutOfRange class>>signalOn:withIndex: (SysExcept.st:660)
Interval>>first (Interval.st:245)
Kernel.MatchingRegexResults>>at: (Regex.st:382)
Kernel.MatchingRegexResults>>printOn: (Regex.st:305)
Kernel.MatchingRegexResults(Object)>>printString (Object.st:534)
Kernel.MatchingRegexResults(Object)>>printNl (Object.st:571)
I don't understand this behavior. I would have expected the result to be ('a', nil) and that m at: 2 to be nil. I tried a different approach as follows:
st> 'a=' =~ '(.*?)=(.*)' ifMatched: [ :m | 'foo' printNl ]
'foo'
'foo'
Which determines properly that there's a match to the regex. But I still can't check if a specific group is nil:
st> 'a=' =~ '(.*?)=(.*)' ifMatched: [ :m | (m at: 2) ifNotNil: [ (m at: 2) printNl ] ]
Object: Interval new "<-0x4ce81b58>" error: Invalid index 1: index out of range
SystemExceptions.IndexOutOfRange(Exception)>>signal (ExcHandling.st:254)
SystemExceptions.IndexOutOfRange class>>signalOn:withIndex: (SysExcept.st:660)
Interval>>first (Interval.st:245)
Kernel.MatchingRegexResults>>at: (Regex.st:382)
optimized [] in UndefinedObject>>executeStatements (a String:1)
Kernel.MatchingRegexResults>>ifNotMatched:ifMatched: (Regex.st:322)
Kernel.MatchingRegexResults(RegexResults)>>ifMatched: (Regex.st:188)
UndefinedObject>>executeStatements (a String:1)
nil
st>
I don't understand this behavior. I would have expected the result to be ('a', nil) and that m at: 2 to be nil. At least that's the way it works in any other language I've used regex in. This makes me think maybe I'm not doing something correct with my syntax.
My question this is: do I have the correct syntax for attempting to match ASCII key value pairs like this (for example, in parsing environment settings)? And if I do, why is an exception being generated, or is there a way I can have it provide a result that I can check without generating an exception?
I found a related issue reported at gnu.org from Dec 2013 with no responses.
The issue had been fixed in master after the above report was received. The commit can be seen here. A stable release is currently blocked by the glib event loop integration.
ValidationExpression="[0-9]{2}[(a-z)(A-Z)]{5}\d{4}[(a-z)(A-Z)]{1}\d{1}Z\d{1}"
SetFocusOnError="true" ControlToValidate="txtGST" Display="Dynamic" runat="server" ErrorMessage="Invalid GST No." ValidationGroup="Add" ForeColor="Red"></asp:RegularExpressionValidator>

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0

Delphi - User specified string manipulation

I have a problem in Delphi7. My application creates mpg video files according to a set naming convention i.e.
\000_A_Title_YYYY-MM-DD_HH-mm-ss_Index.mpg
In this filename the following rules are enforced:
The 000 is the video sequence. It is incremented whenever the user presses stop.
The A (or B,C,D) specifies the recording camera - so video files are linked with up to four video streams all played simultaneously.
Title is a variable length string. In my application it cannot contain a _.
The YYYY-MM-DD_HH-mm-ss is the starting time of the video sequence (not the single file)
The Index is the zero based ordering index and is incremented within 1 video sequence. That is, video files are a maximum of 15 minutes long, once this is reached a new video file is started with the same sequence number but next index. Using this, we can calculate the actual start time of the file (Filename decoded time + 15*Index)
Using this method my application can extract the starting time that the video file started recording.
Now we have a further requirement to handle arbitrarily named video files. The only thing i know for certain is there will be a YYYY-MM-DD HH-mm-ss somewhere in the filename.
How can i allow the user to specify the filename convention for the files he is importing? Something like Regular expressions? I understand there must be a pattern to the naming scheme.
So if the user inputs ?_(Camera)_*_YYYY-MM-DD_HH-mm-ss_(Index).mpg into a text box, how would i go about getting the start time? Is there a better solution? Or do i just have to handle every single possibility as we come accross them?
(I know this is probably not the best way to handle such a problem, but we cannot change the issue - the new video files are recorded by another company)
I'm not sure if your trying to parse the user input into components '?(Camera)*_YYYY-MM-DD_HH-mm-ss_(Index).mpg` but if your just trying to grab the date and time something like this, the date is in group 1, time in group 2
(\d{4}-\d{2}-\d{2})_(d{2}-\d{2}-\d{2})
Otherwise, not sure what your trying to do.
Possibly you can use the underscores "_" as your positional indicator since you smartly don't allow them in the title.
In your example of a filename convention:
?_(Camera)_*_YYYY-MM-DD_HH-mm-ss_(Index).mpg
you can parse this user-specified string to see that the date YYYY-MM-DD is always between the 3rd and 4th underscore and the time HH-mm-ss is between the 4th and 5th.
Then it becomes a simple matter when getting the actual filenames following this convention, to find the 3rd underscore and know the date and time follow it.
If you want phone-calls 24/7, then you should go for the RegEx-thing and let the user freely enter some cryptography in a TEdit.
If you want happy users and a good night sleep, then be creative and drop the boring RegEx-approach. Create your own filename-decoder by using an Angry bird approach.
Here's the idea:
Create some birds with different string manipulation personalities.
Let the user select and arrange these birds.
Execute the user generated string manipulation.
Sample code:
program AngryBirdFilenameDecoder;
{$APPTYPE CONSOLE}
uses
SysUtils;
procedure PerformEatUntilDash(var aStr: String);
begin
if Pos('-', aStr) > 0 then
Delete(aStr, 1, Pos('-', aStr));
WriteLn(':-{ > ' + aStr);
end;
procedure PerformEatUntilUnderscore(var aStr: String);
begin
if Pos('_', aStr) > 0 then
Delete(aStr, 1, Pos('_', aStr));
WriteLn(':-/ > ' + aStr);
end;
function FetchDate(var aStr: String): String;
begin
Result := Copy(aStr, 1, 10);
Delete(aStr, 1, 10);
WriteLn(':-) > ' + aStr);
end;
var
i: Integer;
FileName: String;
TempFileName: String;
SelectedBirds: String;
MyDate: String;
begin
Write('Enter a filename to decode (eg. ''01-ThisIsAText-Img_01-Date_2011-03-08.png''): ');
ReadLn(FileName);
if FileName = '' then
FileName := '01-ThisIsAText-Img_01-Date_2011-03-08.png';
repeat
TempFileName := FileName;
WriteLn('Now, select some birds:');
WriteLn('Bird No.1 :-{ ==> I''ll eat letters until I find a dash (-)');
WriteLn('Bird No.2 :-/ ==> I''ll eat letters until I find a underscore (_)');
WriteLn('Bird No.3 :-) ==> I''ll remember the date before I eat it');
WriteLn;
Write('Chose your birds: (eg. 112123):');
ReadLn(SelectedBirds);
if SelectedBirds = '' then
SelectedBirds := '112123';
for i := 1 to Length(SelectedBirds) do
case SelectedBirds[i] of
'1': PerformEatUntilDash(TempFileName);
'2': PerformEatUntilUnderscore(TempFileName);
'3': MyDate := FetchDate(TempFileName);
end;
WriteLn('Bird No.3 found this date: ' + MyDate);
WriteLn;
WriteLn;
Write('Check filename with some other birds? (Y/N): ');
ReadLn(SelectedBirds);
until (Length(SelectedBirds)=0) or (Uppercase(SelectedBirds[1])<>'Y');
end.
When you'll do this in Delphi with GUI, you'll add more birds and more checking of course. And find some nice bird glyphs.
Use two list boxes. One one the left with all possible birds, and one on the right with all the selected birds. Drag'n'drop birds from left to right. Rearrange (and remove) birds in the list on the right.
The user should be able to test the setup by entering a filename and see the result of the process. Internally you store the script by using enumerators etc.