Removing LEADING and TRAILING keywords from a String in PL/SQL - regex

I need to remove certain keywords from an input string and return the new string. Keywords are stored in another table like MR, MRS, DR, PVT, PRIVATE, CO, COMPANY, LTD, LIMITED etc. They are two kind of keywords LEADING - MR, MRS, DR and TRAILING - PVT, PRIVATE, CO, COMPANY, LTD, LIMITED etc.
So if Keywords is a LEADING then we have to remove that from the beginning and if it's a TRAILING then we have to remove that from the end. e.g.-MR Jones MRS COMPANY should return JONES MRS and MR MRS Jones PVT COMPANY should return MRS JONES PVT (As in first iteration MR and PVT will be trimmed and then word will become MRS JONES PVT) It should remove only very first occurrence of the reserve keyword either at the beginning or at the end of the input string so there are multiple occurrence of the LEADING keyword at the begining it should remove only the first one not the others like I gave example above, it is same for TRAILING keywords as well.
I have written the function below, and it is working fine but it is not efficient and I believe performance of this can be improved a lot(may be using regular expression). Below is the function:
CREATE OR REPLACE FUNCTION replace_keyword (p_in_name IN VARCHAR2)
RETURN VARCHAR2
IS
l_name VARCHAR2 (4000);
l_keyword_found BOOLEAN;
CURSOR c IS
SELECT *
FROM RSRV_KEY_WORDS
WHERE ACTIVE = 'Y'
AND upper(POSITION) in ('LEADING', 'TRAILING');
BEGIN
--Remove the leading and trailing blank spaces
l_name := TRIM (UPPER (p_in_name));
--remove LEADING keywords
l_keyword_found := false;
for rec in c LOOP
IF UPPER (rec.POSITION) = 'LEADING'
AND SUBSTR(l_name, 1,INSTR(l_name,' ',1) - 1) = rec.key_word
AND l_keyword_found = false
THEN
l_name := SUBSTR(l_name,INSTR(l_name,' ',1)+1);
l_keyword_found := true;
END IF;
EXIT WHEN (l_keyword_found);
END LOOP;
--Remove multiple spaces in a word and replace with single blank space
l_name := REGEXP_REPLACE (l_name, '[[:space:]]{2,}', ' ');
--Remove the leading and trailing blank spaces
l_name := TRIM (l_name);
--remove TRAILING keywords
l_keyword_found := false;
for rec in c LOOP
IF UPPER (rec.POSITION) = 'TRAILING'
AND SUBSTR(l_name, INSTR(l_name,' ',-1) + 1) = rec.key_word
AND l_keyword_found = false
THEN
l_name := SUBSTR(l_name,1,INSTR(l_name,' ',-1)-1);
l_keyword_found := true;
END IF;
EXIT WHEN (l_keyword_found);
END LOOP;
--Remove multiple spaces in a word and replace with single blank space
l_name := REGEXP_REPLACE (l_name, '[[:space:]]{2,}', ' ');
--Remove the leading and trailing blank spaces
l_name := TRIM (l_name);
return l_name;
EXCEPTION
WHEN OTHERS
THEN
raise_application_error (
-20001,
'An error was encountered - ' || SQLCODE || ' -ERROR- ' || SQLERRM);
END;
/

I cant really say if this will be faster, but I would give it a try:
Assuming the keywords in RSRV_KEY_WORDS does not change very often I would create a function to produce a regular expression from the table and have Oracle cache the result:
create or replace function get_lead_and_trail_regexp return varchar2
result_cache relies_on (RSRV_KEY_WORDS) is
declare
CURSOR c IS
SELECT ( SELECT listagg(key_word,'|') within group (order by 1)
FROM RSRV_KEY_WORDS
WHERE ACTIVE = 'Y'
AND upper(POSITION) = 'LEADING' ) as leading,
( SELECT listagg(key_word,'|') within group (order by 1)
FROM RSRV_KEY_WORDS
WHERE ACTIVE = 'Y'
AND upper(POSITION) = 'TRAILING' ) as trailing
FROM dual;
begin
for rec in c loop
return '(^[ ]+(('||rec.leading||')[ ]+))|([ ]+(('||rec.trailing||'||)[ ]+)$)';
end loop;
return null; -- Not very likely
end get_lead_and_trail_regexp;
You can then use the regular expression to remove first leading and first trailing keywords in one stroke:
l_name := REGEXP_REPLACE (l_name, get_lead_and_trail_regexp , ' ');
and then carry one with removing any duplicate spaces.
I have tested the regular expression with java.lang.String.replaceAll as I do not currently have an Oracle database available, but I believe it will work with REGEXP_REPLACE too.

Related

Find out if a string contains only ASCII characters

I need to know whether a string contains only ASCII characters. So far I use this REGEX:
DECLARE
str VARCHAR2(100) := 'xyz';
BEGIN
IF REGEXP_LIKE(str, '^[ -~]+$') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
END;
/
Pure ASCII
' ' and ~ are the first, resp. last character in ASCII.
Problem is, this REGEXP_LIKE fails on certain NLS-Settings:
ALTER SESSION SET NLS_SORT = 'GERMAN';
DECLARE
str VARCHAR2(100) := 'xyz';
BEGIN
IF REGEXP_LIKE(str, '^[ -~]+$') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
END;
/
ORA-12728: invalid range in regular expression
ORA-06512: at line 4
Do anybody knows a solution which works independently from current user NLS-Settings? Is this behavior on purpose or should it be considered as a bug?
You can use TRANSLATE to do this. Basically, translate away all the ASCII printable characters (there aren't that many of them) and see what you have left.
Here is a query that does it:
WITH input ( p_string_to_test) AS (
SELECT 'This this string' FROM DUAL UNION ALL
SELECT 'Test this ' || CHR(7) || ' string too!' FROM DUAL UNION ALL
SELECT 'xxx' FROM DUAL)
SELECT p_string_to_test,
case when translate(p_string_to_test,
chr(0) || q'[ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]',
chr(0)) is null then 'Yes' else 'No' END is_ascii
FROM input;
+-------------------------+----------+
| P_STRING_TO_TEST | IS_ASCII |
+-------------------------+----------+
| This this string | Yes |
| Test this string too! | No |
| xxx | Yes |
+-------------------------+----------+
ASCII function with upper limit of 127 may be used :
declare
str nvarchar2(100) := '\xyz~*-=)(/&%+$#£>|"éß';
a nvarchar2(1);
b number := 0;
begin
for i in 1..length(str)
loop
a := substrc(str,i,1);
b := greatest(ascii(a),b);
end loop;
if b < 128 then
dbms_output.put_line('String is composed of Pure ASCII characters');
else
dbms_output.put_line('String has non-ASCII characters');
end if;
end;
I think I will go for one of these two
IF CONVERT(str, 'US7ASCII') = str THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;
IF ASCIISTR(REPLACE(str, '\', '/')) = REPLACE(str, '\', '/') THEN
DBMS_OUTPUT.PUT_LINE('Pure ASCII');
END IF;

PL/SQL. Parse clob UTF8 chars with regexp_like regular expressions

I want to check if any line of my clob have strange characters like (ñ§). These characters are read from a csv-file with an unexpected encoding (UTF-8) which converts some of them.
I tried to filter each line using a regular expression but it's not working as intended. Is there a way to know the encoding of a csv-file when read?
How could I fix the regular expression to allow lines with only these characters? a-zA-Z 0-9 .,;:"'()-_& space tab.
Clob example readed from csv:
l_clob clob :='
"exp","objc","objc","OBR","031110-5","S","EXAMPLE","NAME","08/03/2018",,"122","3","12,45"
"xp","objc","obj","OBR","031300-5","S","EXAMPLE","NAME","08/03/2018",,"0","0","0"
';
Another clob:
DECLARE
l_clob CLOB
:= '"exp","objc","objc","OBR","031110-5","S","EXAMPLE","NAME","08/03/2018",,"122","3","12,45"
"xp","objc","obj","OBR","031300-5","S","EXAMPLE","NAME","08/03/2018",,"0","0","0"';
l_offset PLS_INTEGER := 1;
l_line VARCHAR2 (32767);
csvregexp CONSTANT VARCHAR2 (1000)
:= '^([''"]+[-&\s(a-z0-9)]*[''"]+[,:;\t\s]?)?[''"]+[-&\s(a-z0-9)]*[''"]+' ;
l_total_length PLS_INTEGER := LENGTH (l_clob);
l_line_length PLS_INTEGER;
BEGIN
WHILE l_offset <= l_total_length
LOOP
l_line_length := INSTR (l_clob, CHR (10), l_offset) - l_offset;
IF l_line_length < 0
THEN
l_line_length := l_total_length + 1 - l_offset;
END IF;
l_line := SUBSTR (l_clob, l_offset, l_line_length);
IF REGEXP_LIKE (l_line, csvregexp, 'i')
THEN -- i (case insensitive matches)
DBMS_OUTPUT.put_line ('Ok');
DBMS_OUTPUT.put_line (l_line);
ELSE
DBMS_OUTPUT.put_line ('Error');
DBMS_OUTPUT.put_line (l_line);
END IF;
l_offset := l_offset + l_line_length + 1;
END LOOP;
END;
If you only want to allow special characters you can use this regex:
Your Regex
csvregexp CONSTANT VARCHAR2 (1000) := '^[a-zA-Z 0-9 .,;:"''()-_&]+$' ;
Regex-Details
^ Start of your string - no chars before this - prevents partial match
[] a set of allowed chars
[]+ a set of allowed chars. Has to be one char minimum up to inf. (* instead of + would mean 0-inf.)
[a-zA-Z]+ 1 to inf. letters
[a-zA-Z0-9]+ 1 to inf. letters and numbers
$ end of your string - no chars behind this - prevents partial match
I think you can work it out with this ;-)
If you know there could be an other encoding in your input, you could try to convert and check against the regex again.
Example-convert
select convert('täst','us7ascii', 'utf8') from dual;

Largest "separation" of patterns for Delphi regex?

Update
As Graymatter has observed, regex fails to match when there are at least 2 extra line breaks before the second target. That is to say, changing the concatenation loop to "for I := 0 to 1" will make the regex-match fail.
As shown in the code below, without the concatenation, the program can get the two values using regex. However, with the concatenation, the program cannot get the two values.
Could you help to comment on the reason and the workaround ?
program Project1;
{$APPTYPE CONSOLE}
uses
// www.regular-expressions.info/delphi.html
// http://www.regular-expressions.info/download/TPerlRegEx.zip
PerlRegEx,
SysUtils;
procedure Test;
var
Content: UTF8String;
Regex: TPerlRegEx;
GroupIndex: Integer;
I: Integer;
begin
Regex := TPerlRegEx.Create;
Regex.Regex := 'Value1 =\s*(?P<Value1>\d+)\s*.*\s*Value2 =\s*(?P<Value2>\d*\.\d*)';
Content := '';
for I := 0 to 10000000 do
begin
// Uncomment here to see effect
// Content := Content + 'junkjunkjunkjunkjunk' + sLineBreak;
end;
Regex.Subject := 'junkjunkjunkjunkjunk' +
sLineBreak + ' Value1 = 1' +
sLineBreak + 'junkjunkjunkjunkjunk' + Content +
sLineBreak + ' Value2 = 1.23456789' +
sLineBreak + 'junkjunkjunkjunkjunk';
if Regex.Match then
begin
GroupIndex := Regex.NamedGroup('Value1');
Writeln(Regex.Groups[GroupIndex]);
GroupIndex := Regex.NamedGroup('Value2');
Writeln(Regex.Groups[GroupIndex]);
end
else
begin
Writeln('No match');
end;
Regex.Free;
end;
begin
Test;
Readln;
end.
Adding this line works.
Regex.Options := [preSingleLine];
From the documentation:
preSingleLine
Normally, dot (.) matches anything but a newline (\n). With preSingleLine, dot (.) will match anything, including newlines. This allows a multiline string to be regarded as a single entity. Equivalent to Perl's /s modifier. Note that preMultiLine and preSingleLine can be used together.
When there is only one line break before the second target, the regex can match even without preSingleline. The reason is because \s can match line return.

Word wrap a Delphi string to a certain length using commas not spaces?

I am generating a comma separated list of names in a string eg
Mr John Blue, Miss A Green, Mr Posh Hyphenated-Surname, Mr Fred Green, Miss Helen Red, Ms Jean Yellow
I now want to display them in a memo box that will hold 50 characters on each line so that as many names as possible (and their trailing comma) appear on each line.
so the above should look like
Mr John Blue, Miss A Green,
Mr Posh Hyphenated-Surname, Mr Fred Green,
Miss Helen Red, Ms Jean Yellow
I've played with
Memo1.text := WrapText(Mystring,50)
but it broke lines at spaces between forename and surnames and I tried
Memo1.text := WrapText(MyString, slinebreak, ',' ,50)
to force it to break only after a comma but that broke at spaces as well as commas. Both also tended to break at a hyphen and I note from Rob Kennedy's reply to a similar question that embedded quotes cause problems with Wrap() so a name like Mr John O'Donald would cause problems.
I even tried rolling my own function by counting characters and looking for commas but got bogged down in multiple nested IFs (Too embarassed to show the dreadful code for that!)
Can anyone offer any help or code showing how this can be done?
PS
I have looked at
'Word wrap in TMemo at a plus (+) char'
'How do I split a long string into “wrapped” strings?'
'Find a certain word in a string, and then wrap around it'
and other similar posts but none seem to match what I am looking for.
Set Memo1.WordWrap:=False;
There are many solutions, I show here just one.
But take care :
If you are using it with large amounts of data then the execution is quite slow
procedure TForm1.AddTextToMemo(needle,xsSrc:string);
var
xsNew:string;
mposOld,mposNew:integer;
start:byte;
begin
xsNew:=xsSrc;
repeat
repeat
mposOld:=mposNew;
mposNew:=Pos(needle,xsSrc);
if mposNew>0 then xsSrc[mposNew]:='*';
until (mposNew > 50) OR (mposNew = 0);
if mposOld > 0 then begin
if xsNew[1] = ' ' then start := 2 else start := 1;
if mposNew = 0 then mposOld:=Length(xsNew);
Memo1.Lines.Add(copy(xsNew,start,mposOld));
if mposNew = 0 then exit;
xsNew:=copy(xsNew,mposOld+1,Length(xsNew)-mposOld);
xsSrc:=xsNew;
mposNew:=0;
end else xsSrc:='';
until xsSrc = '';
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
Memo1.Clear;
AddTextToMemo(',','Mr John Blue, Miss A Green, Mr Posh Hyphenated-Surname, '+
'Mr Fred Green, Miss Helen Red, Ms Jean Yellow');
end;
UPDATE
if you have a small amount of data here is fast and easy to read.
...
var
Form1: TForm1;
NameList: TStrings;
...
NameList := TStringList.Create;
...
procedure TForm1.AddTextToMemoB(needle,xsSrc:string);
var
xsNew:string;
i:integer;
sumLen:byte;
begin
xsNew:=''; sumLen:=0;
nameList.Text:=StringReplace(xsSrc,needle,needle+#13#10,[rfReplaceAll]);
for i := 0 to nameList.Count - 1 do begin
sumLen:=SumLen+Length(nameList[i]);
if i < nameList.Count - 1 then begin
if (sumLen + Length(nameList[i+1]) > 50) then begin
if xsNew='' then xsNew:=nameList[i];
Memo1.Lines.Add(xsNew);
xsNew:='';
sumLen:=0;
end else if xsNew='' then xsNew:=nameList[i]+nameList[i+1] else
xsNew:=xsNew+nameList[i+1];
end else Memo1.Lines.Add(xsNew);
end; // for
end;
I haven't tested it, but something along the following lines ought to do the trick.
for LCh in S do
begin
case LCh of
',' : //Comma completes a word
begin
LWord := LWord + LCh;
if (LLine <> '') and //Don't wrap if we haven't started a line
((Length(LLine) + Length(LWord)) > ALineLimit) then
begin
//Break the current line if the new word makes it too long
AStrings.Add(LLine);
LLine := '';
end;
if (LLine <> ' ') then LLine := LLine + ' '; //One space between words
LLine := LLine + LWord;
LWord := '';
end;
else
if (LWord = '') and (LCh in [' ', #9]) then
begin
//Ignore whitespace at start of word.
//We'll explicitly add one space when needed.
//This might remove some extraneous spaces.
//Consider it a bonus feature.
end else
begin
LWord := LWord + LCh;
end;
end;
end;
//Add the remainder
if (LLine <> '') and //Don't wrap if we haven't started a line
((Length(LLine) + Length(LWord)) > ALineLimit) then
begin
//Break the current line if the new word makes it too long
AStrings.Add(LLine);
LLine := '';
end;
if (LLine <> ' ') then LLine := LLine + ' '; //One space between words
LLine := LLine + LWord;
AStrings.Add(LLine);
Of course you may have noted the duplication that should be moved to a sub-routine.
Tweak away to your hearts content.

How to skip quoted text in regex (or How to use HyperStr ParseWord with Unicode text ?)

I need regex help to create a delphi function to replace the HyperString ParseWord function in Rad Studio XE2. HyperString was a very useful string library that never made the jump to Unicode. I've got it mostly working but it doesn't honor quote delimiters at all. I need it to be an exact match for the function described below:
function ParseWord(const Source,Table:String;var Index:Integer):String;
Sequential, left to right token parsing using a table of single
character delimiters. Delimiters within quoted strings are ignored.
Quote delimiters are not allowed in Table.
Index is a pointer (initialize to '1' for first word) updated by the
function to point to next word. To retrieve the next word, simply
call the function again using the prior returned Index value.
Note: If Length(Resultant) = 0, no additional words are available.
Delimiters within quoted strings are ignored. (my emphasis)
This is what I have so far:
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2,
chars : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
Table2 :='['+TRegEx.Escape(Table, false)+']';
RE := TRegEx.create(Table2);
match := RE.Match(Source,Index);
if match.success then
begin
result := copy( Source, Index, match.Index - Index);
Index := match.Index+match.Length;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
cheers and thanks.
I would try this regex for Table2:
Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';
Demo:
This demo is more a POC since I was unable to find an online delphi regex tester.
The delimiters are the space (ASCII code 32) and pipe (ASCII code 124) characters.
The test sentence is:
toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''
http://regexr.com?32i81
Discussion:
I assume that a quoted string is a string enclosed by either two single quotes (') or two double quotes ("). Correct me if I am wrong.
The regex will match either:
a single quoted string
a double quoted string
a string not composed by any passed delimiters
Known bug:
Since I didn't know how ParseWord handle quote escaping inside string, the regex doesn't support this feature.
For instance :
How to interpret this 'foo''bar' ? => Two tokens : 'foo' and 'bar' OR one single token 'foo''bar'.
What about this case too : "foo""bar" ? => Two tokens : "foo" and "bar" OR one single token "foo""bar".
In my original code I was looking for the delimiter and taking everything up to that as my next match, but that concept didn't carry over when looking for something within quotes. #Stephan's suggestion of negating the search eventually lead me to something that works. An additional complication that I never mentioned earlier is that HyperStr can use anything as a quoting character. The default is double quote but you can change it with a function call.
In my solution I've explicitly hardcoded the QuoteChar as double quote, which suits my own purposes, but it would be trivial to make QuoteChar a global and set it within another function. I've also successfully tested it with single quote (ascii 39), which would be the tricky one in Delphi.
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2: string;
Source2 : string;
QuoteChar : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
QuoteChar := #39;
Table2 :='[^'+TRegEx.Escape(Table, false)+QuoteChar+']*|'+QuoteChar+'.*?'+QuoteChar ;
Source2 := copy(Source, Index, length(Source)-index+1);
match := TRegEx.Match(Source2,Table2);
if match.success then
begin
result := copy( Source2, match.index, match.length);
Index := Index + match.Index + match.Length-1;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
end;
This solution doesn't strip the quote chars from around quoted strings, but I can't tell from my own existing code if it should or not, and I can't test using Hyperstr. Maybe someone else knows?