Unexpected RegEx behavior in Delphi XE - regex

Delphi XE, using Delphi's own RegularExpressions unit.
I'm attempting to correct some bad RTF code, where 'bookmark' tags cross the boundaries of a table cell. Seems simple enough. The code I'm using is below. Here's the general idea.
Given this text
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}
Look for a match to this pattern (there should be exactly one in the given text):
{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}
When found, replace it with this (non-RegEx) string:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
The expected results are that the first string should be replaced with the last string, eg:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell} *becomes*
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
However, the result I'm actually getting is this:
{\*\bkmkstart BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 ^{\*\bkmkend BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 \cell}
It looks as if the RegEx parser is getting horribly confused somehow, but I can't even characterize what is happening. It's not a mere double replacement, or an insertion instead of replacement. The 'ReplaceWith' string does seem to be the source of the confusion, though. If I use a nice simple 'XXXX' for the ReplaceWith string, instead of the RTF, it works exactly as it should.
So, any ideas how/why the RegEx search/replace is breaking so strangely here?
Here is the code I'm using:
procedure TfrmMain.btnProcessClick(Sender: TObject);
const
SourceString = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}';
RegExFind = '{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}';
ReplaceWith = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}';
var
ResultStr: string;
MyRegEx: TRegEx;
begin
MyRegEx := TRegEx.Create (RegExFind);
ResultStr := MyRegEx.Replace (SourceString, ReplaceWith);
ShowMessage (ResultStr);
end;

You need to escape the \ characters in your replacement string:
ReplaceWith = '{\\*\\bkmkstart BM0}\\plain\\f0\\fs24\\cf0 ^{\\*\\bkmkend BM0}\\plain\\f0\\fs24\\cf0 \\cell}';
When you make this change the output is:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
In fact, for your replacement string, you only need to escape the backslash in \f0 which, as it happens, appears twice. Personally I think it's just easier to escape the backslash indiscriminately.
By combining regular expressions and RTF you've mixed your own special backslash soup — tread carefully. Just be thankful you aren't using C or older versions of C++ that do not support raw strings. That backslash soup would be completely unpalatable!

Related

String formatting in Xpath expression

I have a function as follow. I need to find all the links with particular search term
def parse(search_term):
response.xpath("//a[contains(.,search_term)]/#href").extract()
I believe above code gives me all the anchor links regardless of the search_term
If I replace search_term with "Energy" or any string, it gives perfect result for e.g
def parse(search_term):
response.xpath("//a[contains(.,'Energy')]/#href").extract()
The above code gives me the links which has 'Energy' as text in it.
Is this a string formatting issue?
XPath expressions are regular Python strings, so you have to "interpolate" them explicitly:
def parse(search_term):
response.xpath("//a[contains(.,'{}')]/#href".format(search_term)).extract()
Note that this only works for strings without any ' characters on it -- if it does, you'll need some tricks to escape it.

regex to match everything except character

I have a payload that contains the following:
����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2
I'm looking to extract the file name of patrick-test-file.txt
I can get close by using this, but it continues to include everything (including ascii characters)
[\\\\](.*?)x�SMB2
With a result of this: �p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������ for the capture group.
How would I just match the characters of the file name, which could be anything of variable length, and could contain alphanumeric characters? Is this possible with pure regex?
Any help is much appreciated.
Sometimes you just can't do a single language-agnostic Regular Expression to accomplish something. And sometimes (usually) it is more performant to do a series of string functions.
I wouldn't personally accept any solution which has hard-coded values, such as x�SMB2.
If you want to use Regular Expressions only, you can first select the File-Name portion like so: (([-\w\d.\\]+)[^-\w\d.\\]?)+, then go ahead and replace [^-\w\d.\\] with nothing "".
Honestly, given the limited detail, the best function is like so:
var fileName = "\patrick-test-file.txt";
But half-joking aside, and with that limited detail, your best bet is to do a couple string functions:
var yuckyString = #"����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2";
var fileNameArea = yuckyString.Split(new[] { "��" }, StringSplitOptions.RemoveEmptyEntries)[0];
var fileName = fileNameArea.Replace("�", "");
Granted, there was no language listed, so I'm using C#. Also, the answer would change if there were irregularities with those special characters. With the limited info, the pattern seems clear.

Regex URI portion: Remove hyphens

I have to split URIs on the second portion:
/directory/this-part/blah
The issue I'm facing is that I have 2 URIs which logically need to be one
/directory/house-&-home/blah
/directory/house-%26-home/blah
This comes back as:
house-&-home and house-%26-home
So logically I need a regex to retrieve the second portion but also remove everything between the hyphens.
I have this, so far:
/[^(/;\?)]*/([^(/;\?)]*).*
(?<=directory\/)(.+?)(?=\/)
Does this solve your issue? This returns:
house-&-home and house-%26-home
Here is a demo
If you want to get the result:
house--home
then you should use a replace method. Because I am not sure what language you are using, I will give my example in java:
String regex = (?<=directory\/)(.+?)(?=\/);
String str = "/directory/house-&-home/blah"
Pattern.compile(regex).matcher(str).replaceAll("\&", "");
This replace method allows you to replace a certain pattern ( The & symbol ) with nothing ""

Regular expression to trim a string

In my application, i am trying to get the name of a file, from a string retrieved from a 'content-header' tag from a server. The filename looks like \"uploads/2014/03/filename.zip\" (quotations included in value).
I have tried using Path.GetFileName(string); to get just the file name but it throws an exception stating that there are illegal characters in the path.
What should i use to get just filename.zip returned? is a regex the best way to trim this string off or is there a better one?
the \"uploads/2014/03/ part will always be the same length. The filename.zip can be any filename and extension, im just using that as an example. But the numbers may vary. It sounds like a job for a regex to me, but i have no idea how to use regular expressions.
You can try something like this:
var inputString = #"\""uploads/2014/03/filename.zip\""";
var result = inputString.Trim('\\', '"').Split('/')[3];
This should work if the format is always like \"uploads/someNumber/someOtherNumber/filename\".In order to make it more safe you might want to use Enumerable.Last method after Split:
var result = inputString.Trim('\\', '"').Split('/').Last();

freepascal regexp replace

Is there an easy way to do a RegExp replace in FreePascal/Lazarus?
Hunting around I can see that I can do a match fairly easily, but I'm struggling to find functions to do a search and replace.
What I'm trying to acheive is as follows.
I have an XML file loaded into a SynEdit component.
The XML file has a decalaration at the start
The DTD is held in a seperate file.
I don't want to combine the two in one file, but I do wantto validate the XML as it is being editted.
I'm reading the XML into a string variable and I want to insert the DTD between the and the XML content in a temporary string variable (to create a compliant XML with self contained DTD) that can be parsed and validated.
So essentially I have:
<?Line1?>
Line2
Line3
And I want to do a RegExp type search and replace for '<?Line1?>' replaceing with '<?Line1?>\n<![DTD\nINFO WOULD\nGO HERE\n!]' to give me:
<?Line1?>
<![DTD
INFO WOULD
GO HERE
!]
Line2
Line3
For example in PHP I would use:
preg_replace('/(<\?.*\?>)/im','$1
<![DTD
INFO WOULD
GO HERE
!]',$sourcestring);
But there doesn't seem to be an equivalent set of regexp functions for FreePascal / Lazarus - just a simple/basic RegExp match function.
Or is there an easier way without using regular expressions - I don't want to assume that the declaration is always there in the correct position on Line 1 though - just to complicate things.
Thanks,
FM
As far as I know, the PerlRegEx unit isn't compatible with Free Pascal. But you can use the RegExpr unit, which comes with Free Pascal.
If I understand correctly, you want a replacement with substitution. Here is a simple example that you can adapt to your need.
{$APPTYPE CONSOLE}
{$IFDEF FPC}{$MODE DELPHI}{$ENDIF}
uses
regexpr;
var
s: string;
begin
s := 'My name is Bond.';
s := ReplaceRegExpr(
'My name is (\w+?)\.',
s,
'His name is $1.',
TRUE // Use substitution
);
WriteLn(s); // His name is Bond.
ReadLn;
end.