freepascal regexp replace

freepascal regexp replace - regex

Is there an easy way to do a RegExp replace in FreePascal/Lazarus?
Hunting around I can see that I can do a match fairly easily, but I'm struggling to find functions to do a search and replace.
What I'm trying to acheive is as follows.
I have an XML file loaded into a SynEdit component.
The XML file has a decalaration at the start
The DTD is held in a seperate file.
I don't want to combine the two in one file, but I do wantto validate the XML as it is being editted.
I'm reading the XML into a string variable and I want to insert the DTD between the and the XML content in a temporary string variable (to create a compliant XML with self contained DTD) that can be parsed and validated.
So essentially I have:
<?Line1?>
Line2
Line3
And I want to do a RegExp type search and replace for '<?Line1?>' replaceing with '<?Line1?>\n<![DTD\nINFO WOULD\nGO HERE\n!]' to give me:
<?Line1?>
<![DTD
INFO WOULD
GO HERE
!]
Line2
Line3
For example in PHP I would use:
preg_replace('/(<\?.*\?>)/im','$1
<![DTD
INFO WOULD
GO HERE
!]',$sourcestring);
But there doesn't seem to be an equivalent set of regexp functions for FreePascal / Lazarus - just a simple/basic RegExp match function.
Or is there an easier way without using regular expressions - I don't want to assume that the declaration is always there in the correct position on Line 1 though - just to complicate things.
Thanks,
FM

As far as I know, the PerlRegEx unit isn't compatible with Free Pascal. But you can use the RegExpr unit, which comes with Free Pascal.
If I understand correctly, you want a replacement with substitution. Here is a simple example that you can adapt to your need.
{$APPTYPE CONSOLE}
{$IFDEF FPC}{$MODE DELPHI}{$ENDIF}
uses
regexpr;
var
s: string;
begin
s := 'My name is Bond.';
s := ReplaceRegExpr(
'My name is (\w+?)\.',
s,
'His name is $1.',
TRUE // Use substitution
);
WriteLn(s); // His name is Bond.
ReadLn;
end.

Related

Compare words in Prestashop Smarty tpl file (Cyrillic symbols)

ALMOST found solution here
But as i can understand THIS {if $haystack1|strstr:"_thestring_"}Found!{/if} not working with non Latin symbols...
The problem: I need to check if string 'терминалы' exist in $payment_method.desc variable
Here is a Smarty code
(The Variable **$payment_method.desc** contain this text 'Оплата наличными через кассы и терминалы'):
{assign "desc" $payment_method.desc}
{assign "var_1" "терминалы"}
{if $desc|#mb_stristr:$var_1|#var_dump}Found!{/if}
{if $desc|#mb_strstr:$var_1|#var_dump}Found!{/if}
{if $desc|#strstr:$var_1|#var_dump}Found!{/if}
Same code work if use Latin symbols.

Smarty var declaration uses PHP internal encoding.
You should check the last parameter of mb_* functions related to encoding. Check this: mb_strstr
This post could help you too: php case-insensitive comparison of russian characters
If you are sure that string has Russian characters you should consider convert from "Windows-1251" encoding.
Any PHP function could be called from Smarty, so you could test with all of them.
Good luck.

Using multiple Perl regular expressions to find and replace

I'm a Perl and regex newcomer in need of your expertise.
I need to process text files that include placeholder lines like Foo Bar1.jpg and replace those with with corresponding URLs like https:/baz/qux/Foo_Bar1.jpg.
As you may have guessed, I'm working with HTML. The placeholder text refers to the filename, which is the only thing available when writing the document. That's why I have to use placeholder text. Ultimately, of course, I want to replace the filename with the URL (after I upload file to my CMS to get the URL). At that point, I have all of the information at hand — the filename and the URL. Of course, I could just paste the URLs over the placeholder names in the HTML document. In fact, I've done that. But I'm certain that there's a better way.
In short, I have placeholder lines like this:
Foo Bar1.jpg
Foo Bar2.jpg
Foo Bar3.jpg
And I also have URL lines like this:
https:/baz/qux/Foo_Bar1.jpg
https:/baz/qux/Foo_Bar2.jpg
https:/baz/qux/Foo_Bar3.jpg
I want to find the placeholder string and capture a differentiator like Bar1 with a regex. Then I want to use the captured part like Bar1 to perform another regex search that matches part of the corresponding URL string, i.e. https:/baz/qux/Foo_Bar1.jpg. After a successful match, I want to replace the Foo Bar1.jpg line with https:/baz/qux/Foo_Bar1.jpg.
Ultimately, I want to do that for every permutation, so that https:/baz/qux/Foo_Bar2.jpg also replaces Foo Bar2.jpg and so on.
I've written regular expressions that match both the placeholder and the URL. That's not my problem, as far as I can tell. I can find the strings I need to process. For example, /[a-z]+\s([a-z0-9]+)\.jpg/ successfully matches what I'm calling the placeholder text and captures what I'm calling the differentiator.
However, though I've spent an embarrassing number of hours over the past week reading through Stack Overflow, various other sites and O'Reilly books on Pearl and Pearl Regular Expressions, I can't wrap my mind around how to process what I can find.

I think the piece you are missing is the idea of using Perl's internal grep function, for searching a list of URL lines based on what you are calling your "differentiator".
Slurp your URL lines into a Perl array (assuming there are a finite manageable number of them, so that memory is not clobbered):
open URLS, theUrlFile.txt or die "Cannot open.\n";
my #urls = <URLS>;
Then within the loop over your file containing "placeholders":
while (my $key = /[a-z]+\s([a-z0-9]+)\.jpg/g) {
my #matches = grep $key, #urls;
if (#matches) {
s/[a-z]+\s$key\.jpg/$matches[0]/;
}
}
You may also want to insert error/warning messages if #matches != 1.

RegEx to extract first XML element name with optional namespace prefix

I have to extract with regEx first element name in the xml (ignoring optional namespace prefix.
Here is the sample XML1:
<ns1:Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</ns1:Monkey>
And here is similar XML that is without namespace, XML2:
<Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</Monkey>
I need a regEx that will return me "Monkey" for either XML1 or XML2
So far I tried HERE this regEx <(\w+:)(\w+) that works for XML1 .... but I don't know how to make it work for XML2

Since it seems to be a one-time job and you really do not have access to XML parser, you can use either of the 2 regexps (that will work only for the XML files like you provided as samples):
<(\w+:)?(\w+)(?=\s*xmlns="http://myurlisrighthereheremonkey\.com/monkeynamespace")
Demo 1
Or (if you check the whole single file contents with the regex):
^\s*<(\w+:)?(\w+)
Demo 2
The main changes are 2:
(\w+:)? - adding ? modifier makes the first capturing group optional
^\s* makes the regex match at the beginning of the string (guess you do not have XML declaration there), or (?=\s*xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace") look-ahead forcing the match only if followed by optional spaces and literal xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace".
However, you really need to think about changing to code supporting XML parsing, it will make your life and lives of those who will be in charge of maintaining code easier.

Unexpected RegEx behavior in Delphi XE

Delphi XE, using Delphi's own RegularExpressions unit.
I'm attempting to correct some bad RTF code, where 'bookmark' tags cross the boundaries of a table cell. Seems simple enough. The code I'm using is below. Here's the general idea.
Given this text
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}
Look for a match to this pattern (there should be exactly one in the given text):
{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}
When found, replace it with this (non-RegEx) string:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
The expected results are that the first string should be replaced with the last string, eg:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell} *becomes*
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
However, the result I'm actually getting is this:
{\*\bkmkstart BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 ^{\*\bkmkend BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 \cell}
It looks as if the RegEx parser is getting horribly confused somehow, but I can't even characterize what is happening. It's not a mere double replacement, or an insertion instead of replacement. The 'ReplaceWith' string does seem to be the source of the confusion, though. If I use a nice simple 'XXXX' for the ReplaceWith string, instead of the RTF, it works exactly as it should.
So, any ideas how/why the RegEx search/replace is breaking so strangely here?
Here is the code I'm using:
procedure TfrmMain.btnProcessClick(Sender: TObject);
const
SourceString = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}';
RegExFind = '{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}';
ReplaceWith = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}';
var
ResultStr: string;
MyRegEx: TRegEx;
begin
MyRegEx := TRegEx.Create (RegExFind);
ResultStr := MyRegEx.Replace (SourceString, ReplaceWith);
ShowMessage (ResultStr);
end;

You need to escape the \ characters in your replacement string:
ReplaceWith = '{\\*\\bkmkstart BM0}\\plain\\f0\\fs24\\cf0 ^{\\*\\bkmkend BM0}\\plain\\f0\\fs24\\cf0 \\cell}';
When you make this change the output is:
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}
In fact, for your replacement string, you only need to escape the backslash in \f0 which, as it happens, appears twice. Personally I think it's just easier to escape the backslash indiscriminately.
By combining regular expressions and RTF you've mixed your own special backslash soup — tread carefully. Just be thankful you aren't using C or older versions of C++ that do not support raw strings. That backslash soup would be completely unpalatable!

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.

That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.

I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.

s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js