Perl Script (running slow) taking a minute to replace the simple regex - regex

Perl script taking (running slowly) a minute to replace the following regex:
$str = '<![CDATA[$..$]]>;
I have a file contains <![CDATA[$..$]]> (not less than 1000 occurrences) latex/tex coding in CDATA. Hence I need to change this into Comment tag and processing instruction like <!--<![CDATA[--><?processingInstruction $..$?><!--]]>-->.
$SqrBrLoopMany = qw/((?:[^\[\]]*(?:{(?:[^\[\]]*(?:{[^\[\]]*})*[^\[\]]*)*})*[^\[\]]*)*)/; # This is for using `\[ <whatever> \]` Square bracket.
$str=~s/(\<\!\[CDATA\[)$SqrBrLoopMany(\]\]>)/<\!\-\-$1\-\-><\?processingInstruction $2\?><\!\-\-$3\-\->/sg;
The above regex I am doing however the script takes a minute to replace the output.
Output should be:
<!--<![CDATA[--><?processingInstruction $..$?><!--]]>-->
It would be appreciated if someone help on this one.

Simplest possible:
s/<!\[CDATA\[(.*?)]]>/<!--<![CDATA[--><?processingInstruction $1?><!--]]>-->/sg
CDATA can not contain any nested structures, so the pattern just looks for the starting <![CDATA[ and closest ending ]]>, and matches everything in between.
The reason your pattern is running slowly, is because you are matching non-brackets ([^\[\]]) in between braces { ... }. If the CDATA section contains [ or ]that are not part of the ending ]]>, it will fail and try to backtrack each of the [^\[\]]* in turn, leading to quintic (O(x5)) execution time.
If square brackets are required to be balanced for it to match, you could do
s/<!\[CDATA\[(([^][]|\[(?2)*?])*?)]]>/<!--<![CDATA[--><?processingInstruction $1?><!--]]>-->/sg
The (?2) will recursively match the second subpattern/capture group again. This should work in both Perl and PCRE based regex engines.
Demo: https://regex101.com/r/LmClY9/2

Thanks to Markus Jarderot given the way/answer to achieve this:
$str=~s/(\<\!\[CDATA\[)([^\]\]>]*)(\]\]>)/<\!\-\-$1\-\-><\?xmltex $2\?><\!\-\-$3\-\->/sg;
<!\[CDATA\[(.*?)]]> Instead of <!\[CDATA\[([^\]\]>]*)]]>

Related

How can I write a regex that will match a nested [quote] BB tag?

As part of a forum that uses BBCode to store posts, I'm trying to write a way to detect mentions and quotes, in order to notify the users.
I have it working for all cases except nested quotes.
This is my regex so far (Python 2.7):
regex = r'\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote="(.*?)"\].*?\[\/quote\]'
These are my test cases:
# This works fine, I get the `user1` group.
Hello [url=/users/user1/]#Foo Bar[/url]
# This works fine, I get the `user2` and `user3` groups.
[quote="user2"]Test message[/quote] OK [quote="user3"]Test message[/quote]
# This doesn't work as I'd l ike. I only get the `user4` group, but not `user5`.
[quote="user4"][quote="user5"]Test message[/quote][/quote]
How can I modify the regular expression to match also the third test with the nested [quote] block?
Here's a link to regex101 for your convenience: https://regex101.com/r/Ov5SI1/1
Thank you!
A minor change in the original regex will solve your problem. Here is the original regex:
\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote="(.*?)"\].*?\[\/quote\]
Error
Consider the input string:
[quote="user4"][quote="user5"]Test message[/quote][/quote]
The last alternation tries to match it and it does succeed. However, the first match is
[quote="user4"][quote="user5"]Test message[/quote]
Now the next match starts after the [/quote]. It will not start anywhere before since all the previous text is already part of a successful match.
Correction
Solution 1:
Changing this part .*?\[\/quote\] in the original regex to a look ahead will result in successful match of both the user4 and user5.
\[quote=\"(.*?)\"\](?=.*?\[\/quote\])
final regex: \[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote=\"(.*?)\"\](?=.*?\[\/quote\])
Solution 2:
Focusing on just the right part of the alternation - \[quote="(.*?)"\].*?\[\/quote\]
Here only \[quote="(.*?)"\] this is necessary if you want to find any patter of the form [quote="..."]. The remaining portion is unnecessary.
Here is the final regex:
\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote=\"(.*?)\"\]
Please do remember that the regex must be applied globally to find all the matches.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regex to match all words containing a question mark

I need some help writing a regex that will match all words in a sql dump that contain a ? within the word somewhere... words can be on the same line and ideally I'll get a full list of these words so I can count the instances.
Sample
test
test??test
test?test word ss?dd ?dddd
term_exists?? term_exists??
test
test?test
aaa? aaaa???
I should see a list as follows
test??test
test?test
ss?dd
?dddd
term_exists??
term_exists??
test?test
aaa?
aaaa???
Basically all words that have a ? in them.
Any help would be great.. been banging my head on this for hours.
Give this a try:
.*?([a-z_]*\?+[a-z_]*).*?
Replace with \1 (or $1 depending on the language you're using).
In action: https://regex101.com/r/Kr776J/1
For best results, enable "single line" mode if possible (Add (?s) to your pattern or use your language's options to turn the flag on).
Given your input string, this will accurately match all of your desired substrings:
\w*\?[^\s]*
or more literally: [a-z_]*\?[^\s]*
or most literally: [a-z_]*\?[a-z?]*
Demo Link (all just 90 steps)
CAustin's takes 159 steps.

JSP Tag Spacing Regex

We are suppose to migrate all our apps from one type of server to another. The new servers do not accept invalid JSP tags where a space is not inserted between the attributes. For example, the following.
<input type="text"name="myField" />
The following regex was given to us to use, but it seems to not be perfect.
[\w.-]+[\s]*=[\s]*"[^"]+"[^\s/%>]
For example, it returns string assignments like the following.
span.style.fontWeight = "bold";
Can anyone suggest a better regex for locating just the invalid JSP code?
UPDATE
I was this regex to work using the Eclipse Search > File functionality.
Try simply this RegEx: (<.+?[^" ]+?="[^"]+?")([^ ]+?)(.+?>). Will locate all "tags" with a " not followed by a space. Then you can replace the captured groups like this: $1 $2$3 to add a space.
Tenub's answer is nearly correct, but as Rachel G. mentioned, it will return false positives when the closing bracket immediately follows the closing quotation mark.
(<[^?%].+?[^" ]+?="[^"]+?")([^/ >]+?)([^>]*(?:/|\?|%)?>)
Should give you the results you're after.
Disclaimer: This is not a strict checker. You could have a tag such as <..." asdf/> go undetected, but as the tags are presumably well formed enough to work under the old system, this should be sufficient.
Simple version:
Find: (=\s*"[^"]*")(\w)
Replace with: $1 $2
Explanation
The find regex looks for = followed by optional whitespace followed by "...", immediately followed by a single alphanumeric character or underscore.
It's separated out into two capturing groups, which are represented by $1 and $2 in the replace expression - with a space inserted between them.
[Minor Issue: This won't work for attribute values that include escaped double quotation marks. Haven't addressed this as am assuming it is pretty unlikely. However, it justifies doing a manual find/replace rather than "replace all" just in case.]

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/