Matching String Wrapped In Symbol For Regex Replace - regex

I'm trying to figure out how to implement Regex on my WordPress blog.
The Problem
I'd like to replace certain content with some inline styles, and I'm using Regex to accomplish this.
My idea is as follows: find the string wrapped in a particular symbol, i.e. "~string~" and dynamically replace this with a span that has a particular class.
I'm going for a similar effect to SO's inline code highlighting feature, but instead of using backticks, I'm using "~" as my symbol of choice (since WordPress already identifies "`" as code).
Quick Example
Original Text
This is a demo paragraph with a wrapped string ~here~, with another string ~~here~~.
After Regex Replacement
This is a demo paragraph with a wrapped string <span class="classOne">here</span>, with another string <span class="classTwo">here</span>.
What I'm Struggling With
The regex I'm using is this: /~(.*?)~/, and it's working fine for finding strings such as "~demo~", but I'm not sure how to extend it to be able to find strings with multiple delimiters, like: "~~demo~~".
The tricky part for me is that it needs to distinguish between just one "~" versus two of them because I'd like to assign different replacements to each result.
Any help would be appreciated! Thanks in advance.

You can use
~~([\s\S]*?)~~(?!~)|~([^~]*)~
See the regex demo. Details:
~~([\s\S]*?)~~(?!~) - ~~, then a capturing group #1 matching any zero or more chars but as few as possible, and then a ~~ substring not followed with another ~ char
| - or
~([^~]*)~ - a ~, then a capturing group #2 matching zero or more chars other than ~, and then a ~
If you use it in PHP, you may use the pattern with preg_replace_callback where you may define separate replacement logic when a specific capturing group is matched.
See a PHP demo:
$html = 'This is a demo paragraph with a wrapped string ~here~, with another string ~~here~~.';
echo preg_replace_callback('/~~([\s\S]*?)~~(?!~)|~([^~]*)~/', function ($m) {
return !empty($m[1]) ? '<span class="classTwo">' . $m[1] . '</span>' : '<span class="classOne">' . $m[2] . '</span>';
},$html);
// => This is a demo paragraph with a wrapped string <span class="classOne">here</span>, with another string <span class="classTwo">here</span>.

To make it little more generic, you can try this (~+)([^~]+?)(~+). This would need an additional check on the number of characters present in the 1st or the 3rd grouping which matches (~). Based on the number of characters take a decision in code for classOne, classTwo, classThree etc...

Related

How can I match multiple hits between 2 delimiters?

Hi, my fellow RegEx'ers ;)
I'm trying to match multiple Texts between every two quotes
Here's my text:
...random code
someArray[] = ["Come and",
"get me,",
"or fail",
"trying!",
"Yours truly"]
random code...
So far, I managed to get the correct matches with two patterns, executed after each other:
(?s)someArray\[\].*?=.*?\[(.*?)\]
this extracts the text between the two brackets and on the result, I use this one:
"(.*?)"
This is working just fine, but I'd love to get the Texts in one regex.
Any help is highly appreciated!
Consider using \G. With its help, you may match "(.*?)" preceded by either someArray[] = [ or previous match of "(.*?)" (well, strictly speaking previous match of entire regex). Then just grab first capture groups from all matches:
(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"
Demo: https://regex101.com/r/eBQWdU/3
How you grab the first capture groups from depends on the language you're using regex in. For example in PHP you may do something like this:
preg_match_all('/(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"/', $input, $matches);
$array_items = $matches[1];
Demo: https://ideone.com/mZgU1x

Regex split and concatenate path base and pattern with filename deleting part of path between them

I have an URL like this:
a) <a href=\"http://example.com/path-pattern-to-match/subPath/onemoreSubpath/arbitrary-number-of-subpaths/someArticle1\">
or:
b) <a href=\"http://example.com/path-pattern-to-match/someArticle2\">
I need to split path pattern with its base URL, start of <a> tag and concatenate it with Iits someArticle. Everything in between needs to be deleted.
Case 'b' remains untouched. Case 'a' needs to become:
<a href=\"http://example.com/path-pattern-to-match/someArticle1\">
Please answer with a RegEx, that is what I need. Other solutions could be interesting if well explained, using Perl or a bash script, but please avoid to suggest some programming module or function to parse it only to say that RegEx is not the best solution and without any real one solution.
PS: I need to parse a non multiline file.
someArticle is variable.
If you have look-behind support, use
(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/)(?:[^\/]+\/)*([^\/>"]*)(?=\\">)
See demo
EXPLANATION
(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/) - a fixed width lookbehind making sure we have <a href=\"http://example.com/path-pattern-to-match/ literal text in front of...
(?:[^\/]+\/)* - 0 or more sequences of 1 or more characters other than / ([^\/]+) followed with a literal / (i.e. subpaths)
([^\/>"]*) - A capturing group that matches our keyword "someArticle" (0 or more characters other than ", >, or /.
(?=\\">) - A positive lookahead checking if there is a \"> right after the preceding subpattern.
Using the $1 replacement string, you can remove the subpaths and keep the "someArticle" part.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

Regular Expression: Start from second one

I want to find the second <BR> tag and to start the search from there. How can i do it using regular expressions?
<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>
Prepend <BR>[^<]*(?=<BR>) to your regex, or remove the lookahead part if you want to start after the second <BR>, such as: <BR>[^<]*<BR>.
Find text after the second <BR> but before the third: <BR>[^<]*<BR>([^<]*)<BR>
This finds "waldo" in <BR>404<BR>waldo<BR>.
Note: I specifically used the above instead of the non-greedy .*? because once the above starts not working for you, you should stop parsing HTML with regex, and .*? will hide when that happens. However, the non-greedy quantifier is also not as well-supported, and you can always change to that if you want.
assuming you are using PHP, you can split your string on <BR> using explode
$str='<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>';
$s = explode("<BR>",$str,3);
$string = end($s);
print $string;
output
$ php test.php
Abdurrahman<BR><SMALL>Fathers Name</SMALL>
you can then use "$string" variable and do whatever you want.
The steps above can be done with other languages as well by using the string splitting methods your prog language has.
this regular expression should math the first two <br />s:
/(\s*<br\s*/?>\s*){2}/i
so you should either replace them with nothing or use preg_match or RegExp.prototype.match to extract the arguments.
In JavaScript:
var afterReplace = str.replace( /(\s*<br\s*\/?>\s*){2}/i, '' );
In PHP
$afterReplace = preg_replace( '/(\s*<br\s*\/?>\s*){2}/i', '', $str );
I'm only sure it'll work in PHP / JavaScript, but it should work in everything...
The usual solution to this sort of problem is to use a "capturing group". Most regular expression systems allow you to extract not only the entire matching sequence, but also sub-matches within it. This is done by grouping a part of the expression within ( and ). For instance, if I use the following expression (this is in JavaScript; I'm not sure what language you want to be working in, but the basic idea works in most languages):
var string = "<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>";
var match = string.match(/<BR>.*?<BR>([a-zA-Z]*)/);
Then I can get either everything that matched using match[0], which is "<BR>like <BR>Abdurrahman", or I can get only the part inside the parentheses using match[1], which gives me "Abdurrahman".

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###