Regular Expression: Start from second one - regex

I want to find the second <BR> tag and to start the search from there. How can i do it using regular expressions?
<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>

Prepend <BR>[^<]*(?=<BR>) to your regex, or remove the lookahead part if you want to start after the second <BR>, such as: <BR>[^<]*<BR>.
Find text after the second <BR> but before the third: <BR>[^<]*<BR>([^<]*)<BR>
This finds "waldo" in <BR>404<BR>waldo<BR>.
Note: I specifically used the above instead of the non-greedy .*? because once the above starts not working for you, you should stop parsing HTML with regex, and .*? will hide when that happens. However, the non-greedy quantifier is also not as well-supported, and you can always change to that if you want.

assuming you are using PHP, you can split your string on <BR> using explode
$str='<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>';
$s = explode("<BR>",$str,3);
$string = end($s);
print $string;
output
$ php test.php
Abdurrahman<BR><SMALL>Fathers Name</SMALL>
you can then use "$string" variable and do whatever you want.
The steps above can be done with other languages as well by using the string splitting methods your prog language has.

this regular expression should math the first two <br />s:
/(\s*<br\s*/?>\s*){2}/i
so you should either replace them with nothing or use preg_match or RegExp.prototype.match to extract the arguments.
In JavaScript:
var afterReplace = str.replace( /(\s*<br\s*\/?>\s*){2}/i, '' );
In PHP
$afterReplace = preg_replace( '/(\s*<br\s*\/?>\s*){2}/i', '', $str );
I'm only sure it'll work in PHP / JavaScript, but it should work in everything...

The usual solution to this sort of problem is to use a "capturing group". Most regular expression systems allow you to extract not only the entire matching sequence, but also sub-matches within it. This is done by grouping a part of the expression within ( and ). For instance, if I use the following expression (this is in JavaScript; I'm not sure what language you want to be working in, but the basic idea works in most languages):
var string = "<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>";
var match = string.match(/<BR>.*?<BR>([a-zA-Z]*)/);
Then I can get either everything that matched using match[0], which is "<BR>like <BR>Abdurrahman", or I can get only the part inside the parentheses using match[1], which gives me "Abdurrahman".

Related

Matching String Wrapped In Symbol For Regex Replace

I'm trying to figure out how to implement Regex on my WordPress blog.
The Problem
I'd like to replace certain content with some inline styles, and I'm using Regex to accomplish this.
My idea is as follows: find the string wrapped in a particular symbol, i.e. "~string~" and dynamically replace this with a span that has a particular class.
I'm going for a similar effect to SO's inline code highlighting feature, but instead of using backticks, I'm using "~" as my symbol of choice (since WordPress already identifies "`" as code).
Quick Example
Original Text
This is a demo paragraph with a wrapped string ~here~, with another string ~~here~~.
After Regex Replacement
This is a demo paragraph with a wrapped string <span class="classOne">here</span>, with another string <span class="classTwo">here</span>.
What I'm Struggling With
The regex I'm using is this: /~(.*?)~/, and it's working fine for finding strings such as "~demo~", but I'm not sure how to extend it to be able to find strings with multiple delimiters, like: "~~demo~~".
The tricky part for me is that it needs to distinguish between just one "~" versus two of them because I'd like to assign different replacements to each result.
Any help would be appreciated! Thanks in advance.
You can use
~~([\s\S]*?)~~(?!~)|~([^~]*)~
See the regex demo. Details:
~~([\s\S]*?)~~(?!~) - ~~, then a capturing group #1 matching any zero or more chars but as few as possible, and then a ~~ substring not followed with another ~ char
| - or
~([^~]*)~ - a ~, then a capturing group #2 matching zero or more chars other than ~, and then a ~
If you use it in PHP, you may use the pattern with preg_replace_callback where you may define separate replacement logic when a specific capturing group is matched.
See a PHP demo:
$html = 'This is a demo paragraph with a wrapped string ~here~, with another string ~~here~~.';
echo preg_replace_callback('/~~([\s\S]*?)~~(?!~)|~([^~]*)~/', function ($m) {
return !empty($m[1]) ? '<span class="classTwo">' . $m[1] . '</span>' : '<span class="classOne">' . $m[2] . '</span>';
},$html);
// => This is a demo paragraph with a wrapped string <span class="classOne">here</span>, with another string <span class="classTwo">here</span>.
To make it little more generic, you can try this (~+)([^~]+?)(~+). This would need an additional check on the number of characters present in the 1st or the 3rd grouping which matches (~). Based on the number of characters take a decision in code for classOne, classTwo, classThree etc...

regular expression replacement of numbers

Using regular expression how do I replace 1,186.55 with 1186.55?
My search string is
\b[1-9],[0-9][0-9][0-9].[0-9][0-9]
which works fine. I just can't seem to get the replacement part to work.
You are very sparse with information in your question. I try to answer as general as possible:
You can shorten the regex a bit by using quantifiers, I would make this in a first step
\b[1-9],[0-9]{3}.[0-9]{2}
Most probably you can also replace [0-9] by \d, is also more readable IMO.
\b\d,\d{3}.\d{2}
Now we can go to the replacement part. Here you need to store the parts you want to keep. You can do that by putting that part into capturing groups, by placing brackets around, this would be your search pattern:
\b(\d),(\d{3}.\d{2})
So, now you can access the matched content of those capturing groups in the replacement string. The first opening bracket is the first group the second opening bracket is the second group, ...
Here there are now two possibilities, either you can get that content by \1 or by $1
Your replacement string would then be
\1\2
OR
$1$2
Python:
def repl(initstr, unwanted=','):
res = set(unwanted)
return ''.join(r for r in initstr if r not in res)
Using regular expressions:
from re import compile
regex = compile(r'([\d\.])')
print ''.join(regex.findall('1,186.55'))
Using str.split() method:
num = '1,186.55'
print ''.join(num.split(','))
Using str.replace() method:
num = '1,186.55'
print num.replace(',', '')
if you just wanna remove the comma you can do(in java or C#):
str.Replace(",", "");
(in java it's replace)
Or in Perl:
s/(\d+),(\d+)/$1$2/

Regular Expression: How to replace a string that does NOT start with something?

I need to replace a root relative URL with a different root relative URL:
/Images/filename.jpg
should be replaced with:
/new/images-dir/filename.jpg
I started by using PHP's str_replace function:
$newText = str_replace('/Images/', '/new/images-dir/', $text);
...but then I realized that it was replacing my absolute URLs that I don't want replaced:
http://sub.domain.com/something/Images/filename.jpg
#...is being replaced with...
http://sub.domain.com/something/new/images-dir/filename.jpg
So then I switched to using PHP's preg_replace function so I can use a regular expression to selectively replace only the root relative URLs and not the absolute URLs. However, I can't seem to figure out the syntax to do this:
$text = 'There is a root relative URL here: <img src="/Images/filename.jpg">'
. 'and an absolute here: <img src="http://sub.domain.com/something/Images/filename.jpg">'
. 'and one not in quotes: /Images/filename.jpg';
$newText = preg_replace('#/Images/#', '/new/images-dir/', $text);
How can I write my regular expression so that it ignores any absolute URLs and only replaces the root relative URLs?
After taking three edits to come up with a correct regex, I concluded that my first answer was best. PHP's string functions are better suited than regular expressions for this task:
Using str_replace():
function match($value)
{
// The second condition is probably unnecessary,
// unless your path argument is incorrectly formatted
if( ($value[0] != "/") || (stristr($value, "http:") != FALSE) )
{
return $value;
}
return str_replace("/Images/", "/new/images-dir/", $value);
}
The advantage of str_replace() is readability.
If the reader doesn't understand regular expressions, they can still clearly see criteria for matching: the input string must begin with '/' and must not contain "http:".
Furthermore, both the search key and replacement string are clearly represented in plain-text.
Using preg_replace():
function match($value)
{
$pattern = "/^(\/((.+?)\/)*?)Images\//";
// Assuming value is a root-relative path, everything
// before "Images/" should be capured into back-reference 1;
// The replacement string re-inserts it before "new/images-dir/"
return preg_replace($pattern, "\\1new/images-dir/", $value);
}
The regular expression tries to match following:
Match the beginning of string with ^,
followed by a forward slash to indicate root-relative URL,
followed by zero-or-more lazily quantified repetitions of the
group ((.+?)/). This group consists of one-or-more lazily quantified characters, and another forward-slash.
Match subsequent string "Images" and final forward-slash.
Both match() functions operate the same when tested as follows:
match("http://test/more/Images/file"); // Returns original argument
match("/test/more/Images/file"); // Returns with match replaced
According to the PHP documentation on Lookbehind assertions:
Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions.
Using this syntax, I was able to get this to work:
$text = preg_replace('#(?<!http\://sub.domain.com/something)/Images/#', '/new/images-dir/', $text);
Root-relative links generally are within quotes, as you've shown. So match on the quote and put it back in the replacement.
$text = 'There is a root relative image here: <img src="/Images/filename.jpg">';
$newText = preg_replace('#"/Images/#', '"/new/images-dir/', $text);
Update
If you have two different cases, try two different and specific replaces rather than trying to engineer one perfect one. Let us know what the other case(s) are.
If you need to match more than that, then you are looking for a "negative lookbehind assertion" so you make sure that it doesn't match the "http://blah" part before it. The problem with lookbehind is that it requires a static string match... it can't have variable length. http://www.php.net/manual/en/regexp.reference.assertions.php
Something like this might work, if you mostly use links to .net and .com links and the Images part is at the root:
$text = 'There is a root relative image here: <img src="/Images/filename.jpg">';
$newText = preg_replace('#(?<=.net|.com|.org|.cc)/Images/#', '/new/images-dir/', $text);

regular expression for find and replace

I've got strings like:
('Michael Herold','Michael Herold'),
but I need to remove the last parts so I end up with:
('Michael Herold'),
I'm still new to Regular Expressions so they confuse me. I'm using Notepad++.
find: \('([^']*)','\1'\)
Replace: ('\1')
So the actual function you use will depend on the language. Notepad++ is a text editor, not a language.
The regular expression that you will want will be ",'Michael Herold'" and you'll replace any matches with "", the empty string.
So in PHP for example, you'll have
$source = "('Michael Herold','Michael Herold')";
$pattern = "/(,'Michael Herold')+/";
$newString = $preg_replace($pattern, $source, "");
Do the equivalent in whatever language you use.
I'm not sure what flavor of regular expressions Notepad++ uses, but try replacing this expression:
\('([^']*)','\1'\)
with this one:
('$1')
The \1 matches whatever was found in the first set of single quotes (Michael Herold in your example), and $1 is replaced with that same string. (Try \1 if $1 doesn't work in Notepad++.)
See it in action here.

Regular expression replace a word by a link

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i'm living in Paris, near Paris Gare du Nord, i love Paris.
would become
i'm living.........near Paris..........i love Paris.
This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="...">Paris</a>), and eliminate the inner link.
Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
in the <b>capital of France</b>, <a href="">Paris</a>
The surplus link comes from step one, replacement result of step 2 will be:
in the <b>capital of France</b>, Paris
You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "Paris";
}
}
Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.
Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1Paris
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1Paris';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
$pattern = 'Paris';
$text = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)#', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)#', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', ''.$pattern.'', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}
Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
Baton
Rouge
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.