Regular expression replace a word by a link - regex

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i'm living in Paris, near Paris Gare du Nord, i love Paris.
would become
i'm living.........near Paris..........i love Paris.

This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="...">Paris</a>), and eliminate the inner link.
Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
in the <b>capital of France</b>, <a href="">Paris</a>
The surplus link comes from step one, replacement result of step 2 will be:
in the <b>capital of France</b>, Paris

You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "Paris";
}
}

Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.

Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1Paris
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1Paris';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.

If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.

$pattern = 'Paris';
$text = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)#', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)#', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', ''.$pattern.'', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}

Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
Baton
Rouge
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.

Related

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.
A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).
Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>
I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.
So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

Regular expression tag matching

I have a very simple Perl function that returns the content of a tag in custom XML code I need to parse. However, if there are line returns inside of the tags, then it returns an empty value and I'm not sure how to fix it:
sub in_tag
{
my ($text, $tag) = #_;
my ($content) = $text =~ m/<$tag.*>(.*)<\/$tag>/;
$content = $content . "";
return $content;
}
# works
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John Doe</name><id type=\"number\">67</id></item>", "name");
# doesnt work
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John\nDoe</name><id type=\"number\">67</id></item>", "name");
To make the . regex metacharacter match a newline, you need to use the /s flag:
m/..../s;
You also want to use non-greedy quantifiers in your regular expression. Put a ? after the * to still match zero or more, but with the provision that it doesn't go beyond text that would match the next part of the pattern:
m/<$tag.*?>(.*?)<\/$tag>/
I don't mind this simple sort of extraction for quick programs or small, uncomplicated inputs, but beyond that I like XML::Twig. It takes a bit to get used to, but once you get the hang of it you'll be able to do all sorts of fancy things with almost no effort.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");

Regular Expression: Start from second one

I want to find the second <BR> tag and to start the search from there. How can i do it using regular expressions?
<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>
Prepend <BR>[^<]*(?=<BR>) to your regex, or remove the lookahead part if you want to start after the second <BR>, such as: <BR>[^<]*<BR>.
Find text after the second <BR> but before the third: <BR>[^<]*<BR>([^<]*)<BR>
This finds "waldo" in <BR>404<BR>waldo<BR>.
Note: I specifically used the above instead of the non-greedy .*? because once the above starts not working for you, you should stop parsing HTML with regex, and .*? will hide when that happens. However, the non-greedy quantifier is also not as well-supported, and you can always change to that if you want.
assuming you are using PHP, you can split your string on <BR> using explode
$str='<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>';
$s = explode("<BR>",$str,3);
$string = end($s);
print $string;
output
$ php test.php
Abdurrahman<BR><SMALL>Fathers Name</SMALL>
you can then use "$string" variable and do whatever you want.
The steps above can be done with other languages as well by using the string splitting methods your prog language has.
this regular expression should math the first two <br />s:
/(\s*<br\s*/?>\s*){2}/i
so you should either replace them with nothing or use preg_match or RegExp.prototype.match to extract the arguments.
In JavaScript:
var afterReplace = str.replace( /(\s*<br\s*\/?>\s*){2}/i, '' );
In PHP
$afterReplace = preg_replace( '/(\s*<br\s*\/?>\s*){2}/i', '', $str );
I'm only sure it'll work in PHP / JavaScript, but it should work in everything...
The usual solution to this sort of problem is to use a "capturing group". Most regular expression systems allow you to extract not only the entire matching sequence, but also sub-matches within it. This is done by grouping a part of the expression within ( and ). For instance, if I use the following expression (this is in JavaScript; I'm not sure what language you want to be working in, but the basic idea works in most languages):
var string = "<BR>like <BR>Abdurrahman<BR><SMALL>Fathers Name</SMALL>";
var match = string.match(/<BR>.*?<BR>([a-zA-Z]*)/);
Then I can get either everything that matched using match[0], which is "<BR>like <BR>Abdurrahman", or I can get only the part inside the parentheses using match[1], which gives me "Abdurrahman".