How to limit my regex that is detecting too much? - regex

I have a regex that is attempting to detect title & link markup:
[title](http://link.com)
So far I have:
(\[)(.*?)(\])(\(((http[s]?)|ftp):\/\/)(.*?)(\))
Which is detecting to much when an untitled link markup is before it
[http://google.com] [Digg](http://digg.com)
[Internal Page] Random other text [Digg](http://digg.com)
How can I limit the regex to just the titled link?
Full PHP for titled & untitled links:
// Titled Links
// [Digg](http://digg.com)
// [Google](http://google.com)
$text = preg_replace_callback(
'/(\[)(.*?)(\])(\(((http[s]?)|ftp):\/\/)(.*?)(\))/',
function ($match) {
$link = trim($match[7]);
$ret = "<a target='_blank' href='" . strtolower($match[5]) . "://" . $link . "'>" . trim($match[2]) . "</a>";
if (strtolower($match[5]) == "http") {
$ret .= "<img src='/images/link_http.png' class='link' />";
} else if (strtolower($match[5]) == "https") {
$ret .= "<img src='/images/link_https.png' class='link' />";
} else if (strtolower($match[5]) == "ftp") {
$ret .= "<img src='/images/link_ftp.png' class='link' />";
}
return $ret;
},
$text
);
// Untitled Links
// [Internal Page]
// [http://google.com]
$text = preg_replace_callback(
'/(\[)(.*?)(\])/',
function ($match) {
$link = trim($match[2]);
$ret = "";
if ($this->startsWith(strtolower($link), "https")) {
$ret = "<a target='_blank' href='" . $link . "'>" . $link . "</a>";
$ret .= "<img src='/images/link_https.png' class='link' />";
} else if ($this->startsWith(strtolower($link), "http")) {
$ret = "<a target='_blank' href='" . $link . "'>" . $link . "</a>";
$ret .= "<img src='/images/link_http.png' class='link' />";
} else if ($this->startsWith(strtolower($link), "ftp")) {
$ret = "<a target='_blank' href='" . $link . "'>" . $link . "</a>";
$ret .= "<img src='/images/link_ftp.png' class='link' />";
} else {
$link = str_replace(" ", "_", $link);
$ret = "<a href='" . $link . "'>" . trim($match[2]) . "</a>";
}
return $ret;
},
$text
);

If you're trying to go through Markdown links, you'll probably want to grab the regex and logic straight from the source:
https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L510
https://github.com/tanakahisateru/js-markdown-extra/blob/master/js-markdown-extra.js#L630

Make the title optional by appending a '?' to the group that matches the title.

Instead of (.*?) try matching on something you really don't want, like a space e.g. ([^\s]+).
Also, the whole of the second part is optional (if you can have a untitled link), so add the ? as #Arnout suggests, e.g.
(\(((http[s]?)|ftp):\/\/)([^\s]+)(\))?
May I also suggest, (though I'm not sure it's supported in PHP regex it appears to be, using the whitespace flag and breaking it up over a few lines for readability:
/
(
\[
)
(.*?)
(
\]
)
(
\(
(
(http[s]?)
|
ftp
)
:\/\/
)
(.*?)
(
\)
)
/x
That is a lot clearer, and it's easier to see:
The [s]? could just be s?
The scheme brackets should be ((?:https?)|(?:ftp)) or it only ORs on the f, and you get additional unneeded captures.
You may possibly be able to comment on it too, within the regex (again, I'm unsure you can with PHP).

This works but doesn't have the parentheses for all the groups you're trying to match.
\[[\w\s]+\]\((https?|ftp)://[^)]+\)

Related

Replacing sub strings with regex in powershell

I have the following regex code in my powershell to identify URL's that I need to update:
'href[\s]?=[\s]?\"[^"]*(https:\/\/oursite.org\/[^"]*News and Articles[^"]*)+\"'
'href[\s]?=[\s]?\"[^"]*(https:\/\/oursite.org\/[^"]*en\/News-and-Articles[^"]*)+\"'
These are getting me the results I need to update, now I need to know how to replace the values "News and Articles" with "news-and-articles" and "en" with "news-and-articles".
I have some code that has a replacement url like so:
$newUrl = 'href="https://oursite.org/"' #replaced value
So the beginning result would be:
https://www.oursite.org/en/News-and-Articles/2017/11/article-name
to be replaced with
https://www.oursite.org/news-and-articles/2017/11/article-name
Here is the function that is going through all the articles and doing a replacement:
function SearchItemForMatch
{
param(
[Data.Items.Item]$item
)
Write-Host "------------------------------------item: " $item.Name
foreach($field in $item.Fields) {
#Write-Host $field.Name
if($field.Type -eq "Rich Text") {
#Write-Host $field.Name
if($field.Value -match $pattern) {
ReplaceFieldValue -field $field -needle $pattern -replacement $newUrl
}
#if($field.Value -match $registrationPattern) {
# ReplaceFieldValue -field $field -needle $registrationPattern -replacement $newRegistrationUrl
#}
if($field.Value -match $noenpattern){
ReplaceFieldValue -field $field -needle $noenpattern -replacment $newnoenpattern
}
}
}
}
Here is the replacement method:
Function ReplaceFieldValue
{
param (
[Data.Fields.Field]$field,
[string]$needle,
[string]$replacement
)
Write-Host $field.ID
$replaceValue = $field.Value -replace $needle, $replacement
$item = $field.Item
$item.Editing.BeginEdit()
$field.Value = $replaceValue
$item.Editing.EndEdit()
Publish-Item -item $item -PublishMode Smart
$info = [PSCustomObject]#{
"ID"=$item.ID
"PageName"=$item.Name
"TemplateName"=$item.TemplateName
"FieldName"=$field.Name
"Replacement"=$replacement
}
[void]$list.Add($info)
}
Forgive me if I'm missing something, but it seems to me that all you really want to achieve is to get rid if the /en part and finally convert the whole url to lowercase.
Given your example url, this could be as easy as:
$url = 'https://www.oursite.org/en/News-and-Articles/2017/11/article-name'
$replaceValue = ($url -replace '/en/', '/').ToLower()
Result:
https://www.oursite.org/news-and-articles/2017/11/article-name
If it involves more elaborate replacements, then please edit your question and give us more examples and desired output.
Try Regex: (?<=oursite\.org\/)(?:en\/)?News-and-Articles(?=\/)
Replace with news-and-articles
Demo

Perl Regex Match and loop HTML Comments

I have a log file with data in format :
<!-- 12/15/16 01:02:27:950.125
DATA1 -->
<!-- 12/15/16 01:02:27:950.373
DATA2 -->
<!-- 12/15/16 01:02:27:950.921
DATA3: Text1 -->
<!-- 12/15/16 01:02:27:951.066
DATA4: Text2 -->
I need to extract and loop all the data inside the comments.
I am reading the file and saving data as one string.
I have tried a few solutions but getiing "undef" on match
use strict;
use Data::Dumper;
use File::Basename;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );
use Date::Format;
use DateTime;
use warnings;
.
.
.
if ( open(ORIGFILE, $filepath) ) {
my #wrp_record_content = <ORIGFILE>;
# my $content = join('', #wrp_record_content);
# my #matches = $content =~ s/<!--(.*)-->//g;
# my $data;
# while ( <ORIGFILE> ) {
# $data .= $_;
# }
# while ( $data =~ m/<!--(.*)-->/g ) {
# print Dumper('===DATA===');
# print Dumper($data);
# }
my $content = join('', #wrp_record_content);
#print Dumper('------CONTENT------');
#print Dumper($content);
#print Dumper('------ CONTENT ENDED ------');
my #matches;
while ($content =~ /<!--.*?-->/gs) {
push #matches, $1;
}
foreach my $m (#matches) {
print Dumper('===MATCH===', "\n");
print Dumper($m);
}
}
Can someone please guide on where it is going wrong?
There is nothing in $1. You must add capturing parentheses to your regex pattern
$content =~ /<!--(.*?)-->/gs
You have done it correctly in the loop that you commented out!

regular expression preg replace omit a character

I use this function to replace relative links with absolutes and make them as parameters for the page to stream it with file_get_contents. there is a problem i think in my regular expression that omits a character
its the function
$pattern = "/<a([^>]*) " .
"href=\"[^http|ftp|https|mailto]([^\"]*)\"/";
$replace = "<a\${1} href=\"?u=" . $base . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
$pattern = "/<a([^>]*) " .
"href='[^http|ftp|https|mailto]([^\']*)'/";
$replace = "<a\${1} href=\"?u=" . $base . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
$pattern = "/<img([^>]*) " .
"src=\"[^http|ftp|https]([^\"]*)\"/";
$replace = "<img\${1} src=\"" . $base . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
$pattern = "/<a([^>]*) " .
"href=\"([^\"]*)\"/";
$replace = "<a\${1} href=\"?u=" . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
so
"UsersList.aspx?dir=09"
with this $base url":
http://www.some-url.com/Members/
should be replaced to
"?u=http://www.some-url.com/Members/UsersList.aspx?dir=09"
but i get
"?u=http://www.some-url.com/Members/sersList.aspx?dir=09"
i dont know whats the problem in my regular expression and how to fix it
Guess your a tag is like
and it will not work with this pattern for your desired result.
$pattern = "/<a([^>]*) " . "href=\"[^http|ftp|https|mailto]([^\"]*)\"/";
in that
[^http|ftp|https|mailto] -- this expression matches only one character, means 'U' will be missing
try removing that like
$pattern = "/<a([^>]*) " . "href=\"([^\"]*)\"/";

Perl - Parsing Arguments/Options with REGEX

I'm creating a perl script to convert a list of commands in a template file () and output them to another file in a different format in an output file ().
The commands in the template file will look as follows:
command1 --max-size=2M --type="some value"
I'm having some problems extracting the options and values from this string. So far i have:
m/(\s--\w*=)/ig
Which will return:
" --max-size="
" --type="
However I have no idea how to return both the option and value as a separate variable or how to accommodate for the use of quotes.
Could anyone steer me in the right direction?
side note: I'm aware that Getops does an awesome job at doing this from the command-line but unfortunately these commands are passed as strings :(
Getopt::Std or Getopt::Long?
Have you looked at this option or this one?
Seems like there's no reason to reinvent the wheel.
The code below produces
#args = ('command1', '--max-size=2M', '--type=some value');
That is suitable to pass to GetOptions as follows:
local #ARGV = #args;
GetOptions(...) or die;
Finally, the code:
for ($cmd) {
my #args;
while (1) {
last if /\G \s* \z /xgc;
/\G \s* /xgc;
my $arg;
while (1) {
if (/\G ([^\\"'\s]) /xgc) {
$arg .= $1;
}
elsif (/\G \\ /xgc) {
/\G (.) /sxgc
or die "Incomplete escape";
$arg .= $1;
}
elsif (/\G (?=") /xgc) {
/\G " ( (?:[^"\\]|\\.)* ) " /sxgc
or die "Incomplete double-quoted arging";
my $quoted = $1;
$quoted =~ s/\\(.)/$1/sg;
$arg .= $quoted;
}
elsif (/\G (?=') /xgc) {
/\G ' ( [^']* ) ' /xgc
or die "Incomplete single-quoted arging";
$arg .= $1;
}
else {
last;
}
}
push #args, $arg;
}
#args
or die "Blank command";
...
}
use Data::Dumper;
$_ = 'command1 --max-size=2M a=ignore =ignore --switch --type="some value" --x= --z=1';
my %args;
while (/((?<=\s--)[a-z\d-]+)(?:="?|(?=\s))((?<![="])|(?<=")[^"]*(?=")|(?<==)(?!")\S*(?!"))"?(?=\s|$)/ig) {
$args->{$1} = $2;
}
print Dumper($args);
---
$VAR1 = {
'switch' => '',
'x' => '',
'type' => 'some value',
'z' => '1',
'max-size' => '2M'
};
(test this demo here)

Regular expression to match links containing "Google"

I want to use PHP regular expressions to match out all the links which contain the word google. I've tried this:
$url = "http://www.google.com";
$html = file_get_contents($url);
preg_match_all('/<a.*(.*?)".*>(.*google.*?)<\/a>/i',$html,$links);
echo '<pre />';
print_r($links); // it should return 2 links 'About Google' & 'Go to Google English'
However it returns nothing. Why?
Better is to use XPath here:
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = "//a[contains(translate(text(), 'GOOGLE', 'google'), 'google')]";
// or just:
// $query = "//a[contains(text(),'Google')]";
$links = $xpath->query($query);
$links will be a DOMNodeList you can iterate.
You should use a dom parser, because using regex for html documents can be "painfully" error prone.
Try something like this
//Disable displaying errors
libxml_use_internal_errors(TRUE);
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$n=0;
foreach ($doc->getElementsByTagName('a') as $a) {
//check if anchor contains the word 'google' and print it out
if ($a->hasAttribute('href') && strpos($a->getAttribute('href'),'google') ) {
echo "Anchor" . ++$n . ': '. $a->getAttribute('href') . '<br>';
}
}