URL rewrite regex conversion - regex

I'm having trouble trying to learn how to write this URL into a regex template to add in as a rewrite. I've tried various regex sandboxes to figure it out on my own but they won't allow a '/' for instance when I copy an expression from here for testing:
I've got a custom post type (publications) with 2 taxonomies (magazine, issue) which I'm trying to create a good looking URL for.
After many hours I've come here to find out how I can convert this.
index.php?post_type=publications&magazine=test-mag&issue=2016-aug
To a templated regex expression (publication, magazine and issue are constant) that can output.
http://example.com/publications/test-mag/2016-aug/
Hopefully with room to extend if an article is followed through from that page.
Thanks in advance.
EDIT 1:
I've got this for my rule:
^publications/([^/]*)/([^/]*)/?$
and this for my match:
^index.php?post_type=publications&magazine=$matches[1]&issue=$matches[2]$
and testing with this:
http://localhost/publications/test-mag/2016-aug/
but its giving me a 404. What's the problem?

^index\.php\?post_type=publications&magazine=([^&]+)&issue=([^&]+)$
^ start of string
index\.php\?post_type=publications&magazine= literal text
([^&]+) one or more non-ampersand characters (will get all text up to the next url parameter. this is captured as a group
&issue= literal text
([^&]+) one or more non-ampersand characters. also captured
$ end of string

$str = 'index.php?post_type=publications&magazine=test-mag&issue=2016-aug';
preg_match('/magazine=([\w-]+?)&issue=([\w-]+)/', $str, $matches);
$res = 'http://example.com/' . $matches[1] . '/' . $matches[2] . '/';
echo $res; // => http://example.com/test-mag/2016-aug/

You can use the add_rewrite_rule method in the WP Rewrite API to accomplish this.
add_rewrite_rule('^/([^/]*)/([^/]*)/?$','index.php?post_type=publications&magazine=$matches[1]&issue=$matches[2]','top');

Related

Matching URLs with other characters around

I need a regex pattern to match URLs in a complicated environment.
An URL would be in this position:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
(That's just a sample URL)
I need to match the URL until the colon, the colon and the code after that should be ignored. There are so many URLs out there and I'm not that experienced to create a pattern to match everything from http:// to :
As I said, everything else should be ignored, left away, except the URL which I need to store in a variable.
Could someone help me create such a pattern? My tries were matching the URL above, but when I put in more complicated URLs, they wouldn't match.
This is the pattern I've created. It works with simple URLs, but not with the complicated ones:
http(s)?://[A-Za-z0-9.,/_-]+
I'm not very good in regex, I'm still learning.
Thank you.
This regex should do it for you.
\[url=(.*?):[a-zA-Z0-9]*\]
Run against your test data:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
This will return the URL in capture group 1.
Assuming PHP (since your test URL is for the PHP manual), you'd use this with preg_match like this:
$value = "[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]";
$pattern = "/\[url=(.*?):[a-zA-Z0-9]*\]/";
preg_match($pattern, $value, $matches);
echo $matches[1];
Output:
http://www.php.net/manual/en/function.preg-replace.php
This will also work against URLs which contain colons in them, such as:
http://www.php.net:8080/manual/en/function.preg-replace.php
http://www.php.net/manual/us:en/function.preg-replace.php
How about this:
^(http(s)?:\/\/)?[^]^(^)^ ]+
Below regex will give you the url part before colon:
\[url=((http|https)?://)?[^\:]+

How to Regex Multiple URLs From Same Variable In Perl

I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g

Regex - Extract TwitterUsername from URL

I'm looking for an universal regular expression which extracts the twitter username from an url.
Sample URLS
http://www.twitter.com/#!/donttrythis
http://twitter.com/KimKardashian
http://www.twitter.com/#!/KourtneyKardash/following
http://twitter.com/#!/jasonterry31/lists/memberships
There are a couple more test cases to make a universal regexp.
https URLs are also valid
URLs like twitter.com/#username also go to username's profile
This should do the trick in PHP
preg_match("|https?://(www\.)?twitter\.com/(#!/)?#?([^/]*)|", $twitterUrl, $matches);
If preg_match returns 1 (a match) then the result is on $matches[3]
Try this:
^https?://(www\.)?twitter\.com/(#!/)?(?<name>[^/]+)(/\w+)*$
The sub group "name" will contain the twitter username.
This regex assumes that each URL is on its own line.
To use it in JS, use this:
^https?://(www\.)?twitter\.com/(#!/)?([^/]+)(/\w+)*$
The result is in the sub group $3.
this regex works fine in jQuery
$('#inputTwitter').blur(function() {
var twitterUserName = $(this).val();
$(this).val(twitterUserName.match(/https?:\/\/(www\.)?twitter\.com\/(#!\/)?#?([^\/]*)/)[3])
});
This one is based on Lombo's answer, works without http(s) too, is less hungry (not keeping spaces after the username) and returns first in the result.
Check it in action: https://regex101.com/r/xI2vF3/3
For js:
(?:https?:\/\/)?(?:www\.)?twitter\.com\/(?:#!\/)?#?([^\/\?\s]*)
Lombo's answer is my favorite, but it will glom any query string in with the result:
http://www.twitter.com/#!/donttrythis?source=internet
will result in a username of "donttrythis?source=internet"
I'd modify it to be:
preg_match("|https?://(www\.)?twitter\.com/(#!/)?#?([^/\?]*)|", $twitterUrl, $matches);
Adding \? to the excluded character class after the username ensures the query string is excluded.
This regex matches all four given URLs. The user name is present in $1
m[twitter\.com/+(?:#!/+)?(\w+)]
Use this to check
perl -le '$_="<url>"; m[twitter\.com/+(?:#!/+)?(\w+)]; print $1'
This one works for me (in PHP): /twitter\.com(?:\/\#!)?\/(\w+)/i
I found Lombo's answer to work the best except it would not work if the URL was www.twitter.com/example . The following works for me on www as well.
$dirty_twitter = array( 'https://twitter.com/', 'http://twitter.com/', 'www.twitter.com/', 'https://www.twitter.com/', 'http://www.twitter.com/', 'twitter.com/' );
$clean_twitter = str_replace( $dirty_twitter, '', $clean_twitter );

Odd Perl Regex Behavior with Parens

I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?
There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.
$body = <<"__ENDHTML__";
Body Blah blah
Body
__ENDHTML__
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
Are you using an old Perl?
You didn't anchor the RE to the end of the string. Put a " afterwards.
While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])
I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).
Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com