Regex for URL to get ID - regex

I am new to Regex. I have a few easy expressions written for urls. I am having some trouble writing one if the URL has a category and then a random html page after the ID:
http://www.website.com/MajorCategory/MinorCateory/118326/title-of-page.html
I am trying to match the 6 digit id in the example above. any help on getting this to match would be awesome.

You can use:
'~[0-9]+(?=/[^/]*$)~'
Working Demo: http://regex101.com/r/wY4mH3
PHP Code:
if ( preg_match('~[0-9]+(?=/[^/]*$)~', $str, $m) )
echo ($m[0]);

Related

Regex: Find usename inside a url

I'm struggling with creating the correct REGEX pattern to find a username string in the middle of a url. In short, I'm working in Powershell and pulling down a webpage and scraping out the "li" elements. I write this to a file so I have a bunch of lines like this:
<LI>Smith, Jimmy
The string I need is the "jimmysmith" part, and every line will have a different username, no longer than eight alpha characters. My current pattern is this:
(<(.|\n)+?>)|( )
and I can use a "-replace $pattern" in my code to grab the "Smith, Jimmy" part. I have no idea what I'm doing, and any success in getting what I did get was face-roll-luck.
After using several online regex helpers I'm still stuck on how to just get the "string after the third "/" and up-to but not including the last quote.
Thank you for any assistance you can give me.
I suggest you use an HTML parser instead. Try:
$html = New-Object -ComObject "HTMLFile"
$source = '<LI>Smith, Jimmy '
$html.IHTMLDocument2_write($source)
$html.links | % nameprop
jimmysmith
Try the following regex:
[^\/"]+(?=">.*<\/A>)
This wll capture the last string in href attribute of <a> tag.
Just simply to replace redundant strings.
'<LI>Smith, Jimmy ' -replace ".*user/|`"\>.*"
If you have multiple lines, try this:
'<LI>Smith, Jimmy ' -replace "^\<LI.*user/|`"\>.*"
Both work, tested.
The answer to my question, was contained in this response by Sergio.
Try the following regex:
[^\/"]+(?=">.*<\/A>)
This will capture the last string in href attribute of <a> tag.

Regular expression to match string from url

I want to match shop name from a url .Please see the example below. Its for url redirection in a word press application.
See the examples given below
http://example.com/outlets/19-awok?page=2
http://example.com/outlets/19-awok
http://example.com/outlets/159-awok?page=3
In all cases i need to get only awok from the url .It will be the text coming after '-' and before query string .
I tried below and its not working
/outlets/(\d+)-(.*)? => /shop/$2
Any help will be greatly appreciated.
You can use this regex:
/outlets/\d+-([^?]+)?
Trailing ? is used to strip previous query string.

URL rewrite regex conversion

I'm having trouble trying to learn how to write this URL into a regex template to add in as a rewrite. I've tried various regex sandboxes to figure it out on my own but they won't allow a '/' for instance when I copy an expression from here for testing:
I've got a custom post type (publications) with 2 taxonomies (magazine, issue) which I'm trying to create a good looking URL for.
After many hours I've come here to find out how I can convert this.
index.php?post_type=publications&magazine=test-mag&issue=2016-aug
To a templated regex expression (publication, magazine and issue are constant) that can output.
http://example.com/publications/test-mag/2016-aug/
Hopefully with room to extend if an article is followed through from that page.
Thanks in advance.
EDIT 1:
I've got this for my rule:
^publications/([^/]*)/([^/]*)/?$
and this for my match:
^index.php?post_type=publications&magazine=$matches[1]&issue=$matches[2]$
and testing with this:
http://localhost/publications/test-mag/2016-aug/
but its giving me a 404. What's the problem?
^index\.php\?post_type=publications&magazine=([^&]+)&issue=([^&]+)$
^ start of string
index\.php\?post_type=publications&magazine= literal text
([^&]+) one or more non-ampersand characters (will get all text up to the next url parameter. this is captured as a group
&issue= literal text
([^&]+) one or more non-ampersand characters. also captured
$ end of string
$str = 'index.php?post_type=publications&magazine=test-mag&issue=2016-aug';
preg_match('/magazine=([\w-]+?)&issue=([\w-]+)/', $str, $matches);
$res = 'http://example.com/' . $matches[1] . '/' . $matches[2] . '/';
echo $res; // => http://example.com/test-mag/2016-aug/
You can use the add_rewrite_rule method in the WP Rewrite API to accomplish this.
add_rewrite_rule('^/([^/]*)/([^/]*)/?$','index.php?post_type=publications&magazine=$matches[1]&issue=$matches[2]','top');

Matching URLs with other characters around

I need a regex pattern to match URLs in a complicated environment.
An URL would be in this position:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
(That's just a sample URL)
I need to match the URL until the colon, the colon and the code after that should be ignored. There are so many URLs out there and I'm not that experienced to create a pattern to match everything from http:// to :
As I said, everything else should be ignored, left away, except the URL which I need to store in a variable.
Could someone help me create such a pattern? My tries were matching the URL above, but when I put in more complicated URLs, they wouldn't match.
This is the pattern I've created. It works with simple URLs, but not with the complicated ones:
http(s)?://[A-Za-z0-9.,/_-]+
I'm not very good in regex, I'm still learning.
Thank you.
This regex should do it for you.
\[url=(.*?):[a-zA-Z0-9]*\]
Run against your test data:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
This will return the URL in capture group 1.
Assuming PHP (since your test URL is for the PHP manual), you'd use this with preg_match like this:
$value = "[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]";
$pattern = "/\[url=(.*?):[a-zA-Z0-9]*\]/";
preg_match($pattern, $value, $matches);
echo $matches[1];
Output:
http://www.php.net/manual/en/function.preg-replace.php
This will also work against URLs which contain colons in them, such as:
http://www.php.net:8080/manual/en/function.preg-replace.php
http://www.php.net/manual/us:en/function.preg-replace.php
How about this:
^(http(s)?:\/\/)?[^]^(^)^ ]+
Below regex will give you the url part before colon:
\[url=((http|https)?://)?[^\:]+

Regex - Extract TwitterUsername from URL

I'm looking for an universal regular expression which extracts the twitter username from an url.
Sample URLS
http://www.twitter.com/#!/donttrythis
http://twitter.com/KimKardashian
http://www.twitter.com/#!/KourtneyKardash/following
http://twitter.com/#!/jasonterry31/lists/memberships
There are a couple more test cases to make a universal regexp.
https URLs are also valid
URLs like twitter.com/#username also go to username's profile
This should do the trick in PHP
preg_match("|https?://(www\.)?twitter\.com/(#!/)?#?([^/]*)|", $twitterUrl, $matches);
If preg_match returns 1 (a match) then the result is on $matches[3]
Try this:
^https?://(www\.)?twitter\.com/(#!/)?(?<name>[^/]+)(/\w+)*$
The sub group "name" will contain the twitter username.
This regex assumes that each URL is on its own line.
To use it in JS, use this:
^https?://(www\.)?twitter\.com/(#!/)?([^/]+)(/\w+)*$
The result is in the sub group $3.
this regex works fine in jQuery
$('#inputTwitter').blur(function() {
var twitterUserName = $(this).val();
$(this).val(twitterUserName.match(/https?:\/\/(www\.)?twitter\.com\/(#!\/)?#?([^\/]*)/)[3])
});
This one is based on Lombo's answer, works without http(s) too, is less hungry (not keeping spaces after the username) and returns first in the result.
Check it in action: https://regex101.com/r/xI2vF3/3
For js:
(?:https?:\/\/)?(?:www\.)?twitter\.com\/(?:#!\/)?#?([^\/\?\s]*)
Lombo's answer is my favorite, but it will glom any query string in with the result:
http://www.twitter.com/#!/donttrythis?source=internet
will result in a username of "donttrythis?source=internet"
I'd modify it to be:
preg_match("|https?://(www\.)?twitter\.com/(#!/)?#?([^/\?]*)|", $twitterUrl, $matches);
Adding \? to the excluded character class after the username ensures the query string is excluded.
This regex matches all four given URLs. The user name is present in $1
m[twitter\.com/+(?:#!/+)?(\w+)]
Use this to check
perl -le '$_="<url>"; m[twitter\.com/+(?:#!/+)?(\w+)]; print $1'
This one works for me (in PHP): /twitter\.com(?:\/\#!)?\/(\w+)/i
I found Lombo's answer to work the best except it would not work if the URL was www.twitter.com/example . The following works for me on www as well.
$dirty_twitter = array( 'https://twitter.com/', 'http://twitter.com/', 'www.twitter.com/', 'https://www.twitter.com/', 'http://www.twitter.com/', 'twitter.com/' );
$clean_twitter = str_replace( $dirty_twitter, '', $clean_twitter );