Regex for BBCode with optional parameters - regex

I'm currently stuck on a regex. I'm trying to fetch the contents of a BBCode, that has optional params and maybe different notations:
[tag]https://example.com/1[/tag]
[tag='https://example.com/2'][/tag]
[tag="http://another-example.com/whatever"][/tag]
[tag=ftp://an-ftp-host][/tag]
[tag='https://example.com/3',left][/tag]
[tag="https://example.com/4",right][/tag]
[tag=https://example.com/5][/tag]
[tag=https://example.com/i-need-this-one,right]http://example.com/i-dont-need-this-one[/tag]
The 2nd param can just be left or right and if this is given, i need the URL from the first param. Otherwise, i need that one between the tags.
An url as param can be wrapped within ' or " or without any of these.
My current regular expression is this:
~\[tag(?|=[\'"]?+([^]"\']++)[\'"]?+]([^[]++)|](([^[]++)))\[/tag]~i
However, this one also includes the 2nd param in the match list and a lot more of things, that i don't want to match.
Any suggestions?

I've made some changes to do what you want. I've included your version here for easy comparison:
Yours: http://regex101.com/r/dE4aE4/1
\[tag(?:=[\'"]?(.*)[\'"]?)?]([^]]*)?\[/tag]
Mine: http://regex101.com/r/dE4aE4/3
\[tag(?:=[\'"]?([^,]*?)(?:,[^]'"]+)?[\'"]?)?]([^\[]+)?\[/tag]
Observe that I've changed a bit to get the URL without the coma (,): from (.*) to ([^,]*?)(?:,[^]'"]+)?
I've also fixed the content part: from ([^]]*)? to ([^\[]+)?

Related

kimonolabs >Text before comma

I'm trying to scrape a piece of text from a website using Kimonolabs. The text is succesfully scraped using the advanced setting:
div > div > ul > li.location > span.value
The text being scraped using this CSS selector is:
Cityname, streetname 1
However, I wish to delete everything before the comma so that only remains:
Cityname
I wish to do this with regex, but I'm totally ignorant about it. What I do konw is that it has to containof 3 blocks when using Kimonolabs: https://help.kimonolabs.com/hc/en-us/articles/203043464-Manually-input-regular-expressions
Can anybody help me setting up the correct regex? All I got so far is the following, but it's not the correct markup for Kimonolabs (it doesn't allow for it in the dashboard):
^(.+?),
See the docs you referred to:
The regular expression pattern in kimono is defined in three parts. It's important that any custom regular expression you produce retains the three part notation, with the surrounding ( ) for each part. The first part refers to the pattern to the left of the desired content. The middle part refers to the pattern that the desired content must match and the third part refers to the pattern to the right of the desired content.
So, you seem to need:
/^()([^,]+)()/
Or, /(^)([^,]+)(,)/ (it should be equivalent), and the 2nd capture group (the middle part) should capture the Cityname.

Regex to find arguments in text

There's undoubtedly a better way to do this but this is the way my requirements need me to do this.
I'm creating a search form for my web application. I want to use a tagged based search. So I'm using regex to make it work.
So I have a search string: 'c:john customer:15478'
The regex needs to find the tag (c:) and the argument (john), drop the tag, and give me the argument -- and it needs to do so for all of the instances of a tag and their arguments. The regex I have comes close, but it doesn't work correctly. It doesn't grab every argument, or drop the tags in a consistent way. So the question: what's wrong with my regex that needs to be fixed in order to achieve the correct results?
Currently it finds the first tag, grabs its argument, and everything else after it. I need it to stop the match after it finds an argument. i.e. in the case above it will match john customer:15478
Maybe a better question is how do I make VB's regex return everything between the first colon, and the beginning of the next tag (which is followed by another colon) or otherwise stop matching at the beginning of the next tag?
Regex:
(?<=({0}({1})??:)+?)(\S+\s*\S*)(?=\s+?\b\w+:.+?)??
The {0} and the {1} represent a String.format call using a string, say Customer (but it could be anything), to define the tag. the {0} is the first character, and the {1} are the rest of the characters. This regex will match anything that exists behind the tag including another tag and its argument if it exists. So for the string
"c:5401 4664 c:john smith p:joam d:domain.com p:1548 c:215-548-5487 d:""192.168.0.1"""
The matches would be
'5401 4664, john smith, 215-548-5487 d:"192.168.0.1"'
'domain.com p:1548, "192.168.0.1"'
'joam d:domain.com, 1548 c:215-548-5487'
given the tags I have defined. The regex fails to stop its matching at the start of the next tag.
If I undestood You correctly this should solve the problem in general:
/\w+:([^:]+)(?:\s|$)/g
https://regex101.com/r/vN6fH1/1
and with defined tag it would look like this:
/{0}({1})?:([^:]+)(?:\s|$)/g
but this still rely on semicolon not tag name
(so it won't match at all if You did not pass tag name that is in string)

regexp subgroup matching for codeigniter

I'm using code igniter and am trying to capture multiple url segments via the $routes array.
For example, my url will look like this:
/segment-1/segment-2/keyword/
I have been trying to use just this regexp:
$route['([\w-?]+){1,3}'] = "my/method";
but that only returns this subgroup match:
segment-1
then when i try this route:
(\/[\w-?]+){1,3}
it returns this as the subgroup match:
/keyword
so I have been explicitly putting the exact route I want to make sure I capture all the instances:
$route['([\w-?]+)'] = 'my/method/$1';
$route['([\w-?]+)/([\w-?]+)'] = 'my/method/$1/$2';
$route['([\w-?]+)/([\w-?]+)/([\w-?]+)'] = 'my/method/$1/$2/$3';
which obviously is rather verbose.
ultimately, I would like to capture all segments in one regexp.
thoughts?
It is generally impossible to have arbitrarily many capture groups in one regular expression. As you can see from your own attempts, if you repeat one capture group, you will just get the last match. However, you could solve it up to a certain number of path elements by making additional "directories" optional. Say 3 is your maximum:
$route['([\w-?]+)(/[\w-?]+(/[\w-?]+)?)?'] = 'my/method/$1$2$3';
You could add a few more nested (/[\w-?]+)? at the end to allow for deeper paths. Otherwise you can fill your $route array automatically with a loop, but that, too, will only work to a fixed depth.

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

find the character ~ 's first occurence that is in the middle of the url

I need to retrieve certain portion of an url using regex. The url looks like this:
/xxxx/bbbb/good/city/games_in_the_city.~cccccc~dddddd~eeeee.html
I need to retrieve games_in_the_city. I got the first portion until the / removed. Now need to find the first occurence of ~ in the string so that the rest can be removed as well.
The regex that I have right now (.*\/good\/city\/)(.*)(\.html) gets games_in_the_city.~cccccc~dddddd~eeeee
How can I modify my regex so ~cccccc~dddddd~eeeee can be removed as well. The final output should be games_in_the_city
I will not know how many ~ (tilde) can appear in the url. sometimes it might one to n.
You can use the regex:
(?:.*\/good\/city\/)(.*?)\.?(?:~[^~]+)*(?:\.html)
See it