preg_match Part of a url - regex

I have a link that looks like this http://site.com/numbers_and_letters/This_is_what-I-need_to-retrieve.html
I basically need to retrieve this part: This_is_what-I-need_to-retrieve
And also replace the the dashes and underscores with spaces so it would end up looking like this: This is what I need to retrieve
I'm new to regex so this is what i'm using:
(it works but has poor performance)
function clean($url)
{
$cleaned = preg_replace("/http:\/\/site.com\/.+\//", '', $url);
$cleaned = preg_replace("/[-_]/", ' ', $cleaned);
//remove the html extension
$cleaned = substr($cleaned, 0,-4);
return $cleaned;
}

What you've got isn't that bad. But maybe you can try comparing its performance to this:
preg_match('[^/]+$', $url, $match);
$cleaned = preg_replace('[-_]', ' ', $match);
EDIT:
If all you have is a hammer, everything looks like a nail.
How about avoiding regex altogether? (I presume each input is a valid URL.)
$cleaned = strtr(substr($url, strrpos($url, '/') + 1, -5), '-_', ' ');
This even removes the .html extension! (I'm making all the same assumptions you already seem to be making, i.e. that all links end in .html.) A brief explanation:
strtr translates a set of characters, e.g. -_, to respective characters in another set, e.g. spaces. (I imagine it'd be more efficient than invoking the entire regex engine.)
substr, you must know, but note that if the last argument is negative, e.g. -5, it indicates the number of characters from the end to ignore. Handy for this case, and again, probably more efficient than regex.
strrpos, of course, finds the last position of a character in a string, e.g. /.

Related

How to extract part of a string with slash constraints?

Hello I have some strings named like this:
BURGERDAY / PPA / This is a burger fest
I've tried using regex to get it but I can't seem to get it right.
The output should just get the final string of This is a burger fest (without the first whitespace)
Here, we can capture our desired output after we reach to the last slash followed by any number of spaces:
.+\/\s+(.+)
where (.+) collects what we wish to return.
const regex = /.+\/\s+(.+)/gm;
const str = `BURGERDAY / PPA / This is a burger fest`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
DEMO
Advices
Based on revo's advice, we can also use this expression, which is much better:
\/ +([^\/]*)$
According to Bohemian's advice, it may not be required to escape the forward slash, based on the language we wish to use and this would work for JavaScript:
.+/\s+(.+)
Also, we assume in target content, we would not have forward slash, otherwise we can change our constraints based on other possible inputs/scenarios.
Note: This is a pythonish answer (my mistake). I'll leave this for it's value as it could apply many languages
Another approach is to split it and then rejoin it.
data = 'BURGERDAY / PPA / This is a burger fest'
Here it is in four steps:
parts = data.split('/') # break into a list by '/'
parts = parts[2:] # get a new list excluding the first 2 elements
output = '/'.join(parts) # join them back together with a '/'
output = output.strip() # strip spaces from each side of the output
And in one concise line:
output= str.join('/', data.split('/')[2:]).strip()
Note: I feel that str.join(..., ...) is more readable than '...'.join(...) in some contexts. It is the identical call though.

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

Pattern matching in postgres 9.1

I am trying to extract the camera make & model from exifdata.
The exifdata itself is quite long, 4 lines follow:
JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
I use the following regexs, but they do not match. Are the patterns correct?
make := substring( meta from 'Make\\s+=\\s+(.*)');
model := substring( meta from 'Model\\s+=\\s+(.*)');
subtring([str] from [pattern]) doesn't work like you seem to think it does. You can find the details of how it works here: 9.7.2. SIMILAR TO Regular Expressions. That's the regular expression syntax your call uses.
For starters, there's this relevant bit of info:
As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null.
Your regular expressions clearly don't match the entire string.
Second is the next sentence:
To indicate the part of the pattern that should be returned on success, the pattern must contain two occurrences of the escape character followed by a double quote (")
This isn't quite the standard regex, so you need to be aware of it.
Rather than trying to get subtring([str] from [pattern]) working, I'm going to recommend an alternative: regexp_matches. This function uses standard POSIX regex syntax, and it returns a text[] containing all the captured groups from the match. Here's a quick test to show that it works:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, '(Make)') m;
(I'm using dollar quoting for your example string, in case you're not familiar with that syntax.)
This gives back the array {Make}.
Second, your regex actually doesn't work, as I found out in my testing. You have two problems:
The double slashes are incorrect. You don't need to escape the \ as PostgreSQL doesn't treat it as an escape character by default. You can read about escaping in strings here; the most relevant section is probably 4.1.2.2. String Constants with C-style Escapes. That section describes what you thought was happening by default, but it actually requires an E prefix to enable.
Fixing that improves the result:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, 'Make\s+=\s+(.*)') m;
now gives an array containing this string:
'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
This brings us to...
The (.*) is matching everything to the end of the string, not the end of the line. You can actually fix this by doing something you probably want to do anyway: get the single quote marks out of the match. You can use this pattern to do that:
$$Make\s+=\s+'([^']+)'$$
I've used dollar quoting again, this time to avoid the ugliness of escaping all those single quote marks. Now the query is:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, $$Make\s+=\s+'([^']+)'$$) m;
which gives you pretty much exactly what you want: an array containing just the string Canon. You'll need to extract the result from the array, of course, but I'll leave that as an exercise for you.
That should be enough info for you to get the second expression working, too.
P.S. PostgreSQL's truly fine manual is a blessing.

Why would regex to separate filename from extension not work in ColdFusion?

I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function:
REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );
I would like this to return an array like: ["Doe, John 8.15.2012","docx"]
but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]
I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?
Does ColdFusion use a different syntax? Or am I doing something wrong?
Thanks.
Why you're not getting expected results...
The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.
It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.
How to solve the problem...
If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...
ListLast( filename , '.' )
....to get only the file extension and to get the name without extension you can do...
rematch( '.+(?=\.[^.]+$)' , filename )
This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).
To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).
(optional) Accessing captured groups...
If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.
For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).
If you wanted to use cfRegex, you can do so with your original pattern like so:
RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )
Or with named arguments:
RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )
And you get returned an array of matches, within each element being an array of the captured groups for that match.
If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.
If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.
#Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.
<cfscript>
param name="URL.filename";
sRegex = "^.+?(?=(?:\.[^.]+?)?$)";
aMatch = reMatch(sRegex, URL.filename);
writeDump(aMatch);
</cfscript>
This works on the following filename patterns:
foo.bar
foo
.htaccess
John 8.15.2012.docx
Explanation of the regex:
^ From the beginning of the string
.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.
(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.
(?: Group this stuff together, but don't remember it for a back reference.
. A dot. This is the separator between file name and file extension.
[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.
? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.
$ To the end of the string
I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.
A few more ways of achieving the same result. They all execute in roughly the same amount of time.
<cfscript>
str = 'Doe, John 8.15.2012.docx';
// sans regex
arr1 = [
reverse( listRest( reverse( str ), '.' ) ),
listLast( str, '.' )
];
// using Java String lastIndexOf()
arr2 = [
str.substring( 0, str.lastIndexOf( '.' ) ),
str.substring( str.lastIndexOf( '.' ) + 1 )
];
// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>

In Actionscript, how to match / in infinitive structures like to cross out/off?

I'm using the following regular expression to find the exact occurrences in infinitives. Flag is global.
(?!to )(?<!\w) (' + word_to_search + ') (?!\w)
To give example of what I'm trying to achieve
looking for out should not bring : to outlaw
looking for out could bring : to be out of line
looking for to should not bring : to etc. just because it matches the first to
I've already done these steps, however, to cross out/off should be in the result list too. Is there any way to create an exception without compromising what I have achieved?
Thank you.
I'm still not sure I understand the question. You want to match something that looks like an infinitive verb phrase and contains the whole word word_to_search? Try this:
"\\bto\\s(?:\\w+[\\s/])*" + word_to_search + "\\b"
Remember, when you create a regex in the form of a string literal, you have to escape the backslashes. If you tried to use "\b" to specify a word boundary, it would have been interpreted as a backspace.
I know OR operator but the question was rather how to organize the structure so it can look ahead and behind. I'm going to explain what I have done so far
var strPattern:String = '(?!to )(?<!\w) (' + word_to_search + ') (?!\w)|';
strPattern+='(?!to )(?<!\w) (' + word_to_search + '\/)|';
strPattern+='(?!to )(\/' + word_to_search + ')';
var pattern:RegExp = new RegExp(strPattern, "g");
First line is the same line in my question, it searches structures like to bail out for cases where you type out. Second line is for matching structures like to cross out/off. But we need something else to match to cross out/off if the word is off. So, the third line add that extra condition.