Purpose of {1} in this regular expression to match url protocols - regex

I was reading this question about how to parse URLs out of web pages and had a question about the accepted answer which offered this solution:
((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)
The solution was offered by csmba and he credited it to regexlib.com. Whew. Credits done.
I think this is a fairly naive regular expression but it's a fine starting point for building something better. But, my question is this:
What is the point of {1}? It means "exactly one of the previous grouping", right? Isn't that the default behavior of a grouping in a regular expression? Would the expression be changed in any way if the {1} were removed?
If I saw this from a coworker I would point out his or her error but as I write this the response is rated at a 6 and the expression on regexlib.com is rated a 4 of 5. So maybe I'm missing something?

#Rob: I disagree. To enforce what you are asking for I think you would need to use negative-look-behind, which is possible but is certainly not related to use {1}. Neither version of the regexp address that particular issue.
To let the code speak:
tibook 0 /home/jj33/swap > cat text
Text this is http://example.com text this is
Text this is http://http://example.com text this is
tibook 0 /home/jj33/swap > cat p
#!/usr/bin/perl
my $re1 = '((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)';
my $re2 = '((mailto\:|(news|(ht|f)tp(s?))\://)\S+)';
while (<>) {
print "Evaluating: $_";
print "re1 saw \$1 = $1\n" if (/$re1/);
print "re2 saw \$1 = $1\n" if (/$re2/);
}
tibook 0 /home/jj33/swap > cat text | perl p
Evaluating: Text this is http://example.com text this is
re1 saw $1 = http://example.com
re2 saw $1 = http://example.com
Evaluating: Text this is http://http://example.com text this is
re1 saw $1 = http://http://example.com
re2 saw $1 = http://http://example.com
tibook 0 /home/jj33/swap >
So, if there is a difference between the two versions, it's doesn't seem to be the one you suggest.

I don't think the {1} has any valid function in that regex.
(**mailto:|(news|(ht|f)tp(s?))://){1}**
You should read this as: "capture the stuff in the parens exactly one time". But we don't really care about capturing this for use later, eg $1 in the replacement. So it's pointless.

#Jeff Atwood, your interpretation is a little off - the {1} means match exactly once, but has no effect on the "capturing" - the capturing occurs because of the parens - the braces only specify the number of times the pattern must match the source - once, as you say.
I agree with #Marius, even if his answer is a little terse and may come off as being flippant. Regular expressions are tough, if one's not used to using them, and the {1} in the question isn't quite error - in systems that support it, it does mean "exactly one match". In this sense, it doesn't really do anything.
Unfortunately, contrary to a now-deleted post, it doesn't keep the regexp from matching http://http://example.org, since the \S+ at the end will match one or more non-whitespace characters, including the http://example.org in http://http://example.org (verified using Python 2.5, just in case my regexp reading was off). So, the regexp given isn't really the best. I'm not a URL expert, but probably something limiting the appearance of ":"s and "//"s after the first one would be necessary (but hardly sufficient) to ensure good URLs.

I don't think it has any purpose. But because RegEx is almost impossible to understand/decompose, people rarely point out errors. That is probably why no one else pointed it out.

Related

Raku Regex to capture and modify the LFM code blocks

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}
TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.
This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

How to grab certain number of characters before and after a perl regex match?

I am crafting regexes that match certain terms best within html code. I'm doing this in an iterative process to whittle down matches to exclude things I don't want. So I craft a regex, run it, and spit out data that I then look through to see how well my match is working. For example, if I am looking for the term "tema" (the name of a trade association that provides standards) I might notice that it also matches "sitemap" and alter my regex in some way to exclude the unwanted items.
To make this easier, I want to print out my match along with some context, say 20 charcters before and after the match, rather than the entire line, to make it easier to scan through the results. This is proving frustratingly hard to accomplish in a simple fashion.
For example, I would think this would work:
$line =~ /(.{,20}tema.{,20})/i;
That is, I want to match up to 20 of anything before and after my keyword and include it in the "context" I print out for scanning.
But it doesn't. Am I missing something here? If a{,20} will match up to 20 'a' characters, why won't .{,20} match 20 of anything that '.' will match?
Scratching my head.
Syntax:
atom{n} (exactly n)
atom{n,} (n or more)
atom{n,m} (n or more, but no more than m)
So,
say $1 if $line =~ /(.{0,20}tema.{0,20})/i;
Or if you're using /g and might get overlapping matches:
say "$1$2$3" while $line =~ /(.{0,20})\K(tema)(?=(.{0,20}))/ig;
(a{,20} doesn't "match up to 20 a characters.")
How about searching with m/^(.*)tema(.*)$/ then use substr or similar to get the last characters of $1 and the first from $2.

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

Regex which ignores comments

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC
shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off.
Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!
Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.
You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.
On the topic of good regex tools, I really like RegexBuddy, but it's not free.
Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.
Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)
There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.
To do this for your VB-style comments, you can use the regex:
'.*|(ABC)
This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.
If you want your regex to both skip comments and strings, you can add strings to your regex:
'.*|"[^"\r\n]*"|(ABC)
I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...
Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.
For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.
link text
Here is my solution to this problem:
1. Find a store all your comments in hash
2. Do your regexp replacement
3. Bring comments back to file
Save your time :-)
string fileTextWithComments = "Some tetx file contents";
Dictionary<string, string> comments = new Dictionary<string, string>();
// 1. Find a store all your comments in hash
Regex rc = new Regex("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)");
MatchCollection matches = rc.Matches(fileTextWithComments);
int index = 0;
foreach (Match match in matches)
{
string key = string.Format("/*Comment#{0}*/", index++);
comments.Add(key, match.Value);
fileTextWithComments = fileTextWithComments.Replace(match.Value, key);
}
// 2. Do your regexp replacement
Regex r = new Regex("YOUR REGEXP PATTERN");
fileTextWithComments = r.Replace(fileTextWithComments, "NEW STRING");
// 3. Bring comments back to file :-)
foreach (string key in comments.Keys)
{
string comment = comments[key];
fileTextWithComments = fileTextWithComments.Replace(key, comment);
}
Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.
What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.
I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.
In Perl because of it's regexp processing power. Assuming #lines is a list of lines you want to match, and $pattern is the pattern you want to match.
#matches =[];
for (#lines){
$line = $_;
$line ~= s/"[^"]*?(?<!\)"//g;
$line ~= s/'.*//g;
push #matches, $_ if $line ~= m/$pattern/;
}
The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace.
The next strips comments. If the pattern still matches, it adds that line to the list of matches.
It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.
You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.
Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.
This answer to a different but related question looks good too...
If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.
RegEx1: (-user ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -user "test user"
Regex2: (-daterange ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -daterange "1/4/13 1/20/13"
RegEx3: (-date )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -date 1/4/13 -
RegEx4: (-day )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -day monday -
Search for the quoted value first if not found, search for the no quotes parameter. This expects only one occurrence of the parameter. It also expects the command to either; use quotes to encapsulate a string with no quotes inside, or; use any character other than a quote in the first position, have no occurrence of ' -' until the next parameter, and have a trailing ' -' (add it onto the string before the regex).

Regex matching in ColdFusion OR condition

I am attempting to write a CF component that will parse wikiCreole text. I am having trouble getting the correct matches with some of my regular expression though. I feel like if I can just get my head around the first one the rest will just click. Here is an example:
The following is sample input:
You can make things **bold** or //italic// or **//both//** or //**both**//.
Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
My first attempt was:
<cfset out = REreplace(out, "\*\*(.*?)\*\*", "<strong>\1</strong>", "all") />
Then I realized that it would not match where the ** is not given, and it should end where there are two carriage returns.
So I tried this:
<cfset out = REreplace(out, "\*\*(.*?)[(\*\*)|(\r\n\r\n)]", "<strong>\1</strong>", "all") />
and it is close but for some reason it gives you this:
You can make things <strong>bold</strong>* or //italic// or <strong>//both//</strong>* or //<strong>both</strong>*//.
Character formatting extends across line breaks: <strong>bold,</strong>
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
Any ideas?
PS: If anyone has any suggestions for better tags, or a better title for this post I am all ears.
The [...] represents a character class, so this:
[(\*\*)|(\r\n\r\n)]
Is effectively the same as this:
[*|\r\n]
i.e. it matches a single "*" and the "|" isn't an alternation.
Another problem is that you replace the double linefeed. Even if your match succeeded you would end up merging paragraphs. You need to either restore it or not consume it in the first place. I'd use a positive lookahead to do the latter.
In Perl I'd write it this way:
$string =~ s/\*\*(.*?)(?:\*\*|(?=\n\n))/<strong>$1<\/strong>/sg;
Taking a wild guess, the ColdFusion probably looks like this:
REreplace(out, "\*\*(.*?)(?:\*\*|(?=\r\n\r\n))", "<strong>\1</strong>", "all")
You really should change your
(.*?)
to something like
[^*]*?
to match any character except the *. I don't know if that is the problem, but it could be the any-character . is eating one of your stars. It also a generally accepted "best practice" when trying to balance matching characters like the double star or html start/end tags to explicitly exclude them from your match set for the inner text.
*Disclaimer, I didn't test this in ColdFusion for the nuances of the regex engine - but the idea should hold true.
I know this is an older question but in response to where Ryan Guill said "I tried the $1 but it put a literal $1 in there instead of the match" for ColdFusion you should use \1 instead of $1
I always use a regex web-page. It seems like I start from scratch every time I used regex.
Try using '$1' instead of \1 for this one - the replace is slightly different... but I think the pattern is what you need to get working.
Getting closer with this:
**(.?)**|//(.?)//
The tricky part is the //** or **//
Ok, first checking for //bold//
then //bold// then bold, then
//bold//
**//(.?)//**|//**(.?)**//|**(.?)**|//(.?)//
I find this app immensely helpful when I'm doing anything with regex:
http://www.gskinner.com/RegExr/desktop/
Still doesn't help with your actual issue, but could be useful going forward.