regex for n characters or at least m characters - regex

This should be a pretty simple regex question but I couldn't find any answers anywhere. How would one make a regex, which matches on either ONLY 2 characters, or at least 4 characters. Here is my current method of doing it (ignore the regex itself, that's besides the point):
[A-Za-z0_9_]{2}|[A-Za-z0_9_]{4,}
However, this method takes twice the time (and is approximately 0.3s slower for me on a 400 line file), so I was wondering if there was a better way to do it?

Optimize the beginning, and anchor it.
^[A-Za-z0-9_]{2}(?:|[A-Za-z0-9_]{2,})$
(Also, you did say to ignore the regex itself, but I guessed you probably wanted 0-9, not 0_9)
EDIT Hm, I was sure I read that you want to match lines. Remove the anchors (^$) if you want to match inside the line as well. If you do match full lines only, anchors will speed you up (well, the front anchor ^ will, at least).

Your solution looks pretty good. As an alternative you can try smth like that:
[A-Za-z0-9_]{2}(?:[A-Za-z0-9_]{2,})?
Btw, I think you want hyphen instead of underscore between 0 and 9, don't you?

The solution you present is correct.
If you're trying to optimize the routine, and the number of matches strings matching 2 or more characters is much smaller than those that do not, consider accepting all strings of length 2 or greater, then tossing those if they're of length 3. This may boost performance by only checking the regex once, and the second call need not even be a regular expression; checking a string length is usually an extremely fast operation.
As always, you really need to run tests on real-world data to verify if this would give you a speed increase.

so basically you want to match words of length either 2 or 2+2+N, N>=0
([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)
working example:
#!/usr/bin/perl
while (<STDIN>)
{
chomp;
my #matches = ($_=~/([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)/g);
for my $m (#matches) {
print "match: $m\n";
}
}
input file:
cat in.txt
ab abc bcad a as asdfa
aboioioi i i abc bcad a as asdfa
output:
perl t.pl <in.txt
match: ab
match: ab
match: bcad
match: as
match: asdf
match: aboioioi
match: ab
match: bcad
match: as
match: asdf

Related

Difference between ? and * in regular expressions - match same input?

I am not able to understand the practical difference between ? and * in regular expressions. I know that ? means to check if previous character/group is present 0 or 1 times and * means to check if the previous character/group is present 0 or more times.
But this code
while(<>) {
chomp($_);
if(/hello?/) {
print "metch $_ \n";
}
else {
print "naot metch $_ \n";
}
}
gives the same out put for both hello? and hello*. The external file that is given to this Perl program contains
hello
helloooo
hell
And the output is
metch hello
metch helloooo
metch hell
for both hello? and hello*. I am not able to understand the exact difference between ? and *
In Perl (and unlike Java), the m//-match operator is not anchored by default.
As such all of the input it trivially matched by both /hello?/ and /hello*/. That is, these will match any string that contains "hell" (as both quantifiers make the "o" optional) anywhere.
Compare with /^hello?$/ and /^hello*$/, respectively. Since these employ anchors the former will not match "helloo" (as at most one "o" is allowed) while the latter will.
Under Regexp Quote-like Operators:
m/PATTERN/ searches [anywhere in] a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.
What is confusing you is that, without anchors like ^ and $ a regex pattern match checks only whether the pattern appears anywhere in the target string.
If you add something to the pattern after the hello, like
if (/hello?, Ashwin/) { ... }
Then the strings
hello, Ashwin
and
hell, Ashwin
will match, but
helloooo, Ashwin
will not, because there are too many o characters between hell and the comma ,.
However, if you use a star * instead, like
if (/hello*, Ashwin/) { ... }
then all three strings will match.
? Means the last item is optional. * Means it is both optional and you can have multiple items.
ie.
hello? matches hell, hello
hello* matches hell, hello, helloo, hellooo, ....
But not using either ^ or $ means these matches can occur anywhere in the string
Here's an example I came up with that makes it quite clear:
What if you wanted to only match up to tens of people and your data was like below:
2 people. 20 people. 200 people. 2000 people.
Only ? would be useful in that case, whereas * would incorrectly capture larger numbers.

Negative lookahead assertion with the * modifier in Perl

I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.

Anyone see anything wrong with my regex for port numbers?

I made a regex for port numbers (before you say this is a bad idea, its going into a bigger regex for URL's which is much harder than it sounds).
My coworker said this is really bad and isn't going to catch everything. I disagree.
I believe this thing catches everything from 0 to 65535 and nothing else, and I'm looking for confirmation of this.
Single-line version (for computers):
/(^[0-9]$)|(^[0-9][0-9]$)|(^[0-9][0-9][0-9]$)|(^[0-9][0-9][0-9][0-9]$)|((^[0-5][0-9][0-9][0-9][0-9]$)|(^6[0-4][0-9][0-9][0-9]$)|(^65[0-4][0-9][0-9]$)|(^655[0-2][0-9]$)|(^6553[0-5]$))/
Human readable version:
/(^[0-9]$)| # single digit
(^[0-9][0-9]$)| # two digit
(^[0-9][0-9][0-9]$)| # three digit
(^[0-9][0-9][0-9][0-9]$)| # four digit
((^[0-5][0-9][0-9][0-9][0-9]$)| # five digit (up to 59999)
(^6[0-4][0-9][0-9][0-9]$)| # (up to 64999)
(^65[0-4][0-9][0-9]$)| # (up to 65499)
(^655[0-2][0-9]$)| # (up to 65529)
(^6553[0-5]$))/ # (up to 65535)
Can someone confirm that my understanding is correct (or otherwise)?
You could shorten it considerably:
^0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
no need to repeat the anchors every single time
no need for lots of capturing groups
no need to spell out repetitions.
Drop the leading 0* if you don't want to allow leading zeroes.
This regex is also better because it matches the special cases (65535, 65001 etc.) first and thus avoids some backtracking.
Oh, and since you said you want to use this as part of a larger regex for URLs, you should then replace both ^ and $ with \b (word boundary anchors).
Edit: #ceving asked if the repetition of 6553, 655, 65 and 6 is really necessary. The answer is no - you can also use a nested regex instead of having to repeat those leading digits. Let's just consider the section
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}
This can be rewritten as
6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))
I would argue that this makes the regex even less readable than it already was. Verbose mode makes the differences a bit clearer. Compare
6553[0-5] |
655[0-2][0-9] |
65[0-4][0-9]{2} |
6[0-4][0-9]{3}
with
6
(?:
[0-4][0-9]{3}
|
5
(?:
[0-4][0-9]{2}
|
5
(?:
[0-2][0-9]
|
3[0-5]
)
)
)
Some performance measurements: Testing each regex against all numbers from 1 through 99999 shows a minimal, probably irrelevant performance benefit for the nested version:
import timeit
r1 = """import re
regex = re.compile(r"0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
r2 = """import re
regex = re.compile(r"0*(?:6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
stmt = """for i in range(1,100000):
regex.match(str(i))"""
print(timeit.timeit(setup=r1, stmt=stmt, number=100))
print(timeit.timeit(setup=r2, stmt=stmt, number=100))
Output:
7.7265428834649
7.556472630353351
Personally I would match just a number and then I would check with code that the number is in range.
Well, it's easy to prove that it will validate any correct port: just generate each valid string and test that it passes. Making sure it doesn't allow anything that it shouldn't is harder though - obviously you can't test absolutely every invalid string. You should definitely test simple cases and anything which you think might pass incorrectly (or which would pass incorrectly with a lesser regex - "65536" being an example).
It will allow some slightly odd port specifications though - such as "0000". Do you want to allow leading zeroes?
You might also want to consider whether you actually need to specify ^ and $ separately for each case, or whether you could use ^(case 1)|(case 2)|...$. Oh, and quantifiers could simplify the "1 to 4 digits" case too: ([0-9]{1,4}) will find between 1 and 4 digits.
(You might want to work on sounding a little less arrogant, by the way. If you're working with other people, communicating in a less aggressive way is likely to do more to improve everyone's day than just proving your regex is correct...)
What's wrong with parsing it into a number and work with integer comparisons? (regardless of whether or not this will be part of a "larger" regex).
If I were to use regex, I would just use:
\d{1,5}
Nope, it doesn't check for "valid" port numbers (neither does yours). But it's much more legible and for practical purposes I'd say it's "good enough."
PS: I'd work on being more humble.
A style note:
Repeating [0-9] over and over again is silly - something like [0-9][0-9][0-9] is much better written as \d{3}.
/^(6553[0-5])|(655[0-2]\d)|(65[0-4]\d{2})|(6[0-4]\d{3})|([1-5]\d{4})|([1-9]\d{1,3})|(\d)$/
regex has many implement ,what the paltform. try below , remove blanks
^[1-5]?\d{1,4}|6([0-4]\d{3}|5([0-4]\d{2}|5([0-2]\d|3[0-5]))$
readable
^
[1-5]?\d{1,4}|
6(
[0-4]\d{3}|
5(
[0-4]\d{2}|
5(
[0-2]\d|
3[0-5]
)
)
$
I would use this one:
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
The following Perl script tests some numbers:
#! /usr/bin/perl
use strict;
use warnings;
my $port = qr{
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
}x;
sub test {
my ($label, $regexp, $start, $stop) = #_;
my $matches = 0;
my $tests = 0;
foreach my $n ($start..$stop) {
$tests++;
$matches++ if "$n" =~ /^$regexp$/;
$tests++;
$matches++ if "0$n" =~ /^$regexp$/;
}
print "$label [$start $stop] => $matches matches in $tests tests\n";
}
test "Port", $port, 0, 2**16;
The output is:
Port [0 65536] => 65536 matches in 131074 tests

RegEx: How can I replace with $n instances of a string?

I'm trying to replace numbers of the form 4.2098234e-3 with 00042098234. I can capture the component parts ok with:
(-?)(\d+).(\d)+e-($d+)
but what I don't know how to do is to repeat the zeros at the start $4 times.
Any ideas?
Thanks in advance,
Ross
Ideally, I'd like to be able to do this with the find/replace feature of TextMate, if that's of any consequence. I appreciate that there are better tools than RegEx for this problem, but it's still an interesting question (to me).
You can't do it purely in regular expressions, because the replace string is just a string with backreferences -- you can't use repetition there.
In most programming lnaguages, you have regex replace with callback, which would be able to do it. However it's not something that a text editor can do (unless it has some scripting support).
This isn't something that should be done with regex. That said, you can do something like this, but it's not really worth the effort: the regex is complicated, and the capability is limited.
Here's an illustrative example of replacing a digit [0-9] with that many zeroes.
// generate the regex and the replacement strings
String seq = "123456789";
String regex = seq.replaceAll(".", "(?=[$0-9].*(0)\\$)?") + "\\d";
String repl = seq.replaceAll(".", "\\$$0");
// let's see what they look like!!!
System.out.println(repl); // prints "$1$2$3$4$5$6$7$8$9"
System.out.println(regex); // prints oh my god just look at the next section!
// let's see if they work...
String input = "3 2 0 4 x 11 9";
System.out.println(
(input + "0").replaceAll(regex, repl)
); // prints "000 00 0000 x 00 000000000"
// it works!!!
The regex is (as seen on ideone.com) (slightly formatted for readability):
(?=[1-9].*(0)$)?
(?=[2-9].*(0)$)?
(?=[3-9].*(0)$)?
(?=[4-9].*(0)$)?
(?=[5-9].*(0)$)?
(?=[6-9].*(0)$)?
(?=[7-9].*(0)$)?
(?=[8-9].*(0)$)?
(?=[9-9].*(0)$)?
\d
But how does it work??
The regex relies on positive lookaheads. It matches \d, but before doing that, it tries to see if it's [1-9]. If so, \1 goes all the way to the end of the input, where a 0 has been appended, to capture that 0. Then the second assertion checks if it's [2-9], and if so, \2 goes all the way to the end of the input to grab 0, and so on.
The technique works, but beyond a cute regex exercise, it probably has no real practicability.
Note also that 11 is replaced to 00. That is, each 1 is replaced with 1 zero. It's probably possible to recognize 11 as a number and put 11 zeroes instead, but it'd only make the regex more convoluted.

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.