How can I make this regex more compact?

How can I make this regex more compact? - regex

Let's say I have a line of text like this
Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56
I want to capture the last six numbers and ignore the Small and the first two groups of numbers.
For this exercise, let's ignore the fact that it might be easier to just do some sort of string-split instead of a regular expression.
I have this regex that works but is kind of horrible looking
^(Small).*?[0-9.]+.*?[0-9.]+.*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+)
Is there some way to compact that?
For example, is it possible to combine the check for the last 6 numbers into a single statement that still stores the results as 6 separate group matches?

If you want to keep each match in a separate backreference, you have no choice but to "spell it out" - if you use repetition, you can either catch all six groups "as one" or only the last one, depending on where you put the capturing parentheses. So no, it's not possible to compact the regex and still keep all six individual matches.
A somewhat more efficient (though not beautiful) regex would be:
^Small\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)
since it matches the spaces explicitly. Your regex will result in a lot of backtracking. My regex matches in 28 steps, yours in 106.
Just as an aside: In Python, you could simply do a
>>> pieces = "Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56".split()[-6:]
>>> print pieces
['1.49', '25.71', '41.05', '12.31', '0.00', '80.56']

Here is the shortest I could get:
^Small\s+(?:[\d.]+\s+){2}([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s*$
It must be long because each capture must be specified explicitly. No need to capture "Small", though. But it is better to be specific (\s instead of .) when you can, and to anchor on both ends.

For usability, you should use string substitution to build regex from composite parts.
$d = "[0-9.]+";
$s = ".*?";
$re = "^(Small)$s$d$s$d$s($d)$s($d)$s($d)$s($d)$s($d)$s($d)";
At least then you can see the structure past the pattern, and changing one part changes them all.
If you wanted to get really ANSI you could make a short use metasyntax and make it even easier to read:
$re = "^(Small)_#D_#D_(#D)_(#D)_(#D)_(#D)_(#D)_(#D)";
$re = str_replace('#D','[0-9.]+',$re);
$re = str_replace('_', '.*?' , $re );
( This way it also makes it trivial to change the definition of what a space token is, or what a digit token is )

Related

Using RegEx how do I remove the trailing zeros from a decimal number

I'm needing to write some regex that takes a number and removes any trailing zeros after a decimal point. The language is Actionscript 3. So I would like to write:
var result:String = theStringOfTheNumber.replace( [ the regex ], "" );
So for example:
3.04000 would be 3.04
0.456000 would be 0.456 etc
I've spent some time looking at various regex websites and I'm finding this harder to resolve than I initially thought.

Regex:
^(\d+\.\d*?[1-9])0+$
OR
(\.\d*?[1-9])0+$
Replacement string:
$1
DEMO
Code:
var result:String = theStringOfTheNumber.replace(/(\.\d*?[1-9])0+$/g, "$1" );

What worked best for me was
^([\d,]+)$|^([\d,]+)\.0*$|^([\d,]+\.[0-9]*?)0*$
For example,
s.replace(/^([\d,]+)$|^([\d,]+)\.0*$|^([\d,]+\.[0-9]*?)0*$/, "$1$2$3");
This changes
1.10000 => 1.1
1.100100 => 1.1001
1.000 => 1
1 >= 1

What about stripping the trailing zeros before a \b boundary if there's at least one digit after the .
(\.\d+?)0+\b
And replace with what was captured in the first capture group.
$1
See test at regexr.com

(?=.*?\.)(.*?[1-9])(?!.*?\.)(?=0*$)|^.*$
Try this.Grab the capture.See demo.
http://regex101.com/r/xE6aD0/11

Other answers didn't consider numbers without fraction (like 1.000000 ) or used a lookbehind function (sadly, not supported by implementation I'm using). So I modified existing answers.
Match using ^-?\d+(\.\d*[1-9])? - Demo (see matches). This will not work with numbers in text (like sentences).
Replace(with \1 or $1) using (^-?\d+\.\d*[1-9])(0+$)|(\.0+$) - Demo (see substitution). This one will work with numbers in text (like sentences) if you remove the ^ and $.
Both demos with examples.
Side note: Replace the \. with decimal separator you use (, - no need for slash) if you have to, but I would advise against supporting multiple separator formats within such regex (like (\.|,)). Internal formats normally use one specific separator like . in 1.135644131 (no need to check for other potential separators), while external tend to use both (one for decimals and one for thousands, like 1.123,541,921), which would make your regex unreliable.
Update: I added -? to both regexes to add support for negative numbers, which is not in demo.

If your regular expressions engine doesn't support "lookaround" feature then you can use this simple approach:
fn:replace("12300400", "([^0])0*$", "$1")
Result will be: 123004

I know I am kind of late but I think this can be solved in a far more simple way.
Either I miss something or the other repliers overcomplicate it, but I think there is a far more straightforward yet resilient solution RE:
([0-9]*[.]?([0-9]*[1-9]|[0]?))[0]*
By backreferencing the first group (\1) you can get the number without trailing zeros.
It also works with .XXXXX... and ...XXXXX. type number strings. For example, it will convert .45600 to .456 and 123. to 123. as well.
More importantly, it leaves integer number strings intact (numbers without decimal point). For example, it will convert 12300 to 12300.
Note that if there is a decimal point and there are only zeroes after that it will leave only one trailing zeroes. For example for the 42.0000 you get 42.0.
If you want to eliminate the leading zeroes too then youse this RE (just put a [0]* at the start of the former):
[0]*([0-9]*[.]?([0-9]*[1-9]|[0]?))[0]*

I tested few answers from the top:
^(\d+\.\d*?[1-9])0+$
(\.\d*?[1-9])0+$
(\.\d+?)0+\b
All of them not work for case when there are all zeroes after "." like 45.000 or 450.000
modified version to match that case: (\.\d*?[1-9]|)\.?0+$
also need to replace to '$1' like:
preg_replace('/(\.\d*?[1-9]|)\.?0+$/', '$1', $value);

try this
^(?!0*(\.0+)?$)(\d+|\d*\.\d+)$
And read this
http://www.regular-expressions.info/numericranges.html it might be helpful.

I know it's not what the original question is looking for, but anyone who is looking to format money and would only like to remove two consecutive trailing zeros, like so:
£30.00 => £30
£30.10 => £30.10 (and not £30.1)
30.00€ => 30€
30.10€ => 30.10€
Then you should be able to use the following regular expression which will identify two trailing zeros not followed by any other digit or exist at the end of a string.
([^\d]00)(?=[^\d]|$)

I'm a bit late to the party, but here's my solution:
(((?<=(\.|,)\d*?[1-9])0+$)|(\.|,)0+$)
My regular expression will only match the trailing 0s, making it easy to do a .replaceAll(..) type function.
Breaking it down, part one: ((?<=(\.|,)\d*?[1-9])0+$)
(?<=(\.|,): A positive look behind. Decimal must contain a . or a , (commas are used as a decimal point in some countries). But as its a look behind, it is not included in the matched text, but still must be present.
\d*?: Matches any number of digits lazily
[1-9]: Matches a single non-zero character (this will be the last digit before trailing 0s)
0+$: Matches 1 or more 0s that occur between the last non-zero digit and the line end.
This works great for everything except the case where trailing 0s begin immediately, like in 1.0 or 5.000. The second part fixes this (\.|,)0+$:
(\.|,): Matches a . or a , that will be included in matched text.
0+$ matches 1 or more 0s between the decimal point and the line end.
Examples:
1.0 becomes 1
5.0000 becomes 5
5.02394900022000 becomes 5.02394900022

Is it really necessary to use regex? Why not just check the last digits in your numbers? I am not familiar with Actionscript 3, but in python I would do something like this:
decinums = ['1.100', '0.0','1.1','10']
for d in decinums:
if d.find('.'):
while d.endswith('0'):
d = d[:-1]
if d.endswith('.'):
d = d[:-1]
print(d)
The result will be:
1.1
0
1.1
10

Perl - Generate All Matching String To A Regex

I am kinda new in perl, i wanted to know if there is a way for generating all the combinations that matches a regex.
how is the best way to generate all the matching strings to :
05[0,2,4,7][\d]{7}
thanks in advance.

While you cannot just take any regex and produce any strings it might fit, in this case you can easily adapt and overcome.
You can use glob to generate combinations:
perl -lwe "print for glob '05{0,2,4,7}'"
050
052
054
057
However, I should not have to tell you that \d{7} actually means quite a few million combinations, right? Generating a list of numbers is trivial, formatting them can be done with sprintf:
my #nums = map sprintf("%07d", $_), 0 .. 9_999_999;
That is assuming you are only looking for 0-9 numericals.
Take those nums and combine them with the globbed ones: Tada.

No there is no way to generate all matches for a certain regex. Consider this one:
a+
There is an infinite number of matches for that regex, thus you cannot list them all.
By the way, I think you want your regex to look like this:
05[0247]\d{7}

2012 answer
String::Random
Regexp::Genex - generates random strings that match the regexp; not all the possible strings, even for finite patterns like [class]
Parse::RandGen
§6.5 regex string generation in HOP

Then there is a way to generate all (four billion of) the matches for this certain regex, viz., 05[0247]\d{7}:
use Modern::Perl;
for my $x (qw{0 2 4 7}) {
say "05$x" . sprintf '%07d', $_ for 0 .. 9999999;
}

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

I have a text input field for titles of various things and to help minimize false negatives on search results(internal search is not the best), I need to have a REGEX pattern which looks at the first four characters of the input string and removes the word(and space after the word) _the _ if it is there at the beginning only.
For example if we are talking about the names of bands, and someone enters The Rolling Stones , what i need is for the entry to say only Rolling Stones
Can a regex be used to automatically strip these 4characters?

Applying the regex
^(?:\s*the\s*)?(.*)$
will match any string, and capture it in backreference no. 1, unless it starts with the (optionally surrounded by whitespace), in which case backref no. 1 will contain whatever follows.
You need to set the case-insensitive option in your regex engine for this to work.

You can use the ^ identifier to match a pattern at the beginning of a line, however for what you are using this for, it can be considered overkill.
A lot of languages support string manipulations, which is a more suitable choice. I can provide an example to demonstrate in Python,
>>> def func(n):
n = n[4:len(n)] if n[0:4] == "The " else n
return n
>>> func("The Rolling Stones")
'Rolling Stones'
>>> func("They Might Be Giants")
'They Might Be Giants'

As you don't clarify with language, here is a solution in Perl :
my $str = "The Rolling Stones";
$str =~ s/^the //i;
say $str; # Rolling Stones

Anyone see anything wrong with my regex for port numbers?

I made a regex for port numbers (before you say this is a bad idea, its going into a bigger regex for URL's which is much harder than it sounds).
My coworker said this is really bad and isn't going to catch everything. I disagree.
I believe this thing catches everything from 0 to 65535 and nothing else, and I'm looking for confirmation of this.
Single-line version (for computers):
/(^[0-9]$)|(^[0-9][0-9]$)|(^[0-9][0-9][0-9]$)|(^[0-9][0-9][0-9][0-9]$)|((^[0-5][0-9][0-9][0-9][0-9]$)|(^6[0-4][0-9][0-9][0-9]$)|(^65[0-4][0-9][0-9]$)|(^655[0-2][0-9]$)|(^6553[0-5]$))/
Human readable version:
/(^[0-9]$)| # single digit
(^[0-9][0-9]$)| # two digit
(^[0-9][0-9][0-9]$)| # three digit
(^[0-9][0-9][0-9][0-9]$)| # four digit
((^[0-5][0-9][0-9][0-9][0-9]$)| # five digit (up to 59999)
(^6[0-4][0-9][0-9][0-9]$)| # (up to 64999)
(^65[0-4][0-9][0-9]$)| # (up to 65499)
(^655[0-2][0-9]$)| # (up to 65529)
(^6553[0-5]$))/ # (up to 65535)
Can someone confirm that my understanding is correct (or otherwise)?

You could shorten it considerably:
^0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
no need to repeat the anchors every single time
no need for lots of capturing groups
no need to spell out repetitions.
Drop the leading 0* if you don't want to allow leading zeroes.
This regex is also better because it matches the special cases (65535, 65001 etc.) first and thus avoids some backtracking.
Oh, and since you said you want to use this as part of a larger regex for URLs, you should then replace both ^ and $ with \b (word boundary anchors).
Edit: #ceving asked if the repetition of 6553, 655, 65 and 6 is really necessary. The answer is no - you can also use a nested regex instead of having to repeat those leading digits. Let's just consider the section
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}
This can be rewritten as
6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))
I would argue that this makes the regex even less readable than it already was. Verbose mode makes the differences a bit clearer. Compare
6553[0-5] |
655[0-2][0-9] |
65[0-4][0-9]{2} |
6[0-4][0-9]{3}
with
6
(?:
[0-4][0-9]{3}
|
5
(?:
[0-4][0-9]{2}
|
5
(?:
[0-2][0-9]
|
3[0-5]
)
)
)
Some performance measurements: Testing each regex against all numbers from 1 through 99999 shows a minimal, probably irrelevant performance benefit for the nested version:
import timeit
r1 = """import re
regex = re.compile(r"0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
r2 = """import re
regex = re.compile(r"0*(?:6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
stmt = """for i in range(1,100000):
regex.match(str(i))"""
print(timeit.timeit(setup=r1, stmt=stmt, number=100))
print(timeit.timeit(setup=r2, stmt=stmt, number=100))
Output:
7.7265428834649
7.556472630353351

Personally I would match just a number and then I would check with code that the number is in range.

Well, it's easy to prove that it will validate any correct port: just generate each valid string and test that it passes. Making sure it doesn't allow anything that it shouldn't is harder though - obviously you can't test absolutely every invalid string. You should definitely test simple cases and anything which you think might pass incorrectly (or which would pass incorrectly with a lesser regex - "65536" being an example).
It will allow some slightly odd port specifications though - such as "0000". Do you want to allow leading zeroes?
You might also want to consider whether you actually need to specify ^ and $ separately for each case, or whether you could use ^(case 1)|(case 2)|...$. Oh, and quantifiers could simplify the "1 to 4 digits" case too: ([0-9]{1,4}) will find between 1 and 4 digits.
(You might want to work on sounding a little less arrogant, by the way. If you're working with other people, communicating in a less aggressive way is likely to do more to improve everyone's day than just proving your regex is correct...)

What's wrong with parsing it into a number and work with integer comparisons? (regardless of whether or not this will be part of a "larger" regex).
If I were to use regex, I would just use:
\d{1,5}
Nope, it doesn't check for "valid" port numbers (neither does yours). But it's much more legible and for practical purposes I'd say it's "good enough."
PS: I'd work on being more humble.

A style note:
Repeating [0-9] over and over again is silly - something like [0-9][0-9][0-9] is much better written as \d{3}.

/^(6553[0-5])|(655[0-2]\d)|(65[0-4]\d{2})|(6[0-4]\d{3})|([1-5]\d{4})|([1-9]\d{1,3})|(\d)$/

regex has many implement ,what the paltform. try below , remove blanks
^[1-5]?\d{1,4}|6([0-4]\d{3}|5([0-4]\d{2}|5([0-2]\d|3[0-5]))$
readable
^
[1-5]?\d{1,4}|
6(
[0-4]\d{3}|
5(
[0-4]\d{2}|
5(
[0-2]\d|
3[0-5]
)
)
$

I would use this one:
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
The following Perl script tests some numbers:
#! /usr/bin/perl
use strict;
use warnings;
my $port = qr{
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
}x;
sub test {
my ($label, $regexp, $start, $stop) = #_;
my $matches = 0;
my $tests = 0;
foreach my $n ($start..$stop) {
$tests++;
$matches++ if "$n" =~ /^$regexp$/;
$tests++;
$matches++ if "0$n" =~ /^$regexp$/;
}
print "$label [$start $stop] => $matches matches in $tests tests\n";
}
test "Port", $port, 0, 2**16;
The output is:
Port [0 65536] => 65536 matches in 131074 tests

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string

Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.

^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.

Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )

#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?

Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?

^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).

Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I make this regex more compact? - regex

Related

Using RegEx how do I remove the trailing zeros from a decimal number

Perl - Generate All Matching String To A Regex

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

Anyone see anything wrong with my regex for port numbers?

regex to match a maximum of 4 spaces

Categories

Resources