Identify an address number in a string - regex

I have a list of addresses, currently quite unclean. They take the format:
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
What I would like to do is split up the street name and street number. I need a regex expression that is true for
955 - 959
95-99
4-9
95 - 99
99
I currently have this:
^[0-9][0-9]\s*+(\s*-\s*[0-9][0-9]+)
which works for the two digit addresses but does not work for the three or one digit addresses.
Thanks

I'm not sure what you're trying to do here \s*+ but you basically had the answer with the last part [0-9][0-9]+ that would find 2+ digits on the end.
Maybe try this (it's more concise). This searches for 1+ digits instead of 2+
\d+(\s*-\s*\d+)?

You can use braces {2,3} for 2-3 numbers - but also *+ isn't right.
/^(([0-9]{1,3}\s-\s)?[0-9]{1,3})\s/
I nested the braces so you only want the first result from the regex.
it breaks up like this
([0-9]{1,3}\s-\s)?
first, Is there a 1-3 digit number with a space-dash-space - OPTIONAL
then.. does it end in a 1-3 digit number followed by a space.

Starting from your regex:
^[0-9][0-9]\s*+(\s*-\s*[0-9][0-9]+)
You got an extra white space matcher in the second block:
^[0-9][0-9]\s*+(-\s*[0-9][0-9]+)
I would suggest you replace [0-9] with \d
^[\d][\d]\s*+(-\s*[\d][\d]+)
Use a + instead o 2 copies of \d meaning at least one number:
^[\d]+\s*+(-\s*[\d]+)
Make the last block optional, so it matches 99 Fake Address:
^[\d]+\s*+(-\s*[\d]+)?
If you know there's only going to be 1 white space, you could replace \s* with \s?:
^[\d]+\s?(-\s?[\d]+)?
That should match all of them :D

For your example, you can do:
/^(\d+[-\s\d]*)\s/gm
Demo
Explanation:
/^(\d+[-\s\d]*)\s/gm
^ start of line
^ at least 1 digit and as many digits as possible
^ any character of the set -, space, digit
^ zero or more
^ trailing space
^ multiline for the ^ start of line assertion

Another way could be
In [83]: s = '955 - 959 Fake Street'
In [84]: s1 = '95-99 Fake Street'
In [85]: s2 = '95 - 99 Fake Street'
In [86]: s3 = '99 Fake Street'
In [87]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s3)
In [88]: d.group()
Out[88]: '99 '
In [89]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s2)
In [90]: d.group()
Out[90]: '95 - 99'
In [91]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s1)
In [92]: d.group()
Out[92]: '95-99'
In [93]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s)
In [94]: d.group()
Out[94]: '955 - 959'
the character set 0-9 cab be represented by \d like this
d = re.search(r'^[\d]+[ ]*(-[ ]*[\d]+){0,1}', s)
Here, in all the examples, we are searching at the beginning of the string, for a sequence of at least one digit followed by zero or more spaces and optionally followed by at most one sequence of only one - symbol followed by zero or more spaces and at least one or more digits.

Related

Regex to enter a decimal number digit by digit

I have a requirement where user can input only between 0.01 to 100.00 in a textbox. I am using regex to limit the data entered. However, I cannot enter a decimal point, like 95.83 in the regex. Can someone help me fix the below regex?
(^100([.]0{1,2})?)$|(^\d{1,2}([.]\d{1,2})?)$
if I copy paste the value, it passes. But unable to type a decimal point.
Please advice.
Link to regex tester: https://regex101.com/r/b2BF6A/1
Link to demo: https://stackblitz.com/edit/react-9h2xsy
The regex
You can use the following regex:
See regex in use here
^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$
How it works
^(?:...|...|...)$ this anchors the pattern to ensure it matches the entire string
^ assert position at the start of the line
(?:...|...|...) non-capture group - used to group multiple alternations
$ assert position at the end of the line
(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})? first option
(?:\d?[1-9]|[1-9]0) match either of the following
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
(?:\.\d{0,2})? optionally match the following
\. this character . literally
\d{0,2} match any digit between 0 and 2 times
0{0,2}\.(?:\d?[1-9]|[1-9]0) second option
0{0,2} match 0 between 0 and 2 times
\. match this character . literally
(?:\d?[1-9]|[1-9]0) match either of the following options
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
10{2}(?:\.0{0,2})? third option
10{2} match 100
(?:\.0{0,2})? optionally match ., followed by 0 between 0 and 2 times
How it works (in simpler terms)
With the above descriptions for each alternation, this is what they will match:
Any two-digit number other than 0 or 00, optionally followed by any two-digit decimal.
In terms of a range, it's 1.00-99.99 with:
Optional leading zero: 01.00-99.99
Optional decimal: 01-99, or 01.-99, or 01.0-01.99
Any two-digit decimal other than 0 or 00
In terms of a range, it's .01-.99 with:
Optional leading zeroes: 00.01-00.99 or 0.01-0.99
Literally 100, followed by optional decimals: 100, or 100., or 100.0, or 100.00
The code
RegExp vs /pattern/
In your code, you can use either of the following options (replacing pattern with the pattern above):
new RegExp('pattern')
/pattern/
The first option above uses a string literal. This means that you must escape the backslash characters in the string in order for the pattern to be properly read:
^(?:(?:\\d?[1-9]|[1-9]0)(?:\\.\\d{0,2})?|0{0,2}\\.(?:\\d?[1-9]|[1-9]0)|10{2}(?:\\.0{0,2})?)$
The second option above allows you to avoid this and use the regex as is.
Here's a fork of your code using the second option.
Usability Issues
Please note that you'll run into a couple of usability issues with your current method of tackling this:
The user cannot erase all the digits they've entered. So if the user enters 100, they can only erase 00 and the 1 will remain. One option to resolving this is to make the entire non-capture group (with the alternations) optional by adding a ? after it. Whilst this does solve that issue, you now need to keep two regular expression patterns - one for user input and the other for validation. Alternatively, you could just test if the input is an empty string to allow it (but not validate the form until the field is filled.
The user cannot enter a number beginning with .. This is because we don't allow the input of . to go through your validation steps. The same rule applies here as the previous point made. You can allow it though if the value is . explicitly or add a new alternation of |\.
Similarly to my last point, you'll run into the issue for .0 when a user is trying to write something like .01. Again here, you can run the same test.
Similarly again, 0 is not valid input - same applies here.
An change to the regex that covers these states (0, ., .0, 0., 0.0, 00.0 - but not .00 alternatives) is:
^(?:(?:\d?[1-9]?|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]?|[1-9]0)|10{2}(?:\.0{0,2})?)$
Better would be to create logic for these cases to match them with a separate regex:
^0{0,2}\.?0?$
Usability Fixes
With the changes above in mind, your function would become:
See code fork here
handleChange(e) {
console.log(e.target.value)
const r1 = /^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$/;
const r2 = /^0{0,2}\.?0?$/
if (r1.test(e.target.value)) {
this.setState({
[e.target.name]: e.target.value
});
} else if (r2.test(e.target.value)) {
// Value is invalid, but permitted for usability purposes
this.setState({
[e.target.name]: e.target.value
});
}
}
This now allows the user to input those values, but also allows us to invalidate them if the user tries to submit it.
Using the range 0.01 to 100.00 without padding is this (non-factored):
0\.(?:0[1-9]|[1-9]\d)|[1-9]\d?\.\d{2}|100\.00
Expanded
# 0.01 to 0.99
0 \.
(?:
0 [1-9]
| [1-9] \d
)
|
# 1.00 to 99.99
[1-9] \d? \.
\d{2}
|
# 100.00
100 \.
00
It can be made to have an optional cascade if incremental partial form
should be allowed.
That partial is shown here for the top regex range :
^(?:0(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|[1-9]\d?(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
The code line with stringed regex :
const newRegExp = new RegExp("^(?:0(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|[1-9]\\d?(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$");
_________________________
The regex 'partial' above requires the input to be blank or to start
with a digit. It also doesn't allow 1-9 with a preceding 0.
If that is all to be allowed, a simple mod is this :
^(?:0{0,2}(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|(?:[1-9]\d?|0[1-9])(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
which allows input like the following:
(It should be noted that doing this requires allowing the dot . as
a valid input but could be converted to 0. on the fly to be put
inside the input box.)
.1
00.01
09.90
01.
01.11
00.1
00
.
Stringed version :
"^(?:0{0,2}(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|(?:[1-9]\\d?|0[1-9])(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$"

IBAN Regex design [duplicate]

This question already has answers here:
IBAN Validation check
(11 answers)
Closed 4 years ago.
Help me please to design Regex that will match all IBANs with all possible whitespaces. Because I've found that one, but it does not work with whitespaces.
[a-zA-Z]{2}[0-9]{2}[a-zA-Z0-9]{4}[0-9]{7}([a-zA-Z0-9]?){0,16}
I need at least that formats:
DE89 3704 0044 0532 0130 00
AT61 1904 3002 3457 3201
FR14 2004 1010 0505 0001 3
Just to find the example IBAN's from those countries in a text :
Start with 2 letters then 2 digits.
Then allow a space before every 4 digits, optionally ending with 1 or 2 digits:
\b[A-Z]{2}[0-9]{2}(?:[ ]?[0-9]{4}){4}(?!(?:[ ]?[0-9]){3})(?:[ ]?[0-9]{1,2})?\b
regex101 test here
Note that if the intention is to validate a complete string, that the regex can be simplified.
Since the negative look-ahead (?!...) won't be needed then.
And the word boundaries \b can be replaced by the start ^ and end $ of the line.
^[A-Z]{2}[0-9]{2}(?:[ ]?[0-9]{4}){4}(?:[ ]?[0-9]{1,2})?$
Also, it can be simplified even more if having the 4 groups of 4 connected digits doesn't really matter.
^[A-Z]{2}(?:[ ]?[0-9]){18,20}$
Extra
If you need to match an IBAN number from accross the world?
Then the BBAN part of the IBAN is allowed to have up to 30 numbers or uppercase letters. Reference
And can be written with either spaces or dashes or nothing in between.
For example: CC12-XXXX-12XX-1234-5678-9012-3456-7890-123
So the regex pattern to match a complete string with a long IBAN becomes a bit longer.
^([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30}$)((?:[ \-]?[A-Z0-9]{3,5}){2,7})([ \-]?[A-Z0-9]{1,3})?$
regex101 test here
Also note, that a pure regex solution can't do calculations.
So to actually validate an IBAN number then extra code is required.
Example Javascript Snippet:
function smellsLikeIban(str){
return /^([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30}$)((?:[ \-]?[A-Z0-9]{3,5}){2,7})([ \-]?[A-Z0-9]{1,3})?$/.test(str);
}
function validateIbanChecksum(iban) {
const ibanStripped = iban.replace(/[^A-Z0-9]+/gi,'') //keep numbers and letters only
.toUpperCase(); //calculation expects upper-case
const m = ibanStripped.match(/^([A-Z]{2})([0-9]{2})([A-Z0-9]{9,30})$/);
if(!m) return false;
const numbericed = (m[3] + m[1] + m[2]).replace(/[A-Z]/g,function(ch){
//replace upper-case characters by numbers 10 to 35
return (ch.charCodeAt(0)-55);
});
//The resulting number would be to long for javascript to handle without loosing precision.
//So the trick is to chop the string up in smaller parts.
const mod97 = numbericed.match(/\d{1,7}/g)
.reduce(function(total, curr){ return Number(total + curr)%97},'');
return (mod97 === 1);
};
var arr = [
'DE89 3704 0044 0532 0130 00', // ok
'AT61 1904 3002 3457 3201', // ok
'FR14 2004 1010 0505 0001 3', // wrong checksum
'GB82-WEST-1234-5698-7654-32', // ok
'NL20INGB0001234567', // ok
'XX00 1234 5678 9012 3456 7890 1234 5678 90', // only smells ok
'YY00123456789012345678901234567890', // only smells ok
'NL20-ING-B0-00-12-34-567', // stinks, but still a valid checksum
'XX22YYY1234567890123', // wrong checksum again
'droid#i.ban' // This Is Not The IBAN You Are Looking For
];
arr.forEach(function (str) {
console.log('['+ str +'] Smells Like IBAN: '+ smellsLikeIban(str));
console.log('['+ str +'] Valid IBAN Checksum: '+ validateIbanChecksum(str))
});
Here is a suggestion that may works for the patterns you provided:
[A-Z]{2}\d{2} ?\d{4} ?\d{4} ?\d{4} ?\d{4} ?[\d]{0,2}
Try it on regex101
Explanation
[A-Z]{2}\d{2} ? 2 capital letters followed by 2 digits (optional space)
\d{4} ? 4 digits, repeated 4 times (optional space)
[\d]{0,2} 0 to 2 digits
You can use a regex like this:
^[A-Z]{2}\d{2} (?:\d{4} ){3}\d{4}(?: \d\d?)?$
Working demo
This will match only those string formats
It's probably best to look up the specifications for a correct IBAN number. But if you want to have a regex similar to your existing one, but with spaces, you can use the following one:
^[a-zA-Z]{2}[0-9]{2}\s?[a-zA-Z0-9]{4}\s?[0-9]{4}\s?[0-9]{3}([a-zA-Z0-9]\s?[a-zA-Z0-9]{0,4}\s?[a-zA-Z0-9]{0,4}\s?[a-zA-Z0-9]{0,4}\s?[a-zA-Z0-9]{0,3})?$
Here is a live example: https://regex101.com/r/ZyIPLD/1

Regular expression for A123ABC

I have a string in the format A123ABC
First letter cannot contain <I,O,Q,U,Z>
Next 3 digits (0-9) from 21-998
Last 3 letters cannot include <I,Q,Z>
I used the following expression [A-HJ-NPR-TV-Y]{1}[0-9]{2,3}[A-HJ-PR-Y]{3}
But I am not able to restrict the number in the range 21-998.
Your letter part is fine, below is just the numbers portion:
regex = "(?:2[1-9]|[3-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-8])"
(?:...) group, but do not capture.
2[1-9] covers 21-29
[3-9][0-9] covers 30-99
[1-8][0-9][0-9] covers 100-899
9[0-8][0-9] covers 900-989
99[0-8] covers 990-998
| stands for "or"
Note: [0-9] may be replaced by \d. So, a more concise representation would be:
regex = "(?:2\d|[3-9]\d|[1-8]\d{2}|9[0-8]\d|99[0-8])"
One option would be matching (\d+) and checking if that falls in the range 21 - 998 outside a regex, in the language you're using, if possible.
If that is not feasible, you have to break it up (just showing the middle part):
(2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])
Breakdown:
2[1-9] matches 21 - 29
[3-9]\d matches 30 - 99
[1-8]\d\d matches 100 - 899
9[0-8]\d matches 900 - 989
99[0-8] matches 990 - 998
Also, the {1} is superfluous and can be omitted, making the complete regex
[A-HJ-NPR-TV-Y](2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])[A-HJ-PR-Y]{3}
Assuming the numbers between 21 and 99 are displayed with three digits (ie. : 021, 055, 099), here's a solution for the number part :
((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))
Entire regex :
[A-HJ-NPR-TV-Y]{1}((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))[A-HJ-PR-Y]{3}
There are probably easier ways to do this, but one way would be to use:
^((?=[^IOQUZ])([A-Z]))((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8]))((?=[^IQZ])([A-Z])){3}$
To explain:
^ denotes the beginning of the string.
((?=[^IOQUZ])([A-Z])) would give you any capital letter not in <I, O, Q, U, Z>.
((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8])) denotes any number between ((21 to 29) or (30 to 99) or (100 to 899) or (900 to 989) or (990 to 998)).
((?=[^IQZ])([A-Z])){3} would match any three capital letters not in <I, Q, Z>.
$ would denote the end of the string.

Reg-ex, Find x then N characters if N+1 == x

OK here is what I have:
(24(?:(?!24).)*)
its works in the fact it finds from 24 till the next 24 but not the 2nd 24... (wow some logic).
like this:
23252882240013152986400000006090000000787865670000004524232528822400513152986240013152986543530000452400
it finds from the 1st 24 till the next 24 but does not include it, so the strings it finds are:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882 - 2400513152986 - 24001315298654353000045 - 2400
that is half of what I want it to do, what I need it to find is this:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882240051315298624001315298654353000045 - 2400
lets say:
x = 24
n = 46
I need to:
find x then n characters if the n+1 character == x
so find the start take then next 46, and the 45th must be the start of the next string, including all 24's in that string.
hope this is clear.
Thanks in advance.
EDIT
answer = 24.{44}(?=24)
You're almost there.
First, find x (24):
24
Then, find n=46 characters, where the 46 includes the original 24 (hence 44 left):
.{44}
The following character must be x (24):
(?=24)
All together:
24.{44}(?=24)
You can play around with it here.
In terms of constructing such a regex from a given x, n, your regex consists of
x.{n-number_of_characters(x)}(?=x)
where you substitute in x as-is and calculate n-number_of_characters(x).
Try this:
(?(?=24)(.{46})|(.{25})(.{24}))
Explanation:
<!--
(?(?=24)(.{46})|(.{25})(.{24}))
Options: case insensitive; ^ and $ match at line breaks
Do a test and then proceed with one of two options depending on the result of the text «(?(?=24)(.{46})|(.{25})(.{24}))»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=24)»
Match the characters “24” literally «24»
If the test succeeded, match the regular expression below «(.{46})»
Match the regular expression below and capture its match into backreference number 1 «(.{46})»
Match any single character that is not a line break character «.{46}»
Exactly 46 times «{46}»
If the test failed, match the regular expression below if the test succeeded «(.{25})(.{24})»
Match the regular expression below and capture its match into backreference number 2 «(.{25})»
Match any single character that is not a line break character «.{25}»
Exactly 25 times «{25}»
Match the regular expression below and capture its match into backreference number 3 «(.{24})»
Match any single character that is not a line break character «.{24}»
Exactly 24 times «{24}»
-->

Regular expression for bit strings with even number of 1s

Let L= { w in (0+1)* | w has even number of 1s}, i.e. L is the set of all bit strings with even number of 1s. Which one of the regular expressions below represents L?
A) (0*10*1)*
B) 0*(10*10*)*
C) 0*(10*1)* 0*
D) 0*1(10*1)* 10*
According to me option D is never correct because it does not represent the bit string with zero 1s. But what about the other options? We are concerned about the number of 1s(even or not) not the number of zeros doesn't matter.
Then which is the correct option and why?
A if false. It doesn't get matched by 0110 (or any zeros-only non-empty string)
B represents OK. I won't bother proving it here since the page margins are too small.
C doesn't get matched by 010101010 (zero in the middle is not matched)
D as you said doesn't get matched by 00 or any other # with no ones.
So only B
To solve such a problem you should
Supply counterexample patterns to all "incorrect" regexps. This will be either a string in L that is not matched, or a matched string out of L.
To prove the remaining "correct" pattern, you should answer two questions:
Does every string that matches the pattern belong to L? This can be done by devising properties each of matched strings should satisfy--for example, number of occurrences of some character...
Is every string in L matched by the regexp? This is done by dividing L into easily analyzable subclasses, and showing that each of them matches pattern in its own way.
(No concrete answers due to [homework]).
Examining the pattern B:
^0*(10*10*)*$
^ # match beginning of string
0* # match zero or more '0'
( # start group 1
10* # match '1' followed by zero or more '0'
10* # match '1' followed by zero or more '0'
)* # end group 1 - match zero or more times
$ # end of string
Its pretty obvious that this pattern will only match strings who have 0,2,4,... 1's.
Look for examples that should match but don't. 0, 11011, and 1100 should all match, but each one fails for one of those four
C is incorrect because it does not allow any 0s between the second 1 of one group and the first 1 of the next group.
This answer would be best for this language
(0*10*10*)
a quick python script actually eliminated all the possibilities:
import re
a = re.compile("(0*10*1)*")
b = re.compile("0*(10*10*)*")
c = re.compile("0*(10*1)* 0*")
d = re.compile("0*1(10*1)* 10*")
candidates = [('a',a),('b',b),('c',c),('d',d)]
tests = ['0110', '1100', '0011', '11011']
for test in tests:
for candidate in candidates:
if not candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it failed on %s" % (candidate[0], test)
ntests = ['1', '10', '01', '010', '10101']
for test in ntests:
for candidate in candidates:
if candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it matched on %s" % (candidate[0], test)
the output:
removed c because it failed on 0110
removed d because it failed on 0110
removed a because it matched on 1
removed b because it matched on 10