Regex for UK registration number - regex

I've been playing with creating a regular expression for UK registration numbers but have hit a wall when it comes to restricting overall length of the string in question. I currently have the following:
^(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})
This allows for an optional string (lower or upper case) of between 1 and 3 characters, followed by a mandatory numeric of between 1 and 3 characters and finally, a mandatory string (lower or upper case) of between 1 and 3 characters.
This works fine but I then want to apply a max length of 7 characters to the entire string but this is where I'm failing. I tried adding a 1,7 restriction to the end of the regex but the three 1,3 checks are superseding it and therefore allowing a max length of 9 characters.
Examples of registration numbers that need to pass are as follows:
A1
AAA111
AA11AAA
A1AAA
A11AAA
A111AAA
In the examples above, the A's represents any letter, upper or lower case and the 1's represent any number. The max length is the only restriction that appears not to be working. I disable the entry of a space so they can be assumed as never present in the string.

If you know what lengths you are after, I'd recommend you use the .length property which some languages expose for string length. If this is not an option, you could try using something like so: ^(?=.{1,7})(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})$, example here.

Related

Open Refine regex for alphabets

i want to edit only alphabetic charcter from my cell
.
what i have done
value.match(/.*?(\^[a-zA-Z]*$).*?/)
but it returns null
i am try to clean address column in my data set following are the sample address
H3656 GALI#4 BLOCK-D, AREA 1
H#36/17 SECTOR 5D AREA 2
AREA 3 BLOCK-B NORTH NAZIMABAD
GERMANY AL JANNAT BENQUET SECTOR 16 Area 2 with short name
so that i first try to remove all numbers from my string
If you want to remove all the numbers, the most direct approach is probably:
value.replace(/\d+/, "")
If for any reason you want to find only the alphabetic characters, as indicated by the title of your question, this will be more effective than a value.match() :
value.find(/\p{L}\s?/).join("")
(\p{L} is a Java regular expression - Openrefine is written in Java - equivalent to [a-zA-Z], but which also takes into account Unicode characters like accented letters.)
In general, you should avoid using the .match() method unless you know exactly what you are doing. In 90% of cases, it is actually .find() that is desired.

Regular expression string division, priorize the part lengths

I have this string
0Sc-a+nn1.ed_AI&AO1301#89
That has to be split in three parts
0Sc-a+nn1.ed_AI&AO
1301
89
I am using this RE (?P<prefix>[a-z\.\_\-\+(\&)]+\W?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?)) in python, but for now, testing in https://regex101.com/.
I am having problem to identify the first part. If I try "Sc-a+nn.ed_AI&AO1301#89" works fine, but adding the numbers to the first part, as the example, don't.
How to priory the second and the third part to be the maximum length allowed around the # and the first one () allow numbers in the beginning and middle (never at the end because will be in part two)? ? is there because sometimes the precedent element doesn't exist.
Use [a-zA-Z]{2} to capture the string after & and specify the length for each part i.e [\d]{4}
(?P<prefix>[A-Za-z0-9._\-+&;]+[a-zA-Z]{2}?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?))

String Finding Alg w/ Lowest Freq Char

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.
Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?
You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where
n is the length of the text
p is the total number of matches found
This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:
If the letter is not part of a match, the algorithm takes unit time.
If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.
You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

Integer range and multiple of

I have a number of fields I want to validate on text entry with a regex for both matching a range (0..120) and must be a multiple of 5.
For example, 0, 5, 25, 120 are valid. 1, 16, 123, 130 are not valid.
I think I have the regex for multiple of 5:
^\d*\d?((5)|(0))\.?((0)|(00))?$
and the regex for the range:
120|1[01][0-9]|[2-9][0-9]
However, I dont know how to combine these, any help much appreciated!
You can't do that with a simple regex. At least not the range-part (especially if the range should be generic/changeable).
And even if you manage to write the regex, it will be very complex and unreadable.
Write the validation on your own, using a parseStringToInt() function of your language and simple < and > checks.
Update: added another regex (see below) to be used when the range of values is not 0..120 (it can even be dynamic).
The second regex in the question does not match numbers smaller than 20. You can change it to match smaller numbers that always end in 0 or 5 to be multiple by 5:
\b(120|(1[01]|[0-9])?[05])\b
How it works (starting from inside):
(1[01]|[0-9])? matches 10, 11 or any one-digit number (0 to 9); these are the hundreds and tens in the final number; the question mark (?) after the sub-expression makes it match 0 or 1 times; this way the regex can also match numbers having only one digit (0..9);
[05] that follows matches 0 or 5 on the last digit (the units); only the numbers that end in 0 or 5 are multiple of 5;
everything is enclosed in parenthesis because | has greater priority than \b;
the outer \b matches word boundaries; they prevent the regex match only 1..3 digits from a longer number or numbers that are embedded in strings; it prevents it matching 15 in 150 or 120 in abc120.
Using dynamic range of values
The regex above is not very complex and it can be used to match numbers between 0 and 120 that are multiple of 5. When the range of values is different it cannot be used any more. It can be modified to match, lets say, numbers between 20 and 120 (as the OP asked in a comment below) but it will become harder to read.
More, if the range of allowed values is dynamic then a regex cannot be used at all to match the values inside the range. The multiplicity with 5 however can be achieved using regex :-)
For dynamic range of values that are multiple of 5 you can use this expression:
\b([1-9][0-9]*)?[05]\b
Parse the matched string as integer (the language you use probably provides such a function or a library that contains it) then use the comparison operators (<, >) of the host language to check if the matched value is inside the desired range.
At the risk of being painfully obvious
120|1[01][05]|[2-9][05]
Also, why the 2?

find a string with at least n matching elements

I have a list of numbers that I want to find at least 3 of...
here is an example
I have a large list of numbers in a sql database in the format of (for example)
01-02-03-04-05-06
06-08-19-24-25-36
etc etc
basically 6 random numbers between 0 and 99.
Now I want to find the strings where at least 3 of a set of given numbers occurs.
For example:
given: 01-02-03-10-11-12
return the strings that have at least 3 of those numbers in them.
eg
01-05-06-09-10-12 would match
03-08-10-12-18-22 would match
03-09-12-18-22-38 would not
I am thinking that there might be some algorithm or even regular expression that could match this... but my lack of computer science textbook experience is tripping me up I think.
No - this is not a homework question! This is for an actual application!
I am developing in ruby, but any language answer would be appreciated
You can use a string replacement to replace - with | to turn 01-02-03-10-11-12 into 01|02|03|10|11|12. Then wrap it like this:
((01|02|03|10|11|12).*){3}
This will find any of the digit pairs, then ignore any number of characters... 3 times. If it matches, then success.