Regex to detect filling character length with periods - regex

I'm trying to build some regex that would detect when someone is trying to "fill out" their username with dots.
There are a few other requirements:
username must contain only letters, numbers and dots
username must start and end with a letter or number
but not more than one consecutive dot
minimum of 6 characters (letters and numbers)
e.g.:
a.b.c.d.e.6 is allowed (not caught) because it has 6 characters
a.b.c.d.5 is not (is caught) because it does not have the prerequisite 6 characters
The way that I'm building the regex is if there's a match, it will reject the username allowed.
What I have thus far is:
/[^a-z0-9.]|^\.|\.$|\.{2,}|\S{31,}|^\S{0,5}$/i
This catches:
any characters that aren't letters, numbers, dots
can't start with a dot
can't end with a dot
can't have 2 or more consecutive dots
can't have 31 or more characters
can't have 5 or less characters
I've tried dozens of different ways to get that last check in place, but they've all either broken the entire check, included the allowable (a.b.c.d.e.6) or just not worked.
the one that I've come closest with is:
(\.{1}[a-z0-9]{1,}){1,3}\S{1,}$
The problem with this is that it's also catching 123.456 (which should be allowed / not caught)
other examples of character strings that it should catch:
asdf.g
a.sdfg
a.sdf.g
as.df.g
I'm trying to do this using only regex, without having to pre-format it using JS.

Ok, after much experimentation I've actually found the answer. It turns out that finding the non-permitted strings was actually easier (for me anyway):
/^(\w\.?){4}\w$/
same expression expanded:
/^\w\.?\w\.?\w\.?\w\.?\w$/
This will catch anything that is populated with only 5 or fewer characters and interspersed with dots.
The full regex that I'm using also catches:
Strings of 31 or more characters (alphanumeric and periods).
Any characters that are not alphanumeric and periods.
Any string starting with a period
Any string ending with a period
Any string that has 2 or more consecutive periods
And a new-comer to the list: Any string that has 8 or more numeric digits without any alpha.
/^(\w\.?){4}\w$|^\w{0,5}$|\w{31,}|[^a-z0-9.]|^\.|\.$|\.{2,}|\d{8,}/i
I've tested this with all the possible combinations that I can think of on regex101 here: https://regex101.com/r/xI7wZ3/1
And it works! (yay)

Related

Regex to remove unwanted text in gene sequences

I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+
Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED

How to write a Regex that identifies specific letters plus a minimum amount of numbers

I'm trying to write a regex that can locate IDs in a body of text. The ID starts with "DW" and has a minimum of 5 numbers after that. It will only have numbers and no other characters following that.
Correct Examples
DW40056
DW4000057
Wrong Examples
DW4005
DW405679fg
Use word boundaries around DW followed by 4 digits then one or more digits:
\bDW\d{4}\d+\b
See live demo.
The word boundaries prevent matches with input such as ABCDW12345XYZ etc.
Although you could code the digits part as\d{5,}, which is simpler than \d{4}\d+, not all engines support open-ended quantity ranges. Since you haven’t indicated the language/tool you’re using, this regex is going to work in more situations.
Try this pattern: DW\d{5,}$
See Demo
Explanation:
DW is two characters that id start with
\d is for 0-9 numbers
{5,} it means \d must appear five or more times
$ it means the end of string. this cause this pattern just take strings that end with numbers (no more characters after numbers)

Verifying that a string starts with a number (easy) OR exactly 3 letters?

I'm trying to make a RegEx expression to verify that a field starts with either the number 3 - the easy part - or starts with three letters, then continues to be numbers
My expression so far is
^((3)[\d])|([a-zA-Z]{3}[\d])$
The expression stops you from doing anything BELOW 3, but it still lets you go over...
I've done some searching and can't find a topic that relates to the issue of having an exact amount of characters
And I'm having trouble with limiting it to exactly 3 letter characters. Unfortunately what I'm working with, it HAS to be RegEx and not another language.
^(?:3|[a-zA-Z]{3})\d+$
verifies, that your string starts with either 3 or 3 letters and then is only followed by numbers (at least one) until the end of the string
See https://regex101.com/r/tD2nK4/3 for some positive and negative examples
This regex should do exactly what you want:
^((3)[\d])|([a-zA-Z]{3}[^a-zA-Z])
Please note that this regex can only cope with the ASCII alphabet.

Identifying number sequences with optional punctuation

I am trying to identify account numbers in different formats using a single regex. The following are the different formats I need to detect:
12-34-56-78-9
12-3456-78-9
123-456-789
1-23-45678-9
We need to detect "-" inbetween a 9-digit number. But there is no clue where "-" could come. As of now, i am creating regex for individual conditions and detecting it. is there a simple regex to detect the above in a single shot?
Here you go, that's a pretty simple pattern:
^(?:\d-?){8}\d$
Demo
It simply means: find a digit (\d), optionally followed by a hyphen (-?), 8 times in a row ({8}), then the last digit (\d). This prevents a hyphen from being the first or last character, and it also prevents two hyphens in a row.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.