Regex: Accepting space anywhere but at the beginning - c++

I'm working with Python bindings for Qt4.8 on OS X.
I want to accept any digit and a few other chars AND white space.
String can be empty or at any length.
What I don't want is, for the string to being or end with white space.
My working example: '[0-9pqw\+\-\*\#\(\)\.][0-9pqw\+\-\*\# \(\)\.]*'
However, I don't want to repeat two blocks one containing space one does not. There should be a better way I guess, employing [^ ], but how?
Second question:
If I want to limit strings total length, how would I do it?
Thank you.

You could use negative lookarounds at the beginning and end of the pattern:
^(?![ ])[0-9pqw+*# ().-]*(?<![ ])$
Note that the brackets are not necessary but aid readability. Neither are any of your escapes (as long as you put the - at the end).

Does this not do what you want?
import re
re.match('^[^\W].*[^\W]$', ' aaa ')
(Where the last arg is your test string).
If you want to ensure the length is less than a certain amount use curly braces. One character is already spent testing the first and last chars of the test string with the inclusion of the [^\W] notation. So in this example, there is a match when there are no spaces at either side and when the test string is no longer than 4 characters.
re.match('^[^\W].{1,2}[^\W]$', 'aaaa')

Related

Regex to remove unwanted text in gene sequences

I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+
Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Regex exact length of whole string

I want to match a string of exact 3 length. I am using the following regex
("\\d?[A-Za-z]{2,3}\d?")
Here the string can have 1 digit either at start or at end of the string, or the string can have 3 letters.Is there any way to define length of the matching string like :
("(\\d?[A-Za-z]{2,3}\d?){3}") // it does not work
I have another solution of it.
("(\\d[A-Za-z]{2})|([A-Za-z]{2}\\d)|([A-Za-z]{3})")
But I just want to know if there is any way to define length of whole matching string.
^.{3}$
If this isn't really your answer you need to specify it better. You have zero solutions not several. What exactly are you trying to match. Give a couple examples.
http://www.regexplanet.com/advanced/java/index.html
^(\d[a-zA-Z]{2}|[a-zA-Z]{2}\d|[a-zA-Z]{3})$
If you want that letters and numbers thing.
If you want the extra stuff at the end to be possible without the string being over you can just look for the space afterwards.
^(\d[a-zA-Z]{2}|[a-zA-Z]{2}\d|[a-zA-Z]{3})\s
From the comments:
So it's
^[^\s]{3}\s\d{7}\s.\d{6}
? -- '^' start of line, '[^\s]' not a space. '{3}' three of those. '\s' a space. '\d' a digit. '{7}' seven of those. '\s' a space. '.' some character. '\d' a digit. '{6}' of those.
Regex is basically just programmatically a way of describing what you're looking for. If you can properly form the question of what you want to match it's easy to write that directly in regex.
Your three solutions will match also longer strings. I suggest you to use word boundary (\b) or line boundary (^ and $):
\b([a-zA-Z]{2}\d|\d[a-zA-Z]{2}|[a-zA-Z]{3})\b
or
^([a-zA-Z]{2}\d|\d[a-zA-Z]{2}|[a-zA-Z]{3})$
based on the specific usage.
EDIT: fixed the regex, matching also 3 digits.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

How to make a regular expression looking for a list of extensions separated by a space

I want to be able to take a string of text from the user that should be formated like this:
.ext1 .ext2 .ext3 ...
Basically, I am looking for a dot, a string of alphanumeric characters of any length a space, and rinse and repeat. I am a little confused on how to say " i need a period, string of characters and a space". But also, the last extension could either be followed by nothing, or a space, or a series of spaces. Also, I guess in between extensions could be followed by any number of spaces?
EDIT: I made it clearer what I was looking for.
Thanks!
Try this:
^(?:\.[A-Za-z0-9]+ +)*\.[A-Za-z0-9]+ *$
(Rubular)
In a Java string literal you need to escape the backslashes:
"^(?:\\.[A-Za-z0-9]+ +)*\\.[A-Za-z0-9]+ *$"
(\.\w+)\s* Match this and get your results.
^((\.\w+)\s*)*$ Check this and if it's true, your String is exactly what you want.
For the last pattern thing, you can't (AFAIK) do both getting all extensions (separated) and checking that the last is followed by other things. Either you check your string, or you extract the extensions from it.
I'd start with something like: ^.[a-z0-9]+([\t\n\v ]+.[a-z0-9]+)*$