Regex to remove unwanted text in gene sequences - regex

I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+

Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

trim spaces on captured group regex

after searching everywhere, How can whitespace be trimmed from a Regex capture group?, seems to have the closest thing, and I can almost taste it, but still no cigar....
So here goes, for my very first post to StackOverflow.....
I am trying to capture Exactly 27 characters of a string (for reasons that really don't matter).
So I used regex: .{27}
on this string "Joey Went to the Store and found some great steaks!"
the result was "Joey Went to the Store and "
Bingo exactly 27 characters.
But that result is causing errors in my program because of the trailing space. So now I need to also take that result and trim the space after "and " to return the result without the trailing space. the final result needs to be "Joey Went to the Store and".
here's the kicker, I need it to all work from a single regex because the application can only apply 1 regex (really dumb program, but I'm stuck with it).
Take a look at this regex:
^.{26}[^\s]?
It will match 26 characters starting from beginning of line and will match the 27th only if it is not a white space character. See the demo below for more details.
Regex Demo

Regular expression for extracting substring in the middle of a string

Looking for a regular expression that extracts multiple characters, at different locations, in my string. For example, the string I'm working with is 5490028400316201600008 and it will always be this same length, but the numbers can change.
I would like to extract the first 9 characters, then skip the next 8, extract the next 4, then ignore the last character. The resulting string would be 5490028400000 in this case. I can't seem to find an easy way to do this and I'm fairly new to regular expressions. Thanks in advance for your advice/help.
First of all, this seems more appropiate for substring functions, they are usually faster and not so error-prone. However, for a learning purpose, you could come up with sth. like:
(.{9}).{8}(.{4}).
This matches any (not only digits, that is - for digits use \d instead) character 9 times, saves it in a group, matches another 8 characters which will not be saved, and will finally match another 4 characters into the second group.
Concenate $1 and $2 (5490028400000 in your case) and you should be fine.
See this demo on regex101.com.

How to make a regular expression looking for a list of extensions separated by a space

I want to be able to take a string of text from the user that should be formated like this:
.ext1 .ext2 .ext3 ...
Basically, I am looking for a dot, a string of alphanumeric characters of any length a space, and rinse and repeat. I am a little confused on how to say " i need a period, string of characters and a space". But also, the last extension could either be followed by nothing, or a space, or a series of spaces. Also, I guess in between extensions could be followed by any number of spaces?
EDIT: I made it clearer what I was looking for.
Thanks!
Try this:
^(?:\.[A-Za-z0-9]+ +)*\.[A-Za-z0-9]+ *$
(Rubular)
In a Java string literal you need to escape the backslashes:
"^(?:\\.[A-Za-z0-9]+ +)*\\.[A-Za-z0-9]+ *$"
(\.\w+)\s* Match this and get your results.
^((\.\w+)\s*)*$ Check this and if it's true, your String is exactly what you want.
For the last pattern thing, you can't (AFAIK) do both getting all extensions (separated) and checking that the last is followed by other things. Either you check your string, or you extract the extensions from it.
I'd start with something like: ^.[a-z0-9]+([\t\n\v ]+.[a-z0-9]+)*$