Regular expression for extracting substring in the middle of a string - regex

Looking for a regular expression that extracts multiple characters, at different locations, in my string. For example, the string I'm working with is 5490028400316201600008 and it will always be this same length, but the numbers can change.
I would like to extract the first 9 characters, then skip the next 8, extract the next 4, then ignore the last character. The resulting string would be 5490028400000 in this case. I can't seem to find an easy way to do this and I'm fairly new to regular expressions. Thanks in advance for your advice/help.

First of all, this seems more appropiate for substring functions, they are usually faster and not so error-prone. However, for a learning purpose, you could come up with sth. like:
(.{9}).{8}(.{4}).
This matches any (not only digits, that is - for digits use \d instead) character 9 times, saves it in a group, matches another 8 characters which will not be saved, and will finally match another 4 characters into the second group.
Concenate $1 and $2 (5490028400000 in your case) and you should be fine.
See this demo on regex101.com.

Related

Regex to remove unwanted text in gene sequences

I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+
Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED

Regex extract first portion of four numbers starting from specific position

How do i extract four numbers starting after the 8th number which is dynamic from the following strings using regex.
20190715171712904_10008_file_activate_10.20.30.4000233223456_name.unl
20190715141712904_10008_runco_activate_10.20.30.40_name.unl
From first string i want 1717
From second string i want 1417
I have tried to write regex queries in https://regex101.com/ i.e.
I have tried ^\d{8}([0-9]{4})$ but not working.
Drop the $. It forces the expression to look for the end of the string after your 4 digits, which it is not. The answer will be in the first subgroup capture. Note you can use \d for the second [0-9] as well.
If your language supports look-behinds, you can capture your digits as the main capture, instead of a subgroup:
(?<=^\d{8})\d{4}
This is really not a problem for a regular expression though - getting the substring indexed from index 4 to index 7 including (0 indexed) is basic and faster in any language.

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Extended Search

So I've got a big text file which looks like the following:
text;text;text;text;text - 5 words
text;text;text;text;text;text - 6 words
text;text;text;text;text;text;text - 7 words
How i can search lines with 6, 7,... words?
I try search with (.*);(.*);(.*);(.*);(.*);(.*); but not work :(
Note: Notepad++ does choke on my existing regex, but the OP adapted it to suit his need, see the comments for more.
First of all, you should be doing a regular expression search, not an extended search.
Here's the regex. Basically you match the first 5 words, then match at least one more after the first 5 (if you don't need to match the last semicolon, take out the ;?):
(.*);(.*);(.*);(.*);(.*)(;(.*))+;?
(You cannot use (.*)(;(.*)){5,} as Notepad++ doesn't support that syntax.)
Don't abuse the *. If you are trying to match at least one character, .+ is less ambiguous. In fact, if ; is the separator, you can try [^;]+ to be even more pedantic.