RegEx replaceAll but ignore newlines - regex

Looking for some regex help! If this can be done in another way / using another tool - please let me know.
Here's a snippet from my data set (there are ~10million rows in total). Every new sequence starts with a '>'.
Note: The line numbers are not in the actual textfile
01 >M00707:15:000000000-AEN4L:1:1101:13198:1037_PairEnd_SUB_SUB merged_sample={14.3: 1}; count=1; 2:N:0:1
02 ctcccggaaaaatttgagcctccagagtagcatataaccgacacgttgccgcctgaaaat
03 acattttccaggtcttnnnnnaaannnggaagcgcgcaccgacgagctttnnannacaag
04 tgtggctctagtgctcggtatttgcaactttttaagtannatgnnngtcgnnnnngaggn
05 nnnnnnnnntaaccnnncaccttcaagcaagtctaagttctcgactaatcaaactataaa
06 tccgctacacggacccagatctcccgccncgtgcannttaaagcaagtctacgttattga
07 agatagaaactattatatcgctaaacgtagctctganncacgctcgccttgactccgact
08 ctgtcaatgtctacgaccaattgaggtggaacatgtgcacatgtgtttcagancattgga
09 ggaattccgggaaaataaattgaggcacaancgaacggtgatctnnnnnnnttagattct
10 gccatgttttttggcacgaacacaattgggcaaatactgttgggatgtggatggat
11 >M00707:15:000000000-AEN4L:1:1101:10949:1045_PairEnd_SUB_SUB_CMP merged_sample={13.3: 1}; count=1; 2:N:0:1
12 atgacatattaatgattcagcccacattccttaatataccacatatgacttacttttcta
13 tatcaacnnnnnnntactttccacaggtatatacatactatgtttaatactcattaattt
14 acttgncactatattattacattatatgattaatccacatttctataacatattagactt
15 tcctcaactagatattat(first)tttcgt(first)aattattatgcagttgtatgacatattactgaatca
16 gccaacattccttaataaaccncatacgactactctgttatcgtatgtgttttatggtct
17 tgattcttagtaatgggtatgacatattattgattcagccnnnattgttnannannnnac
18 atnnancttactnntcttnttcaactctaatatactttccacaggtatatacatactatg
19 ttnaat(last)actcattaat(last)ttacttgccaatatatcattnnnntatatgattaatccacattt
20 ctataacatattagactttcctcaactagatattattttcgtaattattatgcag
I want to cut out everything between the order of characters "tttcgt" and "actcattaat" (but only in that specific order), then replace it with nothing and preserve everything else in its format (with the line breaks etc).
A big challenge to this is also that i need to find tttcgt and actcattaat even if either of those had a line break in between, ie. goes from the end of one line, line break plus line number plus space, and then continued on the next line. (Thanks for #CBroe for pointing that out)
I wrapped "(first)" around the tttcgt chars - see line number 15
I wrapped "(last)" around the actcattaat chars - see line number 19
So far I've mustered up this thinggy (?<=tttcgt).*?(?=actcattaat) - but how can I make my expression ignore newlines?

To make your regex dot match .* include newlines, you need to specify the s modifier. Modifier depends on the implementation of regex.
In python it's the DOTALL flag.
You can't regex a non-consecutive capture group (with characters missing from between input), but you can concat the two capture groups later on, or just string replace the sequence to be removed with an empty string.
Example:
import re;
data = """>M00707:15:000000000-AEN4L:1:1101:13198:1037_PairEnd_SUB_SUB merged_sample={14.3: 1}; count=1; 2:N:0:1
ctcccggaaaaatttgagcctccagagtagcatataaccgacacgttgccgcctgaaaat
acattttccaggtcttnnnnnaaannnggaagcgcgcaccgacgagctttnnannacaag
tgtggctctagtgctcggtatttgcaactttttaagtannatgnnngtcgnnnnngaggn
nnnnnnnnntaaccnnncaccttcaagcaagtctaagttctcgactaatcaaactataaa
tccgctacacggacccagatctcccgccncgtgcannttaaagcaagtctacgttattga
agatagaaactattatatcgctaaacgtagctctganncacgctcgccttgactccgact
ctgtcaatgtctacgaccaattgaggtggaacatgtgcacatgtgtttcagancattgga
ggaattccgggaaaataaattgaggcacaancgaacggtgatctnnnnnnnttagattct
gccatgttttttggcacgaacacaattgggcaaatactgttgggatgtggatggat
>M00707:15:000000000-AEN4L:1:1101:10949:1045_PairEnd_SUB_SUB_CMP merged_sample={13.3: 1}; count=1; 2:N:0:1
atgacatattaatgattcagcccacattccttaatataccacatatgacttacttttcta
tatcaacnnnnnnntactttccacaggtatatacatactatgtttaatactcattaattt
acttgncactatattattacattatatgattaatccacatttctataacatattagactt
tcctcaactagatattat(first)tttcgt(first)aattattatgcagttgtatgacatattactgaatca
gccaacattccttaataaaccncatacgactactctgttatcgtatgtgttttatggtct
tgattcttagtaatgggtatgacatattattgattcagccnnnattgttnannannnnac
atnnancttactnntcttnttcaactctaatatactttccacaggtatatacatactatg
ttnaat(last)actcattaat(last)ttacttgccaatatatcattnnnntatatgattaatccacattt
ctataacatattagactttcctcaactagatattattttcgtaattattatgcag"""
output = re.sub(r'(tttcgt).*(actcattaat)', r'\1\2', data, 0, flags=re.DOTALL)
print output
EDIT: made the code preserve the starting and ending sequences instead of removing them from output.

Related

Regex for month with optional leading 0

I am trying to match various months, that may be in the form of:
01
1
12
13
09
All of the above inputs are valid except for 13.
The current regex I have for this is:
0?(?#optional leading 0, for example 04)
\d(?#followed by any number, 01, 2, 09, etc.)
|(?#or 10,11,12)
1[012]
What's wrong with the above regex? Here's an example link: https://regex101.com/r/cujCmD/1
I would phrase the regex as:
^(?:0?[1-9]|1[012])$
Demo
The parentheses and anchors are needed to ensure that the alternation chosen gets applied to the entire number input.

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

How to copy a regular expression matched in notepad++ and the inverse selection of it

Example for Context
How do you search for the first two characters in every row, select them and then copy them, in NotePad++?
Vice verse = Find everything NOT the first two characters, select them and be able to to cut/copy them.
The specific goal is to auto select the matching regex result so that the found text can be copied to the clip board. NotePad++, to my knowledge, is only able to "Mark" the results found (apply slightly different coloration for visual distinction)- this to me seems counter intuitive not to be able to also "Select" the results found.
Any help here would be greatly appreciated.
In the following list:
09 - ExtraCare Stockpiler
01 - Food & Family Loyalist
04 - ExtraCare Enthusiast
09 - ExtraCare Stockpiler
The regex should return:
09
01
04
09
INVERTED
The same list should return:
- ExtraCare Stockpiler
- Food & Family Loyalist
- ExtraCare Enthusiast
- ExtraCare Stockpiler
Once the above is sorted out, what is the method to select the results so that they can be copied to the clipboard.
Note: Block selection (ALT + Click drag) is not an option because there are 180,000+ rows.
The way I would do this is as follows.
Use a regular expression replace to keep the wanted characters and to delete the unwanted characters. Copy the entire buffer that now contains only the wanted characters and paste into the destination. Then either "undo" the edit replace or reload the file (menu => File => Reload from disc) or just discard the buffer. A minor variation is to: copy the whole of the original buffer, or just the relevant section, into a temporary buffer; do the replacement; copy; paste; then discard the temporary buffer.
To keep only the first two characters of each line of a buffer: Replace ^(..).*$ with \1. To keep everything except the first two characters of each line of a buffer: Replace ^..(.*)$ with \1. In both cases ensure that ". matches newline" is not selected.
The question is not precise on how lines with zero, one or two characters should be processed. The replacements in the previous paragraph will not alter or delete those lines. Hence it may be necessary to precede the above replacements with something to filter out the short lines.

How to capture regex in vim and do the substittion

What will be regular expression for deleting all the text inside the "" quotation in vim?
I am getting problem is using capture block in this case.
INPUT:
16
17
18
19
OUTPUT:
16
17
18
19
this should work for your example:
%s/"\zs[^"]*//
if you like, you can record macro to achieve that too (using less keystrokes):
(assume your cursor is at line 1)
qq0di"j#qq
than you type #q to replay the macro for all lines in your buffer.
note that the recursive macro is just for saving 999#q
Try this:
:%s/HREF="[^"]*"/HREF=""/
By using "[^"]*" instead of just ".*" you avoid matching all the way past the first closing quote to the later closing quote of the second attribute.

Regex: How to match a unix datestamp?

I'd like to be able to match this entire line (to highlight this sort of thing in vim): Fri Mar 18 14:10:23 ICT 2011. I'm trying to do it by finding a line that contains ICT 20 (first two digits of the year of the year), like this: syntax match myDate /^*ICT 20*$/, but I can't get it working. I'm very new to regex. Basically what I want to say: find a line that contains "ICT 20" and can have anything on either side of it, and match that whole line. Is there an easy way to do this?
.*ITC 20.*
should do the trick. . is a wildcard that matches any character, and * means you can have 0 or more of the pattern it follows. (i.e. ba(na)* will match ba, banana, bananananana and so on)