I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101
Related
Given the string below,
ay bee ceefooh deefoo38 ee 37 ef gee38 aitch 38 eye19 jay38 kay 99 el88 em38 en 29 ou38 38 pee 12 q38 arr 999 esss 555
the goal is to match every word such that the suffix is a number that matches the number that appears after foo (which happens to be 38 in this case).
There is only one substring that begins with foo and ends with a number. The expected matches all exist after said substring.
Expected matches:
gee38
jay38
em38
ou38
q38
I've tried foo(\d+).*?(\w+\1)\b and foo(\d+).*(\w+\1)\b, but they fail to match all, because they either match the first one (gee38) or the last one (q38).
Is it possible to match all with just a single regex and, importantly, in just a single run?
The PCRE2 engine that I use behaves in the same way as https://regex101.com/r/uFEDOE/1. So, if the regex can match multiple substrings on regex101, then the engine that I use can too.
(?:foo|\G(?!^))(\d+).*?(?=(\w+))\w+(?=\1\b)
Demo
It could be some size or performance optimization.
#Niko Gambt, say if any optimization is important for you.
I'm trying to get the regular expression to work (using jQuery) for a specific pattern I need.
I need following pattern:
First two character
s of the string need to be numbers (0-9) but maximum number is 53. for numbers below 10 a leading 0 is required
Character on position 3 needs to be a .
the next 4 characters need to be a number between 0-9, minimum number should be 2010, maximum 2050
so, Strings like 01.2020, 21.2020, or 45.2020 have to match but 54.2020 or 04.2051 must not.
I tried to write the regex without the min and max requirement first and I'm testing the string using regex101.com but I'm unable to get it to work.
acording to the definition /^[0-9]{2}\.\d[0-9]{4}$/ should allow me to insert the strings in the format NN.NNNN.
thankful for any input.
2 numbers from 00 to 53 can be matched using this : (?:[0-4][0-9]|5[0-3]) (00 -> 49 or 50 -> 53)
Character on position 3 needs to be a . : you've already got the \.
a number between 2010 and 2050 -> 20(?:[1-4][0-9]|50) (20 followed by either 10 -> 49 or 50)
This gives :
(?:[0-4][0-9]|5[0-3])\.20(?:[1-4][0-9]|50)
Looking for some regex help! If this can be done in another way / using another tool - please let me know.
Here's a snippet from my data set (there are ~10million rows in total). Every new sequence starts with a '>'.
Note: The line numbers are not in the actual textfile
01 >M00707:15:000000000-AEN4L:1:1101:13198:1037_PairEnd_SUB_SUB merged_sample={14.3: 1}; count=1; 2:N:0:1
02 ctcccggaaaaatttgagcctccagagtagcatataaccgacacgttgccgcctgaaaat
03 acattttccaggtcttnnnnnaaannnggaagcgcgcaccgacgagctttnnannacaag
04 tgtggctctagtgctcggtatttgcaactttttaagtannatgnnngtcgnnnnngaggn
05 nnnnnnnnntaaccnnncaccttcaagcaagtctaagttctcgactaatcaaactataaa
06 tccgctacacggacccagatctcccgccncgtgcannttaaagcaagtctacgttattga
07 agatagaaactattatatcgctaaacgtagctctganncacgctcgccttgactccgact
08 ctgtcaatgtctacgaccaattgaggtggaacatgtgcacatgtgtttcagancattgga
09 ggaattccgggaaaataaattgaggcacaancgaacggtgatctnnnnnnnttagattct
10 gccatgttttttggcacgaacacaattgggcaaatactgttgggatgtggatggat
11 >M00707:15:000000000-AEN4L:1:1101:10949:1045_PairEnd_SUB_SUB_CMP merged_sample={13.3: 1}; count=1; 2:N:0:1
12 atgacatattaatgattcagcccacattccttaatataccacatatgacttacttttcta
13 tatcaacnnnnnnntactttccacaggtatatacatactatgtttaatactcattaattt
14 acttgncactatattattacattatatgattaatccacatttctataacatattagactt
15 tcctcaactagatattat(first)tttcgt(first)aattattatgcagttgtatgacatattactgaatca
16 gccaacattccttaataaaccncatacgactactctgttatcgtatgtgttttatggtct
17 tgattcttagtaatgggtatgacatattattgattcagccnnnattgttnannannnnac
18 atnnancttactnntcttnttcaactctaatatactttccacaggtatatacatactatg
19 ttnaat(last)actcattaat(last)ttacttgccaatatatcattnnnntatatgattaatccacattt
20 ctataacatattagactttcctcaactagatattattttcgtaattattatgcag
I want to cut out everything between the order of characters "tttcgt" and "actcattaat" (but only in that specific order), then replace it with nothing and preserve everything else in its format (with the line breaks etc).
A big challenge to this is also that i need to find tttcgt and actcattaat even if either of those had a line break in between, ie. goes from the end of one line, line break plus line number plus space, and then continued on the next line. (Thanks for #CBroe for pointing that out)
I wrapped "(first)" around the tttcgt chars - see line number 15
I wrapped "(last)" around the actcattaat chars - see line number 19
So far I've mustered up this thinggy (?<=tttcgt).*?(?=actcattaat) - but how can I make my expression ignore newlines?
To make your regex dot match .* include newlines, you need to specify the s modifier. Modifier depends on the implementation of regex.
In python it's the DOTALL flag.
You can't regex a non-consecutive capture group (with characters missing from between input), but you can concat the two capture groups later on, or just string replace the sequence to be removed with an empty string.
Example:
import re;
data = """>M00707:15:000000000-AEN4L:1:1101:13198:1037_PairEnd_SUB_SUB merged_sample={14.3: 1}; count=1; 2:N:0:1
ctcccggaaaaatttgagcctccagagtagcatataaccgacacgttgccgcctgaaaat
acattttccaggtcttnnnnnaaannnggaagcgcgcaccgacgagctttnnannacaag
tgtggctctagtgctcggtatttgcaactttttaagtannatgnnngtcgnnnnngaggn
nnnnnnnnntaaccnnncaccttcaagcaagtctaagttctcgactaatcaaactataaa
tccgctacacggacccagatctcccgccncgtgcannttaaagcaagtctacgttattga
agatagaaactattatatcgctaaacgtagctctganncacgctcgccttgactccgact
ctgtcaatgtctacgaccaattgaggtggaacatgtgcacatgtgtttcagancattgga
ggaattccgggaaaataaattgaggcacaancgaacggtgatctnnnnnnnttagattct
gccatgttttttggcacgaacacaattgggcaaatactgttgggatgtggatggat
>M00707:15:000000000-AEN4L:1:1101:10949:1045_PairEnd_SUB_SUB_CMP merged_sample={13.3: 1}; count=1; 2:N:0:1
atgacatattaatgattcagcccacattccttaatataccacatatgacttacttttcta
tatcaacnnnnnnntactttccacaggtatatacatactatgtttaatactcattaattt
acttgncactatattattacattatatgattaatccacatttctataacatattagactt
tcctcaactagatattat(first)tttcgt(first)aattattatgcagttgtatgacatattactgaatca
gccaacattccttaataaaccncatacgactactctgttatcgtatgtgttttatggtct
tgattcttagtaatgggtatgacatattattgattcagccnnnattgttnannannnnac
atnnancttactnntcttnttcaactctaatatactttccacaggtatatacatactatg
ttnaat(last)actcattaat(last)ttacttgccaatatatcattnnnntatatgattaatccacattt
ctataacatattagactttcctcaactagatattattttcgtaattattatgcag"""
output = re.sub(r'(tttcgt).*(actcattaat)', r'\1\2', data, 0, flags=re.DOTALL)
print output
EDIT: made the code preserve the starting and ending sequences instead of removing them from output.
I am trying to validate the input of a LineEdit widget in Qt. I am using regular expressions for the first time so I could use some help. I want to allow 1 to 32 hex bytes separated by a space, for example this should be valid:
"0a 45 bc 78 e2 34 71 af"
And here are some examples of invalid input:
"1 34 bc 4e" -> They need to be written in pairs, so 1 must be 01.
"8a cb3 58 11" -> cb3 invalid.
"56 f2 a8 69 " -> No trailing space is allowed.
After some head scratching I came up with this regex which seems to work:
"([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})"
Now on to my questions:
Do you see any problems with my regex that my tests have failed to show? If so how can I improve it?
Is there a cleaner way to write it?
Thanks in advance
I am not sure what method you used for validation, but one possible problem is that the method searches the string for substring that matches the pattern rather than checking the string matches the pattern. Use exactMatch for validate a string against a regular expression.
In any case, adding anchors ^ and $ is safer (not necessary when exactMatch is used, though):
"^([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})$"
Since you are doing validation, you don't need capturing. And you don't need to put space in []
"^(?:[0-9A-Fa-f]{2} ){0,31}[0-9A-Fa-f]{2}$"
You can set case-sensitivity with setCaseSensitivity method. If you set it to Qt::CaseInsensitive, you can shorten the regex a bit:
"^(?:[0-9a-f]{2} ){0,31}[0-9a-f]{2}$"
I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:
<address> <11 * ascii character value> <11 characters>
00C4F244 75 6C 74 73 3E 3C 43 75 72 72 65 ults><Curre
I wanted to strip out everything bar the 11 characters at the end and used the following expression:
"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"
This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:
1: having the regex engine return the characters I wanted rather than the characters I didn't and
2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.
Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.
Edit: Thanks guys, this is what I wanted:
"(?:^[0-9A-F]{8} )(?:[0-9A-F]{2} ){11} (.*)"
Wish I could turn more than one of you green.
As the file has a fixed format, you could use this regular expression to just match the last 11 characters.
^.{44}(.{11})
Last eleven is:
...........$
or:
.{11}$
Matching a hex byte + space and repeat eleven times:
([0-9A-Fa-f]{2} ){11}
1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)
Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.
2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)
(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.
I'm assuming PCRE here, by the way.
The address and ascii char value are all hex so:
^[0-9A-F\s]{42}
Matching the end of the line would be
.{11}$
To match only the end, you can use a positive look behind.
"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"
This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.
(?<=....) defines a condition that must be met before matching is possible.
I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).
Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.
If you're using Perl, you could also use unpack(), to get each element.
my #data;
open my $fh, '<', $filename or die;
for my $line(<$fh>){
my($address,#list) = unpack 'a8xx(a2x)11xa11', $line;
my $str = pop #list;
# unpack the hexadecimal bytes
my $data = join '', map { pack 'H2',$_ } #list;
die unless $data eq $str;
push #data, [$address,$data,$str];
}
close $fh;
I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().