Regular expression hex bytes in string - regex

I am trying to validate the input of a LineEdit widget in Qt. I am using regular expressions for the first time so I could use some help. I want to allow 1 to 32 hex bytes separated by a space, for example this should be valid:
"0a 45 bc 78 e2 34 71 af"
And here are some examples of invalid input:
"1 34 bc 4e" -> They need to be written in pairs, so 1 must be 01.
"8a cb3 58 11" -> cb3 invalid.
"56 f2 a8 69 " -> No trailing space is allowed.
After some head scratching I came up with this regex which seems to work:
"([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})"
Now on to my questions:
Do you see any problems with my regex that my tests have failed to show? If so how can I improve it?
Is there a cleaner way to write it?
Thanks in advance

I am not sure what method you used for validation, but one possible problem is that the method searches the string for substring that matches the pattern rather than checking the string matches the pattern. Use exactMatch for validate a string against a regular expression.
In any case, adding anchors ^ and $ is safer (not necessary when exactMatch is used, though):
"^([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})$"
Since you are doing validation, you don't need capturing. And you don't need to put space in []
"^(?:[0-9A-Fa-f]{2} ){0,31}[0-9A-Fa-f]{2}$"
You can set case-sensitivity with setCaseSensitivity method. If you set it to Qt::CaseInsensitive, you can shorten the regex a bit:
"^(?:[0-9a-f]{2} ){0,31}[0-9a-f]{2}$"

Related

PCRE2 - Match every word whose suffix matches a backreference

Given the string below,
ay bee ceefooh deefoo38 ee 37 ef gee38 aitch 38 eye19 jay38 kay 99 el88 em38 en 29 ou38 38 pee 12 q38 arr 999 esss 555
the goal is to match every word such that the suffix is a number that matches the number that appears after foo (which happens to be 38 in this case).
There is only one substring that begins with foo and ends with a number. The expected matches all exist after said substring.
Expected matches:
gee38
jay38
em38
ou38
q38
I've tried foo(\d+).*?(\w+\1)\b and foo(\d+).*(\w+\1)\b, but they fail to match all, because they either match the first one (gee38) or the last one (q38).
Is it possible to match all with just a single regex and, importantly, in just a single run?
The PCRE2 engine that I use behaves in the same way as https://regex101.com/r/uFEDOE/1. So, if the regex can match multiple substrings on regex101, then the engine that I use can too.
(?:foo|\G(?!^))(\d+).*?(?=(\w+))\w+(?=\1\b)
Demo
It could be some size or performance optimization.
#Niko Gambt, say if any optimization is important for you.

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

Regex Validate French mobile number

I'm trying to validate french mobile numbers:
I have already removed all non numeric character and the eventual 00 at beginning, and rules are:
start with 06 or 07 or 09
is 10 digit long:
thus :
/^0(6|7|9)\d{8}$/
but (seems) that if countrycode (33) is present, the leading zero has to be avoided, but at this point I cannot create the right regex, since with number:
33614444444
/^(33|0)?(6|7|9)\d{8}$/
it works, but works also with
614444444
while it should not
can suggest solution?
you can do it using the regex
^(33|0)(6|7|9)\d{8}$
see the regex101 demo
Why don't you simply use /^(33|0)(6|7|9)\d{8}$/ ?
I do not think you need the quantifier ?.
When you add ? after (33|0). It implies either none of them is present or one of 33 or 0 is present. It would match all the following -
614444444 // none present
0614444444 // 0 present
33614444444 // 33 present

Adding an AND clause to a regex

I have this simple regex,
[\d]{1,5}
that matches any integer between 0 and 99999.
How would I modify it so that it didn't match 0, but matches 01 and 10, etc?
I know there is a way to do an OR like so...
[\d]{1,5}|[^0]{1}
(doesn't make much sense)
There a way to do an AND?
probably better off with something like:
0*[1-9]+[\d]{0,4}
If I'm right that translates to "zero or more zeros followed by at least one of the characters included in '1-9' and then up to 4 trailing decimal characters"
Mike
I think the simplest way would be:
[1-9]\d{0,4}
throw that between a ^$ if it makes sense in your case, and if so, add a 0* to the beginning:
^0*[1-9]\d{0,4}$
My vote is to keep the regex simple and do that as a separate compare outside the regex. If the regex passes, convert it to an int and make sure the converted value is > 0.
But I know that sometimes one regex in a config file or validation property on a control is all you get.
How about an OR between single digit numbers you will accept and multiple-digit numbers:
^[1-9]$|^\d{2,5}$
I think a negative lookahead would work. Try this:
#!/bin/perl -w
while (<>)
{
chomp;
print "OK: $_\n" if m/^(?!0+$)\d{1,6}$/;
}
Example trace:
0
00
000
0000
00000
000000
0000001
000001
OK: 000001
101
OK: 101
01
OK: 01
00001
OK: 00001
1000
OK: 1000
101
OK: 101
By using look-aheads you can achieve the effect of AND.
^(?=regex1)(?=regex2)(?=regex3).*
Though there is a bug in Internet Explorer, that sometimes doesn't treat (?= ) as zero-width.
http://blog.stevenlevithan.com/archives/regex-lookahead-bug
In your case:
^(?=\d{1,5}$)(?=.*?[1-9]).*
It looks like you are searching for 2 different conditions. Why not break it out to 2 expressions? It might be simpler and more readable.
var str = user_string;
if ('0' != str && str.matches(/^\d{1,5}$/) {
// code for match
}
or the following if a string of 0's is not valid as well
var str = user_string;
if (!str.matches(/^0+$/) && str.matches(/^\d{1,5}$/) {
// code for match
}
Just because you can do it all in one regex doesn't mean that you should.
^([1-9][0-9]{0,4}|[0-9]{,1}[1-9][0-9]{,3}|[0-9]{,2}[1-9][0-9]{,2}|[0-9]{,3}[1-9][0-9]|[0-9]{,4}[1-9])$
Not pretty, but it should work. This is more of a brute force approach. There's a better way to do it via grouping as well, but I'm drawing a blank on the actual implementation at the moment.

regex tutorial, How can I improve this

I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:
<address> <11 * ascii character value> <11 characters>
00C4F244 75 6C 74 73 3E 3C 43 75 72 72 65 ults><Curre
I wanted to strip out everything bar the 11 characters at the end and used the following expression:
"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"
This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:
1: having the regex engine return the characters I wanted rather than the characters I didn't and
2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.
Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.
Edit: Thanks guys, this is what I wanted:
"(?:^[0-9A-F]{8} )(?:[0-9A-F]{2} ){11} (.*)"
Wish I could turn more than one of you green.
As the file has a fixed format, you could use this regular expression to just match the last 11 characters.
^.{44}(.{11})
Last eleven is:
...........$
or:
.{11}$
Matching a hex byte + space and repeat eleven times:
([0-9A-Fa-f]{2} ){11}
1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)
Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.
2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)
(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.
I'm assuming PCRE here, by the way.
The address and ascii char value are all hex so:
^[0-9A-F\s]{42}
Matching the end of the line would be
.{11}$
To match only the end, you can use a positive look behind.
"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"
This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.
(?<=....) defines a condition that must be met before matching is possible.
I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).
Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.
If you're using Perl, you could also use unpack(), to get each element.
my #data;
open my $fh, '<', $filename or die;
for my $line(<$fh>){
my($address,#list) = unpack 'a8xx(a2x)11xa11', $line;
my $str = pop #list;
# unpack the hexadecimal bytes
my $data = join '', map { pack 'H2',$_ } #list;
die unless $data eq $str;
push #data, [$address,$data,$str];
}
close $fh;
I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().