regex tutorial, How can I improve this - regex

I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:
<address> <11 * ascii character value> <11 characters>
00C4F244 75 6C 74 73 3E 3C 43 75 72 72 65 ults><Curre
I wanted to strip out everything bar the 11 characters at the end and used the following expression:
"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"
This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:
1: having the regex engine return the characters I wanted rather than the characters I didn't and
2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.
Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.
Edit: Thanks guys, this is what I wanted:
"(?:^[0-9A-F]{8} )(?:[0-9A-F]{2} ){11} (.*)"
Wish I could turn more than one of you green.

As the file has a fixed format, you could use this regular expression to just match the last 11 characters.
^.{44}(.{11})

Last eleven is:
...........$
or:
.{11}$
Matching a hex byte + space and repeat eleven times:
([0-9A-Fa-f]{2} ){11}

1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)
Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.
2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)
(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.
I'm assuming PCRE here, by the way.

The address and ascii char value are all hex so:
^[0-9A-F\s]{42}

Matching the end of the line would be
.{11}$
To match only the end, you can use a positive look behind.
"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"
This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.
(?<=....) defines a condition that must be met before matching is possible.
I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).
Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.

If you're using Perl, you could also use unpack(), to get each element.
my #data;
open my $fh, '<', $filename or die;
for my $line(<$fh>){
my($address,#list) = unpack 'a8xx(a2x)11xa11', $line;
my $str = pop #list;
# unpack the hexadecimal bytes
my $data = join '', map { pack 'H2',$_ } #list;
die unless $data eq $str;
push #data, [$address,$data,$str];
}
close $fh;
I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().

Related

REGEX: Put a space every 3 digits without using " "

Hello !
I've been looking for more than a day now but I can't find an answer, so I'm coming here to ask my problem!
Explanation:
I created a game thanks to a Discord bot which allows to use many functions (Atlas), one of which is the one I will talk about: replace. What I'm trying to do is by using the REGEX, put a space every three digits to format the numbers like this:
Base number:
25
321
54500
78545515201
After formatting:
25
321
54 500
78 545 515 201
But in the replacement section, spaces " " are trimmed from the front and back, so I can't do $1 . However, if I do $1 $2, the space between the two arguments is counted.
So what I'm looking to do is format my numbers using the replacement as $1 $2 so that the space is counted.
If anyone has the solution, I will really thank you!
EDIT: here is the link about the replace function: https://atlas.bot/documentation/tags/replace
You can make use of an empty capture group to assert a position without a char capture so that your replacement can be $1 $2:
(\d)()(?=(\d{3})+(?!\d))
Here it is in JS:
https://regex101.com/r/virtsL/1/
But it's also compatible in PHP (PCRE), Python, and Java.
Attribution: regex originally from https://coderwall.com/p/uccfpq/formatting-currency-via-regular-expression and I just added the empty capture group.
Per your comments, here is a working version of your attempt; slightly modified:
(\d)()(?=(\d\d\d)+(\D|$))
https://regex101.com/r/McrHgj/1/
const inputStr = `
25
321
54500
78545515201
`
const res = inputStr.replace(/(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))/g, " ")
console.log(res)

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

Regular expression to capture first n digits from comma separated strings

I quickly found a way to get a working multi-line regular expression for my needs, but having trouble with its conversion into a single line.
So, consider this input with regex /^[2-9]\d{1}(?:\s){0}/gm applied:
4126-54D429-001,
5149-A42102-002,
9251-Z48910-003
...
However, when I turn it to one line, I'm getting only first two digits in ouput:
4126-54D429-001, 5149-A42102-002, 9251-Z48910-003 ...
How can this regexp be written to get this capture:
4126-54D429-001, 5149-A42102-002, 9251-Z48910-003 ... ?
This Should Work.
REGEXP
\b\d{2}(?=\d{2})
INPUT
4126-54D429-001, 5149-A42102-002, 9251-Z48910-003, 7851-Z48910-003
OUTPUT
41
51
92
78
The comma is not essential
If i help u, mark me as correct and vote up
This will capture the first two digits of each in groups:
(\d{2})[^,]*

Regular expression hex bytes in string

I am trying to validate the input of a LineEdit widget in Qt. I am using regular expressions for the first time so I could use some help. I want to allow 1 to 32 hex bytes separated by a space, for example this should be valid:
"0a 45 bc 78 e2 34 71 af"
And here are some examples of invalid input:
"1 34 bc 4e" -> They need to be written in pairs, so 1 must be 01.
"8a cb3 58 11" -> cb3 invalid.
"56 f2 a8 69 " -> No trailing space is allowed.
After some head scratching I came up with this regex which seems to work:
"([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})"
Now on to my questions:
Do you see any problems with my regex that my tests have failed to show? If so how can I improve it?
Is there a cleaner way to write it?
Thanks in advance
I am not sure what method you used for validation, but one possible problem is that the method searches the string for substring that matches the pattern rather than checking the string matches the pattern. Use exactMatch for validate a string against a regular expression.
In any case, adding anchors ^ and $ is safer (not necessary when exactMatch is used, though):
"^([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})$"
Since you are doing validation, you don't need capturing. And you don't need to put space in []
"^(?:[0-9A-Fa-f]{2} ){0,31}[0-9A-Fa-f]{2}$"
You can set case-sensitivity with setCaseSensitivity method. If you set it to Qt::CaseInsensitive, you can shorten the regex a bit:
"^(?:[0-9a-f]{2} ){0,31}[0-9a-f]{2}$"

How can I access capture buffers in brackets with quantifiers?

How can I access capture buffers in brackets with quantifiers?
#!/usr/local/bin/perl
use warnings;
use 5.014;
my $string = '12 34 56 78 90';
say $string =~ s/(?:(\S+)\s){2}/$1,$2,/r;
# Use of uninitialized value $2 in concatenation (.) or string at ./so.pl line 7.
# 34,,56 78 90
With #LAST_MATCH_START and #LAST_MATCH_END it works*, but the line gets too long.
Doesn't work, look at TLP's answer.
*The proof of the pudding is in the eating isn't always right.
say $string =~ s/(?:(\S+)\s){2}/substr( $string, $-[0], length($-[0]-$+[0]) ) . ',' . substr( $string, $-[1], length($-[1]-$+[1]) ) . ','/re;
# 12,34,56 78 90
You can't access all previous values of the first capturing group, only the last value (or the current at the match end, as you can see it) will be saved in $1 (unless you want to use a (?{ code }) hack).
For your example you could use something like:
s/(\S+)\s+(\S+)\s+/$1,$2,/
The statement that you say "works" has a bug in it.
length($-[0]-$+[0])
Will always return the length of the negative length of your regex match. The numbers $-[0] and $+[0] are the offset of the start and end of the first match in the string, respectively. Since the match is three characters long (in this case), the start minus end offset will always be -3, and length(-3) will always be 2.
So, what you are doing is taking the first two characters of the match 12 34, and the first two characters of the match 34 and concatenating them with a comma in the middle. It works by coincidence, not because of capture groups.
It sounds as though you are asking us to solve the problems you have with your solution, rather than asking us about the main problem.