How can I access capture buffers in brackets with quantifiers?

How can I access capture buffers in brackets with quantifiers? - regex

How can I access capture buffers in brackets with quantifiers?
#!/usr/local/bin/perl
use warnings;
use 5.014;
my $string = '12 34 56 78 90';
say $string =~ s/(?:(\S+)\s){2}/$1,$2,/r;
# Use of uninitialized value $2 in concatenation (.) or string at ./so.pl line 7.
# 34,,56 78 90
With #LAST_MATCH_START and #LAST_MATCH_END it works*, but the line gets too long.
Doesn't work, look at TLP's answer.
*The proof of the pudding is in the eating isn't always right.
say $string =~ s/(?:(\S+)\s){2}/substr( $string, $-[0], length($-[0]-$+[0]) ) . ',' . substr( $string, $-[1], length($-[1]-$+[1]) ) . ','/re;
# 12,34,56 78 90

You can't access all previous values of the first capturing group, only the last value (or the current at the match end, as you can see it) will be saved in $1 (unless you want to use a (?{ code }) hack).
For your example you could use something like:
s/(\S+)\s+(\S+)\s+/$1,$2,/

The statement that you say "works" has a bug in it.
length($-[0]-$+[0])
Will always return the length of the negative length of your regex match. The numbers $-[0] and $+[0] are the offset of the start and end of the first match in the string, respectively. Since the match is three characters long (in this case), the start minus end offset will always be -3, and length(-3) will always be 2.
So, what you are doing is taking the first two characters of the match 12 34, and the first two characters of the match 34 and concatenating them with a comma in the middle. It works by coincidence, not because of capture groups.
It sounds as though you are asking us to solve the problems you have with your solution, rather than asking us about the main problem.

Related

How can I identify a number of variable length with a regex?

I need a Perl regex to pull a number of between six and ten digits out of a string. The number will always follow a particular word followed by a space (case unknown).
For example, if the word I was looking for is 'string':
some random text blah blah blahSTRING 1234567890some more random text
Desired output:
1234567890
Another example:
yet more random textra ra rastring 654321hey hey my my
Desired output:
654321
I want to load the result into a variable.

/string ([0-9]{6,10})/i
string matches STRING and string as the expression ends with i (case insenstive matching)
matches a space
(starts a capture group to capture the number you trying to get
[0-9]{6,10}matches a number with 6 to 10 places
https://regex101.com/r/mB1zF4/1

Group 1 should contain your number with
/^.*string (\d+).*$/i

Thanks everyone, between all the responses and a bit of googling I ended up with
#!/usr/local/bin/perl -w
use strict;
my $string = 'sgtusadl;fdsas;adlhstring 12345678daf;slkdfja;dflk';
my ( $number ) = $string =~ m/string\s\d{6,10}/gi;
$number =~ s/[^0-9]//g;
print "number is $number\n";
exit 0;

Regex: match when string has repeated letter pattern

I'm using the Regex interpreter found in XYplorer file browser. I want to match any string (in this case a filename) that has repeated groups of 'several' characters. More specifically, I want a match on the string:
jack johnny - mary joe ken johnny bill
because it has 'johnny' at least twice. Note that it has spaces and a dash too.
It would be nice to be able to specify the length of the group to match, but in general 4, 5 or 6 will do.
I have looked at several previous questions here, but either they are for specific patterns or involve some language as well. The one that almost worked is:
RegEx: words with two letters repeated twice (eg. ABpoiuyAB, xnvXYlsdjsdXYmsd)
where the answer was:
\b\w*(\w{2})\w*\1
However, this fails when there are spaces in the strings.
I'd also like to limit my searches to .jpg files, but XYplorer has a built-in filter to only look at image files so that isn't so important to me here.
Any help will be appreciated, thanks.
.
.
.
EDIT -
The regex by OnlineCop below answered my original question, thanks very much:
(\b\w+.\b).(\1)
I see that it matches words, not arbitrary string chunks, but that works for my present need. And I am not interested in capturing anything, just in detecting a match.
As a refinement, I wonder if it can be changed or extended to allow me to specify the length of words (or string chunks) that must be the same in order to declare a match. So, if I specified a match length of 5 and my filenames are:
1) jack john peter paul mary johnnie.jpg
2) jack johnnie peter paul mary johnnie.jpg
the first one would not match since no substring of five characters or more is repeated. The second one would match since 'johnnie' is repeated and is more than 5 chars long.

Do you wish to capture the word 'johnny' or the stuff between them (or both)?
This example shows that it selects everything from the first 'johnny' to the last, but it does not capture the stuff between:
Re: (\b\w+\b).*(\1)
Result: jack bill
This example allows some whitespace between names/words:
Re: (\b\w+.*\b).*(\1)
String: Jackie Chan fought The Dragon who was fighting Jackie Chan
Result: Jackie Chan Jackie Chan

Use perl:
#!/usr/bin/perl
use strict;
use warnings;
while ( my $line = <STDIN> ) {
chomp $line;
my #words = split ( /\s+/, $line );
my %seen;
foreach my $word ( #words ) {
if ( $seen{$word} ) { print "Match: $line\n"; last }
$seen{$word}++;
}
}
And yes, it's not as neat as a one line regexp, but it's also hopefully a bit clearer what's going on.

Remove empty spaces and period

I cannot get this regex to work:
"4. 182 ex" (number, period, 2 blank spaces, 3 numbers, blank space, 2 characters"
The regex syntax should return "4182" and remove period, blank spaces, and characters.
Can you help me please?
EDIT!!!
Thanks everyone but I missed the key question:
a) the regex shall only find the value (4182) when the same line contains a specific text for example "magic", so for example:
"Magic 4. 182 ex"
b) the regex shall "only" find the value (4182) when the table contains a specific text for example "Magic":
"Magic 4. 182 ex
Lisefeo 2. 123 fg
Nioos 3. 124 df"
specific text = exact match or contains those charachters
My regex that I've tried so far but does it work for a whole table (not just a line) ?
(Magic.*?(\d).\s\s(\d{3})\s\w\w)

Just remove all characters that are not digit:
Perl:
$string =~ s/\D+//g;
or
php:
$string = preg_replace('/\D+/', '', $string);
According to your updated question, you could do:
$string =~ s/^Magic(\d+)\. (\d{3})\b.*$/$1$2/
or, with php:
$string = preg_replace('/^Magic(\d+)\. (\d{3})\b.*$/', '$1$2', $string);

For it to match exactly what you said, use:
(\d)\.\s\s(\d{3})\s\w\w
You'll get it in two groups, first digit and second digit group.
RegEx101 exmple
Regards.

^([\d]+)\.[\s]+([\d]+)[\s]..
Tested with perl:
> echo "4. 182 ex" | perl -lne 'print $1,$2 if(/^([\d]+)\.[\s]+([\d]+)[\s]../)'
4182

regular expression about numbers

is this regular expression valid in case I want to include numbers only up to 31 ?
'[^0-9>31]+ or it will also return alphabetic characters and I must somehow exclude them too ?

Your regex accepts one or more characters, each of which is not one of the following
0 1 2 3 4 5 6 7 8 9 >
What you want is:
/^(?:[0-9]|[12][0-9]|3[01])$/

Regular expressions are not the sonic screwdriver of text, able to magically do everything you could possibly want. There is nothing in regular expressions that will check the value of a number.
What you need to do is two steps, written here in Perl.
$ok = ($s =~ /^\d{1,2}$/) && ($s < 31);
That checks the value of $s for start of the string (^), one or two digits (\d{1,2}) and then the end of the string ($). If that is true, then it also checks to see that the numeric value of $s is less than 31.
Yes, you can use a complex regex like this from Ray Toal's answer:
/^(?:[0-9]|[12][0-9]|3[01])$/
but that is far less readable.

regex tutorial, How can I improve this

I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:
<address> <11 * ascii character value> <11 characters>
00C4F244 75 6C 74 73 3E 3C 43 75 72 72 65 ults><Curre
I wanted to strip out everything bar the 11 characters at the end and used the following expression:
"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"
This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:
1: having the regex engine return the characters I wanted rather than the characters I didn't and
2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.
Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.
Edit: Thanks guys, this is what I wanted:
"(?:^[0-9A-F]{8} )(?:[0-9A-F]{2} ){11} (.*)"
Wish I could turn more than one of you green.

As the file has a fixed format, you could use this regular expression to just match the last 11 characters.
^.{44}(.{11})

Last eleven is:
...........$
or:
.{11}$
Matching a hex byte + space and repeat eleven times:
([0-9A-Fa-f]{2} ){11}

1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)
Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.
2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)
(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.
I'm assuming PCRE here, by the way.

The address and ascii char value are all hex so:
^[0-9A-F\s]{42}

Matching the end of the line would be
.{11}$
To match only the end, you can use a positive look behind.
"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"
This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.
(?<=....) defines a condition that must be met before matching is possible.
I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).
Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.

If you're using Perl, you could also use unpack(), to get each element.
my #data;
open my $fh, '<', $filename or die;
for my $line(<$fh>){
my($address,#list) = unpack 'a8xx(a2x)11xa11', $line;
my $str = pop #list;
# unpack the hexadecimal bytes
my $data = join '', map { pack 'H2',$_ } #list;
die unless $data eq $str;
push #data, [$address,$data,$str];
}
close $fh;
I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js