Match numbers in square brackets - regex

Please I try to match the following strings in a file:
test/test/abc_xyz[2][0]/abc
test/test/abc_xyz[2]/abc
test/test/abc_xyz/abc
I tried the following using regexp within a TCL script:
set match [regexp {^.*\/*\_xyz(|\[[0-9]{1,}\]|\[[0-9]{1,}\]\[[0-9]{1,}\])\/abc} $line extracted_string ]
Using this regular expression I managed to extract these lines:
"test/test/abc_xyz[2]/abc"
"test/test/abc_xyz/abc"
But Couldn't by any way extract lines similar to this:
test/test/abc_xyz[2][0]/abc
Could anybody tell me what may I be missing?

In this case, you've got this sequence — [, digit(s), ] — that you want to match zero or more times. That leads to this: (?:\[\d+\])* as a general RE. Using that in place of your more complicated piece, we can get:
set match [regexp {^.*\/*\_xyz((?:\[\d+\])*)\/abc} $line extracted_string substring ]
That's shorter, and much easier to understand. I've also added substring in there to get the bracketed sequence. Dealing with that might be best done with a judicious string map of the brackets to spaces:
set numbers [string map {"[" " " "]" " "} $substring]
Now you can use list operations (llength, lindex, foreach, etc.) on it…

Try this: update ^.*\/*\_xyz(\[\d+\])?\/abc
explain:
your problem has here (|\[[0-9]{1,}\]|\[[0-9]{1,}\]\[[0-9]{1,}\]) block.
i think you want to [05] have one time after xyz & before /abc
(\[\d+\]) for select [05]
? for one or more time have
it may helpful. demo

Related

Extract data between square brackets "[]" using Perl

I was using a regex for extracting data from curved brackets (or "parentheses") like extracting a,b from (a,b) as shown below. I have a file in which every line will be like
this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a2,b2) and [b2|a2]
this is the range of values (a3,b3) and [b3|a3]
I'm using the following string to extract a1,b1, a2,b2, etc...
#numbers = $_ =~ /\((.*),(.*)\)/
However, if I want to extract the data from square brackets [], how can I do it? For example
this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a1,b1) and [b2|a2]
I need to extract/match only the data in square brackets and not the curved brackets.
[Update] In the meantime, I've written a blog post about the specific issue with .* I describe below: Why Using .* in Regular Expressions Is Almost Never What You Actually Want
If your identifiers a1, b1 etc. never contain commas or square brackets themselves, you should use a pattern along the lines of the following to avoid backtracking hell:
/\[([^,\]]+),([^,\]]+)\]/
Here's a working example on Regex101.
The issue with greedy quantifiers like .* is that you'll very likely consume too much in the beginning so that the regex engine has to do extensive backtracking. Even if you use non-greedy quantifiers, the engine will do more attempts to match than necessary because it'll only consume one character at a time and then try to advance the position in the pattern.
(You could even use atomic groups to make the matching even more performant.)
#!/usr/bin/perl
# your code goes here
my #numbers;
while(chomp(my $line=<DATA>)){
if($line =~ m|\[(.*),(.*)\]|){
push #numbers, ($1,$2);
}
}
print #numbers;
__DATA__
this is the range of values [a1,b1]
this is the range of values [a2,b2]
this is the range of values [a3,b3]
Demo
You can match it using non-greedy quantifier *?
my #numbers = $_ =~ /\[(.*?),(.*?)\]/g;
or
my #numbers = /\[(.*?),(.*?)\]/g;
for short.
UPDATE
my #numbers = /\[(.*?)\|(.*?)\]/g;
Use the below code
$_ =~ /\[(.*?)\|(.*?)\]/g;
Now if the pattern is successfully matched, the extracted values would be stored in $1 and $2
.
I know I am a little late here but none of the answers correctly answered OP's question and the one that does actually matches the entire thing along with the square brackets []. Clearly the OP wants to match what is inside the brackets.
To match everything inside square brackets along with the brackets. Example
\[[^\[\]]*]
To match everything inside square brackets excluding the brackets themselves use a
positive look-head and look-behind. Example
(?<=\[)[^\[\]]*(?=\])

What regular expression can I use to find the Nᵗʰ entry in a comma-separated list?

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.