Extract data between square brackets "[]" using Perl - regex

I was using a regex for extracting data from curved brackets (or "parentheses") like extracting a,b from (a,b) as shown below. I have a file in which every line will be like
this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a2,b2) and [b2|a2]
this is the range of values (a3,b3) and [b3|a3]
I'm using the following string to extract a1,b1, a2,b2, etc...
#numbers = $_ =~ /\((.*),(.*)\)/
However, if I want to extract the data from square brackets [], how can I do it? For example
this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a1,b1) and [b2|a2]
I need to extract/match only the data in square brackets and not the curved brackets.

[Update] In the meantime, I've written a blog post about the specific issue with .* I describe below: Why Using .* in Regular Expressions Is Almost Never What You Actually Want
If your identifiers a1, b1 etc. never contain commas or square brackets themselves, you should use a pattern along the lines of the following to avoid backtracking hell:
/\[([^,\]]+),([^,\]]+)\]/
Here's a working example on Regex101.
The issue with greedy quantifiers like .* is that you'll very likely consume too much in the beginning so that the regex engine has to do extensive backtracking. Even if you use non-greedy quantifiers, the engine will do more attempts to match than necessary because it'll only consume one character at a time and then try to advance the position in the pattern.
(You could even use atomic groups to make the matching even more performant.)

#!/usr/bin/perl
# your code goes here
my #numbers;
while(chomp(my $line=<DATA>)){
if($line =~ m|\[(.*),(.*)\]|){
push #numbers, ($1,$2);
}
}
print #numbers;
__DATA__
this is the range of values [a1,b1]
this is the range of values [a2,b2]
this is the range of values [a3,b3]
Demo

You can match it using non-greedy quantifier *?
my #numbers = $_ =~ /\[(.*?),(.*?)\]/g;
or
my #numbers = /\[(.*?),(.*?)\]/g;
for short.
UPDATE
my #numbers = /\[(.*?)\|(.*?)\]/g;

Use the below code
$_ =~ /\[(.*?)\|(.*?)\]/g;
Now if the pattern is successfully matched, the extracted values would be stored in $1 and $2
.

I know I am a little late here but none of the answers correctly answered OP's question and the one that does actually matches the entire thing along with the square brackets []. Clearly the OP wants to match what is inside the brackets.
To match everything inside square brackets along with the brackets. Example
\[[^\[\]]*]
To match everything inside square brackets excluding the brackets themselves use a
positive look-head and look-behind. Example
(?<=\[)[^\[\]]*(?=\])

Related

RegEx improvement recommendations

Given a string like
Some text and [A~Token] and more text and [not a token] and
[another~token]
I need to extract the "tokens" for later replacement. The tokens are defined as two identifiers separated by a ~ and enclosed in [ ]. What I have been doing is using $string -match "\[.*?~.*?\]", which works. And, as I understand it I am escaping both brackets, doing any character zero or more times and forced lazy, then the ~ and then the same any character sequence. So, my first improvement was to replace .*? with .+?, as I want 1 or more, not zero or more. Then I moved to $string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]", which limits the two identifiers to alpha numerics, which is a big improvement.
So, first question is:
Is this last solution the best approach, or is there further improvements to be made?
Also, currently I only get a single token returned, so I am looping through the string, replacing tokens as they are found, and looping till there are no tokens. But, my understanding is that RegEx is greedy by default, and so I would have expected this last version to return two tokens, and I could loop through the dictionary rather than using a While loop.
So, second question is:
What am I doing wrong that I am only getting one match back? Or am I misunderstanding how greedy matching works?
EDIT:
to clarify, I am using $matches, as shown here, and still only getting a count of 1.
if ($string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]") {
Write-Host "new2: $($matches.count)"
foreach ($key in $matches.keys) {
Write-Host "$($matches.$key)"
}
}
Also, I can't really use a direct replace at the point of identifying the token, because there are a TON of potential replacements. I take the token, strip the square brackets, then split on the ~ to arrive at prefix and suffix values, which then identify a specific replacement value, which I can do with a dedicated -replace.
And one last clarification, the number of tokens is variable. It could just be one, it could be three or four. So my solution has to be pretty flexible.
To list all tokens and use the values you can use code like this:
$matces = Select-String '\[([\w]+)~([\w]+)\]' -input $string -AllMatches | Foreach {$_.matches}
foreach($value in $matces){
$fullToken = $value.Value;
$firstPart = $value.Groups[1].Value;
$secondPart = $value.Groups[2].Value;
echo "full token found: '$fullToken' first part: '$firstPart' second part: '$secondPart'";
}
Note in regex parts grouped with () this allows access to parts of you token.
In this loop you can find appropriate value that you want to insert instead of fullToken using firstPart and secondPart.
As for the \[.*?~.*?\] not working properly its because it tries to match and succeeds with text [not a token] and [another~token] as in this regex characters ][ are allowed in token parts. \[[^\]\[]*?~[^\]\[]*?\] (^ negates expression so it would read: all characters except ][) would also be fine but its not that readable with all braces if \w is good enough you should us it.
You can use \w to match a word character (letter, digit, underscore).
That results in the pattern \[\w+~\w+\].
Now you can create a regex object with that pattern:
$rgx = [Regex]::new($pattern)
and replace all occurences of that pattern with the Replace operator:
$rgx.Replace($inputstring, $replacement)
Maybe it's also worth noting that regex has an .Match operator which returns the first occurence of the pattern and an .Matches operator which return all occurences of the pattern.
Taking your example line
$String = "Some text and [A~Token] and more text and [not a token] and [another~token]"
This RegEx with capture groups
$RegEx = [RegEx]"\[(\w+~\w+)\][^\[]+\[[^\]]+\][^\[]+\[(\w+~\w+)\]"
if ($string -match $RegEX){
"First token={0} Second token={1}" -f $matches[1],$matches[2]
}
returns:
First token=A~Token Second token=another~token
See the above RegEx explained on https://regex101.com/r/tp6b9e/1
The area between the two tokens is matched alternating with negated classes
for [/] and the literal char [/]

Regex to match hours and time

I'm still learning Perl regular expressions and I need to match a string that represents the time.
However there are instances where multiple times get entered. Instead of '9AM' I will sometimes get '9AM5PM' or '09AM05PM' and so on... Fortunately, It always starts with one or two numbers and ends with 'AM' or 'PM' (Upper and Lowercase)
Here's what I have so far:
$string =~ /^((([1-9])|(1[0-2]))*(A|P)M)$/i;
Any help would be greatly appreciated!
The only problem I can see with your own code is that the hours field is optional (because you use a *) but you don't say what issues you're having.
You do have a lot of unnecessary captures. Every part of the pattern that is enclosed in parentheses will capture the corresponding part of the target string in an internal variables called $1, $2 etc. Unless you really need those captures it is best to use non-capturing parentheses (?: ... ) instead of the plain ones ( ... ).
Character classes like [1-9] are a single entity and don't need enclosing in parentheses. You also haven't accounted for a leading zero on values less than ten, and you should use a character class [AP] instead of an alternation (?:A|P)
It looks like you need
/\d{1,2}[AP]M/i
But you don't say what you want to do with the times once you have found them.
This snippet of code demonstrates the functionality by putting all the times that it finds in a string into array #times and then printing it with space separators.
use strict;
use warnings;
for my $string (qw/ 9AM 9AM5PM 09AM05PM /) {
my #times = $string =~ /\d{1,2}[AP]M/ig;
print "#times\n";
}
output
9AM
9AM 5PM
09AM 05PM
If you really want to verify that the hour value is in range (are you likely to come across 35pm?) then you could write
my #times = $string =~ / (?: 1[012] | 0?[1-9] ) [AP]M /igx
Note that the /x modifier makes whitespace insignificant within regular expressions, so that it can be used to clarify the form of the pattern.
You can try something like:
$string =~ /^((0?\d|1[0-2])[AP]M)+$/i;
As you can see here. Or:
$string =~ /^((0?\d|1[0-2])[AP]M){1,2}$/i;
If you want it to be just up to 2 hours together.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");