Unexpected regex pattern matching - regex

I'd like to match only for python-like floats, that is for example: 0.1, .1 or 0.. I wrote this regex: r"(\d+\.?\d*)|(\.\d+)" and found out that it also matches with "(.6", which I haven't intended. My guess is it has something to do with grouping with parenthesis, but I haven't escaped any parenthesis with a backslash.
I'm using version 1.7.1 of regex crate and cargo 1.67.0.
use regex::Regex;
fn main() {
let pattern = Regex::new(r"^(\d+\.?\d*)|(\.\d+)").unwrap();
assert!(pattern.is_match("(.6"));
}

The regex ^(\d+\.?\d*)|(\.\d+) matches ^(\d+\.?\d*) or (\.\d+).
You want to have ^(\d+\.?\d*)|^(\.\d+) or add another group. ^((\d+\.?\d*)|(\.\d+))

Related

Negative lookahead preceded by .*

I want to select all text within {}, but only if there is no \status…{} in there.
Examples that should match:
\subsection{Hello} -> "\subsection”, "Hello"
\section{Foobar} -> "\section", "Foobar"
\subsubsection{This is a Triumph} -> "\subsubsection", "This is a Triumph"
Examples that should not match:
\subsection{Hello\statusdone{}}
\section{Hello World\statuswip{}}
\section{Everything\statusproofreading{}}
I thought negative lookaheads would be perfect for this:
(\\.*section)\{(.*)(?!\\status.*)\}
but they match:
\subsection{Hello\statusdone{}} -> "\subsection", "Hello\statusdone{}"
\section{Hello World\statuswip{}} -> "\section", "Hello World\statuswip{}"
\section{Everything\statusproofreading{}} -> "\section", "Everything\statusproofreading{}"
I suspect it is because of the .* preceding the negative lookahead. If I replace it with, e.g.g, Hello in the following regex:
(\\.*section)\{(Hello)(?!\\status.*)\}
It correctly does not match the first negative example \subsection{Hello\statusdone{}}.
How do I work around that?
You should move the negative lookahead earlier in the pattern, so that it checks for the presence of that substring before the entire string (.*) is consumed.
You can use:
\\.*section\{((?!.*\\status.*\{\})[^}]+)}
Live demo here.
Regex doesn't have a needle not-inside haystack tester. (Or at least not common implementation of it.)
You're confusing the way zero-width assertions work. It's an ANY match, not an ALL match. The instant that the first position matches, that fits and it returns it.
You have a two-pass job ahead of you. First problem is that you don't have a Regular language here in LaTeX or whatever, and that means Regular Expressions aren't going to work well for arbitrary text.
\section{\math{\ref{\status{asfd}}}} and the final "}" you match, etc.
You need a parser to do that right, not regex. Sorry.

Regex - Exclude some string after matches [duplicate]

I need to extract from a string a set of characters which are included between two delimiters, without returning the delimiters themselves.
A simple example should be helpful:
Target: extract the substring between square brackets, without returning the brackets themselves.
Base string: This is a test string [more or less]
If I use the following reg. ex.
\[.*?\]
The match is [more or less]. I need to get only more or less (without the brackets).
Is it possible to do it?
Easy done:
(?<=\[)(.*?)(?=\])
Technically that's using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:
is preceded by a [ that is not captured (lookbehind);
a non-greedy captured group. It's non-greedy to stop at the first ]; and
is followed by a ] that is not captured (lookahead).
Alternatively you can just capture what's between the square brackets:
\[(.*?)\]
and return the first captured group instead of the entire match.
If you are using JavaScript, the solution provided by cletus, (?<=\[)(.*?)(?=\]) won't work because JavaScript doesn't support the lookbehind operator.
Edit: actually, now (ES2018) it's possible to use the lookbehind operator. Just add / to define the regex string, like this:
var regex = /(?<=\[)(.*?)(?=\])/;
Old answer:
Solution:
var regex = /\[(.*?)\]/;
var strToMatch = "This is a test string [more or less]";
var matched = regex.exec(strToMatch);
It will return:
["[more or less]", "more or less"]
So, what you need is the second value. Use:
var matched = regex.exec(strToMatch)[1];
To return:
"more or less"
Here's a general example with obvious delimiters (X and Y):
(?<=X)(.*?)(?=Y)
Here it's used to find the string between X and Y. Rubular example here, or see image:
You just need to 'capture' the bit between the brackets.
\[(.*?)\]
To capture you put it inside parentheses. You do not say which language this is using. In Perl for example, you would access this using the $1 variable.
my $string ='This is the match [more or less]';
$string =~ /\[(.*?)\]/;
print "match:$1\n";
Other languages will have different mechanisms. C#, for example, uses the Match collection class, I believe.
[^\[] Match any character that is not [.
+ Match 1 or more of the anything that is not [. Creates groups of these matches.
(?=\]) Positive lookahead ]. Matches a group ending with ] without including it in the result.
Done.
[^\[]+(?=\])
Proof.
http://regexr.com/3gobr
Similar to the solution proposed by null. But the additional \] is not required. As an additional note, it appears \ is not required to escape the [ after the ^. For readability, I would leave it in.
Does not work in the situation in which the delimiters are identical. "more or less" for example.
Most updated solution
If you are using Javascript, the best solution that I came up with is using match instead of exec method.
Then, iterate matches and remove the delimiters with the result of the first group using $1
const text = "This is a test string [more or less], [more] and [less]";
const regex = /\[(.*?)\]/gi;
const resultMatchGroup = text.match(regex); // [ '[more or less]', '[more]', '[less]' ]
const desiredRes = resultMatchGroup.map(match => match.replace(regex, "$1"))
console.log("desiredRes", desiredRes); // [ 'more or less', 'more', 'less' ]
As you can see, this is useful for multiple delimiters in the text as well
PHP:
$string ='This is the match [more or less]';
preg_match('#\[(.*)\]#', $string, $match);
var_dump($match[1]);
This one specifically works for javascript's regular expression parser /[^[\]]+(?=])/g
just run this in the console
var regex = /[^[\]]+(?=])/g;
var str = "This is a test string [more or less]";
var match = regex.exec(str);
match;
To remove also the [] use:
\[.+\]
I had the same problem using regex with bash scripting.
I used a 2-step solution using pipes with grep -o applying
'\[(.*?)\]'
first, then
'\b.*\b'
Obviously not as efficient at the other answers, but an alternative.
I wanted to find a string between / and #, but # is sometimes optional. Here is the regex I use:
(?<=\/)([^#]+)(?=#*)
Here is how I got without '[' and ']' in C#:
var text = "This is a test string [more or less]";
// Getting only string between '[' and ']'
Regex regex = new Regex(#"\[(.+?)\]");
var matchGroups = regex.Matches(text);
for (int i = 0; i < matchGroups.Count; i++)
{
Console.WriteLine(matchGroups[i].Groups[1]);
}
The output is:
more or less
If you need extract the text without the brackets, you can use bash awk
echo " [hola mundo] " | awk -F'[][]' '{print $2}'
result:
hola mundo

regex: ignore potential matches after a "#"

EDIT: Although I've marked this question with the java tag, I don't want a solution that requires java code. I just would like the pattern to be compatible with Java's regex implementation if possible (which unfortunately is not quite PCRE compatible). What I would like is just a single regex that produces the matches I want.
Suppose I have this string:
foo bar foo bar # foo bar foo bar
I'd like to match instances of "foo", but only if they are not after any "#" symbol (if one is present). In other words, I want this result:
foo bar foo bar # foo bar foo bar
^^^ ^^^
I tried using a negative look-behind like this:
(?<!#.*)\bfoo\b
...but this doesn't work because a look-behind cannot be of variable length. Any suggestions?
This one should do the work
(?=.*#) lookahead and gets all text before "#"
global flag "g" repeats pattern
/(?=.*#)(\bfoo\b)/g
You can do replaceFirst method to remove text after # and then do a simple word match:
final Pattern pattern = Pattern.compile("\\bfoo\\b");
final Matcher matcher = pattern.matcher(input.replaceFirst("#.*$", ""));
while (matcher.find()) {
System.err.printf("Found Match: %s%n", matcher.group());
}
Java regex is not powerful enough for doing it with a single regex.
Lookbehind is fixed width, so that's not a solution.
Lookeahead is only applicable when you can be sure that there is a # in the string.
Java does not allow failing a match and then continuing searching at the end (like with SKIP/FAIL in PCRE). It always continues at the character after the last matching start.
#.*|(\bfoo\b) and then checking if the first matching group is defined would be a workaround here, but there's no pure way to just match \bfoo\b sequences.
There is no way to do it with a single regex as others said already. But there is a workaround for this.
Select # and every thing after:
#.*
Copy highlighted part and paste it in parenthesis in place of
HERE:
foo(?=.*\QHERE\E)

regular expression replacement of numbers

Using regular expression how do I replace 1,186.55 with 1186.55?
My search string is
\b[1-9],[0-9][0-9][0-9].[0-9][0-9]
which works fine. I just can't seem to get the replacement part to work.
You are very sparse with information in your question. I try to answer as general as possible:
You can shorten the regex a bit by using quantifiers, I would make this in a first step
\b[1-9],[0-9]{3}.[0-9]{2}
Most probably you can also replace [0-9] by \d, is also more readable IMO.
\b\d,\d{3}.\d{2}
Now we can go to the replacement part. Here you need to store the parts you want to keep. You can do that by putting that part into capturing groups, by placing brackets around, this would be your search pattern:
\b(\d),(\d{3}.\d{2})
So, now you can access the matched content of those capturing groups in the replacement string. The first opening bracket is the first group the second opening bracket is the second group, ...
Here there are now two possibilities, either you can get that content by \1 or by $1
Your replacement string would then be
\1\2
OR
$1$2
Python:
def repl(initstr, unwanted=','):
res = set(unwanted)
return ''.join(r for r in initstr if r not in res)
Using regular expressions:
from re import compile
regex = compile(r'([\d\.])')
print ''.join(regex.findall('1,186.55'))
Using str.split() method:
num = '1,186.55'
print ''.join(num.split(','))
Using str.replace() method:
num = '1,186.55'
print num.replace(',', '')
if you just wanna remove the comma you can do(in java or C#):
str.Replace(",", "");
(in java it's replace)
Or in Perl:
s/(\d+),(\d+)/$1$2/

Regex AND operator

Based on this answer
Regular Expressions: Is there an AND operator?
I tried the following on http://regexpal.com/ but was unable to get it to work. What am missing? Does javascript not support it?
Regex: (?=foo)(?=baz)
String: foo,bar,baz
It is impossible for both (?=foo) and (?=baz) to match at the same time. It would require the next character to be both f and b simultaneously which is impossible.
Perhaps you want this instead:
(?=.*foo)(?=.*baz)
This says that foo must appear anywhere and baz must appear anywhere, not necessarily in that order and possibly overlapping (although overlapping is not possible in this specific case because the letters themselves don't overlap).
Example of a Boolean (AND) plus Wildcard search, which I'm using inside a javascript Autocomplete plugin:
String to match: "my word"
String to search: "I'm searching for my funny words inside this text"
You need the following regex: /^(?=.*my)(?=.*word).*$/im
Explaining:
^ assert position at start of a line
?= Positive Lookahead
.* matches any character (except newline)
() Groups
$ assert position at end of a line
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Test the Regex here: https://regex101.com/r/iS5jJ3/1
So, you can create a javascript function that:
Replace regex reserved characters to avoid errors
Split your string at spaces
Encapsulate your words inside regex groups
Create a regex pattern
Execute the regex match
Example:
function fullTextCompare(myWords, toMatch){
//Replace regex reserved characters
myWords=myWords.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
//Split your string at spaces
arrWords = myWords.split(" ");
//Encapsulate your words inside regex groups
arrWords = arrWords.map(function( n ) {
return ["(?=.*"+n+")"];
});
//Create a regex pattern
sRegex = new RegExp("^"+arrWords.join("")+".*$","im");
//Execute the regex match
return(toMatch.match(sRegex)===null?false:true);
}
//Using it:
console.log(
fullTextCompare("my word","I'm searching for my funny words inside this text")
);
//Wildcards:
console.log(
fullTextCompare("y wo","I'm searching for my funny words inside this text")
);
Maybe you are looking for something like this. If you want to select the complete line when it contains both "foo" and "baz" at the same time, this RegEx will comply that:
.*(foo)+.*(baz)+|.*(baz)+.*(foo)+.*
Maybe just an OR operator | could be enough for your problem:
String: foo,bar,baz
Regex: (foo)|(baz)
Result: ["foo", "baz"]