I have a regex formula that I'm using to find specific patterns in my data. Specifically, it starts by looking for characters between "{}" brackets, and looks for "p. " and grabs the number after. I noticed that, in some instances, if there's not a "p. " value shortly after the brackets, it will continue to go through the next brackets and grab the number after.
For example, here is my sample data:
{Hello}, [1234] (Test). This is sample data used to answer a question {Hello2} [Ch.8 p. 87 gives more information about...
Here is my code:
\{(.*?)\}(.*?)p\. ([0-9]+)
I want it to return this only:
{Hello2} [Ch.8 p. 87
But it returns this:
{Hello}, [123:456] (Test). This is stample data used to answer a
question {Hello2} [Ch.8 p. 87
Is there a way to exclude strings that contain, let's say, "{"?
Your pattern first matches from { till } and then matches in a non greedy way .*? giving up matches until it can match a p, dot space and 1+ digits.
It can do that because the dot can also match {}.
You could use negated character classes [^{}] to not match {}
\{[^{}]*\}[^{}]+p\. [0-9]+
Regex demo
Your expression seems to be working fine, my guess is that we wish to only capture that desired output and non-capture others, which we can do so by slight modification of your original expression:
(?:[\s\S]*)(\{(.*?)\}(.*?)p\. [0-9]+)
Demo 1
or this expression:
(?:[\s\S]*)(\{.*)
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /(?:[\s\S]*)(\{.*)/gm;
const str = `{Hello}, [123:456] (Test). This is stample data used to answer a
question {Hello2} [Ch.8 p. 87`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Here's how you do it in Java. The regex should be fairly universal.
String test = "{Hello2} [Ch.8 p. 87 gives more information about..";
String pat = "(\\{.*?\\}.*p.*?\\d+)";
Matcher m = Pattern.compile(pat).matcher(test);
if (m.find()) {
System.out.println(m.group(1));
}
More specific ones can be provided if more is known about your data. For example, does each {} of information start on a separate line? What does the data look like and what do you want to ignore.
Based on your example text, you may be able to simplify your regex a bit and avoid matching a second open curly brace before you match the page number (unless you have some other purpose for the capture groups). For example:
{[^{]*p\.\s\d+
{ match an open curly brace
[^{]* match all following characters except for another open curly brace
p\.\s\d+ match "p" followed by period, space and one or more digits
Related
I am trying to capture parts of an equation using a regular expression.
Equation: 1×2÷3×4
Regular expression: \d+(×|÷)\d+
I expect this to result in:
1×2
2÷3
3×4
But it only returns:
1×2
3×4
I assume this has something to do with the structure, but I'm not even sure where to start or what to google to find the answer.
If your regex matches something then it will continue after that match so that's why you are getting only two matches. You can use (?=abc) positive lookahead to just see that if there is ([×÷]) and capture it and (\d) after the match.
You can use
/\d(?=([×÷])(\d))/g
The below code is specifically in Javascript
const regex = /\d(?=([×÷])(\d))/g;
const str = "1×2÷3×4";
const results = [...str.matchAll(regex)].map((arr) => {
return `${arr[0]}${arr[1]}${arr[2]}`;
});
console.log(results);
Each part of the string will only be matched to the pattern one time - once the substring "1x2" has matched the regular expression, the '2' won't be re-used in subsequent matches. Consider the string "×2÷3×4" (i.e. drop the first '1') - in this case the first (and only) match is "2÷3".
I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?
When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
If it's an option, split on delimiters instead
Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
Do note that these two aren't exactly equivalent, but it may be an option
Do multilevel matching:
Capture the repeated group in one match
Then run another regex to break that match apart
References
regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Javascript flavor notes
Example
Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):
var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";
var r = /<(\w+(;\w+)*)>/g;
var match;
while ((match = r.exec(text)) != null) {
print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz
The pattern used is:
_2__
/ \
<(\w+(;\w+)*)>
\__________/
1
This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.
Related questions
How do you access the matched groups in a javascript regex?
How's about this? "foo bar baz".match(/(\w+)+/g)
Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
try using 'g':
"foo bar baz".match(/\w+/g)
You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz
For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.
I've been playing with a regex definition (language "Basic") but cannot get it to work.
I will delete my previous post on the same matter when I get a solution.
The regex shall:
MATCH:
"400:-"
"200:-"
"588:-"
"999:-"
BUT NO MATCH:
"1 200:-"
"o 100:-"
"1400:-"
"y 800:-"
"400"
"i 588:-"
Why does this regex not work?
(^[0-9]?[0-9]?[0-9]:-$)
Just try with following regex:
^\d{3}:-$
Your regular expression does work, just remove the optional quantifier ? and place your beginning/ending line anchors outside of your capturing group. It could be simplified to the following.
^([0-9]{3}:-)$
Try
"^[0-9]{3}:-"
That tells it to find any number between 0 and 9 three times, at the beginning of the string, followed immediately by ":-"
If you don't want it to check just the beginning then
bool check;
Regex reg = new Regex("[0-9]{3}:-");
check = reg.IsMatch("400:-"); // true
check = reg.IsMatch("40:-"); // false
check = reg.IsMatch("asdf400:-"); // true
But this will make it match the ones you don't want matched.
I have the following string:
123322
In theory, the regex 1.*2 should match the following:
12 (because * can be zero characters)
12332
123322
If I use the regex 1.*2 it matches 123322.
Using 1.*?2, it will match 12.
Is there a way to match 12332 too?
The perfect thing would be to get all possible matchess in the string (no matter if one match is substring of another)
No, unless there is something else added to the regex to clarify what it should do it will either be greedy or non-greedy. There is no in-betweeny ;)
1(.*?2)*$
you will have multiple captures which you can concatenate to form all possible matches
see here:regex tester
click on 'table' and expand the captures tree
You would need a separate expression for each case, depending on the number of twos you want to match:
1(.*?2){1} #same as 1.*?2
1(.*?2){2}
1(.*?2){3}
...
Generally, this isn't possible. A regex matching engine isn't really designed to find overlapping matches. A quick solution is simply to check the pattern on all substrings manually:
string text = "1123322";
for (int start = 0; start < text.Length - 1; start++)
{
for (int length = 0; length <= text.Length - start; length++)
{
string subString = text.Substring(start, length);
if (Regex.IsMatch(subString, "^1.*2$"))
Console.WriteLine("{0}-{1}: {2}", start, start + length, subString);
}
}
Working example: http://ideone.com/aNKnJ
Now, is it possible to get a whole-regex solution? Mostly, the answer is no. However, .Net does has a few tricks in its sleeve to help us: it allows variable length lookbehind, and allows each capturing group to remember all captures (most engines only return the last match of each group). Abusing these, we can simulate the same for loop inside the regex engine:
string text = "1123322!";
string allMatchesPattern = #"
(?<=^ # Starting at the local end position, look all the way to the back
(
(?=(?<Here>1.*2\G))? # on each position from the start until here (\G),
. # *try* to match our pattern and capture it,
)* # but advance even if you fail to match it.
)
";
MatchCollection matches = Regex.Matches(text, allMatchesPattern,
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
foreach (Match endPosition in matches)
{
foreach (Capture startPosition in endPosition.Groups["Here"].Captures)
{
Console.WriteLine("{0}-{1}: {2}", startPosition.Index,
endPosition.Index - 1, startPosition.Value);
}
}
Note that currently there's a small bug there - the engine doesn't try to match the last ending position ($), so you loose a few matches. For now, adding a ! at the end of the string solves that issue.
working example: http://ideone.com/eB8Hb