Match whole word (Visual Studio style) - regex

I am trying to add Match Whole Word search to my small application.
I want it to do the same thing that Visual Studio is doing.
So for example, below code should work fine:
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
String input = "[ abc() *abc ]";
Match(input, "abc", 2);
Match(input, "abc()", 1);
Match(input, "*abc", 1);
Match(input, "*abc ", 1);
}
private void Match(String input, String pattern, int expected)
{
String escapedPattern = Regex.Escape(pattern);
MatchCollection mc = Regex.Matches(input, #"\b" + escapedPattern + #"\b", RegexOptions.IgnoreCase);
if (mc.Count != expected)
{
throw new Exception("match whole word isn't working");
}
}
}
Searching for "abc" works fine but other patterns return 0 results.
I think \b is inadequate but i am not sure what to use.
Any help would be appreciated.
Thanks

The \b metacharacter matches on a word-boundary between an alphanumeric and non-alphanumeric character. The strings that end with non-alphanumeric characters end up failing to match since \b is working as expected.
To perform a proper whole word match that supports both types of data you need to:
use \b before or after any alphanumeric character
use \B (capital B) before or after any non-alphanumeric character
not use \B if the first or last character of the pattern is intentionally a non-alphanumeric character, such as your final example with a trailing space
Based on these points you need to have additional logic to check the incoming search term to shape it into the appropriate pattern. \B works in the opposite manner of \b. If you don't use \B then you could incorrectly end up with partial matches. For example, the word foo*abc would incorrectly be matched with a pattern of #"\*abc\b".
To demonstrate:
string input = "[ abc() *abc foo*abc ]";
string[] patterns =
{
#"\babc\b", // 3
#"\babc\(\)\B", // 1
#"\B\*abc\b", // 1, \B prefix ensures whole word match, "foo*abc" not matched
#"\*abc\b", // 2, no \B prefix so it matches "foo*abc"
#"\B\*abc " // 1
};
foreach (var pattern in patterns)
{
Console.WriteLine("Pattern: " + pattern);
var matches = Regex.Matches(input, pattern);
Console.WriteLine("Matches found: " + matches.Count);
foreach (Match match in matches)
{
Console.WriteLine(" " + match.Value);
}
Console.WriteLine();
}

I think this is what you're looking for:
#"(?<!\w)" + escapedPattern + #"(?!\w)"
\b is defined in terms of the presence or absence of "word" characters both before and after the current position. You only care about the what's before the first character and what's after the last one.

The \b is a zero-width assertion that matches between a word character and a non-word character.
Letters, digits and underscores are word characters. *, SPACE, and parens are non-word characters. therefore, when you use \b*abc\b as your pattern, it does not match your input, because * is non-word. Likewise for your pattern involving parens.
To solve this,
You will need to eliminate the \b in cases where your input (unescaped) pattern begins or ends with non-word characters.
public void Run()
{
String input = "[ abc() *abc ]";
Match(input, #"\babc\b", 2);
Match(input, #"\babc\(\)", 1);
Match(input, #"\*abc\b", 1);
Match(input, #"\*abc\b ", 1);
}
private void Match(String input, String pattern, int expected)
{
MatchCollection mc = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);
Console.WriteLine((mc.Count == expected)? "PASS ({0}=={1})" : "FAIL ({0}!={1})",
mc.Count, expected);
}

Related

Pattern match for (length)%code with before length

I have a pattern like x%c, where x is a single digit integer and c is an alphanumeric code of length x. % is just a token separator of length and code
For instance 2%74 is valid since 74 is of 2 digits. Similarly, 1%8 and 4%3232 are also valid.
I have tried regex of form ^([0-9])(%)([A-Z0-9]){\1}, where I am trying to put a limit on length by the value of group 1. It does not work apparently since the group is treated as a string, not a number.
If I change the above regex to ^([0-9])(%)([A-Z0-9]){2} it will work for 2%74 it is of no use since my length is to be limited controlled by the first group not a fixed digit.
I it is not possible by regex is there a better approach in java?
One way could be using 2 capture groups, and convert the first group to an int and count the characters for the second group.
\b(\d+)%(\d+)\b
\b Word boundary
(\d+) Capture group 1, match 1+ digits
% Match literally
(\d+) Capture group 2, match 1+ digits
\b Word boundary
Regex demo | Java demo
For example
String regex = "\\b(\\d+)%(\\d+)\\b";
String string = "2%74";
Pattern pattern = Pattern.compile(regex);
String strings[] = { "2%74", "1%8", "4%3232", "5%123456", "6%0" };
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
if (Integer.parseInt(matcher.group(1)) == matcher.group(2).length()) {
System.out.println("Match for " + s);
} else {
System.out.println("No match for " + s);
}
}
}
Output
Match for 2%74
Match for 1%8
Match for 4%3232
No match for 5%123456
No match for 6%0

How to do a camel case to sentence case in dart

Something is wrong with my attempt:
String camelToSentence(String text) {
var result = text.replaceAll(RegExp(r'/([A-Z])/g'), r" $1");
var finalResult = result[0].toUpperCase() + result.substring(1);
return finalResult;
}
void main(){
print(camelToSentence("camelToSentence"));
}
It just prints "CamelToSentence" instead of "Camel To Sentence".
Looks like the problem is here r" $1"; but I don't know why.
You can use
String camelToSentence(String text) {
return text.replaceAllMapped(RegExp(r'^([a-z])|[A-Z]'),
(Match m) => m[1] == null ? " ${m[0]}" : m[1].toUpperCase());
}
Here,
^([a-z])|[A-Z] - matches and captures into Group 1 a lowercase ASCII letter at the start of string, or just matches an uppercase letter anywhere in the string
(Match m) => m[1] == null ? " ${m[0]}" : m[1].toUpperCase() returns as the replacement the uppercases Group 1 value (if it was matched) or a space + the matched value otherwise.
You should not use the / and /g in the pattern.
About the The replaceAll method:
Notice that the replace string is not interpreted. If the replacement
depends on the match (for example on a RegExp's capture groups), use
the replaceAllMapped method instead.
As is does not match, result[0] returns c and result.substring(1) contains amelToSentence so you are concatenating an uppercased c with amelToSentence giving CamelToSentence
You can also use lookarounds
(?<!^)(?=[A-Z])
(?<!^) Assert not the start of the string
(?=[A-Z]) Assert an uppercase char A-Z to the right
Dart demo
For example
String camelToSentence(String text) {
var result = text.replaceAll(RegExp(r'(?<!^)(?=[A-Z])'), r" ");
var finalResult = result[0].toUpperCase() + result.substring(1);
return finalResult;
}
void main() {
print(camelToSentence("camelToSentence"));
}
Output
Camel To Sentence

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

c# regex split or replace. here's my code i did

I am trying to replace a certain group to "" by using regex.
I was searching and doing my best, but it's over my head.
What I want to do is,
string text = "(12je)apple(/)(jj92)banana(/)cat";
string resultIwant = {apple, banana, cat};
In the first square bracket, there must be 4 character including numbers.
and '(/)' will come to close.
Here's my code. (I was using matches function)
string text= #"(12dj)apple(/)(88j1)banana(/)cat";
string pattern = #"\(.{4}\)(?<value>.+?)\(/\)";
Regex rex = new Regex(pattern);
MatchCollection mc = rex.Matches(text);
if(mc.Count > 0)
{
foreach(Match str in mc)
{
print(str.Groups["value"].Value.ToString());
}
}
However, the result was
apple
banana
So I think I should use replace or something else instead of Matches.
The below regex would capture the word characters which are just after to ),
(?<=\))(\w+)
DEMO
Your c# code would be,
{
string str = "(12je)apple(/)(jj92)banana(/)cat";
Regex rgx = new Regex(#"(?<=\))(\w+)");
foreach (Match m in rgx.Matches(str))
Console.WriteLine(m.Groups[1].Value);
}
IDEONE
Explanation:
(?<=\)) Positive lookbehind is used here. It sets the matching marker just after to the ) symbol.
() capturing groups.
\w+ Then it captures all the following word characters. It won't capture the following ( symbol because it isn't a word character.

Regular expression that matches string equals to one in a group

E.g. I want to match string with the same word at the end as at the begin, so that following strings match:
aaa dsfj gjroo gnfsdj riier aaa
sdf foiqjf skdfjqei adf sdf sdjfei sdf
rew123 jefqeoi03945 jq984rjfa;p94 ajefoj384 rew123
This one could do te job:
/^(\w+\b).*\b\1$/
explanation:
/ : regex delimiter
^ : start of string
( : start capture group 1
\w+ : one or more word character
\b : word boundary
) : end of group 1
.* : any number of any char
\b : word boundary
\1 : group 1
$ : end of string
/ : regex delimiter
M42's answer is ok except degenerate cases -- it will not match string with only one word. In order to accept those within one regexp use:
/^(?:(\w+\b).*\b\1|\w+)$/
Also matching only necessary part may be significantly faster on very large strings. Here're my solutions on javascript:
RegExp:
function areEdgeWordsTheSame(str) {
var m = str.match(/^(\w+)\b/);
return (new RegExp(m[1]+'$')).test(str);
}
String:
function areEdgeWordsTheSame(str) {
var idx = str.indexOf(' ');
if (idx < 0) return true;
return str.substr(0, idx) == str.substr(-idx);
}
I don't think a regular expression is the right choice here. Why not split the the lines into an array and compare the first and the last item:
In c#:
string[] words = line.Split(' ');
return words.Length >= 2 && words[0] == words[words.Length - 1];