How to find the whole word in kotlin using regex - regex

I want to find the whole word in string. But I don't know how to find the all word in kotlin. My finding word is [non alpha]cba[non alpha]. My example code is bellows.
val testLink3 = """cba#cba cba"""
val word = "cba"
val matcher = "\\b[^a-zA-Z]*(?i)$word[^a-zA-Z]*\\b".toRegex()
val ret = matcher.find(testLink3)?.groupValues
But output of my source code is "cba"
My expected value is string array such as "{cba, cba, cba}".
How to find this value in kotlin language.

You may use
val testLink3 = """cBa#Cba cbA123"""
val word = "cba"
val matcher = "(?i)(?<!\\p{L})$word(?!\\p{L})".toRegex()
println(matcher.findAll(testLink3).map{it.value}.toList() )
println(matcher.findAll(testLink3).count() )
// => [cBa, Cba, cbA]
// => 3
See the online Kotlin demo.
To obtain the list of matches, you need to run the findAll method on the regex object, map the results to values and cast to list:
.findAll(testLink3).map{it.value}.toList()
To count the matches, you may use
matcher.findAll(testLink3).count()
Regex demo
(?i) - case insensitive modifier
(?<!\\p{L}) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a letter
$word - your word variable value
(?!\\p{L}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a letter.

Related

Overlapping matches in Regex - Scala

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.
val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray
When I run the script I get:
aba, ded, ghg
And it should be:
aba, aca, ded, dfd, ghg, gig
It does not detect overlapped combinations.
The way consists to enclose the whole pattern in a lookahead to consume only the start position:
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
m => println(m.group(1))
}
The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).
You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:
object Main extends App {
val text = "abaca dedfd ghgig"
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
val allMatches = p.findAllMatchIn(text).map(_.group(1))
println(allMatches.mkString(", "))
// => aba, aca, ded, dfd, ghg, gig
}
See the online Scala demo
Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.

scala matching optional set of characters

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.
Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.
Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!

regular expression to contain all strings that don't contain a pattern

I have a pattern 'NewTree' and I want to get all strings that don't contain this pattern 'NewTree'. How do I use regex to do the filter?
So if I have 1.BoostKite 2.SetTree 3. ComeNewTreeNow
Then the output should be BoostKite and SetTree.
Any suggestions? I wanted regex that can work anywhere and not use any language specific function.
You can try using a Negative Lookahead if you want to use a regular expression.
^(?!.*NewTree).*$
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
\w*NewTree\w*|([a-zA-Z]+)
Live Demo
In Python:
( The strings being in list context, as you commented 'array' above )
>>> import re
>>> regex = re.compile(r'^(?!.*NewTree).*$')
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if regex.match(x)]
['BoostKite', 'SetTree']
If it is just a long string of multiple words and you want to ignore the words that contain NewTree
>>> s = '1.BoostKite 2.SetTree 3. ComeNewTreeNow 4. foo 5. bar'
>>> filter(None, re.findall(r'\w*NewTree\w*|([a-zA-Z]+)', s))
['BoostKite', 'SetTree', 'foo', 'bar']
You can do this without a regular expression as well.
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if "NewTree" not in x]
['BoostKite', 'SetTree']
Match each word with the regex \w+NewTree\b. It returns true if it ends with NewTree
Use i modifier for case insensitive match (ignores case of [a-zA-Z])
Use \w* instead of \w+ in above regex if you want to match for NewTree word as well.
If you are looking for contains NewTree then try this regex \w*NewTree\w*\b
I think you can do this in general in the manner of the following example for your specific case:
^(([^N]|N[^e]|Ne[^w]|New[^T]|NewT[^r]|NewTr[^e]|NewTre[^e])+)?(.|..|...|....|.....)?$
So far what I have here is a near miss. It will not match any string that has substring NewTree. But it will not match every string that is free of the substring NewTree. In particular it will not match Nvwxyz.

Python RegEx query missing overlapping substrings

Python3.3, OS X 7.5
I am attempting to locate all instances of a 4-character substring defined as follows:
First character = 'N'
Second character = Anything but 'P'
Third character = 'S' or 'T'
Fourth character = Anything but 'P'
My query looks like this:
re.findall(r"\N[A-OQ-Z][ST][A-OQ-Z]", text)
This is working except in one particular case where two substrings overlap. That case involves the following 5character substring:
'...NNTSY...'
The query catches the first 4-character substring ('NNTS'), but not the second 4-character substring ('NTSY').
This is my first attempt at regular expressions, and obviously I'm missing something.
You can do this if the re engine does not consume characters as it matches them, which is possible with lookahead assertions:
import re
text = '...NNTSY...'
for m in re.findall(r'(?=(N[A-OQ-Z][ST][A-OQ-Z]))', text):
print(m)
Output:
NNTS
NTSY
Having everything within the assertion works but also feels weird. Another way is taking the N out of the assertion:
for m in re.findall(r'(N(?=([A-OQ-Z][ST][A-OQ-Z])))', text):
print(''.join(m))
From the Python 3 documentation (emphasis added):
$ python3 -c 'import re; help(re.findall)'
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
If you want overlapping instances, use regex.search() in a loop. You have to compile the regular expression because the API for non-compiled regular expressions doesn't take a parameter to specify the starting position.
def findall_overlapping(pattern, string, flags=0):
"""Find all matches, even ones that overlap."""
regex = re.compile(pattern, flags)
pos = 0
while True:
match = regex.search(string, pos)
if not match:
break
yield match
pos = match.start() + 1
(N[^P](?:S|T)[^P])
Edit live on Debuggex

Regular Expressions: about Greediness, Laziness and Substrings

I have the following string:
123322
In theory, the regex 1.*2 should match the following:
12 (because * can be zero characters)
12332
123322
If I use the regex 1.*2 it matches 123322.
Using 1.*?2, it will match 12.
Is there a way to match 12332 too?
The perfect thing would be to get all possible matchess in the string (no matter if one match is substring of another)
No, unless there is something else added to the regex to clarify what it should do it will either be greedy or non-greedy. There is no in-betweeny ;)
1(.*?2)*$
you will have multiple captures which you can concatenate to form all possible matches
see here:regex tester
click on 'table' and expand the captures tree
You would need a separate expression for each case, depending on the number of twos you want to match:
1(.*?2){1} #same as 1.*?2
1(.*?2){2}
1(.*?2){3}
...
Generally, this isn't possible. A regex matching engine isn't really designed to find overlapping matches. A quick solution is simply to check the pattern on all substrings manually:
string text = "1123322";
for (int start = 0; start < text.Length - 1; start++)
{
for (int length = 0; length <= text.Length - start; length++)
{
string subString = text.Substring(start, length);
if (Regex.IsMatch(subString, "^1.*2$"))
Console.WriteLine("{0}-{1}: {2}", start, start + length, subString);
}
}
Working example: http://ideone.com/aNKnJ
Now, is it possible to get a whole-regex solution? Mostly, the answer is no. However, .Net does has a few tricks in its sleeve to help us: it allows variable length lookbehind, and allows each capturing group to remember all captures (most engines only return the last match of each group). Abusing these, we can simulate the same for loop inside the regex engine:
string text = "1123322!";
string allMatchesPattern = #"
(?<=^ # Starting at the local end position, look all the way to the back
(
(?=(?<Here>1.*2\G))? # on each position from the start until here (\G),
. # *try* to match our pattern and capture it,
)* # but advance even if you fail to match it.
)
";
MatchCollection matches = Regex.Matches(text, allMatchesPattern,
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
foreach (Match endPosition in matches)
{
foreach (Capture startPosition in endPosition.Groups["Here"].Captures)
{
Console.WriteLine("{0}-{1}: {2}", startPosition.Index,
endPosition.Index - 1, startPosition.Value);
}
}
Note that currently there's a small bug there - the engine doesn't try to match the last ending position ($), so you loose a few matches. For now, adding a ! at the end of the string solves that issue.
working example: http://ideone.com/eB8Hb