Scala Anchored Regex acts as unachored - regex

So for some reason in Scala 2.11, my anchored regex patterns act as unanchored regex patterns.
scala> """something\.com""".r.anchored findFirstIn "app.something.com"
res66: Option[String] = Some(something.com)
scala> """^.something\.com$""".r.anchored findFirstIn "app.something.com"
res65: Option[String] = None
I thought the first expression would evaluate as None like the second (manually entered anchors) but it does not.
Any help would be appreciated.

The findFirstIn method un-anchors the regex automatically.
You can see that the example code also matches A only:
Example:
"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"
BTW, once you create a regex like "pattern".r, it is anchored by default, but that only matters when you use the regex in a match block. Inside the FindAllIn or FindFirstIn, this type of anchoring is just ignored.
So, to make sure the regex matches the whole string, always add ^ and $ (or \A and \z) anchors if you are not sure where you are going to use the regexes.

I think, it is only supposed to work with match:
val reg = "o".r.anchored
"foo" match {
case reg() => "Yes!"
case _ => "No!"
}
... returns "No!".
This doesn't seem very useful, because just "o".r is anchored by default anyway. The only use of this I can imagine is if you made some unanchored (by accident? :)), and then want to undo it, or if you just want to match both cases, but s
eparately:
val reg = "o".r.unanchored
"foo" match {
case reg.anchored() => "Anchored!
case reg() => "Unanchored"
case _ => "I dunno"
}

Related

Scala pattern match regex

In my scala program, I want to use a pattern match to test whether there is a valid .csv file in the input path.
path ="\DAP\TestData\test01.csv"
val regex=""".csv$""".r.unanchored
I tried to use the previous regex to match the string, it worked, but when it went to match pattern, it cannot work.
path ="\DAP\TestData\test01.csv"
val regex="""\.csv$""".r.unanchored
path match {
case regex(type) =>println(s"$type matched")
case _ =>println("something else happeded")
}
I need to successfully print information like ".csv matched".
Could anyone help me with this issue? I m really confused by this issue.
Thanks
It's not clear which part of the path you want to capture and report. But in any case you'll probably want a capture group in the regex pattern.
val path = raw"\DAP\TestData\test01.csv"
val re = """(.*\.csv)$""".r.unanchored
path match {
case re(typ) => println(s"$typ matched") //"\DAP\TestData\test01.csv matched"
case _ => println("something else happened")
}
You can also use the capture group to capture any one of many different target patterns.
val re = ".*\\.((?i:json|xml|csv))$".r
raw"\root\test31.XML" match {
case re(ext) => println(s"$ext matched") //"XML matched"
case _ => println("something else happeded")
}
You can try it like this:
val regex = """(\.csv)$""".r.unanchored
path match {
case regex(fileType) => println(s"$fileType matched")
case _ => println("something else happeded")
}

Why does Scala regexp work differently in pattern matching

I have a simple regular expression val emailRegex = "\\w+#\\w+\\.\\w+".r that matches simple emails (not for production, of course:). When I run the following code:
println(email match {
case emailRegex(_) => "cool"
case _ => "not cool"
})
printlnemailRegex.pattern.matcher(email).matches())
It prints not cool and true. Adding anchors doesn't help either: "^\\w+#\\w+\\.\\w+$".r gives the same result. But when I add parentheses "(\\w+#\\w+\\.\\w+)".r it prints cool and true.
Why does this happen?
The number of arguments to a regex pattern should match the number of capturing group in the regex. Your regex does not have any capturing groups, so there should be zero arguments:
println(email match {
case emailRegex() => "cool"
case _ => "not cool"
})
printlnemailRegex.pattern.matcher(email).matches())
Because pattern matching with a regex is about capturing regex groups:
val email = "foo#foo.com"
val slightyDifferentEmailRegex = "(\\w+)#\\w+\\.\\w+".r // just add a group with two brackets
println(email match {
case slightyDifferentEmailRegex(g) => "cool" + s" and here's the captured group: $g"
case _ => "not cool"
})
prints:
cool and here's the captured group: foo

How to filter a list in Kotlin using Regex [duplicate]

I've created a very simple match-all Regex with Regex.fromLiteral(".*").
According to the documentation: "Returns a literal regex for the specified literal string."
But I don't really get what "for the specified literal string" is supposed to mean.
Consider this example:
fun main(args: Array<String>) {
val regex1 = ".*".toRegex()
val regex2 = Regex.fromLiteral(".*")
println("regex1 matches abc: " + regex1.matches("abc"))
println("regex2 matches abc: " + regex2.matches("abc"))
println("regex2 matches .* : " + regex2.matches(".*"))
}
Output:
regex1 matches abc: true
regex2 matches abc: false
regex2 matches .* : true
so apparently (and contrary to my expectations), Regex.fromLiteral() and String.toRegex() behave completely different (I've tried dozens of different arguments to regex2.matches() - the only one that returned true was .*)
Does this mean that a Regex created with Regex.fromLiteral() always matches only the exact string it was created with?
If yes, what are possible use cases for such a Regex? (I can't think of any scenario where that would be useful)
Yes, it does indeed create a regex that matches the literal characters in the String. This is handy when you're trying to match symbols that would be interpreted in a regex - you don't have to escape them this way.
For example, if you're looking for strings that contain .*[](1)?[2], you could do the following:
val regex = Regex.fromLiteral(".*[](1)?[2]")
regex.containsMatchIn("foo") // false
regex.containsMatchIn("abc.*[](1)?[2]abc") // true
Of course you can do almost anything you can do with a Regex with just regular String methods too.
val literal = ".*[](1)?[2]"
literal == "foo" // equality checks
literal in "abc.*[](1)?[2]abc" // containment checks
"some string".replace(literal, "new") // replacements
But sometimes you need a Regex instance as a parameter, so the fromLiteral method can be used in those cases. Performance of these different operations for different inputs could also be interesting for some use cases.
The Regex.fromLiteral() instantiates a regex object while escaping the special regex metacharacters. The pattern you get is actually \.\*, and since you used matches() that requires a full string match, you can only match a .* string with it (with find() you could match it anywhere inside a string).
See the source code:
public fun fromLiteral(literal: String): Regex = Regex(escape(literal))

scala matching optional set of characters

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.
Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.
Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!

Exclusive Or in Regular Expression

Looking for a bit of regex help.
I'd like to design an expression that matches a string with "foo" OR "bar", but not both "foo" AND "bar"
If I do something like...
/((foo)|(bar))/
It'll match "foobar". Not what I'm looking for. So, how can I make regex match only when one term or the other is present?
Thanks!
This is what I use:
/^(foo|bar){1}$/
See: http://www.regular-expressions.info/quickstart.html under repetition
If your regex language supports it, use negative lookaround:
(?<!foo|bar)(foo|bar)(?!foo|bar)
This will match "foo" or "bar" that is not immediately preceded or followed by "foo" or "bar", which I think is what you wanted.
It's not clear from your question or examples if the string you're trying to match can contain other tokens: "foocuzbar". If so, this pattern won't work.
Here are the results of your test cases ("true" means the pattern was found in the input):
foo: true
bar: true
foofoo: false
barfoo: false
foobarfoo: false
barbar: false
barfoofoo: false
You can do this with a single regex but I suggest for the sake of readability you do something like...
(/foo/ and not /bar/) || (/bar/ and not /foo/)
This will take 'foo' and 'bar' but not 'foobar' and not 'blafoo' and not 'blabar':
/^(foo|bar)$/
^ = mark start of string (or line)
$ = mark end of string (or line)
This will take 'foo' and 'bar' and 'foo bar' and 'bar-foo' but not 'foobar' and not 'blafoo' and not 'blabar':
/\b(foo|bar)\b/
\b = mark word boundry
You haven't specified behaviour regarding content other than "foo" and "bar" or repetitions of one in the absence of the other. e.g., Should "food" or "barbarian" match?
Assuming that you want to match strings which contain only one instance of either "foo" or "bar", but not both and not multiple instances of the same one, without regard for anything else in the string (i.e., "food" matches and "barbarian" does not match), then you could use a regex which returns the number of matches found and only consider it successful if exactly one match is found. e.g., in Perl:
#matches = ($value =~ /(foo|bar)/g) # #matches now hold all foos or bars present
if (scalar #matches == 1) { # exactly one match found
...
}
If multiple repetitions of that same target are allowed (i.e., "barbarian" matches), then this same general approach could be used by then walking the list of matches to see whether the matches are all repeats of the same text or if the other option is also present.
You might want to consider the ? conditional test.
(?(?=regex)then|else)
Regular Expression Conditionals
If you want a true exclusive or, I'd just do that in code instead of in the regex. In Perl:
/foo/ xor /bar/
But your comment:
Matches: "foo", "bar" nonmatches:
"foofoo" "barfoo" "foobarfoo" "barbar"
"barfoofoo"
indicates that you're not really looking for exclusive or. You actually mean
"Does /foo|bar/ match exactly once?"
my $matches = 0;
while (/foo|bar/g) {
last if ++$matches > 1;
}
my $ok = ($matches == 1)
I know this is a late entry, but just to help others who may be looking:
(/b(?:(?:(?!foo)bar)|(?:(?!bar)foo))/b)
I'd use something like this. It just checks for space around the words, but you could use the \b or \B to check for a border if you use \w. This would match " foo " or " bar ", so obviously you'd have to replace the whitespace as well, just in case. (Assuming you're replacing anything.)
/\s((foo)|(bar))\s/
I don't think this can be done with a single regular expression. And boundaries may or may not work depending on what you're matching against.
I would match against each regex separately, and do an XOR on the results.
foo = re.search("foo", str) != None
bar = re.search("bar", str) != None
if foo ^ bar:
# do someting...
I tried with Regex Coach against:
x foo y
x bar y
x foobar y
If I check the g option, indeed it matches all three words, because it searches again after each match.
If you don't want this behavior, you can anchor the expression, for example matching only on word boundaries:
\b(foo|bar)\b
Giving more context on the problem (what the data looks like) might give better answers.
\b(foo)\b|\b(bar)\b
And use only the first capture group.
Using the word boundaries, you can get the single word...
me#home ~
$ echo "Where is my bar of soap?" | egrep "\bfoo\b|\bbar\b"
Where is my bar of soap?
me#home ~
$ echo "What the foo happened here?" | egrep "\bfoo\b|\bbar\b"
What the foo happened here?
me#home ~
$ echo "Boy, that sure is foobar\!" | egrep "\bfoo\b|\bbar\b"