Why is my regex failing on on certain strings that otherwise succeed? - regex

I have code written in F# that iterates over an array of strings using regex to extract part of those strings. The problem is that the regex appears to randomly successfully match on some, but fail on others, even on an exact duplicates from the same list where it previously succeeded. What am I missing? Is this some sort of regex issue that I am not aware of?
Regex Pattern:
(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))
F# code:
[<Literal>]
let ApiPattern = #"(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))"
let parseOutEndpoints (inputs : (int * string) array) =
let regEx = new Regex(ApiPattern, RegexOptions.Compiled)
inputs |> Array.map (fun (id, path) -> [|id.ToString(); path|]) |> Array.collect (fun x -> x)
|> writeRawPathsToFile
File.ReadAllLines(RawPathsFile)
|> Array.map(fun (x) ->
let m = regEx.Match(x)
if m.Success
then
let endpoint = Domain.Endpoint(m.Value)
endpoint
else
let line = $"{x}"
File.AppendAllLines(FailedRegexMatches, [line], Encoding.UTF8)
Domain.NoEndpoint
)
Sample string array Data:
All of these should return a match, but don't. In comparison to this original list, a significantly reduced list of successful matches will be returned.
/enterprise-review/v9/choose?rr=Straight&pr=1%2E35239
/review-id-service/v1/business-id
/orderout/v1/vendor/shipping
/vendor-service/v1/Product/PartnerId/35310108
/Inspect/v1/Recommendation/Products/LaneId/0002,519188,13148,16939,7348,195982
/bin-inventory/v1/vendor?el=1%2E35239
/u-future/v1/fone?fhid=3028
/decline-summary/v1/details/card/65821974
/provide-service/v8/proDetails
/monetary-points/v1/sum/wins/681197
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data

This one helped to resolve your issue:
/(.*?/v\d/.*?((?=[\?\d\s])|$))
The reason behind problem: probably \r (windows carriage return), whitespaces and also end of string (noted as $ in regex).

Here's your regex and input in regexstorm, a .net Rex tester:
regex storm
I'd have made this a comment but RS's share urls contain the full Rex and input so it's too long for a comment (and SO doesn't allow url shorteners in comments)
So, my question is; does this look right to you? Are all the highlighted matches what you're expecting to match? If so, as RS's engine is .net based, I don't think there is a problem with the regex part of your code..

Related

regex pattern works in online tool, parses in NSRegularExpression, but fails to match anything

I am trying to match roman numerals from test strings like:
Series Name.disk_V.Episode_XI.Episode_name.avi
Series Name.Season V.Episode XI.Part XXV.Episode_name.avi
and a real-world example in which the XIII should not match:
XIII: The Series season II episode V.mp4
Following the logic in this fantastic thread and many experiments in an online regex debugger I came up with this:
(?<=d|dvd|disc|disk|s|se|season|e|ep|episode)[\s._-]\KM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(?=[\s._-])
The last example returns two matches, "II" and "V", ignoring the XIII in the name part. Yay!
So then I tried it in a Swift playground:
let file = "Series Name.disk_V.Episode_XI.Episode_name.avi"
let p = #"(?<=d|dvd|disc|disk|s|se|season|e|ep|episode)[\s._-]\KM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(?=[\s._-])"#
let r = try NSRegularExpression(pattern: p, options: [.caseInsensitive])
let nsString = file as NSString
let results = r.matches(in: suggestion, options: [], range: NSMakeRange(0, nsString.length))
The pattern parses without error but returns no matches. I found that it works if I remove the \K, although that leaves the leading separator in the match. According to this thread, Obj-C (which I assume means NSRegex) supports \K, so I'm not sure why this fails.
There are a number of similar-sounding threads here on SO, but they invariably have to do with patterns that fail to parse, mostly due to escaping. This is not the case here, it parses fine and I can see the pattern is correct (ie, no double-slashes) if you print(r). It just doesn't match.
Can anyone offer some insight or an alternative regex that does not use \K?
TheFourthBird's idea is the solution. I modified the pattern by removing the \K and making the entire roman section a named group:
(?<=d|dvd|disc|disk|s|se|season|e|ep|episode)[\s._-](?<roman>M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))(?=[\s._-])
To parse it, everything as above to start but then look for the matching items like this:
for result in results {
let nameRange = result.range(withName: "roman")
print(nsString.substring(with: nameRange))
}
Output:
V
XI
Bingo!

Elixir: How to count urls in a string

Suppose I have a string:
content = "Please visit https://www.google.com...\nOr visit http://my.website.io\nhttp://myfriends.website.com\nOr https://www.myneigborsite.com, http://visit.me.com"
There are 5 urls in the string.
How do i count the urls using syntax?
I have tried using Regex.scan/2 |> Enum.count/1, or String.split/2 |> Enum.count/1 <- with regex but i always get wrong output.
I have also tried every http/https regex I found in the internet, but still I can't get the correct output.
Here's one that I've tried.
iex> content
...> |> String.split(~r/^(https?):\/\/[^\s$.?#].[^\s]*$/)
...> |> Enum.count()
...> |> Kernel.-(1)
-1
Another one with the same regex..
iex> Regex.scan(~r/^(https?):\/\/[^\s$.?#].[^\s]*$/, content) |> Enum.count()
0
but when I check if the regex matches some of the urls
iex> Regex.match?(~r/^(https?):\/\/[^\s$.?#].[^\s]*$/, "https://www.google.com")
true
iex(48)> Regex.match?(~r/^(https?):\/\/[^\s$.?#].[^\s]*$/, "http://my.website.io")
true
It does match.
I can't figure out what's the problem. Please help me.
You need to only count urls, which means you don’t need an overcomplicated regular expression.
~r|https?://[\w.-]+|
|> Regex.scan(content)
|> Enum.count()
#⇒ 5
Your attempts failed because you put $, the EOL-matcher in the expressions, which is obviously not matched when the URL is not terminating the string.

F#: Detecting errors in regex patterns

I am new to programming and F# is my first .NET language.
As a beginner's project, I would like to write an application asking the user to enter a regex pattern and then flagging any errors.
I have looked through the Regex API on MSDN but there doesn't seem to be any methods that would automatically detect any errors in regex patterns. Will more experienced programmers kindly share with me how they would go about accomplishing this?
Thank you in advance for your help.
If you need to check if a regex compiles or not, simply use try-with block. If you need to check if a regex pattern matches your input string, use IsMatch() or .Success. That is quite enough.
An example with code taken from another SO post, but with an error in regex pattern where I replaced (http:\/\/\S+) with (http:\/\/\S+:
try
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches input =
Regex.Matches(input, "(http:\/\/\S+")
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) -> value, (groups |> Seq.length))
with
| :? System.Exception as ex -> printfn "Exception! %s " (ex.Message); None
More on F# exception raising can be found here or here.

F# Mapping Regular Expression Matches with Active Patterns

I found this useful article on using Active Patterns with Regular Expressions:
http://www.markhneedham.com/blog/2009/05/10/f-regular-expressionsactive-patterns/
The original code snippet used in the article was this:
open System.Text.RegularExpressions
let (|Match|_|) pattern input =
let m = Regex.Match(input, pattern) in
if m.Success then Some (List.tl [ for g in m.Groups -> g.Value ]) else None
let ContainsUrl value =
match value with
| Match "(http:\/\/\S+)" result -> Some(result.Head)
| _ -> None
Which would let you know if at least one url was found and what that url was (if I understood the snippet correctly)
Then in the comment section Joel suggested this modification:
Alternative, since a given group may
or may not be a successful match:
List.tail [ for g in m.Groups -> if g.Success then Some g.Value else None ]
Or maybe you give labels to your
groups and you want to access them by
name:
(re.GetGroupNames()
|> Seq.map (fun n -> (n, m.Groups.[n]))
|> Seq.filter (fun (n, g) -> g.Success)
|> Seq.map (fun (n, g) -> (n, g.Value))
|> Map.ofSeq)
After trying to combine all of this I came up with the following code:
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let (|Match|_|) pattern input =
let re = new Regex(pattern)
let m = re.Match(input) in
if m.Success then Some ((re.GetGroupNames()
|> Seq.map (fun n -> (n, m.Groups.[n]))
|> Seq.filter (fun (n, g) -> g.Success)
|> Seq.map (fun (n, g) -> (n, g.Value))
|> Map.ofSeq)) else None
let GroupMatches stringToSearch =
match stringToSearch with
| Match "(http:\/\/\S+)" result -> printfn "%A" result
| _ -> ()
GroupMatches testString;;
When I run my code in an interactive session this is what is output:
map [("0", "http://www.bob.com"); ("1", "http://www.bob.com")]
The result I am trying to achieve would look something like this:
map [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1);]
Basically a mapping of each unique match found followed by the count of the number of times that specific matching string was found in the text.
If you think I'm going down the wrong path here please feel free to suggest a completely different approach. I'm somewhat new to both Active Patterns and Regular Expressions so I have no idea where to even begin in trying to fix this.
I also came up with this which is basically what I would do in C# translated to F#.
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches =
let matchDictionary = new Dictionary<string,int>()
for mtch in (Regex.Matches(testString, "(http:\/\/\S+)")) do
for m in mtch.Captures do
if(matchDictionary.ContainsKey(m.Value)) then
matchDictionary.Item(m.Value) <- matchDictionary.Item(m.Value) + 1
else
matchDictionary.Add(m.Value, 1)
matchDictionary
Which returns this when run:
val matches : Dictionary = dict [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1)]
This is basically the result I am looking for, but I'm trying to learn the functional way to do this, and I think that should include active patterns. Feel free to try to "functionalize" this if it makes more sense than my first attempt.
Thanks in advance,
Bob
Interesting stuff, I think everything you are exploring here is valid. (Partial) active patterns for regular expression matching work very well indeed. Especially when you have a string which you want to match against multiple alternative cases. The only thing I'd suggest with the more complex regex active patterns is that you give them more descriptive names, possibly building up a collection of different regex active patterns with differing purposes.
As for your C# to F# example, you can have functional solution just fine without active patterns, e.g.
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches input =
Regex.Matches(input, "(http:\/\/\S+)")
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) -> value, (groups |> Seq.length))
//FSI output:
> matches testString;;
val it : seq<string * int> =
seq
[("http://www.bob.com", 2); ("http://www.b.com", 1);
("http://www.bill.com", 1)]
Update
The reason why this particular example works fine without active patterns is because 1) you are only testing one pattern, 2) you are dynamically processing the matches.
For a real world example of active patterns, let's consider a case where 1) we are testing multiple regexes, 2) we are testing for one regex match with multiple groups. For these scenarios, I use the following two active patterns, which are a bit more general than the first Match active pattern you showed (I do not discard first group in the match, and I return a list of the Group objects, not just their values -- one uses the compiled regex option for static regex patterns, one uses the interpreted regex option for dynamic regex patterns). Because the .NET regex API is so feature filled, what you return from your active pattern is really up to what you find useful. But returning a list of something is good, because then you can pattern match on that list.
let (|InterpretedMatch|_|) pattern input =
if input = null then None
else
let m = Regex.Match(input, pattern)
if m.Success then Some [for x in m.Groups -> x]
else None
///Match the pattern using a cached compiled Regex
let (|CompiledMatch|_|) pattern input =
if input = null then None
else
let m = Regex.Match(input, pattern, RegexOptions.Compiled)
if m.Success then Some [for x in m.Groups -> x]
else None
Notice also how these active patterns consider null a non-match, instead of throwing an exception.
OK, so let's say we want to parse names. We have the following requirements:
Must have first and last name
May have middle name
First, optional middle, and last name are separated by a single blank space in that order
Each part of the name may consist of any combination of at least one or more letters or numbers
Input may be malformed
First we'll define the following record:
type Name = {First:string; Middle:option<string>; Last:string}
Then we can use our regex active pattern quite effectively in a function for parsing a name:
let parseName name =
match name with
| CompiledMatch #"^(\w+) (\w+) (\w+)$" [_; first; middle; last] ->
Some({First=first.Value; Middle=Some(middle.Value); Last=last.Value})
| CompiledMatch #"^(\w+) (\w+)$" [_; first; last] ->
Some({First=first.Value; Middle=None; Last=last.Value})
| _ ->
None
Notice one of the key advantages we gain here, which is the case with pattern matching in general, is that we are able to simultaneously test that an input matches the regex pattern, and decompose the returned list of groups if it does.

How to do Erlang pattern matching using regular expressions?

When I write Erlang programs which do text parsing, I frequently run into situations where I would love to do a pattern match using a regular expression.
For example, I wish I could do something like this, where ~ is a "made up" regular expression matching operator:
my_function(String ~ ["^[A-Za-z]+[A-Za-z0-9]*$"]) ->
....
I know about the regular expression module (re) but AFAIK you cannot call functions when pattern matching or in guards.
Also, I wish matching strings could be done in a case-insensitive way. This is handy, for example, when parsing HTTP headers, I would love to do something like this where "Str ~ {Pattern, Options}" means "Match Str against pattern Pattern using options Options":
handle_accept_language_header(Header ~ {"Accept-Language", [case_insensitive]}) ->
...
Two questions:
How do you typically handle this using just standard Erlang? Is there some mechanism / coding style which comes close to this in terms of conciseness and easiness to read?
Is there any work (an EEP?) going on in Erlang to address this?
You really don't have much choice other than to run your regexps in advance and then pattern match on the results. Here's a very simple example that approaches what I think you're after, but it does suffer from the flaw that you need to repeat the regexps twice. You could make this less painful by using a macro to define each regexp in one place.
-module(multire).
-compile(export_all).
multire([],_) ->
nomatch;
multire([RE|RegExps],String) ->
case re:run(String,RE,[{capture,none}]) of
match ->
RE;
nomatch ->
multire(RegExps,String)
end.
test(Foo) ->
test2(multire(["^Hello","world$","^....$"],Foo),Foo).
test2("^Hello",Foo) ->
io:format("~p matched the hello pattern~n",[Foo]);
test2("world$",Foo) ->
io:format("~p matched the world pattern~n",[Foo]);
test2("^....$",Foo) ->
io:format("~p matched the four chars pattern~n",[Foo]);
test2(nomatch,Foo) ->
io:format("~p failed to match~n",[Foo]).
A possibility could be to use Erlang Web-style annotations (macros) combined with the re Erlang module. An example is probably the best way to illustrate this.
This is how your final code will look like:
[...]
?MATCH({Regexp, Options}).
foo(_Args) ->
ok.
[...]
The MATCH macro would be executed just before your foo function. The flow of execution will fail if the regexp pattern is not matched.
Your match function will be declared as follows:
?BEFORE.
match({Regexp, Options}, TgtMod, TgtFun, TgtFunArgs) ->
String = proplists:get_value(string, TgtArgs),
case re:run(String, Regexp, Options) of
nomatch ->
{error, {TgtMod, match_error, []}};
{match, _Captured} ->
{proceed, TgtFunArgs}
end.
Please note that:
The BEFORE says that macro will be executed before your target function (AFTER macro is also available).
The match_error is your error handler, specified in your module, and contains the code you want to execute if you fail a match (maybe nothing, just block the execution flow)
This approach has the advantage of keeping the regexp syntax and options uniform with the re module (avoid confusion).
More information about the Erlang Web annotations here:
http://wiki.erlang-web.org/Annotations
and here:
http://wiki.erlang-web.org/HowTo/CreateAnnotation
The software is open source, so you might want to reuse their annotation engine.
You can use the re module:
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$").
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$", [caseless]).
EDIT:
match(String, Regexps) ->
case lists:dropwhile(
fun({Regexp, Opts}) -> re:run(String, Regexp, Opts) =:= nomatch;
(Regexp) -> re:run(String, Regexp) =:= nomatch end,
Regexps) of
[R|_] -> R;
_ -> nomatch
end.
example(String) ->
Regexps = ["$RE1^", {"$RE2^", [caseless]}, "$RE3"]
case match(String, Regexps) of
nomatch -> handle_error();
Regexp -> handle_regexp(String, Regexp)
...
For string, you could use the 're' module : afterwards, you iterate over the result set. I am afraid there isn't another way to do it AFAIK: that's why there are regexes.
For the HTTP headers, since there can be many, I would consider iterating over the result set to be a better option instead of writing a very long expression (potentially).
EEP work : I do not know.
Erlang does not handle regular expressions in patterns.
No.
You can't pattern match on regular expressions, sorry. So you have to do
my_function(String) -> Matches = re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$"),
...