How to check String equality in Google Truth assertions? - unit-testing

Truth.assertThat(actual).matches(expected) or Truth.assertThat(actual).isEqualTo(expected) ?
The docs say that the matches() method takes in a String in the form of a regex but not sure if a string literal works as well? That's what got me confused.

It sounds like you want isEqualTo(expected), which performs an exact equality assertion.
As you say, matches accepts a regex, which lets you do things like assertThat("foo").matches("f.*"). But regexes can interfere with exact matching. For example, assertThat("$5").matches("$5") will fail because the $ in the regex means "end of string." But assertThat("$5").isEqualTo("$5") will pass.

Related

Difference between non-greedy search and negated character set

Is there any difference between these two regular expression patterns (assuming single-line mode is enabled): a.*?b and a[^b]*b ? What about in terms of performance?
a.*?b has to check at each consumed character if it matches the pattern (i.e. if the next one is a b). This is called backtracking.
With the string a12b the execution would look like this:
Consume a
Consume the following 0 characters. Is the next one a b? No.
Consume the following character (a1). Is the next one a b? No.
Consume the following character (a12). Is the next one a b? Yes!
Consume b
Match
a[^b]*b consumes anything that isn't a b without asking itself questions and is much faster for longer strings because of that.
With the string a12b the execution would look like this:
Consume a
Consume anything that follows that isn't a b. (a12)
Consume b
Match
RegexHero has a benchmark feature that will demonstrate that with the .NET regex engine.
Other than the performance difference, they match the same strings in your example.
However there are situations where there is a difference between the two. In the string aa111b111b
(?<=aa.*?)b matches both b while (?<=aa[^b]*)b matches only the first one.
I have tested your both regex, naming them as:
NONGREEDY = /a.*?b/;
GREEDY = /a[^b]*b/;
I named negative regex as GREEDY but is just the name.
You can check test-non-greedy-vs-greedy-performance on JsPerf and run the tests to see it by yourself. Feel free to modify the string to perform different test cases.
You can check different test that guys have added and the benchmark results varies depending of the input string.
Below test is for string: ab
Below test is for string: axb
Below test is for string: afdkjsklfjsdlkfjsdlkfjsdlkjflskdjflsdfjjflksdjfb
After these tests the performance seems to vary depending of the string you are parsing.
Hope this test can help answering this question.

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/

Meaning of "match" as related to Regular Expressions

I'm writing a term paper on regular expressions and I'm a bit confused regarding the way one uses the word "match" when referring to regexes. Which of the following is the correct wording to use:
"The regular expression matches the string"
or
"The string matches the regular expression"
Or are they both correct? All opinions on this are welcome! I really want to get this right and I think it would help my understanding greatly to get this clarified.
I think both are correct. It depends on what you're focusing on. If your focus is in the regular expression itself to see if it serves to work on a given string or set of strings, then you use the first sentence. In the contrary, if you are more interested in looking at a set of strings that match certain criteria, the second one is applicable. You know, a match has the meaning of some equivalence under certain conditions, so both sentences sound equivalent to me.
The string is being matched to the regular expression pattern, therefore I would say the latter is more accurate
When two things match, it is (from a logical perspective at least) irrelevant in which order you mention them.
So it depends on what you want to put focus on.
The string matches the regular expression: Focus is on the string.
The regular expression matches the string: Focus is on the regex.
The latter sounds better to me. The regex specifies a pattern that the string may match. But there's nothing really wrong with either.
If you said either one to me, I would understand what you're saying. I'm sure people have said both to me, and I never thought either one needed to be corrected.
I agree that the string matches (or not) the regular expression. To make it clear why I'd say: the regular expression defines a grammar, and a given string is either well-formed according to that grammar or not.
"The regular expression matches the string"
True if the RE matches the whole string (eg. using ^ $ or just happening to match everything). Otherwise, I would write: the regular expression has match(es) in the string.
"The string matches the regular expression"
Again, true if the regex matches everything, otherwise it sounds a bit odd.
But indeed, in the case of a whole match, the two sentences are equivalent.
Since you're looking for a regular expression within a string, it's more correct to say that you've found the regular expression since that's a one-way relationship.
But as to which matches which, that's a two way relationship and it doesn't really matter (in English, anyway - I can't vouch for other languages ), so either would be correct.
My preference would be to say that the string matches the regular expression, since the RE is the invariant part and the string changes. But that's a personal preference and is unlikely to have any bearing on reality :-)
"The string matches the regular expression" seems to be shorthand for "the string is in the language defined by and isomorphic to the regular expression."
"The regular expression matches the string" seems to be shorthand for "a parser automaton compiled from the regular expression will parse the string and halt in a final state."
I'd say:
At design time a user/develper creates a regular expression that matches a string.
At run time a regular expression engine finds a string that matches the regular expression.
(Not intended to be a definition, just an example of common usage.)
Since a regular expression represents a possibly infinite set of finite strings, I would say that it is most correct to write that "string s matches regular expression r". You could also say that "string s is member of the set generated by regular expression r".
Also, you should consider using the words accept and reject, especially if you intend to discuss finite automata in your paper.

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)

Regular expression that rejects all input?

Is is possible to construct a regular expression that rejects all input strings?
Probably this:
[^\w\W]
\w - word character (letter, digit, etc)
\W - opposite of \w
[^\w\W] - should always fail, because any character should belong to one of the character classes - \w or \W
Another snippets:
$.^
$ - assert position at the end of the string
^ - assert position at the start of the line
. - any char
(?#it's just a comment inside of empty regex)
Empty lookahead/behind should work:
(?<!)
The best standard regexs (i.e., no lookahead or back-references) that reject all inputs are (after #aku above)
.^
and
$.
These are flat contradictions: "a string with a character before its beginning" and "a string with a character after its end."
NOTE: It's possible that some regex implementations would reject these patterns as ill-formed (it's pretty easy to check that ^ comes at the beginning of a pattern and $ at the end... with a regular expression), but the few I've checked do accept them. These also won't work in implementations that allow ^ and $ to match newlines.
(?=not)possible
?= is a positive lookahead. They're not supported in all regexp flavors, but in many.
The expression will look for "not", then look for "possible" starting at the same position (since lookaheads don't move forward in the string).
One example of why such thing could possibly be needed is when you want to filter some input with regexes and you pass regex as an argument to a function.
In spirit of functional programming, for algebraic completeness, you may want some trivial primary regexes like "everything is allowed" and "nothing is allowed".
To me it sounds like you're attacking a problem the wrong way, what exactly
are you trying to solve?
You could do a regular expression that catches everything and negate the result.
e.g in javascript:
if (! str.match( /./ ))
but then you could just do
if (!foo)
instead, as #[jan-hani] said.
If you're looking to embed such a regex in another regex, you
might be looking for $ or ^ instead, or use lookaheads like #[henrik-n] mentioned.
But as I said, this looks like a "I think I need x, but what I really need is y" problem.
Why would you even want that? Wouldn't a simple if statment do the trick? Something along the lines of:
if ( inputString != "" )
doSomething ()
[^\x00-\xFF]
It depends on what you mean by "regular expression". Do you mean regexps in a particular programming language or library? In that case the answer is probably yes, and you can refer to any of the above replies.
If you mean the regular expressions as taught in computer science classes, then the answer is no. Every regular expression matches some string. It could be the empty string, but it always matches something.
In any case, I suggest you edit the title of your question to narrow down what kinds of regular expressions you mean.
[^]+ should do it.
In answer to aku's comment attached to this, I tested it with an online regex tester (http://www.regextester.com/), and so assume it works with JavaScript. I have to confess to not testing it in "real" code. ;)
EDIT:
[^\n\r\w\s]
Well,
I am not sure if I understood, since I always thought of regular expression of a way to match strings. I would say the best shot you have is not using regex.
But, you can also use regexp that matches empty lines like ^$ or a regexp that do not match words/spaces like [^\w\s] ...
Hope it helps!