So I did a problem earlier that said:
L(r) = {w in {a,b}* : w contains at least 2 a's}
For that one I said {a^2n , b} because that guarantees a string like aab or aabaab etc. Not sure how to approach the one I posted about in the title. Possibly a solution might be a^2n, b^2m so its always even, but also 2 odd numbers like a^n b^3m is also always even. Am i allowed to set boundaries like n>=m?
Thank you!
You correctly observe that n and m must either be both even or both odd. It only needs to be added that an odd number is one more than an even number.
A simple regular expression for "an even number of as" ( {a2n : n ≥ 0}) is (aa)*, while "an odd number of as" is (aa)*a.
Building on that, we can two cases for the original question: (aa)*(bb)* and (aa)*a(bb)*b, which can be combined into (aa)*(ab+ε)(bb)*. (Assuming you are using + for alternation and ε for the empty string.)
r=((a+b)^2)* ,i think this regular expression is also giving the right answer
Related
I need to find the word "Judgment" or "Judgement" or "JUDGMENT" or "JUDGEMENT" or "J U D G M E N T" from a document or any permutation/combination of those characters in both upper/lower cases (in that particular order). Is there a regex function that could help me out?
The problem is, I am applying the code to different documents and every document contains a different form of that word. My code needs to recognize the word in all instances.
I just use your question as a string. Because it has all the combination you want and try this with you other combination. Leave a comment if this regex not worked.
>>> import re
>>>
>>> pattern = re.compile('(j[\s]*u[\s]*d[\s]*g[e|M|\s]*n[\s]*t)', re.IGNORECASE)
>>> string = """I need to find the word "Judgment" or "Judgement" or "JUDGMENT" or "JUDGEMENT" or "J U D G M E N T" from a document or any permutation/combination of those characters in both upper/lower cases (in that particular order). Is there a regex function that could help me out? The problem is, I am applying the code to different documents and every document contains a different form of that word. My code needs to recognize the word in all instances."""
>>>
>>> pattern.findall(string)
['Judgment', 'Judgement', 'JUDGMENT', 'JUDGEMENT', 'J U D G M E N T']
Here is the visualization of above regex.
You would probably want to preprocess your text data. Otherwise, it wouldn't be rational to do so considering the time-complexity of such regular expression, if even possible.
Permutation might be possible since the order of letters would remain the same, combination would be quite complicated, which would include words such as get, gem, Meg, and many others.
If you maybe want to have a very low boundary expression, maybe this expression would be OK to look into:
\b([judgment\s]+)\b
and here you can see how it would fail:
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
I'm trying to understand the equivalence between regular expressions α and β defined below, but I'm losing my mind over conflicting information.
a+b: a or b
ab: concatenation of a and b
$: empty string
α = (1*+0)+(1*+0)(0+1)*($+0+1)
β = (1*+0)(0+1)*($+0+1)
https://ivanzuzak.info/noam/webapps/regex_simplifier/ says, that α is equivalent to β.
My school however teaches that concatenation has stronger binding than union, meaning that:
11*+0 =/= 1(1*+0)
Which would mean that my α looks like this with parentheses:
α = (1*+0) + ( (1*+0)(0+1)*($+0+1) )
and that
α =/= ( (1*+0) + (1*+0) ) (0+1)*($+0+1)
I hope it's clear what my problem is, I'd appreciate any kind of help. Thanks.
Usually, two regular expressions are considered equivalent when they match the same set of words.
How they match it is not relevant. Therefore it doesn't matter which of the operators has greater precedence.
Note the subtle difference between being equal (in written form) and being equivalent (having the same effect).
Alright, it turns out that I have misunderstood why b+b <=> b.
It's that L1∪L2 <=> L2, if L1 is subset of L2.
What is the regular expression that generates the language where every odd position in the string is an a? (Please answer with the shortest possible regex: minimal parentheses, no spaces and any piped strings in alphabetical order!)
I assume I'm working with only a's and b's.
(a(a|b))+ would only cover even strings: a(a|b), a(a|b)a(a|b), etc.
How do I also cover the case that the string is odd? ex: a(a|b)a
Note: not using programming syntax
Edit: some valid strings would be: a, aa, aaa, aaaa, aaaaa, ab, aba, abab, ababa, etc.
EDIT: Solution
My instructor gave the answer (aa|ab)*. This is incorrect because it misses case(s), for example "a".
I think this suits your requirement:
^a(.a)*.?$
Position 1 must be "a": ^a
Repetitions of any character + a, making a sequence where odds are "a"'s: (.a)*
Allowing for a termination not ending in "a", ex abab: .?$
You can check it here: regex101
^(a.)*a?$
Allows empty values ("")
From start to end of line (^...$)
Every odd place (1,3,5,...) is an a followed by any letter/number
There may or may not be an a in the end
One sign shorter thay Jorge's answer, but allows empty values
See regex101 example here
I think this might help you
Case 1: even length strings (1(0+1))*
Case 2: odd length strings 1((0+1)1)*
Finally the answer is Case 1 + Case 2
I'm writing a small function in R as follows:
tags.out <- as.character(tags.out)
tags.out.unique <- unique(tags.out)
z <- NROW(tags.out.unique)
for (i in 1:10) {
l <- length(grep(tags.out.unique[i], x = tags.out))
tags.count <- append(x = tags.count, values = l) }
Basically I'm looking to take each element of the unique character vector (tags.out.unique) and count it's occurrence in the vector prior to the unique function.
This above section of code works correctly, however, when I replace for (i in 1:10) with for (i in 1:z) or even some number larger than 10 (18000 for example) I get the following error:
Error in grep(tags.out.unique[i], x = tags.out) :
invalid regular expression 'c++', reason 'Invalid use of repetition operators
I would be extremely grateful if anyone were able to help me understand what's going on here.
Many thanks.
The "+" in "c++" (which you're passing to grep as a pattern string) has a special meaning. However, you want the "+" to be interpreted literally as the character "+", so instead of
grep(pattern="c++", x="this string contains c++")
you should do
grep(pattern="c++", x="this string contains c++", fixed=TRUE)
If you google [regex special characters] or something similar, you'll see that "+", "*" and many others have a special meaning. In your case you want them to be interpreted literally -- see ?grep.
It would appear that one of the elements of tags.out_unique is c++ which is (as the error message plainly states) an invalid regular expression.
You are currently programming inefficiently. The R-inferno is worth a read, noting especially that Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
Given you are simply trying to count the number of times each value occurs there is no need for the loop or regex
counts <- table(tags.out)
# the unique values
names(counts)
should give you the results you want.
a) Start and end with a number
b) Hyphen should start and end with a number
c) Comma should start and end with a number
d) Range of number should be from 1-31
[Edit: Need this rule in the regex, thanks Ed-Heal!]
e) If a number starts with a hyphen (-), it cannot end with any other character other than a comma AND follow all rules listed above.
E.g. 2-2,1 OR 2,2-1 is valid while 1-1-1-1 is not valid
E.g.
a) 1-5,5,15-29
b) 1,28,1-31,15
c) 15,25,3 [Edit: Replaced 56 with 3, thanks for pointing it out Brian!]
d) 1-24,5-6,2-9
Tried this but it passes even if the string starts with a comma:
/^[0-9]*(?:-[0-9]+)*(?:,[0-9]+)*$/
How about this? This will check rules a, b and c, at least, but does not check rule d.
/^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$/
If you need to ensure that all the numbers are in the range 1-31, then the expression will get a whole lot uglier:
/^([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?(,([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?)*$/
Note that your example c contains a number, 56, that does not fall within the range 1-31, so it will not pass the second expression.
try this
^\d+(-\d+)?(,\d+(-\d+)?)*$
DEMO
Here is my workings
Numbers:
0|([1-9][0-9]*) call this expression A Note this expression treats zero as a special case and prevents numbers starting with a zero eg 0000001234
Number or a range:
A|(A-A) call this expression B (i.e (0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))
Comma operator
B(,B)*
Putting this togher should do the trick and we get
((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*))))(,((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))))*
You can abbreviatge this with \d for [0-9]
The other approaches have not restricted the allowed range of numbers. This allows 1 through 31 only, and seems simpler than some of the monstrosities people have come up with ...
^([12][0-9]?|3[01]?|[4-9])([-,]([12][0-9]?|3[01]?|[4-9]))*$
There is no check for sensible ranges; adding that would make the expression significantly more complex. In the end you might be better off with a simpler regex and implementing sanity checks in code.
I propose the following regex:
(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$
It looks awful but it isn't :) In fact the construction (?<name>...){0} allows us to define a named regex and to say that it doesn't match where it is defined. Thus I defined a pattern for numbers called number and a pattern for what I called a thing i.e. a range or number called thing. Next I know that your expression is a sequence of those things, so I use the named regex thing to build it with the construct \g<thing>. It gives (\g<thing>,)*\g<thing>. That's easy to read and understand. If you allow whitespaces to be non significant in your regex, you could even indent it like this:
(?<number>[1-9]|[12]\d|3[01]){0}
(?<thing>\g<number>-\g<number>|\g<number>){0}
^(\g<thing>,)*\g<thing>$/
I tested it with Ruby 1.9.2. Your regex engine should support named groups to allow that kind of clarity.
irb(main):001:0> s1 = '1-5,5,15-29'
=> "1-5,5,15-29"
irb(main):002:0> s2 = '1,28,1-31,15'
=> "1,28,1-31,15"
irb(main):003:0> s3 = '15,25,3'
=> "15,25,3"
irb(main):004:0> s4 = '1-24,5-6,2-9'
=> "1-24,5-6,2-9"
irb(main):005:0> r = /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
=> /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
irb(main):006:0> s1.match(r)
=> #<MatchData "1-5,5,15-29" number:"29" thing:"15-29">
irb(main):007:0> s2.match(r)
=> #<MatchData "1,28,1-31,15" number:"15" thing:"15">
irb(main):008:0> s3.match(r)
=> #<MatchData "15,25,3" number:"3" thing:"3">
irb(main):009:0> s4.match(r)
=> #<MatchData "1-24,5-6,2-9" number:"9" thing:"2-9">
irb(main):010:0> '1-1-1-1'.match(r)
=> nil
Using the same logic in my previous answer but limiting the range
A becomes [1-9]\d|3[01]
B becomes ([1-9]\d|3[01])|(([1-9]\d|3[01])-([1-9]\d|3[01]))
Overall expression
(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01])))(,(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01]))))*
An optimal Regex for this topic could be:
^(?'int'[1-2]?[1-9]|3[01])((,\g'int')|(-\g'int'(?=$|,)))*$
demo