Simplify regular expression - regex

I want to simplify this regular expression:
0*|0*1(ε|0*1)*00*
I used this identity:
(R+S)*=(R*S*)*=(R*+S*)*
and couldn't get better than this:
0*|0*1(0*1)*00* [(ε|0*1)*=(ε*0*1)*=(0*1)*]
Can this regular expression be simplified even more, and how? I have no clue what else to do. :)
EDIT 1: I altered + to | ,for + could stand for "one or more times", beside alternation which is now denoted by |.
Explanation of notation:
1) ε stands for empty word
2) * is Kleene star
3) AB is just a concatenation of languages of regular expressions A and B.
EDIT 2: Formal proof that this reduces to (0*1)*0+|ε:
0*|0*1(ε|0*1)*00* =
= 0*|0*1(0*1)*0+ =
= 0*|(0*1)+0+ =
= 0+|ε|(0*1)+0+ =
= ε0+|(0*1)+0+|ε
= (ε|(0*1)+)0+|ε
= (0*1)*0+|ε
Is there any way to reduce it further to (0|1)*0|ε?

I think it reduces to this (0*1)*0+|

(Update: See edit history for long, sad story of previous incorrect attempts).
I (now) believe this reduces to:
ε|(0|1)*0
in other words, either:
The empty string
Any string of ones and zeros ending in 0
Proving this is another matter altogether. ;-)

I managed to formally reduce given regular expression to ε|(0|1)*0.
This is the proof:
0*|0*1(ε|0*1)*00* =
= 0*|0*1(0*1)*0+ =
= 0*|(0*1)+0+ =
= 0+|ε|(0*1)+0+ =
= ε0+|(0*1)+0+|ε =
= (ε|(0*1)+)0+|ε =
= (0*1)*0+|ε =
= (0*1)*0*0|ε = #
= (0|1)*0|ε
The trick was to use the identity (A*B)*A* = (A|B)* of which I wasn't aware when the question was asked, in the step marked with #.

Related

How to remove anything after a non-slash character in a string?

The problem I am encountering is strange. Suppose I have:
a = "www.XXXXXXX.com"
b = "www.XXXXXXX.com/laskdfj/=*&9809f/12-613"
c = "www.XXXX.comllkjldfjlsadjfjldsf"
d = "http://www.XXXX.CoMmasldfjl"
e = "www.XXX.us/sdf"
f = "www.XXX.us0948klsdf"
If following after the ".com" or ".us" is not a slash, then remove it. So the result would be like:
a = "www.XXXXXXX.com"
b = "www.XXXXXXX.com/laskdfj/=*&9809f/12-613"
c = "www.XXXX.com"
d = "http://www.XXXX.CoM"
e = "www.XXX.us/sdf"
f = "www.XXX.us"
Regular expression is new to me, and I read several blogs about regular expression, none of them seem to talk about how to use if-statement to handle my situation... any hints?
You can utilize sub for this task:
sub('(.*\\.(?i:com|us))[^/]+', '\\1', x)
If you're wanting a more general approach, you can use:
sub('(.*\\.[[:alpha:]]{2,3})[^/]*', '\\1', x)
CodeBunk

Why can I transform the regular expression 1*0 + 1*0(0+1)*(0+1) to 1*0(0+1)*?

I can't quite understand, why I can transform the regex 1*0 + 1*0(0+1)*(0+1) to 1*0(0+1)*. Anyone able to help me?
You can use the law of distributivity:
(1*0)+(1*0(0+1)*(0+1))
= (1*0ε)+(1*0(0+1)*(0+1))
= (1*0)(ε+(0+1)*(0+1))
and then apply the definition of the the Kleene star a* = ε+a*a:
= (1*0)((0+1)*)
= 1*0(0+1)*

Mysterious no-match in regular expression

Imagine I have a cell array with two filenames:
filenames{1,1} = 'SMCSx0noSat48VTFeLeakTrace.txt';
filenames{2,1} = 'SMCSx0NoSat48VTrace.txt';
I want to get the filename which starts with 'SMCSx0' and contains the filterword 'NoSat48VTrace':
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48VTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
it works, I get:
file =
'SMCSx0NoSat48VTrace.txt'
But for whatever reason does the change of the filterword to 'noSat48VTFeLeakTrace' doesn't get me the other file?
%// case 2
expression = 'SMCSx0';
filterword = 'noSat48VTFeLeakTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
which is absolutely the same as before, but
file =
Empty cell array: 0-by-1
I'm actually use these lines in a function for months, without problems. But now I added some files to my folder which are not found, though their names are similar to before. Any hints?
It is actually supposed to work without including Trace into the filterword, which it does for the first case, that's why I put .*\ into the regex.
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48V';
... works
'^' expression '.*\'
The \ near the end makes it that \n is interpreted as a new-line character:
SMCSx0.*\noSat48VTFeLeakTrace.*\.txt$
This worked fine with the other filterword because NoSat48VTrace has an upper case N and \N is interpreted as simply N.
Get rid of the \, you don't need it.
You have an extra backslash in there:
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
^^^
|||
remove it and it should give the expected result.

Shortcut to get a statement with certain pattern in R

I have to write the following as it is.
('trial1' = Ozone1, 'trial2' = Ozone2, trial3 = Ozone3,...........trial1000 = Ozone1000)
I want to write this with one command in R. How do I do it?
I tried it using paste0
Let us take only 5 as number of repetitions:
paste0("trial",1:5,"= Ozone", 1:5)
I get this as result.
"trial1= Ozone1" "trial2= Ozone2" "trial3= Ozone3" "trial4= Ozone4" "trial5= Ozone5"
But it is not the way I wanted it. I want the output to come out as it is like (not even in inverted commas):
('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5 = Ozone5)
Also as you can see, it is not a string i.e. output should not come between inverted commas as "........". I want it as it is exactly.
How do i do it?
This will generate the string you want...
paste0('(',paste0("'trial",1:1000,"'= Ozone",1:1000,collapse=' ,'),')')
This will print the string without quotes...
print(paste0('(',paste0("'trial",1:10,"'= Ozone",1:10,collapse=' ,'),')'), quote=FALSE)
I hope it answered your question...
You need to escape the single quotes, ie \', and use the collapse argument of paste0:
paste0("(", paste0("\'trial",1:5,"\' = Ozone",1:5, collapse=", "), ")")
[1] "('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5' = Ozone5)"

Recursive tricks with regexp in Matlab

I tried to use regexprep to solve a problem - I'm given a string, that represents a function; it contains a patterns like these: 'sin(arcsin(f))' where f - any substring; and I need to replace it with simple 'f_2'. I successfully used regexprep unless I face with such string:
str = 'sin(arcsin(sin(arcsin(f_2))))*x^2';
str = regexprep(str, 'sin\(arcsin\((\w*)\)\)','$1');
it returns
str =
sin(arcsin(f_2))*x^2
But I want it to be
str =
f_2*x^2
Is there any way to solve it (except obvious solution with for-loops).
I was not able to test this, but I thinkg I found an expression that you can call multiple times to do what you asked for; each time it will "strip" one sin(arcsin()) pair out of your equation. Once it stops changing, you're done.
(.*)sin\(arcsin\((.*(\(.*?\))*)(\)\).*$)
Here is some Matlab code that shows how this might work:
str = 'sin(arcsin(sin(arcsin(f_2))))*x^2';
regex = (.*)sin\(arcsin\((.*(\(.*?\))*)(\)\).*$);
oldlength = 0
newlength = length(str)
while (newlength != oldlength)
oldlength = newlength;
str = regexprep(str, regex,'$1$2');
newlength = length(str);
end
As I said - I could not test this. Let me know if you have any problems with this.
Demo of the regular expression:
http://regex101.com/r/bR9gC7
Change your pattern to search for 1 or more (+) nested sin(arcsin( occurrences:
str = 'sin(arcsin(sin(arcsin(f_2))))*x^2';
str2 = regexprep(str, '(sin\(arcsin\()+(\w*)(\)\))+','$2')
str2 =
f_2*x^2