How does one define a "language" using a regular expression?

How does one define a "language" using a regular expression? - regex

I would like to define "some language" using a regular expression. The requirements are:
The language must contain an infinite number of strings.
The underlying alphabet must have at least three different characters.
I also need to draw a deterministic finite state automaton that accepts the strings of that language.
Give two character strings that are accepted by that finite state automaton and two that are not.
Given this set of requirements, I have thus far (based on my 20 years old memory of set theory and the math associated with it), come up with the following and would appreciate some input from a set-theory, regular expression and formal language definition expert (I know there are many of you who have a deeply vested interest in this subject).
Does the following come even close to fulfilling (1) and (2) at least? What does (4) actually imply? For instance, if the set can hold infinite strings (in theory), as per requirement (1), then how can we fulfill requirement (4) which says "Given 2 strings that are accepted by the (FSA) and TWO THAT ARE NOT"???
My current (rather fallable) solution is:
Alphabet:
∑ = {s,a,e,t,n}
Language:
L* = { Ø , ∈ , taste, set, ate, sane, ….}
OR (using regular expression)
L* = [saetn]*
Any takers?
Thanks.

First of all, the regular expression [saetn]* would accept all strings over the alphabet you chose, so you would be unable to find two which are not in the language (The language would be L = Σ*) and can't satisfy requirement (4).
L = { Ø , ε , taste, set, ate, sane, ...}
is not a valid language, because a language cannot contain Ø. The empty set is not a string (and a language is a set of strings, not a set of sets). Let's remove the Ø.
L = { ε , taste, set, ate, sane, ...}
Does the following come even close to fulfilling (1) and (2) at least?
It doesn't fulfil (1), as there is no reasonable pattern for the ... to have any meaning. The language looks finite.
L = { ε , taste, set, ate, sane }
Would be a valid finite language where ε denotes the empty string. All finite languages are regular, since you can create an expression that is an OR of all the strings in the language (|taste|set|ate|sane).
It does fulfil (2), as you picked the alphabet ∑ = {s,a,e,t,n}, which has 5 elements.
What does (4) actually imply?
It means that the language can't contain all strings over the alphabet. There must be at least two strings in Σ* that are not in the language, and you must show what they are. That doesn't prevent the language from being infinite.
An example of an infinite language would be:
L = { ε, s, a, t, ss, aa, tt, sss, aaa, ttt, ssss, ... }
That language (over the alphabet {s, a, t}) contains all strings which have no more than one distinct character. One regular expression that would accept that language would be s*|a*|t*. The language is clearly infinite, and any strings which contain two different symbols, like at or sat are not in the language. That language satisfies all the requirements. There are many other languages that satisfy all the requirements.
I will leave the drawing of the DFA to you. If you have any questions about it, feel free to comment on my answer.

Related

How to match a string which begins with at least one of the k and can contain multiple of the keywords in any order

k1, k2, ..., kn keywords. For example, given k1, k2, k3 I need to match all following occurrences.
k1
k2
k3
k1k2
k1k3
k2k1
k2k3
k3k1
k3k2
k1k2k3
k1k3k2
k2k1k3
k2k3k1
k3k1k2
k3k2k1
Logic I have is to create regex for each permutation of k1, k2, ..., kn (n being variable). However this leads to factorial number of regexes - 3! in above example, k1(k2)?(k3)?, k1(k3)?(k2)?, k2(k1)?(k3)?, k2(k3)?(k1)?, k3(k1)?(k2)?, k3(k2)?(k1)? when run sequentially on the same string will get me all above matches.
How can this be made more efficient?

However this leads to factorial number of regexes - 3! in above example, k1(k2)?(k3)?, k1(k3)?(k2)?, k2(k1)?(k3)?, k2(k3)?(k1)?, k3(k1)?(k2)?, k3(k2)?(k1)? when run sequentially on the same string will get me all above matches.
That is true.
How can this be made more efficient?
Use a proper programming language / script to do the job. There you can use loops and generate the needed combinations "easily", without the hassles of regexes.
Note: Regexes were not created as a one-for-all tool, and definitely not for complex, algorithmic tasks.

Regular expressions recognize regular languages. Your language is finite, so it's regular by definition (you can write a regular expression for it by concatenating all the words with | between them), but what characterizes regular languages is repetitions of patterns.
A finite language cannot have arbitrary repetitions, which means that your regexp cannot have any * in it. So, it's not a very traditional regular language.
In some cases, the regular expression for a language, in particular a finite language, cannot be much simpler than simply listing all the strings of the language. This is one of those cases. The language has a structure, but it's not a structure based on repetitions, so the power of regular expressions is just not aligned with the task/
If you look at the complexity you need in your regular expression (or finite state machine, another way to match regular languages) in order to recognize the strings of your langauge, you can look at the information you need to remember after seeing any prefix of the string.
To recognize k1k2k3k4 and reject k1k2k3k1, k1k2k3k2, and k1k2k3k3, the information you need to remember after seeing k1k2k3 is that you have seen k1, k2, and k3.
So, for any sequence of keywords, you must remember the exact subset of keywords that has been seen so far. That's roughly exponential in the length of the string seen.
If you have 100 keywords, after seeing 50 of them, you need to remember which 50, and there are K(100,50) possible combinations (aka. 100891344545564193334812497256). That's where the factorial comes from (K(100,50) is 100!/(50!*50!)).
Your regular expression needs to be able to distinguish that many states, because for any two, there is a suffix which will be allowed by one and rejected by the other.

What is the difference between (a+b)* and (ab)*?

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).
I came up with below two possibilities:
(a+b)*
(a*b*)*
However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.

The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).
So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.
Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.
The two expressions are different ways of expressing the same language.

In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.
(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).
However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].
However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).
If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".
That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.
Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)
And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)
In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.

Acording to the algebraic properties of regular expressions,
(a*b*)* = (a+b)*
Therefore (a+b)* = (a*b*)*
Extra information:
(a+b)* = L(a+b)*
= (L(a+b))*
= (L(a) U L(b))*
= ({a} U {b})*
= {a,b}*
= {ε, a, b, aa, bb, ab, abab, aba, bbba,...}

Given Two Regex, Determine if One is a Complement of Other

I'd like to know how you can tell if some regular expression is the complement of another regular expression. Let's say I have 2 regular expressions r_1 and r_2. I can certainly create a DFA out of each of them and then check to make sure that L(r_1) != L(r_2). But that doesn't necessarily mean that r_1 is the complement of r_2 and vice versa. Also, it seems to be that many different regular expressions that could be the same complement of a single regular expression.
So I'm wondering how, given two regular expressions, I can determine if one is the complement of another. This is also new to me, so perhaps I'm missing something that should be apparent.
Edit: I should point out that I am not simply trying to find the complement of a regular expression. I am given two regular expressions, and I am to determine if they are the complement of each other.

Here is one approach that is conceptually simple, if not terribly efficient (not that there is necessarily a more efficient solution...):
Construct NFAs M and N for regular expressions r and s, respectively. You can do this using the construction introduced in the proof that finite automata describe the same languages.
Determinize M and N to get M' and N'. We might as well go ahead and minimize them at this point... giving M'' and N''.
Construct a machine C using the Cartesian product machine construction on machines M'' and N''. Acceptance will be determined by the symmetric difference, or XOR, criterion: accepting states in the product machine correspond to pairs of states (m, n) where exactly one of the two states is accepting in its automaton.
Minimize C and call the result C'
If L(r) = L(s)', then the initial state of C' will be accepting and C' will have all transitions originating in the initial state also terminating in the initial state. If this is the case,
Why should this work? The symmetric difference of two sets is the set of everything in exactly one (not both, not neither). If L(s) and L(r) are complementary, then it is not difficult to see that the symmetric difference includes all strings (by definition, the complement of a set contains everything not in the set). Suppose now there were non-complementary sets whose symmetric difference were the universe of all strings. The sets are not complementary, so either (1) their union is non-empty or (2) their union is not the universe of all strings. In case (1), the symmetric difference will not include the shared element; in case (2), the symmetric difference will not include the missing strings. So, only complementary sets have the symmetric difference equal to the universe of all strings; and a minimal DFA for the set of all strings will always have an accepting initial state with self-loops.

For complement: L(r_1) == !L(r_2)

Pattern matching language knowledge, pattern matching approach

I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!

I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.

Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]

Proving that a language is regular by giving a regular expression

I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}

This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.

Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js