Indent code inside brackets with regex - regex

I would like to indent a code string with tabulation. The simple rule is that I must append a tabulation after each line feed inside braces "{}".
My trouble is for nested braces... here I need many tabulation to be exactly the number of nested braces.
Do you think it is possible to do with a regex replace?

It is impossible with to be done with regex [at least with standard regex, which stand for regular expressions for regular languages] because the language you are describing is irregular!
It is even impossible to know if there are the same number of { as } in a given string with regular language.
We can show that if this language is regular, using homomorphism we can create the language L={anbn} which is a known irregular language.

Related

Regular Expression groups ignoring comma inside parenthesis

I know that are plenty of regular expressions around here similar to what I am going to ask, but couldn't find one that actually helps me.
This one got close, but it uses Java split method, but I need to capture the values using only regular expressions:
Java: splitting a comma-separated string but ignoring commas in quotes
So, what I need to do is, given the below input:
string,string([a-zA-Z]{0,9}),integer
I would like to capture 3 matches:
string
string([a-zA-Z]{0,9})
integer
Note that inside the parenthesis we can have a regular expression, which means almost any chars, even comma.
I can't use split here, because I am not using Java, but an internal declarative programming that uses ICU regular expressions and has an API for capturing groups, but not a regex based split method.
Any help would be appreciated. And I am really sorry if there exists other posts that could be duplicated as this one, but I have spent a few hours looking around, and even played with the post I mentioned, but couldn't get to a solution.
Thanks
EDIT
The input I provided is just an example, but other inputs are also possible.
Besides, after #sin comments, I have reviewed the input, and we can actually assume we'll have quotes inside the parenthesis, like that:
string("[\w]{0,9}"),integer,string

PCRE repetition based on captured number -- (\d)(.{\1})

1xxx captures x
2xxx captures xx
3xxx captures xxx
I thought maybe this simple pattern would work:
(\d)(.{\1})
But no.
I know this is easy in Perl, but I'm using PCRE in Julia which means it would be hard to embed code to change the expression on-the-fly.
Note that regular expressions are usually compiled to a state machine before being executed, and are not naively interpreted.
Technically, n Xn (where n is a number and X a rule containing all characters) isn't a regular language. It isn't a context-free language, and isn't even a context-sensitive language! (See the Chomsky Hierarchy). While PCRE regexes can match all all context-free languages (if expressed suitably), the engine can only match a very limited subset of context-sensitive languages. We have a big problem on our hand that can neither be solved by regular expressions nor regexes with all the PCRE extensions.
The solution here usually is to separate tokenization, parsing, and semantic validation when trying to parse some input. Here:
read the number (possibly using a regex)
read the following characters (possibly using a regex)
validate that the length of the character string is equal to the given number.
Obviously this isn't going to work in this specific case without implementing backtracking or similar strategies, so we will have to write a parser ourselves that can handle the input:
read the number (possibly using a regex)
then read that number of characters at that position (possibly using a substr-like function).
Regexes are awesome, but they are simply not the correct tool for every problem. Sometimes, writing the program yourself is easier.
It can't be done in general. For the particular example you gave, you can use the following:
1.{1}|2.{2}|3.{3}
If you have a long but fix list of numbers, you can generate the pattern programmatically.

Is there a regular language to represent regular expressions?

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.
Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?
Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).
Depending on what you mean by "represent", the answer is "yes" or "no":
If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.
If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:
The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]*, and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n).
Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.
As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.
The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".
This variant even has the same abstract syntax as standard regular expressions.
Another variant would be to specify the NFA that recognizes the language directly using a regular language, e.g.: ([a-z]+,([^,]|\\,|\\\\)+,[a-z]+\$?;)*.
The idea is that [a-z]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c, and a $ indicating accepting transitions (cf. note below). In c, backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish.
The first source state mentioned is the (single) initial state. An empty expression defines the empty language.
Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t. This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non-$ triple.
One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.
The answer is probably NO.
As you have pointed out, set of all possible regular expressions itself is not a regular set. Any TRUE regular expression (not those extended) can be converted into finite automata (FA). If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.
But that's not possible as far as I know. RE itself can be reduced into three basic operation(According to the Dragon Book):
concatenation: e.g. ab
alternation: e.g. a|b
kleen closure: e.g. a*
The kleen closure can match infinite number of characters, but it cannot know how many characters to match.
Just think such case: you want to match 3 consecutive as. Then the corresponding regular expression is /aaa/. But what if you want match 4, 5, 6... as? Parser with only one RE cannot know the exact number of as. So it fails to give the right matching to arbitrary expressions. However, the RE parser has to match infinite different forms of REs. According to your expression, a regular expression cannot match all the possibilities.
Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters). But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

regex matching pair of brackets

I'm trying to write a Sublime Text 2 syntax highlighter for Simulink's Target Language Compiler (TLC) files. This is a scripting language for auto-generating code. In TLC, the syntax to expand the contents of a token (similar to dereferencing a pointer in C or C++) is
%<token>
The regular expression I wrote to match this is
%<.+?>
This works for most cases, but fails for the following statement
%<LibAddToCommonIncludes("<string.h>")>
Modifying the regular expression to greedy fixes this if the statement is by itself on a line, but fails in several other cases. So that is not an option.
For that line, the highlighting stops at the first > instead of the second. How can I modify the regular expression to handle this case?
It'd be great if there was a general expression that could handle any number of nested <> pairs; for example
%<...<...>...<...<...>...>...>
where the dots are optional characters. The entire expression above should be a single match.
A generic way through regular expressions is difficult -as explained very well in this thread.
You can try to specifically match 2 < characters through a regex. Something like %<.+?<.+?>.+?>.