Strings that can never be parsed with regular expression [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am teaching regular expression to some good programmers. They are good at programming but hardly use regex. My task is to train them up so they know when to use regex and when not.
After showing most regular expression features, I found they are parsing everything with regex. This is not what I want. I want they know that there are some texts that can never be parsed with regex.
But I am out of luck. I know regular expression can parse regular language. If its a non regular language it can not parse it. So I am looking for non regular language example.
My target is when they fail to parse it, they will come up with some custom parser.
So, could you provide some good example of such non-regular language?

The best example is the parsing html
Show your students this:
<div>
<div>some shit</div>
<div>
This is some shit again
<div>
Really? Is this parsable?
</div>
</div>
</div>
and ask them to match the content of the inner most div, provided the html is dynamic.
In general, ask your students not to parse any other language using regex.
The best way to teach them that is by making them read this answer
In other words:
Use regex only when there is a uniform pattern of something
Also,
You cannot parse palindromes
You cannot parse another regex
You cannot match people's names and emails as they vary.(Email can be matched, but is a overkill)

A simple and understandable example of a non-regular language would be the language of palindromes, or in other words, strings that are equal to their reverses. It's pretty easy to demonstrate its non-regularity with the pumping lemma (see Wikipedia: http://en.wikipedia.org/wiki/Pumping_lemma)
Mind though, that in practical computing, the distinction isn't quite as clear, as many regular expression engines support features such as back references that allow recognition of certain non-regular languages. A regex engine with back references can match, for example, the language of squares or repetitions ("PonyPony", "123123", "gg" etc): (.*)\1 which isn't possible without back references.

Related

Is it OK to mix string parsing while learning reg ex? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm performing some regular expression exercises in Python 2.7.3, on Windows 7. Some of these exercises have me looking for similar patterns on the same line. For example, I want to use regex to capture name1 and name2...
<XML tag><more tags>[name1]</XML tag><XML2 tag>[name2]<XML2 tag></more tags>
Would it be "cheating" or "missing the point" if I used any string parsing to capture name2? I feel like using regex the correct way alone should be able to capture both of those names, but string parsing is what I've always been familiar with.
An analogy would be like someone studying recursion in C++, but using a While loop. Recursion should NOT have any While loops (although of course it may be part of some other grand design).
Good question! Many beginners come into it believing they should be able do everything with one regex match. After all, people are always saying how powerful regexes are, and what you're trying to do is so simple...
But no, the regex is responsible for finding the next match, that's all. Retrieving the substring that it matched, or finding multiple matches, or performing substitutions, that's all external to the act of matching the regex. That's why languages provide methods like Python's findall() and sub(); to do the kind of "string parsing" operations you're talking about, so you don't have to.
It occurred to me a while back that the process of mastering regexes is one of learning everything you can't do with them, and why not. Understanding which parts of the regex matching operation are performed by the regex engine, and which parts are the responsibility of the enclosing language or tool, is a good start.

Transliteration between different writing systems [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.
So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.
So the original text is like this (in broad Finno-Ugric Transcription):
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.
And I would need it in a form like this:
мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.
This continues for some thousand lines.
There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.
I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö. мӧдiс иван
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.
Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!
Niko

Inverted index for regex? Regex search engine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.
My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.
Maybe something similar? I don't really know.
Here's a description of my idea:
The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.
an example of a search: [^ ]*ell[^ ]* .*\.
something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.
it'll be a simple sed-like regex, maybe a bit javascripty.
they are all similar anyway (with the basics)
Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.
Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?
If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..

What is the best way to define grammars for a text editor? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I'm masochistically writing an open-source text editor for Mac and have finally reached the point at which I want to add syntax highlighting. I've been going back and forth on various solutions for the past few days, and I've finally decided to open the question to a wider audience.
Here are the options I see:
Define languages basically with a series of regex pattern matching (similar to how TextMate defines its languages)
Define languages with a formal grammar like BNF or PEG
Using regex pattern matching seems less than ideal as it cannot formally represent a language nearly as well as a formal grammar; however, some less formal languages will have a hard time fitting into BNF (i.e. Markdown -- though I know there's a great PEG implementation).
What are the performance tradeoffs for live syntax highlighting? What about flexibility for a wide range of languages?
If I go the BNF route, Todd Ditchendorf created the awesome ParseKit framework which would work nicely out-of-the-box. Anyone know of any anything similar for PEG's?
Unless you want to fight the battle of getting a full-context free (or worse, a full context-sensitive) grammar completely correct for every language you want to process (or worse, for every dialect of the language you want to process... how many kinds of C++ are there?), for the purposes of syntax highlighting you're probably better giving up on complete correctness and accept that sometimes you'll get it wrong. In that case, regexps seem like an extremely good answer. They can also be very fast, so they won't interfere with the person doing the editing.
If you insist on doing full syntax checking/completion (I don't think you are), then you'll need that full grammar. You'll also be a very long time in producing editors for real languages.
Sometimes it is better not to be too serious. A 98% solution that you can get is better than a 100% solution that never materializes.
It might not be exactly what you need since you are writing the editor yourself, but there is an awesome framework called Xtext that will actually generate a complete editor with syntax coloring, customizable outline view and auto-completion etc., based on a grammar for your language: http://eclipse.org/Xtext

Regex Performance Optimization Tips and Tricks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
After reading a pretty good article on regex optimization in java I was wondering what are the other good tips for creating fast and efficient regular expressions?
Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.
I created a video demonstrating these techniques. I started with the very poorly written regular expression in the catastrophic backtracking article (x+x+)+y. And then I made it 3 million times faster after a series of optimizations, benchmarking after every change. The video is specific to .NET but many of these things apply to most other regex flavors as well:
.NET Regex Lesson: #5: Optimization
Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
i'm not sure whether PCRE is NFA and i'm only familiar with PCRE but + and * are usually greedy by default, they will match as much as possible to turn this around use +? and *? to match the least possible, bear these two clauses in mind while writing your regexp.
Know when not to use a regular expression -- sometimes a hand coded solution is more efficient and more understandable.
Example: suppose you want to match an integer that's evenly divisible by 3. It's trivial to design a finite state machine to accomplish this, and therefore a corresponding regex must exist, but writing it out is not so trivial -- and I'd sure hate to have to debug it!