Is it OK to mix string parsing while learning reg ex? [closed] - regex

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm performing some regular expression exercises in Python 2.7.3, on Windows 7. Some of these exercises have me looking for similar patterns on the same line. For example, I want to use regex to capture name1 and name2...
<XML tag><more tags>[name1]</XML tag><XML2 tag>[name2]<XML2 tag></more tags>
Would it be "cheating" or "missing the point" if I used any string parsing to capture name2? I feel like using regex the correct way alone should be able to capture both of those names, but string parsing is what I've always been familiar with.
An analogy would be like someone studying recursion in C++, but using a While loop. Recursion should NOT have any While loops (although of course it may be part of some other grand design).

Good question! Many beginners come into it believing they should be able do everything with one regex match. After all, people are always saying how powerful regexes are, and what you're trying to do is so simple...
But no, the regex is responsible for finding the next match, that's all. Retrieving the substring that it matched, or finding multiple matches, or performing substitutions, that's all external to the act of matching the regex. That's why languages provide methods like Python's findall() and sub(); to do the kind of "string parsing" operations you're talking about, so you don't have to.
It occurred to me a while back that the process of mastering regexes is one of learning everything you can't do with them, and why not. Understanding which parts of the regex matching operation are performed by the regex engine, and which parts are the responsibility of the enclosing language or tool, is a good start.

Related

Recognize pattern of characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am student in the college and I need some help with recognition pattern of characters: what it will match ? or maybe somebody can explain how does it work ?
" (k[abc]*p)+ "
Thank you for any help.
This is a bit of a vague question as you're basically asking how regular expression work.
First of all I would recommend 'Mastering Regular Expressions' which is a pretty great O'Reily book on regex.
Also, for a regex playground, I really like to use Rubular (http://www.rubular.com/) as a playground, although this is meant for ruby, it can give you a good understanding into general regex expression and comes with a nice quick reference guide.
Taking some time to figure this out yourself will be very helpful, regular expressions are not going away.
In this case, your expression is evaluating everything inside the () as one chunk. So it's looking for a k, then at least one (+) of either abc ([abc]) followed by a p, at least one time (+).
So things like kap, kabcp will match.

Strings that can never be parsed with regular expression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am teaching regular expression to some good programmers. They are good at programming but hardly use regex. My task is to train them up so they know when to use regex and when not.
After showing most regular expression features, I found they are parsing everything with regex. This is not what I want. I want they know that there are some texts that can never be parsed with regex.
But I am out of luck. I know regular expression can parse regular language. If its a non regular language it can not parse it. So I am looking for non regular language example.
My target is when they fail to parse it, they will come up with some custom parser.
So, could you provide some good example of such non-regular language?
The best example is the parsing html
Show your students this:
<div>
<div>some shit</div>
<div>
This is some shit again
<div>
Really? Is this parsable?
</div>
</div>
</div>
and ask them to match the content of the inner most div, provided the html is dynamic.
In general, ask your students not to parse any other language using regex.
The best way to teach them that is by making them read this answer
In other words:
Use regex only when there is a uniform pattern of something
Also,
You cannot parse palindromes
You cannot parse another regex
You cannot match people's names and emails as they vary.(Email can be matched, but is a overkill)
A simple and understandable example of a non-regular language would be the language of palindromes, or in other words, strings that are equal to their reverses. It's pretty easy to demonstrate its non-regularity with the pumping lemma (see Wikipedia: http://en.wikipedia.org/wiki/Pumping_lemma)
Mind though, that in practical computing, the distinction isn't quite as clear, as many regular expression engines support features such as back references that allow recognition of certain non-regular languages. A regex engine with back references can match, for example, the language of squares or repetitions ("PonyPony", "123123", "gg" etc): (.*)\1 which isn't possible without back references.

Inverted index for regex? Regex search engine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.
My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.
Maybe something similar? I don't really know.
Here's a description of my idea:
The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.
an example of a search: [^ ]*ell[^ ]* .*\.
something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.
it'll be a simple sed-like regex, maybe a bit javascripty.
they are all similar anyway (with the basics)
Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.
Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?
If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..

Converting Regular Expressions For Use With VBA [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Forgive me if this seems basic, but I'm new to regular expressions, and I can't seem to find general information that isn't example driven.
Basically, I'd like to know what to keep in mind when converting regular expressions found around the internet for use with vba. Is it as simple as "oh, just change the non-greedy operator to this ...", or is it involved on a level that I really need to find VBA specific expressions, or get better at writing my own.
My specific example involves converting this phone number pattern
^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$
But i'm really more interested in either a general approach, or an affirmation of its futility. Thanks!
Following up on Cor_Blimey and funkwurm's comments, I have used VBScript 5.5 regexes in Office 2003 and Office 2007 VBA with good success. Details here. For anything more complicated than () and [] groups, I have found writing from scratch more effective than trying to modify an existing expression.

Regex Performance Optimization Tips and Tricks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
After reading a pretty good article on regex optimization in java I was wondering what are the other good tips for creating fast and efficient regular expressions?
Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.
I created a video demonstrating these techniques. I started with the very poorly written regular expression in the catastrophic backtracking article (x+x+)+y. And then I made it 3 million times faster after a series of optimizations, benchmarking after every change. The video is specific to .NET but many of these things apply to most other regex flavors as well:
.NET Regex Lesson: #5: Optimization
Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
i'm not sure whether PCRE is NFA and i'm only familiar with PCRE but + and * are usually greedy by default, they will match as much as possible to turn this around use +? and *? to match the least possible, bear these two clauses in mind while writing your regexp.
Know when not to use a regular expression -- sometimes a hand coded solution is more efficient and more understandable.
Example: suppose you want to match an integer that's evenly divisible by 3. It's trivial to design a finite state machine to accomplish this, and therefore a corresponding regex must exist, but writing it out is not so trivial -- and I'd sure hate to have to debug it!