Regular Expression library in C/C++ - c++

I want to write regular expression library in C/C++.
What is the good starting point , any books or articles.
I know there are may libraries are available , but I want to write my own version.

A good starting point is to use existing implementations and criticize them.
Pay attention to data structures and design decisions you don't like.
Avoid them when you write your version.

[Edit 16-Jan-2015] I recently encountered this beautiful book Beautiful Code. I recommend you go through Chapter 1, "A Regular Expression Matcher" by Brian Kernighan.
You can read the classic paper by Ken Thompson, "Regular expression search algorithm" ... http://portal.acm.org/citation.cfm?doid=363347.363387 ... this paper should give you a good understanding on how regular expressions are matched using finite automata.
This is another page giving some detailed information by Russ Cox ... http://swtch.com/~rsc/regexp/
Hope these help you get started.

I don't know a book that will help you with the implementation details -- and I'm sure there are tons of details to make it efficient. However, the book Languages and Machines, by Thomas A. Sudkamp, will be of help to understand the ideas behind an implementation.
I think what you'll need to do is compile a regular expression into a finite automata. If you don't know much about grammars and automatas, then part II of that book "Grammars, Automata, and Languages" will be of great help.
The book Compilers, principles, techniques, & tools; by Alfred Aho, Monica Lam, Ravi Sethi and Jeffrey Ullman (also refered to as the dragon book), may also be of help. It's oriented towards making a compiler for a computer language, not for regular expression language. However, you'll probably find it helpful, specially the part about parsing, as it has more of a practical nature (as opposed to Languages and Machines that is very theoretical).
Anyway, if I was to write a regular expression language, those would be my starting points. I recommend you borrowing both from the library you have access to. Other than that, you should take a look at working implementations. I'm just guessing here, but I think there'll be probably good documentation regarding Perl regular expression implementation. Seeing they're so popular and work so well.
Good luck.

Related

get started with regular expression [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I’m always afraid whenever I see any regular expression. I think it very difficult to understand. But fear is not the solution. I’ve decided to start learning regex, so can someone advise me how I can just start? And if there’s is any easy tutorial?
☝ Getting Started with /Regexes/
Regular expressions are a form of declarative programming. If you are used to imperative, functional, or object-oriented programming, then they are a very different way of thinking. It’s a rules-based approach with subtle backtracking issues. I daresay a background in Prolog might actually do you some good with these, which certainly isn’t something I commonly advise.
Normally I would just have people play around with the grep command from their shell, then advance to using regexes for searching and replacing in their editor.
But I’m guessing you aren’t coming from a Unix background, because if you were, you would have come across regexes all over, from the very most basic grep command to pattern-matching in the vi or emacs editors. You can look at the grep manpage by typing
% man grep
on your BSD,
Linux,
Apple, or Sun systems — just to name a few.
                                     ☹ ¡ʇɟoƨoɹɔᴉƜ ʇnoqɐ əɯ ʞƨɐ ʇ ̦uop əƨɐəld ʇƨnɾ  ☹
☟ (?: Book Learnin’? )
If you ran into regular expresions at school or university, it was probably in the context of automata theory. They come up when discussing regular languages. If you have suffered through such classes, you may remember that regular expressions are the user-friendly face to messy finite automata. What they probably did not teach you, however, is that outside of the ivory tower, the regular expressions people actually use to in the real world are far, far behind "regular" in the rarefied, theoretical, and highly irregular sense of that otherwise commonplace word. This means that the modern regular expressions — call them patterns if you prefer — can do much more than the traditional regular
expressions taught in computer science classes. There just isn’t any REGULAR left in modern regular expressions outside the classroom, but this is a good thing.
I say “modern”, but in fact regular expressions haven’t been regular since Ken Thompson first put back references into his backtracking NFA, back when he was famousluy proving NFA–DFA equivalence. So unless you actually are using a DFA engine, it might be best to just forget any book-learnin’ nonsense about REGULARness of regexes. It just doesn’t apply to the way we really use them every day in the real world.
Modern regular expressions allow for much more than just back references though, as you will find once you delve into them. They’re their own wonderful world, even if that world is a bit surreal at times. They can let you substitute for pages and pages of code in just one line. They can also make you lose hair over their crazy behavior. Sometimes they make your computer seem like it’s hung, because it’s actually working very hard in a race between it and the heat-death of the universe in some awful O(2ⁿ) algorithm, or even worse. It can easily be much worse, actually. That’s what having this sort of power in your hands can do. There are no training wheel or slow lane. Regexes are a power tool par excellence.
/☕✷⅋⋙$⚣™‹ª∞¶⌘̤℈⁑‽#♬˘$π❧/
⁠
⁠
⁠
Just one more thing before I give you a big list of helpful references. As I’ve already said today elsewhere, regexes do not have to be ugly, and they do not have to be hard. REMEMBER: If you create ugly regexes, it is only a reflection on you, not on them.
That’s absolutely no excuse for creating regexes that are hard to read. Oh, there’s plenty like that out there all right, but they shouldn’t be and they needn’t be. Even though regexes are (for the most part( a form of declarative programming, all the software engineering techniques that one uses in other forms of programming   ̲s̲t̲i̲l̲l̲ ̲a̲p̲p̲l̲y̲ ̲h̲e̲r̲e̲!
A regex should never look like a dense row of punctuation that’s impossible to decipher. Any language would be a disaster if you removed all the alphabetical identifiers, removed all whitespace and indentation, removed all comments, and removed every last trace of top-down programming. So of course they look like cr#p if you do that. Don’t do that!
So use all of those basic tools, including aesthetically pleasing code layout, careful problem decomposition, named subroutines, decoupling the declaration from the execution (including ordering!), unit testing, plus all the rest, whenever you’re creating regexes. These are all critical steps in making your patterns maintainable.
It’s one thing to write /(.)\1/, but quite another to write something like mǁ☕⅋⚣⁑™∞¶⌘℈‽#♬❧ǁ. Those are regexes from the Dark Ages: don’t just reject them: burn them at the stake! It’s programming, after all, not line-noise or golf!
☞ Regex References
The Wikipedia page on regular expressions is a decent enough overview.
IBM has a nice introduction to regexes in their Speaking Unix series.
Russ Cox has a very nice list of classic regular expressions references. You might want to check out the original Version 8 regular expressions, here found in a Perl manpage, but these were the original, most basic patterns that everybody grew up with back in olden days.
Mastering Regular Expressions from O’Reilly, by Jeffrey Friedl.
Jan Goyvaerts’s regular-expressions.info site and his Regular Expression Cookbook, also from O’Reilly.
I’m a native speaker of Perl, so let me say four words about it. Chapter 5 of the Perl Cookbook and Chapter 6 of Programming Perl, both somewhat embarrassingly by yours truly et alios, also from O’Reilly, are devoted to regular expressions in Perl. Perl was the language that originated most regex features found in modern regular expressions, and it continues to lead the pack. Perl’s Unicode support for regexes is especially rich and remarkably simple to use — in comparison with other languages’. You can download all the code examples from those two books from the O’Reilly site, or see the next item. The perldoc.org site has quite a bit on pattern matching, including the perlre and perluniprops manpages, just to take a couple of starting points.
Apropos the Perl Cookbook, the PLEAC project has reïmplemented the Perl Cookbook code in a dizzying number of diverse languages, including ada, common lisp, groovy, guile, haskell, java, merd, ocaml, php, pike, python, rexx, ruby, and tcl. If you look at what each language does for their equivalent of PCB’s regex chapter, you will learn a tremendously huge amount about how that language deals with regular expressions. It’s a marvellous resource and quite an eye-opener, even if some up the solutions are, um, supoptimal.
Java Regular Expressions by Mehran Habibi from Apress. It’s certainly better than trying to figure anything out by reading Sun’s documentation on the Pattern class. Java is probably the worst possible language for learning regexes in; it is very clumsy and often completely stupid. I speak from painful personal experience, not from ignorance, and I am hardly alone in this appraisal. If you have to use a JVM language, I recommend Groovy or perhaps Scala. Unfortunately, both are based on the standard Java pattern matching classes, so share their inadequacies.
If you need Unicode and you’re using Java or C⁺⁺ instead of Perl, then I recommend looking into the ICU library. They handle Unicode in Java much better than Sun does, but it still feels too much like assembler for my tastes. Perl and Java appear to have the best support for Unicode and multiple encodings. Java is still kinda warty, but other languages often have this even worse. Be warned that languages with regexes bolted on the site are always clumsier to use them in than those that don’t.
If you’re using C, then I would probably skip over the system-supplied regex library and jump right into PCRE by Phil Hazel. A bonus is that PCRE can be built to handle Unicode reasonably well. It is also the basic regex library used by several other languages and tools, including PHP.
regular-expressions.info is a gold-mine of information and tutorials about regular expressions. From beginner to expert, there's not much out there that is better than this site when it comes to the study of regular expressions.
regular-expressions.info has a good tutorial here
http://www.regular-expressions.info/tutorial.html
Regular expressions in itself might not achieve any utility, unless combined in with either a text manipulation operations using some kind of scripting tool(sed/awk) or a programming language like Perl or so. Try to install Regex Buddy. Nice standalone tool which can let you use regular expressions, on some files you may point it to.
So yes you can learn about some basic info mentioning their structure, syntax, semantics, if I may call so, but try to read the regular expressions tutorials in - Perl, Vim,... and do some example string/text manipulation in those contexts, programatically
-AD.
While learning at: regular-expressions.info, the Regular Expressions Cheat Sheet (V2) is something you definitely want to have.
http://www.gskinner.com/RegExr/ exists both as an online version and as an AIR application.
The cool thing about this app (besides that it work like a charm) is that you can save your expressions or share them with the community right from the app.
Say you need an e-mail regex you can just search for e-mail and you will get back a rated list of expressions.
Another helpful feature is the interpretation of your expressions into human readable form. This makes it easier to learn and master.
For the tutorial part this article is very easy consume.
This book saved my ass when I was starting out with awk and sed.

C++ for Ruby scripters

I am a fairly capable Ruby scripter/programmer, but have been feeling pressure to branch out into C++. I haven't been able to find any sites along the lines of "C++ for Ruby Programmers". This site exists for Python (which is quite similar, I know). Does anyone know of a guide that can help me translate my Ruby 'thoughts' into C++?
I don't think that language introductions written specifically for migrants from a certain language have considerable advantage over traditional "independent" introductory books. Reading as a cognitive process has a great feature: reading speed varies greatly. That means that you should take any good C++ book (I'm sure you'll find excellent recommendations here on SO) and your reading speed will be greatly affected by your previous programming knowledge - reading about things you already know will become almost skimming-fast, others will take some time. In the end, you will spend practically the same amount of time as you would if you read a specific migrant course, with the difference of having read a book that you will be able to use as a language reference at any given time in the future, unlike the "transitional guide", which is always kind of "one-time read".
On the other hand, from a perspective of a writer, it's pretty ungrateful to assume (and rely on) such thing as reader's knowledge on a topic. When one says he knows Ruby is it really a guarantee that he knows OOP thoroughly, for example? Or just have been using it not understanding the internals (which is really easy for a Rails programmer, for example).
So a general book is a safe bet both for a writer and a reader. :)
I agree with others. Your skills in Ruby will certainly help you to learn C++ in a way, but they are quite different. A great online book to learn C++ : Thinking in c++
Bruce Eckel's books are a really good start with an adapted learning curve. Simple to begin but going quite deep into the language. Recommended.
my2c
If you want to learn C++, start here.
Once you have learned the fundamentals, you should also look into these:
Effective C++,
More Effective C++,
Effective STL, and
Exceptional C++

Regular Expressions and Assembly

I know 8086 Assembly and learning MIPS Assembly. Also, I'm learning Regular Expressions, then I want to know:
How can I use Regular Expressions on them?
This is a challenging problem to pull off in assembly from scratch. No assembly language would support regular expressions as a first-class construct because there's too much of a difference in the abstraction level to make it a useful inclusion. That means you need to build it yourself.
Supporting regular expressions is essentially like having a compiler inside your program that translates the expression into a sequence of matching instructions. You will have to build all of the constituent pieces: a translation engine, a series of transformation rules, a DFA assembler, and a matching engine.
That said, it's not impossible! Start small, supporting tiny subsets of the real language you want to support, and then work your way up. Check out chapter 16 of Assembly Language Programming for a detailed walkthrough of how you might build your own regular expression engine. You'll need a good understanding of how they work (which this chapter will give you) and a solid understanding of assembly as well (see the earlier chapters for that).
Try this:
AsmRegEx - regular expression engine
It's written in FASM. Unfortunately, it seems that the project won't progress anymore...
The set of articles here describes how to build a very simple but powerful regex engine from scratch. It uses C++ but explains the theory in detail and the code can be translated to ASM without too much effort by an experienced programmer.
That said, I don't think it's a particularly interesting exercise, neither for learning ASM nor for learning regular expressions. You'll just get too bogged down by the details.
Regular expressions do not exists in assembly, that seems a little bizarre question, as Regex's are of a higher-level language nature, it does not exist at the nuts and bolts level...
Edit: Nathan, here is the link that might be of interest to you. Scroll down to the bottom of the page ;)
Hope this helps,
Best regards,
Tom.
Start off with very simple regular expressions. For example, recognising sequences of alphabetic characters and numeric characters and work your way up from there. You will need to consider carefully how your code is going to deliver it's results.
It may be a good idea to create a regex parser is C first, as more people in this forum will be able to help you. Once you have got it working, to can translate it to assembler code. Again, more people here will be familiar with 8086 assembly language programming than with MIPS, so it may be a good idea to use 8086 even though the CPU architecture is not very nice.
Not sure if you want to know how to implement a regex engine in assembler or just how to easily use regular expressions on your null-ended strings from assembly language. If its the first, you have been given some pointers. If it's the later, it depends on your platform, but the easiest way is to call a C-coded library from your assembly. Unix variants have POSIX regular expressions already avalaible in the libc, and you can call them from your assembly, just following the aproppiate calling conventions.

How are regular expressions processed?

How are regular expressions processed?
A regular expression describes a ruleset for a state machine. It generally moves over a string one character at a time, making decisions based on what happened on previous characters and what is described in the regex.
Any regex can also be written as a loop over a string one character at a time. Some of these could be fairly simple, but the power of regex is found when what appears to be a simple regex, with a few lookbehinds and subgroups would take a thousand lines of code to reproduce in your own state machine.
Regular expressions can be modeled as a Deterministic Finite State Machine. That would probably be a good place to start if you wanted to "process" one.
This question is very broad. This is not a complete answer but Jeff Moser has an excellent write-up on his blog that walks through .NET's regex process: How .NET Regular Expressions Really Work
I suspect other answers will shed light on other areas of regular expressions unless your question is updated to be more specific.
This will depend on which regex implementation you're referring to.
There are 2 common but different techniques used in regex engines:
Nondeterministic Finite State Machine
Deterministic Finite State Machine
This MSDN article explains several techniques implemented in various engines and then goes onto explain .NET's implementation and why Microsoft chose what they chose for .NET.
They go even more in-depth in the various articles you see listed here.
http://www.moserware.com/2009/03/how-net-regular-expressions-really-work.html
Despite what everyone here says about state machines, you can write a remarkably simple regex recogniser using recursive techiques with very little state. There are examples of these in two of Brian Kernighan's books Software Tools In Pascal and The Practice Of Programming.

How important is knowing Regexs?

My personal experience is that regexs solve problems that can't be efficiently solved any other way, and are so frequently required in a world where strings are as important as they are that not having a firm grasp of the subject would be sufficient reason for me to consider not hiring you as a senior programmer (a junior is always allowed the leeway of training).
However.
A number of responses on the recurrent "What's the regex for this?" type-questions suggest that a great deal of coders find them somewhere between unintelligible and opaque.
This is not about whether a simple indexOf or substring is a better solution, that's a technical matter, and sometimes the simple way is correct, sometimes a regex is, and sometimes neither (looking at you html parser questions).
This is about how important it is to understand Regexs and whether the anti-Regex opinion (that trite "...now they have two problems" thing) is merited or FUD.
Should a programmer should be expected to understand Regexs? Is this a required skill?
edit: just in case it isn't clear, I'm not asking whether I need to learn them (I'm a defender of the faith) but whether the anti-camp have are an evolutionary dead end or whether it's an unnecessary niche skill like InstallShield.
REs let you solve relatively complex problems that would otherwise require you to code up full parsers with backtracking and all that messy sort of stuff. I liken the use of REs to using chainsaws to chop down a tree instead of trying to do it with a piece of celery.
Once you've learned how to use the chainsaw safely, you'll never go back. People who continue to spout anti-RE propaganda will never be as productive as those of us who have learned to love them.
So yes, you should know how to use REs, even if you understand only the basic constructs. They're a tool just like any other.
There are some tasks where regular expressions are the best tool to use.
There are some tasks where regular expressions are pointlessly obscure.
There are some tasks where they're reasonably appropriate, but a different approach may be more readable.
In general, I think of using a regular expression when an actual pattern is involved. If you're just looking for a specific string, I wouldn't generally use a regex. As an example of a grey area, someone once asked on a newsgroup the best way to check whether one string contained any of a number of other strings. The two ways which came up were:
Build a regex with alternatives and perform a single match.
Test each string in turn with string.Contains.
Personally I think the latter way is much simpler - it doesn't require any thought about escaping the strings you're looking for, or any other knowledge of regular expressions (and their different flavours across different platforms).
As an example of somewhere that regular expressions are quite clearly the wrong choice, someone seriously proposed using a regular expression to test whether or not a string three characters long. Their regular expression didn't even work, despite them claiming that the reason they thought of regular expressions first is because they'd been using them for so long, and that they naturally sort of "thought" in regular expressions.
There are, however, plenty of examples where regular expressions really do make life easier - as I say, when you're actually matching patterns: "I want one letter, then three digits, then another letter" or whatever. I don't find myself using regular expressions very often, but when I do use them, they save a lot of work.
In short, I believe it's good to know regular expressions - but equally to be careful about when to use them. It's easy to end up with write-only code which could be made simpler to understand by rewriting with simple string operations, even if the resulting code is slightly longer.
EDIT: In response to the edit of the question...
I don't think it's a good idea to be evangelical about them - in my experience, that tends to lead to using them where an alternative would be simpler, and that just makes you look bad. On the other hand, if you come across someone writing complicated code to avoid using a regular expression, it's fine to point out that a regex would make the code simpler.
Personally I like to comment my regular expressions in quite a detailed way, splitting them up onto several lines with a comment between each line. That way they're easier to maintain, and it doesn't look like you're just trying to be "hard core" geeky (which can be the impression, even if it's not the actual intended aim).
I think the most important thing is to remember that short != readable. Never claim that using a regex is better because it requires less code - claim that it's better when it's genuinely simpler and easier to understand (or where there's a significant performance benefit, of course).
As a developer you should know the pros and cons of as many tools as possible that could provide pre-made solutions for your problems. Every developer should know how to work with regular expressions and have a feeling when they should be used and when it is besser to use simple string functions to achieve a goal.
Rejecting them outright because they are hard to read is no option in my opinion. A developer who thinks so strips himself of a valuable tool for searching and validating complex string patterns.
I have really mixed feelings. I have used them and know the bones of the syntax and something in me loves their conciseness. However they are not commonly understood and are a highly obfuscated form of code. I too would like to see performance comparisons against similar operations in plain code. There is no question that the exploded code will be more maintainable and more easily and widely understood, which is a serious consideration in any commercial software project.
Even if they turn out to be more performant, the argument for them taken to its logical conclusion would see us all embedding assembler into our code for important loops - perhaps we should. Neat and concise and very fast, but almost un-maintainable.
On balance I think that until the regex syntax becomes mainstream they probably cause more trouble than they solve and should be used only very carefully.
In the Steve Yegge's article, Five Essential Phone Screen Questions, you should read the section "Area Number Three: Scripting and Regular Expressions".
Steve Yegge has some interesting points. He gives real world problems he has encountered with clients having to parse 50,000 files for a particular pattern of a phone number. The applicants who know regular expressions tear through the problem in a few minutes while those who don't write monster multi-hundred line programs that are very unwieldy. This article convinced me I should learn regular expressions.
Not a brilliant answer but everywhere I've worked the following holds true
0 < Number of people who (fully) understand regex < 1
If I knew how to do it I'd write that previous expression as a regex, but I can't. The best I could come up with on the fly is s/fully/a little/g - that's my limit (and that's probably not a regex).
A more serious answer is that the right regex will solve all kinds of problems, with one(ish) line of code. But you'll have real problems debugging it if it goes wrong. Therefore IMHO a complex regex however 'clean/clever' is a liability, if it takes ten lines of code to replicate it, why's that a problem, is memory/disk space suddenly expensive again?
BTW I'd love to know if regexs are fast compared to code equivalent.
It is not clear what kind of answer you are expecting.
I can imagine roughly three kinds of answer to this question:
Regexen are essential to the education of professional programmers. They enable the use the powerful unix shell tools, and regex-based search-replace can dramatically cut down on text-munging handiwork that is a part of a programmer's life. Programmers that do not know regexen are just intelectually lazy which is a very bad trait for a programmer.
Regexps are kinda useful depending on the application domain. Surely, knowing how to write regexps is a valuable tool a programmer's chest, but most of the time you can do fine without using them. Also, regexps tend to be very hard to read, so abuse must be strongly discouraged.
Some nutcases like to put regexs everything (I'm looking at you, the perl guy who implemented a regex-based tetris in perl). But really, they are just a bit of computer science trivia whose only practical use is in writing parsers. They are widely taught because they make a good teaching topic on which to evaluate students, and like most such topics it can forgotten the second you step out of the exam room.
You will notice the careful use of the plural forms "regexen" (pro), "regexps" (careful neutral) and "regexs" (con).
Personally, I am of the first kind. Good programmers like to learn new languages, and they hate repetitive handiwork.
When you have to parse something (ranging from simple date strings to programming languages) you should know your tools and regular expressions are one of them.
But you should also know what you can do with regexes and what not. At this point it comes in handy if you know the Chomsky hierarchy
hierarchy. Otherwise you end up trying to use regular expressions to parse context-sensitive languages and wonder why you can't get your regex right.
The fact that all languages support regexs should mean something !
I think knowing a regex is a quite important skill. While the usage of regex in a programming environment/language is question of maintainable code, I find the knowledge of regex to be useful with some commands (say egrep), editors (vim, emacs etc.). Using a regex to do a find and replace in vim is very handy when you have a text file and you want to do some formatting once in a while.
I find it very useful to know regular expressions. They are a very powerful tool, and in my opinion there are problems that you simply can't solve without these.
I would however not take regular expressions as a killing criterion for "hiring you as a senior programmer". They are like the wealth of other tools in the world. You should really known them in a problem domain where you need them, but you cannot presume that someone already knows all of these.
"a junior is always allowed the leeway
of training"
If a senior isn't, then I would not hire him!
To the ones that argue how complex and unreadable a regular expression is: If the regexp solution to a problem is complex and unreadable, then probably the problem itself is! Good luck in solving it in an other way...
What does the following do?
"([A-Za-z][A-Za-z0-9+.-]{1,120}:A-Za-z0-9/{1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:#&~=%-]{0,1000}))?)"
How long did it take you to figure out? to debug?
Regexs are awesome for single-use throwaway programs, but long hairy regexps are not the best choice for programs that other people will need to maintain over the years.
I find that regex's can be very helpful depending on the type of programming that you do. However I probably write less than one regex a month, and because of this long interval between requiring regex's I forget alot about how they work.
I should probably go through mastering regular expressions or something similar someday.
Knowing when to use a regexp and the basics of how they work and what their limitations are is important. But filling your head with a lot of syntax rules that you probably won't need very often is just a pointless academic exercise.
A regexp crib sheet can be written on one sheet of A4 paper or a couple of pages in a textbook - no need to know this stuff by heart, If you use it every day it will stick. If you don't use it very often then the brain cells are probably better used for something else.
A developer thought he had one problem and tried to solve it using regex. Now he has 2 problems.
I agree with pretty much everything said here, and just need to include the mandatory quip:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems.
(attributed to Jamie Zawinski)
Like most jokes, it contains a kernel of truth.