Regex match limited number of words - regex

I need to match a limited number of words with regex. For example matching at most 2 words from a sentence.
EDIT: I'm using CoffeeScript and I tried
^([a-zA-Z0-9]+[^a-zA-Z0-9]*){1,3}
which seems to be working on http://rubular.com/r/ncNgZBo6Lq but not on my script. So probably it's not supported on this implementation.

It depends on which regex implementation you are using and on exactly what you mean by "word". One interpretation of your request would afford this regex as a solution:
/(\w+)\W+(\w+)/
To many regex engines, the \w represents a "word character", and the \W represents any character that is not a "word character". In contrast to simply looking for whitespace, that will pick up words separated only by punctuation as separate words, such as in series-of-hyphenated-words. Be careful, however, as what "word character" means to your regex engine may not be exactly what you want it to mean. For example, the above probably counts contractions such as "don't" as two words (but perhaps that's ok).
More generally, if you can create a regex that matches every individual "word" (whatever that means to you) but not anything else, then you can form a regex as
/(one-word-regex)regex-for-what-can-separate-words(one-word-regex)/
.

Related

RegEx: Non-repeating patterns?

I'm wrestling with how to write a specific regex, and thought I'd come here for a little guidance.
What I'm looking for is an expression that does the following:
Character length of 7 or more
Any single character is one of four patterns (uppercase letters, lowercase letters, numbers and a specific set of special characters. Let's say #$%#).
(Now, here's where I'm having problems):
Another single character would also match with one of the patterns described above EXCEPT for the pattern that was already matched. So, if the first pattern matched is an uppercase letter, the second character match should be a lowercase letter, number or special character from the pattern.
To give you an example, the string AAAAAA# would match, as would the string AAAAAAa. However, the string AAAAAAA, nor would the string AAAAAA& (as the ampersand was not part of the special character pattern).
Any ideas? Thanks!
If you only need two different kinds of characters, you can use the possessive quantifier feature (available in Objective C):
^(?:[a-z]++|[A-Z]++|[0-9]++|[#$%#]++)[a-zA-Z0-9#$%#]+$
or more concise with an atomic group:
^(?>[a-z]+|[A-Z]+|[0-9]+|[#$%#]+)[a-zA-Z0-9#$%#]+$
Since each branch of the alternation is a character class with a possessive quantifier, you can be sure that the first character matched by [a-zA-Z0-9#$%#]+ is from a different class.
About the string size, check it first separately with the appropriate function, if the size is too small, you will avoid the cost of a regex check.
First you need to do a negative lookahead to make sure the entire string doesn't consist of characters from a single group:
(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)
Then check that it does contain at least 7 characters from the list of legal characters (and nothing else):
^[a-zA-Z0-9#$%#]{7,}$
Combining them (thanks to Shlomo for pointing that out):
^(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)[a-zA-Z0-9#$%#]{7,}$

Regex for one word in sentence

So I am trying to match a (any) word(s) that would have:
At least one upper case letter
At least one lower case letter
At least one number
I currently got to this using lookaheads
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).+$
But I am not able to get this to match on one word. I tried to use \b around the lookaheads but it doesn't work. The thing the word that I am trying to match on can have the above conditions in any order. Example: aB5 OR Ba5 OR 5Ba etc.. Need some pointers.
The main problem is that . includes spaces. You need to change your .'s to be restricted to word-characters only, i.e. \w. Note that \w is (mostly) [A-Za-z0-9_], if you wish to exclude some of these or include more, you should make the appropriate changes.
Another thing is that if you're looking for words in a string, you need to remove ^ and $ because these mean the start and end of the string respectively.
Since all your requirements are "at least" (as opposed to "at most"), you don't really need \b because of matching happens left-to-right, so you can never get part of a word.
Regex:
(?=\w*\d)(?=\w*[a-z])(?=\w*[A-Z])\w+
Test.
I currently got to this using lookaheads
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).+$
But I am not able to get this to match on one word.
Lookaheads are the correct approach, but if you want to find single words only you must not allow every character (.) in between but only word-characters (like \w). So
/(?=\w*\d)(?=\w*[a-z])(?=.\w[A-Z])\w+/g
should do it. Of course you're free to allow more letters than only \w, maybe even \S.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

A pattern matching an expression that doesn't end with specific sequence

I need a regex pattern which matches such strings that DO NOT end with such a sequence:
\.[A-z0-9]{2,}
by which I mean the examined string must not have at its end a sequence of a dot and then two or more alphanumeric characters.
For example, a string
/home/patryk/www
and also
/home/patryk/www/
should match desired pattern and
/home/patryk/images/DSC002.jpg should not.
I suppose this has something to do with lookarounds (look aheads) but still I have no idea how to make it.
Any help appreciated.
Old Answer
You can use a negative lookbehind at the end if your regex flavor supports it:
^.*+(?<!\.\w{2,})$
This will match a string that has an end anchor not preceded by the icky sequence you don't want.
Note that as m.buettner has pointed out, this uses an indefinite length lookbehind, which is a feature unique to .NET
New Answer
After a bit of digging around, however, I've found that variable length look-aheads are pretty widely supported, so here is a version that uses those:
^(?:(?!\.\w{2,}$).)++$
In a comment on an answer, you have stated you wanted to not match strings with forward slashes at the end, which is accomplished by simply adding a forward slash to the lookahead.
^(?:(?!(\.\w{2,}|/)$).)++$
Note that I am using \w for succinctness, but it lets underscores through. If this is important, you could replace it with [^\W_].
Asad's version is very convenient, but only .NET's regex engine supports variable-length lookbehinds (which is one of the many reasons why every regex question should include the language or tool used).
We can reduce this to a fixed-length lookbehind (which is supported in most engines except for JavaScrpit) if we think about the possible cases which should match. That would be either one or zero letters/digits at the end (whether preceded by . or not) or two or more letters/digits that are not preceded by a dot.
^.*(?:(?<![a-zA-Z0-9])[a-zA-Z0-9]?|(?<![a-zA-Z0-9.])[a-zA-Z0-9]{2,})$
This should do it:
^(?:[^.]+|\.(?![A-Za-z0-9]{2,}$))+$
It alternates between matching one or more of anything except a dot, or a dot if it's not followed by two or more alphanumeric characters and the end of the string.
EDIT: Upgrading it to meet the new requirement is just more of the same:
^(?:[^./]+|/(?=.)|\.(?![A-Za-z0-9]{2,}$))+$
Breaking that down, we have:
[^./]+ # one or more of any characters except . or /
/(?=.) # a slash, as long as there's at least one character following it
\.(?![A-Za-z0-9]{2,}$) # a dot, unless it's followed by two or more alphanumeric characters followed by the end of the string
On another note: [A-z] is an error. It matches all the uppercase and lowercase ASCII letters, but it also matches the characters [, ], ^, _, backslash and backtick, whose code points happen to lie between Z and a.
Variable length look behinds are rarely supported, but you don't need one:
^.*(?<!\.[A-z0-9][A-z0-9]?)$

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?