Regex expression testing - regex

Hello I am trying to use regex to make sure the user has entered at minimum three dot points I think i am close but at the moment my expression will return unexpected results.
Here is my expression as of 30/01/17
/(•([\s]*[\w]+|[\.\,]*)*|\n){3,}/g
and here is the text snippet i am testing;
blahblahblahblahblahblahblah
blahblahblahblahblahblah
blahblahblahblahblahblah.
• blahblahblahblahblahblah.
•blahblahblahblahblahblah.
blahblahblahblahblahblah.
.•blahblah, blahblahblahblah.
NOTE: I put the full stop in place in the third dot point as it is the easiest way to trigger the unexpected result.
Thanks in advance for any feedback.

You can look for non dots [^.] then a dot [.] followed by non dots [^.] three times:
/(?:[^.]*[.][^.]){3,}/
Demo
You can use the same procedure of a • if that is the character you are referring to.

Use a look ahead:
^(?=(.*\.){3}).*

You can use this regex ^ *•.*$, and if it finds three matches, then you have three lines starting will a bullet point. You have to use the multiline (m) flag, so ^ and $ also match start and end of a line respectively. I can show you code if you tell me what language your are using.
If you only want to make sure the text contains three bullet points (ex: this ••• can work), then don't use regex at all, your language most likely have a function to count matches of a character in a string.
Don't forget to upvote! ;)

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

How does one go about de-composing a regular expression?

Is there a concept of scope in regular expressions?
In this
^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$
for matching a 10 digit North American telephone number, with or withour parenthesis, hyphens or dots (another one of my attempts while understanding reg. expressions)
I'm having trouble understanding, when you go about decomposing an expression like this, how do you go about it? How do you tell what is scoped from this to that?
Okey, it starts with a ^ and ends with a $, both ends of lines.
Just before the end there is a three digit number followed by an optional dot or hyphen, and a four digit number. That part is clear.
So that leaves us with
(\(\d{3}\)|^\d{3}[.-]?)?
What is the purpose here of the caret, if we already had one at the beginning?
And what does this tells us apart that the first three digit number can be in parenthesis or without them followed by a dot or a hyphen?
I'm trying to figure out a sort of systematic way, when I find an unknown expression somewhere, how to go about to de compose it and see what it does?
Edit: From what others suggested in the comments, the second caret seems to be unnecessary. Testing it in RegexPal confirmed that on the following
^(\(\d{3}\)|\d{3}[.-]?)?\d{3}[.-]?\d{4}$
^(123)456.7890
^123.456.7890
^456.7890
but not
^ (123)456.7890
^ 123.456.7890
(caret designating the beginning of the line). Can anyone think of an example where the second caret would be needed?
answering this question now as it looks like you had some unresolved questions.
Is there a concept of scope in regular expressions?
Can anyone think of an example where the second caret would be needed?
Yes, kind of, and yes. Let's start with the second.
Multiple Carets
You can have multiple carets, and that can be quite helpful.
Here is a simple example (demo here):
(?<=^|\b)dog|^cat
This regex either matches:
1. dog, if the lookahead (?<=^|\b) can successfully assert that it is either at the beginning of a line, or preceded by a word boundary (therefore, dog in hotdog will not match), or...
2. cat, if it is at the beginning of a line.
In this particular example, you could rearrange the grouping to rewrite this as ^(?:dog|cat)|\bdog, but that is not the point. They point is that multiple carets are possible and potentially useful: at several points in the regex, you may want to assert that the engine is currently positioned at the beginning of the string (which the ^ anchor does).
Scope
You are wanting to ask about the "scope" of the caret. The scope of any token t is the exact position p in the string where the engine is currently positioned. The engine can only match the token there. If it fails to match t at that position, and there is no backtracking possible, then the match fails. Next, the engine attempts a whole new match starting one position in the string further from the one where it started the match attempt. (Usually that is unrelated to p, unless t was the pattern's first token.) During that match attempt, if the engine manages to match all the tokens before t, then once again the scope of t will be the position of the engine in the string at that time. That position may or may not be the same as the earlier p.
Hope this helps, let me know if any questions remain. :)

Regex: Search for verb roots

I've seen the results for classifying verbs by their endings. But I want to use Regular Expressions to find verb roots for regular verbs in Spanish.
I'm using this fancy site: http://regexpal.com/
Which I suspect may not be compatible with my end use, but will be a great starting point.
From what I have seen, the caret should identify all strings after it based on your supplied string-pattern.
So, to me:
ˆgust
Should find "gusta", "gustan", "gustamos", "gustas","gustar".
I know that I'm way off, but looking at many of the pages and tutorials and examples, I don't see anything that looks similar to what I want to do.
When you look for regex matching you'll get only the matching part, meaning, in case you have the word "gustan" and you're trying to match it with ^gust like you suggested, the output of the matcher will be "gust" - which is not what you want (you want the whole word).
So instead of matching to ^gust try matching to ^gust\w*$ which means anything that starts with "gust" and has zero or more characters following it.
^(gust[a-zA-Z]*)$
Edit live on Debuggex
^ denotes the start of the line
[a-zA-Z] letters only
* means zero or more
() is called a capture group
$ is the end of the line
If you want to edit with different words you could do this...
^((?:gust|otherwords)[a-zA-Z]*)$
Edit live on Debuggex
all you have to change/edit is |otherwords this will allow you to add more words that you want to match.
please read more about regex here and use debugexx.com to experiment.

regular expression to remove the first word of each line

I am trying to make a regular expression that grabs the first word (including possible leading white space) of each line. Here it is:
/^([\s]+[\S]*).*$/\1//
This code does not seem to be working (see http://regexr.com?34o6m). The code is supposed to
Begin at the start of the line
Create a capturing group where it places the first word (with possible leading white space)
Grab the rest of the line
Substitute the entire line with just the inside of the first capturing group
I tried another version also:
/\S(?<=\s).*^//
It looks like this one fails too (http://regexr.com?34o6s). The goal here was to
Find the first non-whitespace character.
Look behind to make sure it has a whitespace character behind it (i.e. not the first letter of the line).
Grab the rest of the line.
Erase everything the expression just grabbed.
Any insight to what is going wrong would be greatly appreciated. Thanks!
Try this regular expression
^(\s*.*?\s).*
Demo: gskinner
You mixed up your + and *.
/^([\s]*[\S]+).*$/\1/
This means zero or more spaces followed by one or more non-spaces.
You might also want to use $1 instead of \1:
/^([\s]*[\S]+).*$/$1/
Okay, well this seems to work using replace() in Javascript:
/^([\s]*[\S]+).*$/
I tested it on www.altastic.com/regexinator, which as far as I know is accurate [I made it though, so it may not be ;-) ]
remove the first two words
#"^.asterisk? .asterisk? "
this works for me
when posted, the asterisk sign doesn't show. have no idea.
if you want to remove the first word, simply start the regex as follow
a dot sign
an asterisk sign
a question mark
a space
replace with ""

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);