I need to parse an array-like text with regular expression and get the match groups.
One example of then text I want to parse is this:
['red','green', 'blue']
I want to use match groups, because I want to extract them.
I am using this regular expression, but the groups found by it are not like what I expected:
\[ *('.+?')( *, *('.+?'))* *\]
The idea is to parse in this order:
A square bracket
Any number of spaces
A group with:
Single quote
Any character
Single quote
Zero or more groups of:
Any number of spaces
A comma
Any number of spaces
A group with
Single quote
Any character
Single quote
Any number of spaces
A square bracket
And get one group with each parsed array element.
Can you help me?
Hint: a easy way to test regexp is the site http://rubular.com
This isn't going to be a totalitarian answer, but I'm fairly certain you can't whitespace check by doing " *", at least it may depend on the language you're using.
Here's a C# regex example that shows some of the language requirements to check for whitespace: regex check for white space in middle of string
Edit: I see you added Ruby as your language, unfortunately I'm not verbose in Ruby so specifics I cannot help you with, sorry.
Edit2: Seeing as you're forcing yourself into Ruby to debug your regex statement, might I suggest: http://www.debuggex.com/ which tries to stay language independent?
Try this regex: '([^']+)', it should give you the following match groups red, green, blue according to rubular.com
You can match an arbitrary number of groups with one regex:
^\[\s*|(?:\G'([^']+)'\s*(?:,\s*|]$))+
or like this (should be more performant):
^\[\s*+|(?>\G'([^']++)'\s*+(?>,\s*+|]$))++
This work in ruby like asked before, in delphi I don't know.
Related
I am trying to create a bunch of YAML files, mostly composed of strings of text. Now when using apostrophes in words, they must be escaped by typing a double apostrophe, because I’m using apostrophes to wrap the strings.
I want to create a regex that will check for apostrophes in the text that aren’t double. What I have is this:
^([^'\n]*?)'(([^'\n]*?)'(?!')([^'\n]+?))*?'$\n
https://regex101.com/r/v4nUTn/3
My issue is that as soon as my string has a double apostrophe, but also has an apostrophe which isn’t a double apostrophe, it doesn’t match because my negative lookahead doesn’t match as soon as it sees the double apostrophe. (for example the string t''e'st won’t match even though it is missing a double apostrophe after the e)
How can I make it so that my negative lookahead will not fail as soon as it sees one double apostrophe?
This regex should work:
\w'\w
Test here.
My guess is that maybe an expression similar to
('[^'\r\n]*'|[^\r\n\w']+)|([\w']*)
would be an option to look into.
If the second capturing group returns true, then the string is undesired.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
One suggestion would be to do this in two steps.
For example, if every 'candidate' value looks like this: - 'something here' (where you want to test the apostrophes in the something here content of the string, then first isolate out that content via:
/^\s*- '(.+)'$/im
And then make sure all apostrophe's appear as you want them to appear within match group 1 of the result.
Then, replace the original match with your 'sanitised' match.
Doing this means you don't have to be concerned with the bounding apostrophes causing complications to the check for apostrophes in the value.
Note: there may well be a perfect one-step regex to do this, but understanding that you can break tasks into several steps is useful if you spend a lot of time with regular expressions, and can help you sidestep 'perfect regex paralysis'.
If you want your string to match if there is at least one 'single quote' between your singlequote strings, then you should allow consumption of either a string which doesn't have any singlequote in it or consume if it contains two singlequotes and then you should modify your regex a bit to consume two singlequotes and add |'' in your regex, which will now consume either non-singlequote text or a portion which has at least two singlequotes.
Try this updated regex demo and see if this works like you wanted?
https://regex101.com/r/v4nUTn/4
I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})
I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$
When you address a regex capture, things can get tricky when digits follow the capture. In PCRE, I can write
${1}000
to substitute the capture of Group 1 followed by three zeroes.
Does anyone know the equivalent syntax in Dreamweaver replace operations, if any?
If we had a series of "A"s instead of zeroes, we could use:
$1AAAA
But these:
$10000
${1}0000
do not work.
I believe the regex flavor is ECMAScript. Just cannot find the information.
This may not be addressed in the syntax. If so, that would be good to know.
Thank you!
Edit: I should add that this is not matter of life and death as I have a number of grep tools at my fingertips. I would just like to know.
Dreamweaver's regular expression find and replace is supposed to be based on JavaScript's implementation of RegExp. You should be able to just use $1000 in the replacement text. However, like you've found, the replacement groups ($ + group number) are not properly recognized when the replacement text has digits immediately after the grouping token.
FWIW: I've logged a bug on this at http://adobe.ly/DWwish
I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);