regex: match until the first ] character is found

regex: match until the first ] character is found - regex

I'm trying to match untill the first occurence of ] is found but can't seem to make it work, if someone could help me figure this out.
The string I'm matching against:
[plugin:tabs][tab title="test"]Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam sit amet nisl nisl. Ut interdum libero vitae quam ultricies et lacinia elit aliquet. Praesent tincidunt, sem tempus feugiat feugiat, turpis tellus scelerisque erat, sit amet feugiat neque arcu ac lectus. Sed at mi et elit interdum scelerisque vitae eu felis.[/tab][/plugin]
What it should match:
[plugin:tabs]
What it keeps matching:
[plugin:tabs][tab title="test"]
The regex:
(\[plugin:(?<identifier>[^\s]+)(?<parameters>.*?)\])
EDIT:
What it should also match:
[plugin:tabs test="test"]

You just need to add ? like so (lazy match, will match as few characters as possible):
(\[plugin:(?<identifier>[^\s]+?)(?<parameters>.*?)\])
^
Although the (?<parameters>.*?) part is unnecessary then.
So your final Regex would look like this:
(\[plugin:(?<identifier>[^\s]+?)\])
€dit: See #stema's answer.

Try this here
(\[plugin:(?<identifier>[^\]\s]+)(?<parameters>.*?)\])
See it here on Regexr
This avoids additionally to the whitespace characters also the ] character in the first named group.
If you don't need the first capturing group you can make it a non-capturing group by adding ?: right after the opening bracket.
(?:\[plugin:(?<identifier>[^\]\s]+)(?<parameters>.*?)\])
To avoid that the space in between is captured by the second group, just match optional whitespace between the two groups
(?:\[plugin:(?<identifier>[^\]\s]+)\s*(?<parameters>.*?)\])
See it here on Regexr

With any language that supports lookbehinds that will be your easiest solution.
/(^(?<!])*)/

Related

Ignore a substring in RegEx pattern

I want to ignore the certain substring in the result match, not exclude if the substring exists.
For example
I have the text:
Lorem ipsum dolor sit amet, consectetur adipiscing eliti qwer-
ty egeet qwewerty lectus. Proinera risus massa, placerat in q-
werty sed, tincidunt in nunci auspendisse vel dolor qwerty qw-
erty, molestie nisl sit amet, qwerty ligula curabitur ipsum,
euismod at augue at, dapibus feugiat qweerty
I need to find all qwerty, even if it contains -\n.
My decision is adding (?:-\n)? after every char:
/q(?:-\n)?w(?:-\n)?e(?:-\n)?r(?:-\n)?t(?:-\n)?y/gm
But it looks bulky (even for the example that contains only 6 chars) and it is too hard to modify the regex later, is there a magic to make the regex shorter?

No, regex is not good at this kind of match. The easiest way would be to remove - and \n first.

Regex to capture everything before pattern in Google Sheets

I’m having a hard time figuring out the regex code in Google Sheets to check a cell then return everything including new lines \n and returns \r before a certain pattern \*+.
A little more background: I'm using REGEXEXTRACT(A:A,"...") format inside a bigger ArrayFormula so that it automatically updates when a new row is added. This one’s working properly. It’s only the regex part I’m having trouble with.
So, for the purpose of this question, let's say I'm only worried about extracting the data from the A1 cell before a certain pattern and return that value in cell B1. Which brings us to this code in cell B1:
REGEXEXTRACT(A1,"...")
For example, this is how my A1 cell looks like:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
***********
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aenean egestas urna facilisis massa posuere, quis accumsan erat ornare.
Curabitur at dapibus nibh. Nam nec vestibulum ligula. Phasellus bibendum mi urna, ac hendrerit libero interdum non. Suspendisse semper non elit aliquam auctor.
Morbi vel sem tortor. Donec a sapien quis erat condimentum consequat in ut sem. Quisque in tellus sed est lobortis ultricies sed vitae enim.
I want to return this value in B1:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
Which is basically anything before the pattern *******. In Python, I can add the re.DOTALL to the .* but I can't get this to work in Google Sheets.

To make a dot match line breaks, you need to add (?s) to the pattern. To match any char, you may use a .. To match up to the leftmost occurrence, use lazy quantifier, *?. To actually extract a substring you need, wrap the part of the pattern you are interested in getting with capturing parentheses.
So, to match up to the first ******* substring, you may use
(?s)^(.*?)\*\*\*\*\*\*\*
or (?s)^(.*?)\*{7}. See the regex demo (note that Go regex engine is also RE2, so you may test your patterns there, at regex101.com).
(?s) - a DOTALL modifier
^ - start of string
(.*?) - Group 1: any 0+ chars as few as possible
\*\*\*\*\*\*\* - 7 literal asterisk symbols.
Note you cannot rely on a negated character class (that matches line breaks) if your substring may contain * chars, that is, ^([^*]*)\*\*\*\*\*\*\* won't work in those cases.
If you just want to match any chars up to the first * in the string, your regex will simplify greatly to
^([^*]+)
It matches
^ - start of string
([^*]+) - Capturing group 1: one or more chars other than *.

re.DOTALL flag in python corresponds to (?s) single line mode flag in re2.
Python:
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
re2:
Flags: s let . match \n (default false)
So,
=REGEXEXTRACT(A1,"(?s)(.*?)\*")
This corresponds to re.findall()

Not regex though might suit someone wanting the same result but less particular about the method:
=ArrayFormula(LEFT(A1:A,Find("***********",A1:A)-3))

If you really only want to match everything before the first *:
=REGEXEXTRACT(A1;"[^*]*")
If you want to allow a single star in the text and only stop at multiple (2 or more) stars (possibly divided by newlines) at the beginning of a line, you could try:
=REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}")
But you would have to strip the stars. E.g.
=REGEXREPLACE(REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}"); "\n(\*\n?){2,}"; "")
A lookahead does not seem to work in Google Sheets.

regex - multiple occurrences within match

Is it possible to find multiple matching groups within the full match using ONLY regex?
Given the text below
{1234} Lorem ipsum dolor sit amet, consectetur adipiscing elit. ** Sed
iaculis nisi et dapibus consectetur. Vestibulum ** feugiat sapien, sed
sagittis magna. Phasellus euismod tempor augue, ** eget dictum mi
sagittis sit amet. Quisque sit amet diam vel magna imperdiet pulvinar
vel ac lectus. {4321} Lorem ipsum....
Im trying to group all the occurences of ** within the numbers.
I came up with the following:
\{\d+\}.+?(\*\*)+.+\{\d.+\}
https://regex101.com/r/s746be/2
Which as you can see it only groups the first group because of the lazy question mark or the last if I remove the question mark.

Why not break it down to a few simple steps instead of using a really big inefficient Regex?
You can try doing something like this:
1) Grab the text between {1234} and {4321} using a simple regex like this:
/\{\d+\}(.*?)\{\d+\}/
2) Extract the matched text between these two delimiters
3) Run a second global regex search on this matched inner text using a simple regex pattern like so:
/\*\*/g
Hope this helps

sed: How to substitute within a match?

Using regexr, I wrote the expression /[\.!?] [A-Z]/g to match sentences using 3 assumptions:
Sentences end with punctuation: [.,!?] (I'm not sure how to match double punctuation marks or combinations...)
One or more spaces always follow the punctuation mark.
The next sentence begins with a CAPITAL letter. (True 99% of the time, except for lowercase nouns such as iDevices)
Using sed, I'd like to take these matches, and substitute the space(s) with a \n character. I can do an after match $' and a before match $`, but how can I replace within a match?
If there is a better way of splitting texts into one sentence per line, I'm open to alternatives.
No bashisms: for Linux, OS X, and BSD
Input:
Vivamus fermentum semper porta. Nunc diam velit, adipiscing ut
tristique vitae, sagittis vel odio. Maecenas convallis ullamcorper
ultricies. Curabitur ornare, ligula semper consectetur sagittis, nisi
diam iaculis velit, id fringilla sem nunc vel mi.
Output:
Vivamus fermentum semper porta.
Nunc diam velit, adipiscing ut tristique vitae, sagittis vel odio.
Maecenas convallis ullamcorper ultricies.
Curabitur ornare, ligula semper consectetur sagittis, nisi diam iaculis velit, id fringilla sem nunc vel mi.

You can use this replacement:
sed 's/\([.!?][.!?]*\) *\([A-Z]\)/\1\n\2/g;' file
\(...\) delimits a capture groups and \1 is a reference to the captured content.
The OSX version of sed doesn't interpret \n as a newline, you must use instead the sequence \1'$'\n\\2 as replacement string.
A more POSIX way consists to write:
sed 's/\([.!?][.!?]*\) *\([A-Z]\)/\1\
\2/g;' file
with an escaped newline as suggested by #cliffordheath.
Note that the dot doesn't need to be escaped inside a character class.

You need to use capture groups with \( and \) to re-insert the punctuation and initial letter. This example allows a following sentence to start with any alphanumeric (but requires at least one space to avoid messing up decimal numbers):
$ sed -e 's/\([.!?]\) *\([[:alnum:]]\)/\1\
\2/g'
foo. bat! baz? foo, bar.
foo.
bat!
baz?
foo, bar.
I hope this helps.

Simple email address check using preg_match

I have a simple regular expression for finding email addresses in a text, but even though I don't see an error, it doesn't work.
$addr=array();
$t='Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean fermentum risus id tortor. Morbi leo mi, nonummy eget tristique non, rhoncus non leo. Donec quis nibh at felis congue commodo. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos aaa#bbb.com. Aliquam ccc#ddd.net ornare wisi eu metus.';
if(preg_match_all('~[A-Z0-9._%-]+#[A-Z0-9.-]+\.[A-Z]{2,4}~',$t,$addr, PREG_SET_ORDER)){
echo 'found';
}
I have also tried this version I found, but it didn't work either:
if(preg_match_all('/^[A-Z0-9._%-]+#[A-Z0-9._%-]+\.[A-Z]{2,4}$/',$t,$addr, PREG_SET_ORDER)){

You are matching emails that are all uppercase. You need to either do [A-Za-z], or set the case insensitive flag on preg_match

This would also work for the match pattern and is shortest:
[0-z.%-]+#[0-z.-]+\.[A-z]{2,4}
This works because 0-z covers A-Z, a-z, 0-9 and _, also A-z covers A-Z and a-z
I develop in ruby and you can see this 'in action' at http://rubular.com/r/PdbH1BjWMs

Include the lower case letter class:
if(preg_match_all('~[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[A-Z]{2,4}~',$t,$addr, PREG_SET_ORDER)){
echo 'found';
}
...note a-z.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex: match until the first ] character is found - regex

With any language that supports lookbehinds that will be your easiest solution. /(^(?<!])*)/

Related

Ignore a substring in RegEx pattern

Regex to capture everything before pattern in Google Sheets

regex - multiple occurrences within match

sed: How to substitute within a match?

Simple email address check using preg_match

Categories

Resources