Regex Conditionnals - regex

I would like to control orphans in InDesign by applying a "No Break" character style based on a GREP expression. Basically, I need to target the last 2 words of a paragraph (That is to say: The last 2 strings of characters separated by a space).
I found a solution for my English publications where (\H+?\h?){2}$ works like a charm.
The problem is with my French publications where some punctuation requires to have a space before it. I am trying to specify the Matching Pattern based on the last character of the paragraph: If it is a ?, ! or :, I match the last 3 "words" using (\H+?\h?){3}$, if not than I match the last 2.
I thought the following expression would work:
(?(?=[\?!:]$)((\H+?\h?){3}$)|(\H+?\h?){2}$)
but somehow it always default to the "else" statement.
Can someone tell me where I did go wrong?

Maybe you want option (A) below
See if I understand correctly ...
The requirements are:
Capture the last two words
Even if in the end it is ?,! or :
(A) Use this to capture as group: https://regexr.com/4lr6h
(\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:[\?!:]|$)
(B) Use this to capture only words: https://regexr.com/4lr84
\w*\s\w*(?=(?:$|[\?!:]))
(C) Use this to capture tree last words with marks: https://regexr.com/4lr87
\w*\s\w*[\?!:]?$

Related

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex Group not starting with

I'm having trouble to compute 2 regex in one (used to deal with .ini files)
I've got this one (I suggest you to use rubular with theses examples to understand)
^(?<key>[^=;\r\n]+)=((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)
to match :
This="isnot;acomment"
This="isa";comment
This="isa;special";case
And I've got this one :
^(?<key>[^=;\r\n]+)=(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*)
to match
This=isasimplecase
This=isasimple;comment
And I'm trying to merge the 2 regex, sadly I do not manage to say "If my value group is not starting with \" use the second one if not use the first one".
Right now i've got this :
^(?<key>[^=;\r\n]+)=(((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)|(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*))
But it's creating 2 more sections unnamed for the simple case without quoted. I was thinking that maybe by adding "the first item of the value group for the simple case must not start with \". But I didn't manage to do it.
PS : I suggest you to use rubular to understand better my problem. Sorry if I wasn't clear enough
How about this?
^(?<key>[^=;\r\n]+)=(?<value>"[^"]*"|[^;\n\r]*);?(?<comment>.*)
DEMO
(?<key>[^=;\r\n]+) Matches the part before the = symbol.
"[^"]*" Matches the string within the double quotes , ex strings like "foobar". If there is no " then the regex engine move on to the next pattern that is [^;\n\r]* and it matches upto the first ; or newline or \r character. These matched characters are stored into a named group called value.
;? Optional semicolon.
(?<comment>.*) Remaining characters are stored into the comment group.

regex with 3 backreferences but one optional

I have a regular expression that captures three backreferences though one (the 2nd) may be null.
Given the flowing string:
http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajonathonoat.es&source=web&cd=1&ved=0CC8QFjAA&url=http%3A%2F%2Fjonathonoat.es%2Fbritish-mozcast%2F&ei=MQj9UKejDYeS0QWruIHgDA&usg=AFQjCNHy1cDoWlIAwyj76wjiM6f2Rpd74w&bvm=bv.41248874,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1
I wish to capture the TLD (in this case .co.uk), q param and cd param.
I'm using the following RegEx:
/.*\.google([a-z\.]*).*q=(.*[^&])?.*cd=(\d*).*/i
Which works except the 2nd backreference includes the other parameters upto the cd param, I current get this:
["http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1 ", ".co.uk", "site%3Ajonathonoat.es&source=web", "1", index: 0, input: "http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1"]
The 1st backreference is correct, it's .co.uk and so is the 3rd; it's 1. I want the 2nd backreference to be either null (or undefined or whatever) or just the q param, in this example site%3Ajonathonoat.es. It currently includes the source param too (site%3Ajonathonoat.es&source=web).
Any help would be much appreciated, thanks!
I've added a JSFiddle of the code, look in your browser console for the output, thanks!
if negating character classes, i always add a multiplier to the class itself:
/.*\.google([a-z\.]*).*q=([^&]*?)?.*cd=(\d*).*/i
i also recoomend not using * or + as they are "greedy", always use *? or +? when you are going to find delimiters inside your string. For more on greedyness check J.F.Friedls Mastering Rgeular Expressions or simply here
You want the middle group to be:
q=([^&]*)
This will capture characters other than ampersand. This also allows zero characters, so you can remove the optional group (?).
Working example: http://rubular.com/r/AJkXxgeX5K