Sublime Text regex with optional groups - regex

Example Data (including newlines)
Alpha1
100
Bravo2
something else
Charlie3
200
Delta4
A==1
Echo5
300
Foxtrot6
I would like to get the output of:
Alpha1 100 Bravo2 something else
Charlie3 200 Delta4 A==1
Echo5 300 Foxtrot6
The pattern is:
AlphaNumeric
Numeric
AlphaNumeric
value that is not a single alphanumeric "word"
The first three parts are easy -- (\w+)\s+(\d+)\s+(\w+)\s+ -- but I don't know how to have the conditional fourth group. Is this possible? If so, how?

This pattern worked for me:
(\w+)\s+(\d+)\s+(\w+)(\s+([\w\s]+ \w+$))?
Before:
After:
Of course, \w includes underscores, so replace \w with [a-z0-9] if necessary.
Update
This pattern is more specific and should be more reliable:
(\w+)\n(\d+)\n(\w+)(\n([^\n]*[^\w\n][^\n]*))?
Before:
After:

Related

Using Data Validation in Google Sheets to only allow numbers, commas and spaces

I am creating an input google sheet to accept just numbers, commas & spaces - examples listed below. - At a basic level, I just want to exclude the use of A-Z / a-z.
80092382
800
800,876
98672102,20192210
I would like it to exclude anything like:
Hey 01
987 blue
black 1 white
orange
I am getting stuck early on, where I'm trying to only allow text with only numbers in, or excluding anything with letters in.
I have tried the following lines of code within the RegexMatch formula but it either accepts lines with text, or rejects my numbers.
=RegexMatch(L5,"\d")
- This one rejects the numbers.
=RegexMatch(to_text(L5),"\d")
- This rejects only where there are no numbers in the cell - So 'Hey 01' is accepted.
=RegexMatch(to_text(L5),"^\d")
- Same issue, if the cell starts with a number '987 blue' then its accepted
I've attempted a few other ways, such as using the NOT function at the beginning & using other regular expressions. If anyone can point me in the right direction then that would be much appreciated.
Test sheet
You can use
=ARRAYFORMULA(IF(REGEXMATCH(TO_TEXT(C2:C9), "^\d+(,\s*\d+)*$"), "Good", "Error"))
The regex matches
^ - start of string
\d+ - one or more digits
(,\s*\d+)* - zero or more repetitions of
, - a comma
\s* - zero or whitespace
\d+ - one or more digits
$ - end of string.
You can test the content with this type of regex:
=arrayformula(if(regexmatch(to_text(A1:A),"([^\d.,])"),"Error",))

How to create a matching regex pattern for "greater than 10-000-000 and lower than 150-000-000"?

I'm trying to make
09-546-943
fail in the below regex pattern.
​^[0-9]{2,3}[- ]{0,1}[0-9]{3}[- ]{0,1}[0-9]{3}$
Passing criteria is
greater than 10-000-000 or 010-000-000 and
less than 150-000-000
The tried example "09-546-943" passes. This should be a fail.
Any idea how to create a regex that makes this example a fail instead of a pass?
You may use
^(?:(?:0?[1-9][0-9]|1[0-4][0-9])-[0-9]{3}-[0-9]{3}|150-000-000)$
See the regex demo.
The pattern is partially generated with this online number range regex generator, I set the min number to 10 and max to 150, then merged the branches that match 1-8 and 9 (the tool does a bad job here), added 0? to the two digit numbers to match an optional leading 0 and -[0-9]{3}-[0-9]{3} for 10-149 part and -000-000 for 150.
See the regex graph:
Details
^ - start of string
(?: - start of a container non-capturing group making the anchors apply to both alternatives:
(?:0?[1-9][0-9]|1[0-4][0-9]) - an optional 0 and then a number from 10 to 99 or 1 followed with a digit from 0 to 4 and then any digit (100 to 149)
-[0-9]{3}-[0-9]{3} - a hyphen and three digits repeated twice (=(?:-[0-9]{3}){2})
| - or
150-000-000 - a 150-000-000 value
) - end of the non-capturing group
$ - end of string.
This expression or maybe a slightly modified version of which might work:
^[1][0-4][0-9]-[0-9]{3}-[0-9]{3}$|^[1][0]-[0-9]{3}-[0-9]{2}[1-9]$
It would also fail 10-000-000 and 150-000-000.
In this demo, the expression is explained, if you might be interested.
This pattern:
((0?[1-9])|(1[0-4]))[0-9]-[0-9]{3}-[0-9]{3}
matches the range from (0)10-000-000 to 149-999-999 inclusive. To keep the regex simple, you may need to handle the extremes ((0)10-000-000 and 150-000-000) separately - depending on your need of them to be included or excluded.
Test here.
This regex:
((0?[1-9])|(1[0-4]))[0-9][- ]?[0-9]{3}[- ]?[0-9]{3}
accepts (space) or nothing instead of -.
Test here.

Regular Expressions in R

I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}
If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.
Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.

Regular expression= tabspace+STRING+tabspace

How can I write this as a regular expression?
tabspaceSTRINGtabspace
My data looks like this:
12345 adsadasdasdasd 30
34562 adsadasdasdasd asdadaads<adasdad 30
12313 adsadasdasdasd asdadas dsaads 313123<font="TNR">adsada 30
1232131 adsadasdasdasd asdadaads<adasdad"asdja <div>asdjaıda 30
I want to get
12345 30
34562 30
12313 30
1232131 30
\t*\t doesn't work.
try the following regular expression
\t.+\t
The problem there is your definition of String...
If you use something like the suggested above, it'll match
tabspaceSTRINGtabspacetabspace
You get the picture. This might be acceptable, if not, you need to limit your "STRING" definition, like:
\t\w+\t
or:
\t[a-zA-Z]+\t
What characters are allowed in your string?
\t\w+\t
\w would allow letters, digits and the underscore (depending on your regex engine ASCII or Unicode)
See it here on Regexr, a good platform to test regular expressions.
Your "regex" \t*\t would match 0 or more tabs and then one tab. The * is a quantifier meaning 0 or more and is referring to the character or group before (here to your \t)
If your whitespace are not tabs, try this
\s+.+\s+30
\s is a whitespace character (space, tab, newline (not important for Notepad++)).
If you are not sure about the strings you are looking for except that they are separated by tabs it is a good approach to describe such a string as everything but a tab: (^\t*)
[^\t]*\t([^\t]*)\t[^\t]*
You can test it on regexpad.com.