Regex capture group ( ) within character set [ ] - regex

I would like to match space characters () only if they are followed by a hash (#).
This is what ( #) below is trying to do, which is a capture group. (I tried escaping the brackets, otherwise the brackets are not recognised properly within a group set). However, this is not working.
The below regex
/#[a-zA-Z\( #\)]+/g
matches all of the below
#CincoDeMayo #Derby party with UNLIMITED #seafood towers
while I would like to match #CincoDeMayo #Derby and separately #seafood
Is there any way to specify captures groups () within a character set []?

Character classes are meant to match a single character, thus, it is not possible to define a character sequence inside a character class.
I think you want to match specific consecutive hashtags. Use
/#[a-zA-Z]+(?: +#[a-zA-Z]+)*/g
or
/#[a-zA-Z]+(?:\s+#[a-zA-Z]+)*/g
See the regex demo.
Details
#[a-zA-Z]+ - a # followed with 1+ ASCII letters
(?: - start of a non-capturing group...
\s+ - 1+ whitespaces
#[a-zA-Z]+ - a # followed with 1+ ASCII letters
)* - ... that repeats 0 or more times.

Related

Regex match specific strings

I want to capture all the strings from multi lines data. Supposed here the result and here’s my code which does not work.
Pattern: ^XYZ/[0-9|ALL|P] I’m lost with this part anyone can help?
Result
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/ALL
XYZ/P1
XYZ/P2,3
XYZ/P4,5-7
XYZ/P1-4,5-7,8-9
Changed to
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/A12345 after the slash limited to 6 alphanumeric chars
XYZ/LH-1234567890 after the /LH- limited to 10 numeric chars
The pattern could be:
^XYZ\/(?:ALL|P?[0-9]+(?:-[0-9]+)?(?:,[0-9]+(?:-[0-9]+)?)*)$
The pattern in parts matches:
^ Start of string
XYZ\/ Match XYX/ (You don't have to escape the / depending on the pattern delimiters)
(?: Outer on capture group for the alternatives
ALL Match literally
| Or
P? Match an optional P
[0-9]+(?:-[0-9]+)? Match 1+ digits with an optional - and 1+ digits
(?: Non capture group to match as a whole
,[0-9]+(?:-[0-9]+)? Match ,and 1+ digits and optional - and 1+ digits
)* Close the non capture group and optionally repeat it
) Close the outer non capture group
$ End of string
Regex demo
You can use this regex pattern to match those lines
^XYZ\/(?:P|ALL|[0-9])[0-9,-]*$
Use the global g and multiline m flags.
Btw, [P|ALL] doesn't match the word "ALL".
It only matches a single character that's a P or A or L or |.

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.
If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2
You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo

C# Regex Expression to extract field name and values in SQL Condition

Consider following 2 SQL conditions.
1.) AssetView.[PROPTYPE] NOT IN ('B15/30','SFD','SFA')
2.) AssetView.[FICO] IN (500,600,700)
I want to break this SQL using RegEx so that I can have table name, field name, function type and field values into 4 different parts.
e.g.
Table Name - AssetView
Field Name - PROPTYPE
Function - NOT IN
Field Values (Together or separate): B15/30, SFD, SFA
Here is the regex I tried (https://rubular.com/r/WGiyz0oGrooyiA) but I am not able to split TableName, Field Name and Function type into its own group.
(.*?)[^=]['(]+(.*?)[')]
In your pattern (.*?)[^=]['(]+(.*?)[')] you make use of a character classes ['(] and [')] which match any of the listed and can also first match an opening ' and then a closing )
For your example data, you might use:
(\w+)\.\[(\w+)\] +(\w+(?: \w+)*) +\(([^)\n]+)\)
(\w+) Capture 1+ word chars in group 1
\. Match a dot
\[(\w+)\] + Capture 1+ word chars between square brackets in group 2 and 1+ spaces
(\w+(?: \w+)*) + Capture 1+ word chars followed by repeating 0+ times matching a space and 1+ word chars in group 3 and 1+ spaces
\(([^)\n]+)\) Capture 1+ times not a closing parenthesis or newline between parenthesis in group 4
Rubular regex | .NET regex (click on the Table tab)
If you want to allow more characters to match than \w you could extend that using a character class.
For example if you also want to allow a hyphen and a space use [\w-]+ or if you want to match all between the brackets you could make use of a negating character class, for example \[([^\]]+)\]

Regex which does not allow leading space and any character from (^\\/:*?"<>|)

I have a regex like "^[a-zA-Z]:(\\\\+[^\\/:*?"<>|]+)*([\\\\]+)?$" which is responsible for file path validation.
It successfully validates paths like C:\Users\data and C:\\Users\\data
I want the string which comes after "C:\" to not start with space and not have (^\\/:*?"<>|) characters in it.
You could use match the start of the string up till the colon and use your negated character class to not match your unwanted characters right after. You could add a space or \s to that character class to not match that as well.
Also you might use a capturing group and backreference to which variant is used for the backslashed \\ or \
After that you could use a repeating pattern and specify which characters to allow for the rest of the string.
^[a-zA-Z]:(\\+)(?:[^\\/:*?"<>|\s][\w&]+(?: [\w&]+)*(?:\1[a-zA-Z&]+)*)?$
Regex demo
That will match:
^ Start of the string
[a-zA-Z]: - [a-zA-Z]: Match a-zA-Z and a colon
(\\+) Capture in a group 1+ times a backslash to reference it
(?: Non capturing group
[^\\/:*?"<>|\s] Negated character class to not match 1+ times what is listed (Added \s but you could also just use a space)
[\w&]+(?: [\w&]+)* Match 1+ times a word char and repeat 0+ times matching a space and 1+ times a word char. Note that you can extend the character class to match what you want.
(?: Non capturing group
\1[a-zA-Z&]+ Match backreference to what is captured in group 1 followed by 1+ times a-zA-Z (You can add to the character class what you would like to match as well)
)* Close non capturing group and repeat it 0+ times
)? Close non capturing group and make it optional
$ End of the string
As said here
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u)
So you can mix it with if-then-else regex statement like (?(?!your_pattern_in_regex)match_then|match_else)

Use class content inside REGEX

I want to parse a nested structure like this one in MATLAB :
structure NAME_PART_1
Some content
block NAME_PART_2
Some other content
end NAME_PART_2
block NAME_PART_3
subblock NAME_PART_4
Some content++
end NAME_PART_4
end NAME_PART_3
end NAME_PART_1
structure
NAME_PART_5
end NAME_PART_5
First, I would like to extract the content of each structure. It's quite easy because a structure content is always between "structure NAME" and "end NAME".
So, I would like to use regex. But I don't know in advance what the structure name will be.
So, I wrote my regex like this :
\bstructure\s+([\w.-]*)((?:\s|.)*)\bend\b\s+XXXX
But, I don't know by what I should replace "XXXX", in order to "reference" the content of the first class of this regex. But is that even possible?
Try this Regex:
structure\s+([\w.-]+)\s*((?:(?!end\s+\1)[\s\S])*)end\s+\1
Click for Demo
Explanation:
structure - matches structure
\s+ - matches 1+ occurrences of a white-space
([\w.-]+) - matches 1+ occurrences of either a word character or a . or a -. This sub-match which contains the structure name is captured in Group 1.
\s* - matches 0+ occurrences of a white-space
((?:(?!end\s+\1)[\s\S])*) - Tempered Greedy Token - Matches 1+ occurrences of any character [\s\S] which does not start with the sequence end followed by Group 1 contents \1 i.e, structure name. This sub-match is captured in Group 2 which contains the contents of the structure
end\s+\1 - matches the word end followed by 1+ white-spaces followed by Structure Name contained in Group 1 \1.
Apart from making use of a backreference \1 to refer what is captured, you might replace the alternation in the capturing group ((?:\s|.)*) with matching a newline followed by 0+ characters and repeat that while capturing it ((?:\n.*)+)
Also you might omit the word boundary after end end\b\s+ as 1+ whitespace characters is what follows after end and instead add a word boundary at the end so that \1 is not part of a larger match.
\bstructure\s+([\w.-]+)((?:\n.*)+)\bend\s+\1\b
Regex demo
Explanation
\bstructure\s+ Match structure followed by 1+ whitespace chars
([\w.-]+) Capture in a group repeating 1+ times any of the listed chars
( Capturing group
(?:\n.*)+ Match newline followed by 0+ times any char except a newline
) Close capturing group
\bend Match end
\s+\1\b Match 1+ times a whitespace char followed by a backreference to group 1 and end with a word boundary.