Match a code string in some text - regex

I have some code strings that I need to extract some data from, but specific data at that.
The strings are always in the same format.
I need to extract the text at the beginning between the ( and ), so it would extract List Options here.
I need to extract the text near the end #groups here.
The string I need at the end will always start with #
(List Options((join ", ", #groups))
I have tried:
^\((\w+).*, (#\w+)\)\)$
But it only gives me the word List

This should work well for you.
^\(([^(]+)[^#]+([^)]+)\)+$
See working demo
Regular expression:
^ the beginning of the string
\( look and match '('
( group and capture to \1:
[^(]+ any character except: '(' (1 or more times)
) end of \1
[^#]+ any character except: '#' (1 or more times)
( group and capture to \2:
[^)]+ any character except: ')' (1 or more times)
) end of \2
\)+ ')' (1 or more times)
$ before an optional \n, and the end of the string

try this one
^\(([^\(]+?)\(.*?#([^\)]+?)\)
or if you need the # sign also captured, just move it inside the 2nd capturing group
^\(([^\(]+?)\(.*?(#[^\)]+?)\)

Related

How do I get the contents of the second or third set of brackets with only Regular Expression?

If I have a string like this:
Report - No Adj - Direct Deposit (1) (64402117-acdd-44f9-a9de-53a5b83b961a) (2014-08-20).dotx
I can get the contents of the first set of brackets using:
\(([^)]*)\)
I'm using a third party program which only lets me pass in a RegEx string.
What Regular Expression would give me the contents of the second and third sets of brackets?
To match only the contents of the third set of brackets, you can use the following regex:
(?:\([^()]+\).*?){2}\(([^()]+)\)
Explanation:
(?: # Begin group
\( # Match '('
[^()]+ # Match any character that is not a '(' or ')'
\) # Match ')'
.*? # Content outside the brackets (until the next '(')
){2} # Repeat the group exactly 2 times
\( # Match '('
( # Begin first capturing group
[^()]+ # Match any character that is not a '(' or ')'
) # End first capturing group
\) # Match ')'
RegEx Demo
What you need to is make your RegEx expression global g like:
http://regex101.com/r/rR8rR5/1

Sublime SQL REGEX highlighting

I'm trying to modify an existing language definition in sublime
It's for SQL
Currently (\#|##)(\w)* is being used to match against local declared parameters (e.g. #MyParam) and also system parameters (e.g. ##Identity or ##FETCH_STATUS)
I'm trying to split these into 2 groups
System parameters I can get like (\#\#)(\w)* but I'm having problems with the local parameters.
(\#)(\w)* matches both #MyParam and ##Identity
(^\#)(\w)* only highlights the first match (i.e. #MyParam but not #MyParam2
Can someone help me with the correct regex?
Try the below regex to capture local parameters and system parameters into two separate groups.
(?<=^| )(#\w*)(?= |$)|(?<=^| )(##\w*)(?= |$)
DEMO
Update:
Sublime text 2 supports \K(discards the previous matches),
(?m)(?:^| )\K(#\w*)(?= |$)|(?:^| )\K(##\w*)(?= |$)
DEMO
Explanation:
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \1:
# '#'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \1
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"
) end of look-ahead
| OR
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \2:
## '##'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \2
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"

Remove characters after space before a comma

I have a string:
stuff.more AS field1, stuff.more AS field2, blah.blah AS field3
Is there a way I can use regex to extract anything to the right of a space, up-to and including a comma leaving:
field1, field2, field3
I cannot get the proper regex syntax to work for me.
(\w+)(?:,|$)
Edit live on Debuggex
\w is a alphanumeric character (you can replace this with [^ ] if you want any character except a space)
+ means one or more character
?: makes a capture group not a capture group
,|$ means the end of the string is either a , or the end of the line
note: () denotes a capture group
please read more about regex here and use debugexx.com to experiment.
Is there a way I can use regex to extract anything to the right of a space up-to and including a comma...
You could do this with either a non capturing group for your , or use a look ahead.
([^\s]+)(?=,|$)
Regular expression:
( group and capture to \1:
[^\s]+ any character except: whitespace (\n,
\r, \t, \f, and " ") (1 or more times)
) end of \1
(?= look ahead to see if there is:
, a comma ','
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
/[^ ]+(,|$)/
should do it. (,|$) allows for your last entry in the line without a comma.

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/

What is this regex doing?

/^([a-z]:)?\//i
I don't quite understand what the ? in this regex if I had to explain it from what I understood:
Match begin "Group1 is a to z and :" outside ? (which I don't get what its doing) \/ which makes it match / and option /i "case insensitive".
I realize that this will return 0 or 1 not quiet sure why because of the ?
Is this to match directory path or something ?
If I test it:
$var = 'test' would get 0 while $var ='/test'; would get 1 but $var = 'test/' gets 0
So anything that begins with / will get 1 and anything else 0.
Can someone explain this regex in elementary terms to me?
See YAPE::Regex::Explain:
#!/usr/bin/perl
use strict; use warnings;
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^([a-z]:)?\//i)->explain;
The regular expression:
(?i-msx:^([a-z]:)?/)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
)? end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
It matches a lower- or upper case letter ([a-z] with the i modifier) positioned at the start of the input string (^) followed by a colon (:) all optionally (?), followed by a forward slash \/.
In short:
^ # match the beginning of the input
( # start capture group 1
[a-z] # match any character from the set {'A'..'Z', 'a'..'z'} (with the i-modifier!)
: # match the character ':'
)? # end capture group 1 and match it once or none at all
\/ # match the character '/'
? will match one or none of the preceding pattern.
? Match 1 or 0 times
See also: perldoc perlre
Explanation:
/.../i # case insensitive
^(...) # match at the beginning of the string
[a-z]: # one character between 'a' and 'z' followed by a colon
(...)? # zero or one time of the group, enclosed in ()
So in english: Match anything which begins with a / (slash) or some letter followed by a colon followed by a /. This looks like it matches pathnames across unix and windows, e.g.
it would match:
/home/user
and
C:/Applications
etc.
It looks like it is looking for a "rooted" path. It will successfully match any string that either starts with a forward slash (/test), or a drive letter followed by a colon, followed by a forward slash (c:/test).
Specifically, the question mark makes something optional. It applies to the part in parentheses, which is a letter followed by a colon.
Things that will match:
C:/
a:/
/
(That last item above is why the ? is there)
Things that will not match:
C:
a:
ab:/
a/