I have a string:
stuff.more AS field1, stuff.more AS field2, blah.blah AS field3
Is there a way I can use regex to extract anything to the right of a space, up-to and including a comma leaving:
field1, field2, field3
I cannot get the proper regex syntax to work for me.
(\w+)(?:,|$)
Edit live on Debuggex
\w is a alphanumeric character (you can replace this with [^ ] if you want any character except a space)
+ means one or more character
?: makes a capture group not a capture group
,|$ means the end of the string is either a , or the end of the line
note: () denotes a capture group
please read more about regex here and use debugexx.com to experiment.
Is there a way I can use regex to extract anything to the right of a space up-to and including a comma...
You could do this with either a non capturing group for your , or use a look ahead.
([^\s]+)(?=,|$)
Regular expression:
( group and capture to \1:
[^\s]+ any character except: whitespace (\n,
\r, \t, \f, and " ") (1 or more times)
) end of \1
(?= look ahead to see if there is:
, a comma ','
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
/[^ ]+(,|$)/
should do it. (,|$) allows for your last entry in the line without a comma.
Related
I need some assistance, I have been at this for hours now. I am not winning.
I need to match a space only if its followed by a non-numeric character (which I will replace with blank to remove it from the string).
I have tried this ^[^\s+]+\D and it works to some extent.
if I have the string " JLABCD-1 836397-BTD56517" it return correctly without the leading space, which is what I want "JLABCD-1 836397-BTD56517"
if I have " BefhMS JLZARL-1 836397-BTD56517" it returns this "JLZARL-1 836397-BTD56517"
But if I don't have a space before the the first word, I want it to ignore all other spaces.
If I have "_JLABCD-1 836397-BTD56517", I want to return "JLABCD-1 836397-BTD56517" or the original string as it is. Not "836397-BTD56517" which is what I am getting at the moment.
Is this possible with Regex?
Use a look ahead:
"^ +(?=\D)"
but it seems you just want to match any leading spaces. If so, just use:
"^ +"
The negated (due to its first character being ^) character class [^\s+] in your regex matches anything not whitespace or a +.
Use
^\s+(\D)
Replace with $1, it is a backreference to the capturing group (\D). Or \1 if $1 does not work.
See proof.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1
I have the following text on lines in a text file.
34233-XR43-2343
34234-GH33-3434
23567-RF32-5266
234552667
09223-23RF-5237
I want to tab and duplicate that text and place it in the same line with additional text at the beginning and end.
For example: (Consider white space to be a tab)
34233-XR43-2343 ~/images/products/34233-XR43-2343.jpg
34234-GH33-3434 ~/images/products/34234-GH33-3434.jpg
23567-RF32-5266 ~/images/products/23567-RF32-5266.jpg
234552667 ~/images/products/234552667.jpg
09223-23RF-5237 ~/images/products/09223-23RF-5237.jpg
Given your format, you can use the following regular expression to do this.
Find: ^(\S+)
Replace: \1\t~/images/products/\1.jpg
Explanation:
^ # the beginning of the string
( # group and capture to \1:
\S+ # non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times)
) # end of \1
In the replacement, we use \1 to reference what was matched and captured by capturing group #1.
I have a file with a list of usernames and email addresses and I need two expressions. One to get the email addresses (they always end in .com or .net or .org) and one to get the usernames.
I only want the usernames as one expression and domain portions as the other, I don't want the # sign.
#stackoverflow.com
#google.com
#example.com
I tried
^#.*?..*?$
Users
#Perl
#Python
#PHP
I tried
^#.*?$
Any suggestions are good.
In your first expression, it would match if you escaped the dot \. before the last .*? Your second expression is just clearly matching the whole lines. To match but exclude the # you could do..
For the domains use:
^#(\S+\.[^\s]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
\S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times)
\. '.'
[^\s]+ any character except: whitespace (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo
For the users use:
^#([^\s.]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
[^\s.]+ any character except: whitespace or '.' (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo
You could do something like this for domains:
^#[^.]+\.[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a ., followed by one or more of any character other than ., followed by the end of the string.
But this will not capture domains with more than two parts (e.g. #meta.stackoverflow.com). If that's important you might try this instead:
^#[^.]+(\.[^.]+)+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a group which consists of a ., followed by one or more of any character other than ., where this group may be repeated repeated one or more times, followed by the end of the string.
And this for users:
^#[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by the end of the string.
try this
USERS
(?<=#)\w+
EMAIL
(?<=#)\w+.(?:com|net|org)
EDIT:
uhm you didn't stipulate what regex engine you're running on, this is pcre based, but any engine with lookbehind most likely will have the same syntax
I have some code strings that I need to extract some data from, but specific data at that.
The strings are always in the same format.
I need to extract the text at the beginning between the ( and ), so it would extract List Options here.
I need to extract the text near the end #groups here.
The string I need at the end will always start with #
(List Options((join ", ", #groups))
I have tried:
^\((\w+).*, (#\w+)\)\)$
But it only gives me the word List
This should work well for you.
^\(([^(]+)[^#]+([^)]+)\)+$
See working demo
Regular expression:
^ the beginning of the string
\( look and match '('
( group and capture to \1:
[^(]+ any character except: '(' (1 or more times)
) end of \1
[^#]+ any character except: '#' (1 or more times)
( group and capture to \2:
[^)]+ any character except: ')' (1 or more times)
) end of \2
\)+ ')' (1 or more times)
$ before an optional \n, and the end of the string
try this one
^\(([^\(]+?)\(.*?#([^\)]+?)\)
or if you need the # sign also captured, just move it inside the 2nd capturing group
^\(([^\(]+?)\(.*?(#[^\)]+?)\)
I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/