extract usernames or email/domain after # sign - regex

I have a file with a list of usernames and email addresses and I need two expressions. One to get the email addresses (they always end in .com or .net or .org) and one to get the usernames.
I only want the usernames as one expression and domain portions as the other, I don't want the # sign.
#stackoverflow.com
#google.com
#example.com
I tried
^#.*?..*?$
Users
#Perl
#Python
#PHP
I tried
^#.*?$
Any suggestions are good.

In your first expression, it would match if you escaped the dot \. before the last .*? Your second expression is just clearly matching the whole lines. To match but exclude the # you could do..
For the domains use:
^#(\S+\.[^\s]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
\S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times)
\. '.'
[^\s]+ any character except: whitespace (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo
For the users use:
^#([^\s.]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
[^\s.]+ any character except: whitespace or '.' (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo

You could do something like this for domains:
^#[^.]+\.[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a ., followed by one or more of any character other than ., followed by the end of the string.
But this will not capture domains with more than two parts (e.g. #meta.stackoverflow.com). If that's important you might try this instead:
^#[^.]+(\.[^.]+)+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a group which consists of a ., followed by one or more of any character other than ., where this group may be repeated repeated one or more times, followed by the end of the string.
And this for users:
^#[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by the end of the string.

try this
USERS
(?<=#)\w+
EMAIL
(?<=#)\w+.(?:com|net|org)
EDIT:
uhm you didn't stipulate what regex engine you're running on, this is pcre based, but any engine with lookbehind most likely will have the same syntax

Related

Regex to capture optional characters

I want to pull out a base string (Wax) or (noWax) from a longer string, along with potentially any data before and after if the string is Wax. I'm having trouble getting the last item in my list below (noWax) to match.
Can anyone flex their regex muscles? I'm fairly new to regex so advice on optimization is welcome as long as all matches below are found.
What I'm working with in Regex101:
/(?<Wax>Wax(?:Only|-?\d+))/mg
Original string
need to extract in a capturing group
Loc3_341001_WaxOnly_S212
WaxOnly
Loc4_34412-a_Wax4_S231
Wax4
Loc3a_231121-a_Wax-4-S451
Wax-4
Loc3_34112_noWax_S311
noWax
Here is one way to do so, using a conditional:
(?<Wax>(no)?Wax(?(2)|(?:Only|-?\d+)))
See the online demo.
(no)?: Optional capture group.
(? If.
(2): Test if capture group 2 exists ((no)). If it does, do nothing.
|: Or.
(?:Only|-?\d+)
I assume the following match is desired.
the match must include 'Wax'
'Wax' is to be preceded by '_' or by '_no'. If the latter 'no' is included in the match.
'Wax' may be followed by:
'Only' followed by '_', in which case 'Only' is part of the match, or
one or more digits, followed by '_', in which case the digits are part of the match, or
'-' followed by one or more digits, followed by '-', in which case
'-' followed by one or more digits is part of the match.
If these assumptions are correct the string can be matched against the following regular expression:
(?<=_)(?:(?:no)?Wax(?:(?:Only|\d+)?(?=_)|\-\d+(?=-)))
Demo
The regular expression can be broken down as follows.
(?<=_) # positive lookbehind asserts previous character is '_'
(?: # begin non-capture group
(?:no)? # optionally match 'no'
Wax # match literal
(?: # begin non-capture group
(?:Only|\d+)? # optionally match 'Only' or >=1 digits
(?=_) # positive lookahead asserts next character is '_'
| # or
\-\d+ # match '-' followed by >= 1 digits
(?=-) # positive lookahead asserts next character is '-'
) # end non-capture group
) # end non-capture group

Regex to validate cookie string (Key value paired)

So far I tried this regex but no luck.
([^=;]+=[^=;]+(;(?!$)|$))+
Valid Strings:
something=value1;another=value2
something=value1 ; anothe=value2
Invalid Strings:
something=value1 ;;;name=test
some=value=3;key=val
somekey=somevalue;
You might use an optional repeating group to get the matches.
If you don't want to cross newline boundaries, you might add \n or \r\n to the negated character class.
^[^=;\n]+=[^=;\n]+(?:;[^=;\n]+=[^=;\n]+)*$
Explanation
^ Start of string
[^=;\n]+=[^=;\n]+ Match the key and value using a negated character class
(?: Non capture group
;[^=;\n]+=[^=;\n]+ Match a comma followed by the same pattern
)* Close group and repeat 0+ times
$ End string
Regex demo

Regex catching adjacent characters with a single character set

I am trying to construct a regex statement that matches a string conforming to the following conditions:
3-63 lowercase alphanumeric characters, plus "." and "-"
May not start or end with . or -
Dashes and periods cannot be adjacent to each other.
abc-123.xyz <- should match
abc123-.xyz <- should not match
I have been able to put this regex together, but it does not catch the third requirement. I've tried to use another negative lookahead/lookbehind,[i.e. - (?!.-|-.) ] but its still matching the strings with adjacent periods and dashes. Here's the regex statement I came up with that fulfills conditions 1 & 2:
^(?!\.|-)([a-z0-9]|\.|-){3,63}(?<!\.|-)$
FYI, this regex is for validating input when specifiying an AWS S3 bucket name in a CloudFormation template.
How about:
^(?=.{3,63}$)[a-z0-9]+(?:[-.][a-z0-9]+)*$
Use this Pattern ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$ Demo
# ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$
^ # Start of string/line
(?! # Negative Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
[.-] # Character in [.-] Character Class
(?= # Look-Ahead
[.-] # Character in [.-] Character Class
) # End of Look-Ahead
) # End of Negative Look-Ahead
[^.-] # Character not in [.-] Character Class
[a-z0-9.-] # Character in [a-z0-9.-] Character Class
{1,61} # (repeated {1,61} times)
[^.-] # Character not in [.-] Character Class
$ # End of string/line
^[a-z0-9](?:[a-z0-9]|[.\-](?=[a-z0-9])){2,62}$
We match a lowercase alphanumeric character, followed by between 2 and 62 repetitions of either:
a lowercase alphanumeric character, or
a . or - (which must be followed by a lowercase alphanumeric character).
The last restriction makes sure that you can't have two ./- characters in a row, or a ./- at the end of the string.

Regex to allow a comma seperated list of codes

I have an input form which I need to validate, the list must follow these rules
comma separated
each code can either
begin with a single letter followed by a single underscore only, followed by any number of letters or
a group of numbers
the list must not end with a trailing comma
Valid example data
A_AAAAA,B_BBBBB,122334,D_DFDFDF
12345,123567,123456,A_BBBBB,C_DDDDD,1234567
Invalid example data
RR_RRR,12345
1_111,AVSFFF,
A_SDDF,,123342
I am using http://www.regexr.com and have got as far as this: [A-Z_]_[A-Z],|[0-9],
The problem with this is the last code in each valid data example is not selected so the line does not pass the regex pattern
Try this:
^(?:(?:[A-Za-z]_[A-Za-z]*|\d+)(?:,|$))+(?<!,)$
regex101 demo.
Explanation:
^ start of string
(?: this group matches a single element in the list:
(?:
[A-Za-z] a character
_ underscore
[A-Za-z]* any number of characters (including 0)
| or
\d+ digits
)
(?: followed by either a comma
,
| or the end of the string
$
)
)+ match any number of list elements
(?<! make sure there's no trailing comma
,
)
$ end of string
Try this -
^(?:[A-Z]_[A-Z]*|[0-9]+)(?:,(?:[A-Z]_[A-Z]*|[0-9]+))*$
Demo

Remove characters after space before a comma

I have a string:
stuff.more AS field1, stuff.more AS field2, blah.blah AS field3
Is there a way I can use regex to extract anything to the right of a space, up-to and including a comma leaving:
field1, field2, field3
I cannot get the proper regex syntax to work for me.
(\w+)(?:,|$)
Edit live on Debuggex
\w is a alphanumeric character (you can replace this with [^ ] if you want any character except a space)
+ means one or more character
?: makes a capture group not a capture group
,|$ means the end of the string is either a , or the end of the line
note: () denotes a capture group
please read more about regex here and use debugexx.com to experiment.
Is there a way I can use regex to extract anything to the right of a space up-to and including a comma...
You could do this with either a non capturing group for your , or use a look ahead.
([^\s]+)(?=,|$)
Regular expression:
( group and capture to \1:
[^\s]+ any character except: whitespace (\n,
\r, \t, \f, and " ") (1 or more times)
) end of \1
(?= look ahead to see if there is:
, a comma ','
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
/[^ ]+(,|$)/
should do it. (,|$) allows for your last entry in the line without a comma.