Regex to match words and those with an apostrophe - regex

Update: As per comments regarding the ambiguity of my question, I've increased the detail in the question.
(Terminology: by words I am refering to any succession of alphanumerical characters.)
I'm looking for a regex to match the following, verbatim:
Words.
Words with one apostrophe at the beginning.
Words with any number of non-contiguous apostrophe throughout the middle.
Words with one apostrophe at the end.
I would like to match the following, however not verbatim, rather, removing the apostrophes:
Words with an apostrophe at the beginning and at the end would be matched to the word, without the apostrophes. So 'foo' would be matched to foo.
Words with more than one contiguous apostrophe in the middle would be resolved to two different words: the fragment before the contiguous apostrophes and the fragment after the contiguous apostrophes. So, foo''bar would be matched to foo and bar.
Words with more than one contiguous apostrophe at the beginning or at the end would be matched to the word, without the apostrophes. So, ''foo would be matched to foo and ''foo'' to foo.
Examples
These would be matched verbatim:
'bout
it's
persons'
But these would be ignored:
'
''
And, for 'open', open would be matched.

Try using this:
(?=.*\w)^(\w|')+$
'bout # pass
it's # pass
persons' # pass
' # fail
'' # fail
Regex Explanation
NODE EXPLANATION
(?= look ahead to see if there is:
.* any character except \n (0 or more times
(matching the most amount possible))
\w word characters (a-z, A-Z, 0-9, _)
) end of look-ahead
^ the beginning of the string
( group and capture to \1 (1 or more times
(matching the most amount possible)):
\w word characters (a-z, A-Z, 0-9, _)
| OR
' '\''
)+ end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
$ before an optional \n, and the end of the
string

/('\w+)|(\w+'\w+)|(\w+')|(\w+)/
'\w+ Matches a ' followed by one or more alpha characters, OR
\w+'\w+ Matche sone or more alpha characters followed by a ' followed by one or more alpha characters, OR
\w+' Matches one or more alpha characters followed by a '
\w+ Matches one or more alpha characters

How about this?
'?\b[0-9A-Za-z']+\b'?
EDIT: the previous version doesn't include apostrophes on the sides.

I submitted this 2nd answer coz it looks like the question has changed quite a bit and my previous answer is no longer valid. Anyway, if all conditions are listed up, try this:
(((?<!')')?\b[0-9A-Za-z]+\b('(?!'))?|\b[0-9A-Za-z]+('[0-9A-Za-z]+)*\b)

This works fine
('*)(?:'')*('?(?:\w+'?)+\w+('\b|'?[^']))(\1)
on this data no problem
'bou
it's
persons'
'open'
open
foo''bar
''foo
bee''
''foo''
'
''
on this data you should strip result (remove spaces from matches)
'bou it's persons' 'open' open foo''bar ''foo ''foo'' ' ''
(tested in The Regulator, results in $2)

Related

Regex for no single quote and newline character in between single quotes

so far I have this '.[^ \n']*'(?!') with a negative look ahead after the last qoute
Unfortunately, this does allow ''' (three single quotes).
The regex should match these strings
'abc'
'abc##$%^xyz'
The regex shouldn't match these strings
'\n'
'abc#'#$%^xyz'
'''
'
My current regex is looking at negative precedes for a single quote. I am trying to find a way to make it more generalized so if doesn't match if it has odd number of single qoutes.
If your patterns occur always alone in a line, you could use this:
^'[^\n']*'$
If you want to find matching pairs of single quotes in a bigger text, I think regex is not the solution for you.
You could use:
^'[^\n']*(?:'[^\n']*')*[^\n']*'$
Explanation
^ Start of string
' Match a single quote
[^\n']* Match 0+ chars other than a newline or a single quote
(?: Non capture group to repeat as a whole part
'[^\n']*' Match from ' to ' without matching newlines in between
)* Close the non capture group and optionally repeat it
[^\n']* Match 0+ chars other than a newline or a single quote
' Match a single quote
$ End of string
See a regex101 demo.

Notepad++: Delete everything after a number of characters in string

The following is an example of 24 characters per line in Notepadd ++. I need to limit the characters per line to 14 characters.
Hell, how is she today ?
I need it to look like the following:
Hell, how is
I used this code
Find what: ^(.{1,14}).*
Replace with: $1
However, it show "Hell, how is s", it is misspelled.
How can I can limit the number of characters to 14 characters per line in Notepad++ and delete last word ?
This should work for you:
Find what: ^(.{1,14}(?<=\S)\b).*$
Replace with: $1
so for Hell, how is she today ? the output is: Hell, how is
DEMO
^ # The beginning of the string
( # Group and capture to \1:
.{1,14} # Any character except \n (between 1 and 14 times (matching the most amount possible))
(?<=\S) # This lookbehind makes sure the last char is not a space
\b # The boundary between a word char (\w). It matches the end of a word in this case
) # End of \1
.*$ # Match any char till the end of the line
Is that what you want:
Find what: ^(.{1,14}) .*$
Replace with: $1
This will truncate at 14 characters or less if there is a space.
Also could use \K as a variable length lookbehind and replace with nothing:
^.{0,13}\w\b\K.*
\w matches a word character, \b a word boundary
Test at regex101.com

extract usernames or email/domain after # sign

I have a file with a list of usernames and email addresses and I need two expressions. One to get the email addresses (they always end in .com or .net or .org) and one to get the usernames.
I only want the usernames as one expression and domain portions as the other, I don't want the # sign.
#stackoverflow.com
#google.com
#example.com
I tried
^#.*?..*?$
Users
#Perl
#Python
#PHP
I tried
^#.*?$
Any suggestions are good.
In your first expression, it would match if you escaped the dot \. before the last .*? Your second expression is just clearly matching the whole lines. To match but exclude the # you could do..
For the domains use:
^#(\S+\.[^\s]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
\S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times)
\. '.'
[^\s]+ any character except: whitespace (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo
For the users use:
^#([^\s.]+)$
Regular expression:
^ the beginning of the string
# '#'
( group and capture to \1:
[^\s.]+ any character except: whitespace or '.' (1 or more times)
) end of \1
$ before an optional \n, and the end of the string
See live demo
You could do something like this for domains:
^#[^.]+\.[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a ., followed by one or more of any character other than ., followed by the end of the string.
But this will not capture domains with more than two parts (e.g. #meta.stackoverflow.com). If that's important you might try this instead:
^#[^.]+(\.[^.]+)+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by a group which consists of a ., followed by one or more of any character other than ., where this group may be repeated repeated one or more times, followed by the end of the string.
And this for users:
^#[^.]+$
This will match the start of the string, followed by an #, followed by one or more of any character other than ., followed by the end of the string.
try this
USERS
(?<=#)\w+
EMAIL
(?<=#)\w+.(?:com|net|org)
EDIT:
uhm you didn't stipulate what regex engine you're running on, this is pcre based, but any engine with lookbehind most likely will have the same syntax

cryptic perl expression

I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?
Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.
You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.
/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.
While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!
This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match
The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine

Why does "Year 2010" =~ /([0-4]*)/ results in empty string in $1?

If I run
"Year 2010" =~ /([0-4]*)/;
print $1;
I get empty string.
But
"Year 2010" =~ /([0-4]+)/;
print $1;
outputs "2010". Why?
You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.
Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.
The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.
The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.
The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.
you can also use YAPE::Regex::Explain for explanation of a regular expression like
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();
output:
The regular expression:
(?-imsx:([0-4]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]* any character of: '0' to '4' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The regular expression:
(?-imsx:([0-4]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]+ any character of: '0' to '4' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.
The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.
It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)
We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.
The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.
To make your first RE match, use the anchor '$':
"Year 2010" =~ /([0-4]*)$/;
print $1;