RegEx for first letters that MIGHT precede # - regex

I'm looking for the regex to capture the first letters of a string that might be an email address. If it is an email address, just the first letter of words before the #. In other words, the first letter of words that may or may not be followed by a #, and if there is an # present, ignore the text after it. For example (captured letters in bold and explanation given on the first 3):
first.last (captures f and l when no # is present)
first (captures f)
first.last#exampledomain.com (captures f and l but stops capturing when it encounters an #)
first#example.com
first#sub.example.com
first.middle.last#example.com
The regex I have so far is /\b[a-z](?=.*#)/g but it only works if there is an # present.
For background, I'm trying to capitalize the first letter of names used in an email address. Anything after the # should be left as is. That's why I really just need to capture lowercase letters at the start of a word. I'm using actionscript which uses the same conventions as javascript.
SOLUTION:
Since actionscript doesn't support lookbehind, I ended up using this code to return the string with capitalized results of the regex:
var pattern:RegExp;
if(string.indexOf("#") == -1) {
// no # in string, so just find first letters of words
pattern = /\b[a-z]/g;
} else {
// # exists so just find first letters of words before #
pattern = /\b[a-z](?=.*#)/g;
}
// return the string, capitalizing the results of the regex
return string.replace( pattern, function():String { return String(arguments[0]).toUpperCase(); } );

I think the best way for you to do this is to do 2 searches:
The first would be to capture everything before # character with:
[^#]+
then to search through THAT list to capture all the first letters:
\b[a-z]
I of course don't know how you are actually implementing any of this (posting some real code may help me help you, FYI) but this seems to be the best option.
If you were using ANY other engine, I would suggest this:
(?<!.*#.*)\b[a-z]
which makes use of lookbehinds, but alas, JS does not have lookbehinds.

Maybe this could help to you.
^([a-zA-Z]{1})[a-zA-Z0-9]+\.?([a-zA-Z]{1})?
https://regex101.com/r/eL3dK4/1

I don't know how your captures work. Maybe this is closer/useful.
^([A-Za-z])[^.#]*(?:[.]([A-Za-z])[^.#]*)*#
Or maybe you just want to cap the matches at three?
^([A-Za-z])[^.#]*(?:[.]([A-Za-z])[^.#]*)?(?:[.]([A-Za-z])[^.#]*)?#
https://regex101.com/r/eL3dK4/1#javascript

Related

Dart regex for capturing groups but ignoring certain similar patterns

I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!
I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+

Regex to find string starting with % and ending with .DESCR

I have a very large file of source code loaded in Notepad++, and I am trying to use it's regex search capabilities to find all places where a property is used.
I need to find all places where a property DESCR is set. I tried searching for just .DESCR without regex, but there are far too many results for me to sift through. I know that the code I am looking for will either be prefaced with %This. or & and some variable name, followed by .DESCR =.
I've tried using RegExr to construct the regex, but it isn't finding the strings I want. I've looked here to try to understand regex more, but I am missing something still.
EDIT: More descriptions
Here are examples of something I would be looking for:
%This.oPosition.DESCR = &DATAREC.Y_BUSINESS_TITLE.Value;
%This.data.DESCR = "";
&data.DESCR = "Analyst";
&oPosition.DESCR = &DATAREC.DESCR.Value;
It should not, however, match on these:
&P_NODE_PIN_DESCR = &NODE_PIN_DESCR;
&qLang.Descr = &sDescr;
I know that I am way off base, but here is what I have tried:
(\%This\.|\&[A-Z]+)\.DESCR = This doesn't pick up anything.
\%This.|\&(A-Z)+.DESCR This picks up on %This but nothing following, and doesn't find anything prefaced by &.
\%This.\w.DESCR =|\&\w+.DESCR = It looks like it's working on RegExr, but it doesn't match properly in Notepad++ (It matches on things like &ACCT_DESCR =)
I'm just not familiar enough with regex to understand what I am missing.
EDIT:
Notepad++ search settings:
You can search for (?:%this\.|&)\w+\.DESCR = according to your description. Please untick match case in the search dialog (except you are only searching for This, but not for this or similar.
(?:%this\.|&) matches either %this. or & both literally (but case insensitive)
\w+ matches one or more word characters, thus letters, numbers or underscore. You could also use [a-z]+ to be stricter and only consider letters - or [a-zA-Z]+ when searching case sensitive
\.DESCR = matches .DESCR = literally. If you only want to match DESCR case sensitive, you can use an inline modifier for case sensitivity: \.(?-i)DESCR =
Here's why your attempts didnt work:
You are checking for lowercase only. [A-Z] You need to check for [a-zA-Z] or use the insensitive modifier /i (in this case represented by the "match case" check box
When using the or simple it refers to everything after it until it reaches the end or a closed parentheses
Here's the regex you need
(\%This\.|\&)[A-Za-z]+\.DESCR
If you want to capture only .DESCR you can use this non-capturing groups like so:
(?:(?:\%This\.|\&)[A-Za-z]+)(\.DESCR)
You can then use the back-reference $1 or \1 to replace .DESCR in these specific appearances
https://regex101.com/r/fW9lZ2/2

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Regex to detect one of several strings

I've got a list of email addresses belonging to several domains. I'd like a regex that will match addresses belonging to three specific domains (for this example: foo, bar, & baz)
So these would match:
a#foo
a#bar
b#baz
This would not:
a#fnord
Ideally, these would not match either (though it's not critical for this particular problem):
a#foobar
b#foofoo
Abstracting the problem a bit: I want to match a string that contains at least one of a given list of substrings.
Use the pipe symbol to indicate "or":
/a#(foo|bar|baz)\b/
If you don't want the capture-group, use the non-capturing grouping symbol:
/a#(?:foo|bar|baz)\b/
(Of course I'm assuming "a" is OK for the front of the email address! You should replace that with a suitable regex.)
^(a|b)#(foo|bar|baz)$
if you have this strongly defined a list. The start and end character will only search for those three strings.
Use:
/#(foo|bar|baz)\.?$/i
Note the differences from other answers:
\.? - matching 0 or 1 dots, in case the domains in the e-mail address are "fully qualified"
$ - to indicate that the string must end with this sequence,
/i - to make the test case insensitive.
Note, this assumes that each e-mail address is on a line on its own.
If the string being matched could be anywhere in the string, then drop the $, and replace it with \s+ (which matches one or more white space characters)
should be more generic, the a shouldn't count, although the # should.
/#(foo|bar|baz)(?:\W|$)/
Here is a good reference on regex.
edit: change ending to allow end of pattern or word break. now assuming foo/bar/baz are full domain names.
If the previous (and logical) answers about '|' don't suit you, have a look at
http://metacpan.org/pod/Regex::PreSuf
module description : create regular expressions from word lists
Ok I know you asked for a regex answer.
But have you considered just splitting the string with the '#' char
taking the second array value (the domain)
and doing a simple match test
if (splitString[1] == "foo" && splitString[1] == "bar" && splitString[1] == "baz")
{
//Do Something!
}
Seems to me that RegEx is overkill. Of course my assumption is that your case is really as simple as you have listed.
You don't need a regex to find whether a string contains at least one of a given list of substrings. In Python:
def contain(string_, substrings):
return any(s in string_ for s in substrings)
The above is slow for a large string_ and many substrings. GNU fgrep can efficiently search for multiple patterns at the same time.
Using regex
import re
def contain(string_, substrings):
regex = '|'.join("(?:%s)" % re.escape(s) for s in substrings)
return re.search(regex, string_) is not None
Related
Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA) [pdf]