Matching letter grades within a body of text - regex

I am trying to write a regex expression for matching letter grades embedded within a string however, I am having some difficulty with certain characters. These characters are commas, backslashes, forward slashes, or apostrophes at word boundaries.
These strings may consist of either just a letter grade, or a mixture of a letter grade and notes left by an instructor. The valid range for these grades is anything from an A+ to D-, with an F reserved for failures. For a particular letter such as C the valid grades are: C+, C, or C-. Grades will never appear embedded within another word. Examples of some of these strings are as follows:
string1: "A+"
string2: "B. Submitted with deferral"
string3: "F. Could not read M/C answer sheet."
string4: "C+"
string5: "Received a B- with late submission penalty."
The expression that I have tried thus far is as follows:
(\b[A-D]\b[+-]?)|\bF\b)
For string1 and string2, this will produce the following matches
"A+"
"B. Submitted with deferral"
For string3 this expression should match
F. Could not read M/C answer sheet.
But instead matches
F. Could not read M/C answer sheet.
Any assistance would be much appreciated.
Edit:
For clarity a substring is a letter grade if and only if:
It is if the form A+, A, A-, B+, B, B-, ..., D+, D, D-, with F (without a sign) reserved for a failing grade
It is not embedded in a word, for example FOA+O would not match A+. Likewise substrings such as AC or FB should produce no matches
Letters separated by characters such as \ / ?' should not be matched, for example A/C, B+'C, F\D should not produce matches, whereas A, C or A,C should match both letters.
Letter separated by periods such as B.A. should not result in matches. Whereas an letter occurring at the end of a sentence such as A. may be considered a match.
Consider the following example strings
string1: "A-- A-C, A\D, F/A, D'C, A,C, B+D, C-C, AB, XA, B.A. C C,
Cat, F, C+, B-."
string2: " A "
string3: "B+."
string4: "X"
string5: "F"
in these strings the only valid matches should be
string1: "A-- A-C, A\D, F/A, D'C, A,C, B+D, C-C, AB, XA, B.A. C
C, Cat, F, C+, B-."
string2: " A "
string3: "B+."
string5: "F"

I'm not sure which regex engine you're using but the following regex works for all of the test cases you presented:
See regex in use here
(?<=^|[\s,])(?:[A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)
(?<=^|[\s,]) Lookbehind ensuring what precedes is either of the following options:
^ Asserts position at the start of the line.
[\s,] Match any whitespace character or the comma character.
(?:[A-D][-+]?|F) Match either of the following options:
[A-D][-+]? Match the following:
[A-D] Match any character in the range from A to D in the ASCII table (ABCD).
[-+]? Optionally match any character in the set (- or +)
F Match this literally.
(?=[-+.]\B|[\s,]|$) Lookahead ensuring what proceeds is either of the following options:
[-+.]\B Matches any character in the set (-+.), followed by an assertion for anything that doesn't match a word boundary (ensures what follows is not a letter).
[\s,] Matches any whitespace character or the comma character.
$ Asserts position at the end of the line.
Alternatives
Fixed-width lookbehind - see in use here
(?:^|(?<=[\s,]))(?:[A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)
Without lookbehind (uses capture group instead) - see in use here
(?:^|[\s,])([A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)

The "C" in "M/C" is matched because \b considers the "/" a valid word boundary.
(?<=^|\s)[A-F][+-]{0,1}(?=\W)
This regular expression will match letter grades that are either at the beginning of the line (^), or are preceded with whitespace (\s). The positive lookbehind (?<=) ensures that the leading whitespace is not considered part of the match.
After the letter grade, we have (?=\W), which will require one non-word character, using positive lookahead to exclude the boundary character from the match.

Your original expression is just fine, yet this expression has a start anchor, that might be helping us here:
(?<=^|\s)\b[A-DF]\b[+-]?
Demo 1
Or with capturing group:
(?<=^|\s)(\b[A-DF]\b[+-]?)
Demo 2
Or without lookarounds, these might work:
(?:^|\s)(\b[A-DF]\b[+-]?)
(^|\s)(\b[A-DF]\b[+-]?)
^(\b[A-DF]\b[+-]?)|\s(\b[A-DF]\b[+-]?)

Related

Regular Expressions - A word with only one capitalized letter and which doesn't contain numbers

I am new to RegExp. I have a sentence and I would like to pull out a word which satisfies the following -
It must contain only one capitalized letter
It must consist of only characters/letters without numbers
For instance -
"appLe", "warDrobe", "hUsh"
The words that do not fit - "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram" et cetera.
How would you resolve this problem?
The following regex should work:
will find words that have only one capital letter
will only find words with letters (no numbers or special characters)
will match the entire word
\b(?=[A-Z])[A-Z][a-z]*\b|\b(?=[a-z])[a-z]+[A-Z][a-z]*\b
Matches:
appLe
hUsh
Harry
suSan
I
Rejects
HarrY - has TWO capital letters
warDrobeD - has TWO capital letters
sf_dsfsdF - has SPECIAL characters
331ffsF - has NUMBERS
Leopd1997 - has NUMBERS
mistram - does not have a CAPITAL LETTER
See it in action here
Note:
If the capital letter is OPTIONAL- then you will need to add a ? after each [A-Z] like this:
\b(?=[A-Z])[A-Z]?[a-z]*\b|\b(?=[a-z])[a-z]+[A-Z]?[a-z]*\b
You can do this by using character sets ([a-z] & [A-Z]) with appropriate quantifiers (use ? for one or zero capitals), wrapped in () to capture, surrounded by word breaks \b.
If the capital is optional and can appear anywhere use:
/\b([a-z]*[A-Z]?[a-z]*)\b/ //will still match empty string check for length
If you always want one capital appearing anywhere use:
/\b([a-z]*[A-Z][a-z]*)\b/ // does not match empty string
If you always want one capital that must not be the first or last character use:
/\b([a-z]+[A-Z][a-z]+)\b/ // does not match empty string
Here is a working snippet demonstrating the second regex from above in JavaScript:
const exp = /\b([a-z]*[A-Z][a-z]*)\b/
const strings = ["appLe", "warDrobe", "hUsh", "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram", ""];
for (const str of strings) {
console.log(str, exp.test(str))
}
Regex101 is great for dev & testing!
RegExp:
/\b[a-z\-]*[A-Z][a-z\-]*\b/g
Demo:
RegEx101
Explanation
Segment
Description
\b[a-z\-]*
Find a point where a non-word is adjacent to a word ([A-Za-z0-9\-] or \w), then match zero or more lowercase letters and hyphens (note, the hyphen needs to be escaped (\-))
[A-Z]
Find a single uppercase letter
[a-z\-]*\b
Match zero or more lowercase letters and hyphens, then find a point where a non-word is adjacent to a word

I feel like this regex pattern should work but it doesn't

I'm new to regex and honestly not that experienced.
I got this regex pattern that I want to try and use.
/(a..e.)([a-zA-Z])/gi
The plan is that it should match any words that follow the pattern. So I can loop over a list of words and it locks A in the first slot at E in the second to last spot. And it finds all words that matches this. However I've run into an issue. I expect it to match with the word ADDER however it doesn't. When I remove the last period, so that the pattern becomes
/(a..e)([a-zA-Z])/gi
It does work. Shouldn't these two basically be the same? Since we're using a wildcard dot?
Using the https://regexr.com/
The (a..e.)([a-zA-Z]) pattern looks for an a, after which there must be any two chars (other than line break chars), then an e letter, and then any single char other than line break chars. This pattern neither guarantees you match a whole word, nor that the chars matched with . will be letters.
/(a..e.)([a-zA-Z])/gi is not equal to /(a..e)([a-zA-Z])/gi as they match and consume different strings. Since there is no . after e, the second pattern matches fewer chars, not allowing any single char other than line break chars after e letter before any single letter (the last pattern part).
To match words starting with the a letter, followed with two more letters, then an e letter, and then one more letter you can use
/\ba[a-z]{2}e[a-z]\b/gi
See the regex demo. Details:
/gi - match all occurrences (g) in a case insensitive way (i)
\b - matches a word boundary
a - a / A
[a-z]{2} - two ASCII letters
e - an e letter
[a-z] - any ASCII letter
\b - matches a word boundary.

Regex to extract 2 lists of connected words

I want to extract 2 lists of words that are connected by the sign =. The regex code works for separate lists but not in combination.
Example string: bla word1="word2" blabla abc="xyz" bla bla
One output shall contain the words directly left of =, i.e. word1, abc and the other output shall contain the words directly right of =, i.e. word2, xyz without quotes.
\w+(?==\"(?:(?!\").)*\")
extracts the words left of =, i.e. word1,abc
=\"(?:(?!\").)*\" extracts the words right of = including quotes and =, i.e. ="word2",="xyz"
How can I combine these 2 queries to a single regex-expression that outputs 2 groups? Quotes and equal signs shall not be outputted.
You can use
([^\s=]+)="([^"]*)"
See the regex demo. Details:
([^\s=]+) - Group 1: one or more occurrences of a char other than whitespace and = char
=" - a =" substring
([^"]*) - Group 1: zero or more chars other than " char
" - a " char.
Note: \w+ only matches one or more letters, digits and underscores, and won't match if the keys contain, say, hyphens. (?:(?!\").)* tempered greedy token is not efficient, and does not match line break chars. As the negative lookahead only contains a single char pattern (\.), it is more efficient to write it as a negated character class, [^.]*. It also matches line break chars. If you do not want that behavior, just add the \r\n into the negated character class.
If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question)
import re
test_str='abc="def" ghi'
ans=re.search("(\w+)=\"(\w+)\"",test_str)
print(ans.group(1))
print(ans.group(2))
my_list=list(ans.groups())
print(my_list)
This should do what you want:
(?: (\w*)=)(?:\"(\w*)\")
This is for a python regex.
You can see it working here.

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

Reg Ex question

What does the following reg ex code mean?
'/^\w{4,20}$/'
It means that string should contain from 4 to 20 word characters (letters, digits, and underscores). Here:
^ (caret) matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well
$ (dollar) matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break
\w shorthand character class matching word characters (letters, digits, and underscores). Can be used inside and outside character classes.
{n,m} where n >= 0 and m >= n Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times
Let me show you a usage example. Say, we have the file with the following contents:
[spongebob#conductor /tmp]$ cat file.txt
between4and20
therearetoomanyalphanumcharacters
foo
okay
Now you want to get only those strings which match your pattern '/^\w{4,20}$/':
[spongebob#conductor /tmp]$ grep -E '^\w{4,20}$' blah
between4and20
okay
On output you see only those lines, which fulfil your regular expression.
Ah, also, don't confuse ^ (caret) with ^ immediately after the opening [, the latter negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [), for example [^a-d] matches x (any character except a, b, c or d).
It means:
^ Between the beginning,
$ and the end of a given string,
\w{4,20} there should be only 4-20 Alphanumeric characters (like
a,b,c,d,1,2,3...etc, and also _)
I think you'll find Wikipedia's page on Regular Expressions a big, big help while learning regexes.
And just so there is no confusion, ^ and $ don't necessarily need each other,
If the regex was:
'/^\w{4,20}/'
That'd mean: The match should be at the start of the string, followed by 4-20 alphanumeric characters.
Example (match in bold): Foobar baz
And if the regex pattern was:
'/\w{4,20}$/'
That'd mean: The match should be at the end of the string, proceeded by 4-20 alpha-numeric characters
Example (match in bold): Foo barbaz
/ opening delimiter
^ = start of sting
\w = word character
{x,y} min max
$ = end of string
/end delimiter