How to Write Regular Expression for Indian Names [closed] - regex

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
How to Write Regular Expression for Validating Indian Names ?
Indian Names mostly don't contain Surnames or Last Names of a Person. It Contains only Initial followed by Name or vice versa.
The Name should Contains either Upper and Lower Cases or some times both cases. White-Spaces and Full Stops are also the Part of the Name.
The Person Name should start with upper or lower case alphabet.
I need the Regular Expression to Validate the Names I listed below.
List of Test Case Valid Names:
B. Bala Manigandan
B.Balamanigandan
B. Balamanigandan
G.S. Sakthivel
G. S. Sakthivel
Sakthivel M.G.
GS Sakthivel
G.S. SAKTHIVEL
BALA MANIGANDAN .B
Sakthivel .M.G

If you want to catch all strings containing only upper and lowercase characters, periods and spaces, you can use
^[a-zA-Z\. ]+$
Here's a breakdown of how the regex works
^ matches the beginning of the string
[a-zA-Z\. ] matches any character a-z, A-Z, or a period or space
    + makes the above match any string with 1 or more characters
$ matches the end of the string
You can test this regex here, at regex101.com.
The regex could probably be created better, though, to ensure that the string does not end or start with a period or a space
^(?![\. ])[a-zA-Z\. ]+(?<! )$
This is essentially the same as the above regex, except it uses negative lookaheads and negative lookbehinds to make sure the string does not start with a period or space or end with a space
( Beginning of the group
    ?! makes the group a Negative Lookahead
     [\. ] Matches either a period or space (in this case, because we are looking a negative lookahead, it will make sure it does not match this)
) End of the group
( Beginning of the second group
    ?<! makes the group a Negative Lookbehind
     The space matches a literal space (in this case, because we are looking a negative lookbehind, it will make sure it does not match this)
) End of the second group
You can test this regex here on regex101.com

Related

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"

Regex to seperate multiple items on a single line [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a set of results that I would like to parse out using Regex and I can't seem to find an expression that works. On each line in a txt file, there are 2 entries each containing a quantity up to 100 followed by an item name of varying lengths and spaces.
Example:
7 BALLS OF STRING 13 CARDBOARD BOXES
14 ROCKS 12 PENCILS
I would like to match the 1st entry with the quantity in group 1, and the 2nd entry with it's quantity in group 2.
You can use the following regular expression pattern and use it while reading the file, line per line:
^(\d*\s[A-Z\s]*)\s(\d*\s[A-Z\s]*)$
Here is a live example: https://regex101.com/r/18dege/1
Here a few details:
^ matches the beginning of the string, $ the end of it
\d* matches any number (0 or more) of numeric characters greedy (equal to [0-9]*)
\s matches a white space character (e.g. tab, space, etc.)
[A-Z\s]* matches any number (0 or more) of uppercase characters and whitespace greedy
() creates a matching group (to extract some parts of the string)
According to the comment below, uppercase letters can be followed by lowercase letters, which should not be matched. An example for this would be:
7 BALLS OF STRING 13 CARDBOARD BOXES
14 ROCKS 12 PENCILS
18 TABLES 3 BLANKETS sewn with patches
To match this pattern, you can use the following regular expression:
^(\d*\s[A-Z\s]*?)[a-z\s]*\s(\d*\s[A-Z\s]*?)[a-z\s]*$
As an update to the above pattern, I've added the following:
[a-z\s]* between the statements (not in the group) and after the second statement, to match a lowercase string
(\d*\s[A-Z\s]*?) I've added a question mark ?, to make the matching non-greedy. This prevents adding the white space between the uppercase and the lowercase part to the matching group. It is now required to have an end of string character $ at the end of the pattern, otherwise, the second group would not match enough characters.
Here is a live example: https://regex101.com/r/18dege/2

Filter lines based on range of value, using regex [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What regex will work to match only certain rows which have a value range (e.g. 20-25 days) in the text raw data (sample below):
[product-1][arbitrary-text][expiry-17days]
[product-2][arbitrary-text][expiry-22days]
[product-3][arbitrary-text][expiry-29days]
[product-4][arbitrary-text][expiry-25days]
[product-5][arbitrary-text][expiry-10days]
[product-6][arbitrary-text][expiry-12days]
[product-7][arbitrary-text][expiry-20days]
[product-8][arbitrary-text][expiry-26days]
'product' and 'expiry' text is static (doesn't change), while their corresponding values change.
'arbitrary-text' is also different for each line/product. So in the sample above, the regex should only match/return lines which have the expiry between 20-25 days.
Expected regex matches:
[product-2][arbitrary-text][expiry-22days]
[product-4][arbitrary-text][expiry-25days]
[product-7][arbitrary-text][expiry-20days]
Thanks.
Please check the following regex:
/(.*-2[0-5]days\]$)/gm
( # start capturing group
.* # matches any character (except newline)
- # matches hyphen character literally
2 # matches digit 2 literally
[0-5] # matches any digit between 0 to 5
days # matches the character days literally
\] # matches the character ] literally
$ # assert position at end of a line
) # end of the capturing group
Do note the use of -2[0-5]days to make sure that it doesn't match:
[product-7][arbitrary-text][expiry-222days] # won't match this
tested this one and it works as expected:
/[2-2]+[0-5]/g
[2-2] will match a number between 2 and 2 .. to restrict going pass the 20es range.
[0-5] second number needs to be between 0 and 5 "the second digit"
{2} limit to 2 digits.
Edit : to match the entire line char for char , this shoudl do it for you.
\[\w*\-\d*\]\s*\[\w*\-[2-2]+[0-5]\w*\]
Edit2: updated to account for Arbitrary text ...
\[(\w*-\d*)\]+\s*\[(\w*\-\w*)\]\s*\[(\w*\-[2-2]+[0-5]\w*)\]
edit3: Updated to match any character for the arbitrary-text.
\[(\w*-\d*)\]\s*\[(.*)\]\s*\[(\w*\-[2-2][0-5]\w*)\]
.*\D2[0-5]d.*
.* matches everything.
\D prevents numbers like 123 and 222 from being valid matches.
2[0-5] covers the range.
d so it doesn't match the product number.
I pasted your sample text into http://regexr.com
It's a useful tool for building regular expressions.
You can try this one :
/(.*-2[0-5]days\]$)/gm
try it HERE

matching a number pattern between special characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am trying to write a regular expression to extract only the number 120068018
!false!|!!|!!|!!|!!|!120068018!|!!|!false!
I am rather new to regular expressions and am finding this a daunting task.
I have tried using the pattern as the numbers always start with 1200
'/^1200+!\$/'
but that does not seem to work.
If your number always starts with 1200, you can do
/1200\d*/
This matches a string that starts with 1200 plus "however many digits follow". Depending on the 'dialect' of regex that you use, you might find
/1200[0-9]*/
more robust; or
/1200[[:digit:]]*/
which is more "readable"
Any of these can be "captured" with parentheses. Again, depending on the dialect, you may or may not need to escape these (add \ in front). Example
echo '||!!||!!||!!120012345||##!!##!!||' | sed 's/^.*\(1200[0-9]*\).*$/\1/'
produces
120012345
just as you wanted. Explanation:
sed stream editor command
s substitute
^.* everything from start of string until
\( start capture group
1200 the number 1200
[0-9]* followed by any number of digits
\) end capture
.*$ everything from here to end of line
/\1/ and replace with the contents of the first capture group
Use this:
/1200\d+/
\d is a meta-character that will match any digits.
Your regular expression didn't work because:
^ matches start of a string. You didn't have your number there.
$ matches end of the string. Again, your number is in the middle of the string.
applies to the immediately preceding character, meta-character or group. So 1200+ means 120 followed by 1 or more zeroes.
This is the regular expression you need:
/[0-9]+/
[0-9] Matches any number from 0 to 9
The + sign means 1 or more times.
If you want to get any number starting with 1200, then it would be
/1200[0-9]*/
Here I am using * because it allows zero or more. Otherwise, the number 1200 wouldn't be captured.
If you want to capture (extract) the String, surround it with parenthesis:
/(1200[0-9]*)/

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)