Given the following:
"John Smith"
"John Smith (123)"
"John Smith (123) (456)"
I'd like to capture:
"John Smith"
"John Smith", "123"
"John Smith (123)", "456"
What Java regex would allow me to do that?
I've tried (.+)\s\((\d+)\)$ and it works fine for "John Smith (123)" and "John Smith (123) (456)" but not for "John Smith". How can I change the regex to work for the first input as well?
You may turn the first .+ lazy, and wrap the later part with a non-capturing optional group:
(.+?)(?:\s\((\d+)\))?$
^ ^^^ ^^
See the regex demo
Actually, if you are using the regex with String#matches() the last $ is redundant.
Details:
(.+?) - Group 1 capturing one or zero characters other than a linebreak symbol, as few as possible (thus, allowing the subsequent subpattern to "fall" into a group)
(?:\s\((\d+)\))? - an optional sequence of a whitespace, (, Group 2 capturing 1+ digits and a )
$ - end of string anchor.
A Java demo:
String[] lst = new String[] {"John Smith","John Smith (123)","John Smith (123) (456)"};
Pattern p = Pattern.compile("(.+?)(?:\\s\\((\\d+)\\))?");
for (String s: lst) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println(m.group(1));
if (m.group(2) != null)
System.out.println(m.group(2));
}
}
Related
I would like to take a string like "My String 2022-01-07" extract the date part into a named capture.
I've tried the following regex, but it only works when there's an exact match:
# Does not work
iex> Regex.named_captures(~r/(?<date>\$?(\d{4}-\d{2}-\d{2})?)/, "My String 2021-01-01")
%{"date" => ""}
# Works
iex> Regex.named_captures(~r/(?<date>\$?(\d{4}-\d{2}-\d{2})?)/, "2021-01-01")
%{"date" => "2021-01-01"}
I've also tried this without luck:
iex> Regex.named_captures(~r/([a-zA-Z0-9 ]+?)(?<date>\$?(\d{4}-\d{2}-\d{2})?)/, "My String 2021-01-01")
%{"date" => ""}
Is there a way to use named captures to extract the date part of a string when you don't care about the characters surrounding the date?
I think I'm looking for a regex that will work like this:
iex> Regex.named_captures(REGEX???, "My String 2021-01-01 Other Parts")
%{"date" => "2021-01-01"}
You want
Regex.named_captures(~r/(?<date>\$?\d{4}-\d{2}-\d{2})/, "My String 2021-01-01")
Your regex - (?<date>\$?(\d{4}-\d{2}-\d{2})?) - represents a named capturing group with date as a name and a \$?(\d{4}-\d{2}-\d{2})? as a pattern. The \$?(\d{4}-\d{2}-\d{2})? pattern matches
\$? - an optional $ char
(\d{4}-\d{2}-\d{2})? - an optional sequence of four digits, -, two digits, -, two digits.
Since the pattern is not anchored (does not have to match the whole string) and both consecutive pattern parts are optional and thus can match an empty string, the ~r/(?<date>\$?(\d{4}-\d{2}-\d{2})?)/ regex **matches the first empty location (empty string) at the start of the "My String 2021-01-01" string.
Rule of thumb: If you do not want to match an empty string, make sure your pattern contains obligatory patterns, that must match at least one char.
Extract Date only:
void main() {
String inputString = "Your String 1/19/2023 9:29:11 AM";
RegExp dateRegex = new RegExp(r"(\d{1,2}\/\d{1,2}\/\d{4})");
Iterable<RegExpMatch> matches = dateRegex.allMatches(inputString);
for (RegExpMatch m in matches) {
print(m.group(0));
}
}
This will output:
1/19/2023
Extract Date and time:
void main() {
String inputString = "Your String 1/19/2023 9:29:11 AM";
RegExp dateTimeRegex = new RegExp(r"(\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}:\d{2}:\d{2} [AP]M)");
Iterable<RegExpMatch> matches = dateTimeRegex.allMatches(inputString);
for (RegExpMatch m in matches) {
print(m.group(0));
}
}
This will output: 1/19/2023 9:29:11 AM
I'd like to sanitize a string so all whitespace is removed, except those between words, and surrounding hyphens
1234 - Text | OneWord , Multiple Words | Another Text , 456 -> 1234 - Text|OneWord,Multiple Words|Another Text,456
std::regex regex(R"(\B\s+|\s+\B)"); //get rid of whitespaces except between words
auto newStr = std::regex_replace(str, regex, "*");
newStr = std::regex_replace(newStr, std::regex("*-*"), " - ");
newStr = std::regex_replace(newStr, std::regex("*"), "");
this is what I currently use, but it is rather ugly and I'm wondering if there is a regex I can use to do this in one go.
You can use
(\s+-\s+|\b\s+\b)|\s+
Replace with $1, backreference to the captured substrings in Group 1. See the regex demo. Details:
(\s+-\s+|\b\s+\b) - Group 1: a - with one or more whitespaces on both sides, or one or more whitespaces in between word boundaries
| - or
\s+ - one or more whitespaces.
See the C++ demo:
std::string s("1234 - Text | OneWord , Multiple Words | Another Text , 456");
std::regex reg(R"((\s+-\s+|\b\s+\b)|\s+)");
std::cout << std::regex_replace(s, reg, "$1") << std::endl;
// => 1234 - Text|OneWord,Multiple Words|Another Text,456
I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo
As the title shows, how do I capture a person who:
Last name start with letter "S"
First name NOT start with letter "S"
The expression should match the entire last name, not just the first letter, and first name should NOT be matched.
Input string is like the following:
(Last name) (First name)
Duncan, Jean
Schmidt, Paul
Sells, Simon
Martin, Jane
Smith, Peter
Stephens, Sheila
This is my regular expression:
/([S].+)(?:, [^S])/
Here is the result I have got:
Schmidt, P
Smith, P
the result included "," space & letter "P" which should be excluded.
The ideal match would be
Schmidt
Smith
You can try this pattern: ^S\w+(?=, [A-RT-Z]).
^S\w+ matches any word (name in your case) that start with S at the beginning,
(?=, [A-RT-Z]) - positive lookahead - makes sure that what follows, is not the word (first name in your case) starting with S ([A-RT-Z] includes all caps except S).
Demo
I did something similar to catch the initials. I've just updated the code to fit your need. Check it:
public static void Main(string[] args)
{
//Your code goes here
Console.WriteLine(ValidateName("FirstName LastName", 'L'));
}
private static string ValidateName(string name, char letter)
{
// Split name by space
string[] names = name.Split(new string[] {" "}, StringSplitOptions.RemoveEmptyEntries);
if (names.Count() > 0)
{
var firstInitial = names.First().ToUpper().First();
var lastInitial = names.Last().ToUpper().First();
if(!firstInitial.Equals(letter) && lastInitial.Equals(letter))
{
return names.Last();
}
}
return string.Empty;
}
In you current regex you capture the lastname in a capturing group and match the rest in a non capturing group.
If you change your non capturing group (?: into a positive lookahead (?= you would only capture the lastname.
([S].+)(?=, [^S]) or a bit shorter S.+(?=, [^S])
Your regex worked for me fine
$array = ["Duncan, Jean","Schmidt, Paul","Sells, Simon","Martin, Jane","Smith, Peter","Stephens, Sheila"];
foreach($array as $el){
if(preg_match('/([S].+)(?:,)( [^S].+)/',$el,$matches))
echo $matches[2]."<br/>";
}
The Answer I got is
Paul
Peter
I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.
\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.
You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo
In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe
import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string
With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.