How to capture repeated group up to N times? - c++

I would like to capture chains of digits in a string, but only up to 3 times.
Any chain of digits afterwards should be ignored. For instance:
T441_S45/1 => 441 45 1
007_S4 => 007 4
41_445T02_74 => 41 445 02
I've tried (\d+){1,3} but that doesn't seem to work...
Any hint?

You may match and capture the first three chunks of digits separated with any amount of non-digits and the rest of the string, and replace with the backreferences to those groups:
^\D*(\d+)(?:\D+(\d+))?(?:\D+(\d+))?.*
Or, if the string can be multiline,
^\D*(\d+)(?:\D+(\d+))?(?:\D+(\d+))?[\s\S]*
The replacement string will look like $1 $2 $3.
Details
^ - start of string
\D* - 0+ non-digits
(\d+) - Group 1: one or more digits
(?:\D+(\d+))? - an optional non-capturing group matching:
\D+ - 1+ non-digits
(\d+) - Group 2: one or more digits
(?:\D+(\d+))? - another optional non-capturing group matching:
\D+ - one or more non-digits
(\d+) - Group 3: one or more digits
[\s\S]* - the rest of the string.
See the regex demo.
C++ demo:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::vector<std::string> strings;
strings.push_back("T441_S45/1");
strings.push_back("007_S4");
strings.push_back("41_445T02_74");
std::regex reg(R"(^\D*(\d+)(?:\D+(\d+))?(?:\D+(\d+))?[\s\S]*)");
for (size_t k = 0; k < strings.size(); k++)
{
std::cout << "Input string: " << strings[k] << std::endl;
std::cout << "Replace result: "
<< std::regex_replace(strings[k], reg, "$1 $2 $3") << std::endl;
}
return 0;
}
Output:
Input string: T441_S45/1
Replace result: 441 45 1
Input string: 007_S4
Replace result: 007 4
Input string: 41_445T02_74
Replace result: 41 445 02

Related

Remove only non-leading and non-trailing spaces from a string in Ruby?

I'm trying to write a Ruby method that will return true only if the input is a valid phone number, which means, among other rules, it can have spaces and/or dashes between the digits, but not before or after the digits.
In a sense, I need a method that does the opposite of String#strip! (remove all spaces except leading and trailing spaces), plus the same for dashes.
I've tried using String#gsub!, but when I try to match a space or a dash between digits, then it replaces the digits as well as the space/dash.
Here's an example of the code I'm using to remove spaces. I figure once I know how to do that, it will be the same story with the dashes.
def valid_phone_number?(number)
phone_number_pattern = /^0[^0]\d{8}$/
# remove spaces
number.gsub!(/\d\s+\d/, "")
return number.match?(phone_number_pattern)
end
What happens is if I call the method with the following input:
valid_phone_number?(" 09 777 55 888 ")
I get false because line 5 transforms the number into " 0788 ", i.e. it gets rid of the digits around the spaces as well as the spaces. What I want it to do is just to get rid of the inner spaces, so as to produce " 0977755888 ".
I've tried
number.gsub!(/\d(\s+)\d/, "") and number.gsub!(/\d(\s+)\d/) { |match| "" } to no avail.
Thank you!!
If you want to return a boolean, you might for example use a pattern that accepts leading and trailing spaces, and matches 10 digits (as in your example data) where there can be optional spaces or hyphens in between.
^ *\d(?:[ -]?\d){9} *$
For example
def valid_phone_number?(number)
phone_number_pattern = /^ *\d(?:[ -]*\d){9} *$/
return number.match?(phone_number_pattern)
end
See a Ruby demo and a regex demo.
To remove spaces & hyphen inbetween digits, try:
(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)
See an online regex demo
(?: - Open non-capture group;
d+ - Match 1+ digits;
| - Or;
\G(?!^)\d+ - Assert position at end of previous match but (negate start-line) with following 1+ digits;
)\K - Close non-capture group and reset matching point;
[- ]+ - Match 1+ space/hyphen;
(?=\d) - Assert position is followed by digits.
p " 09 777 55 888 ".gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, '')
Prints: " 0977755888 "
Using a very simple regex (/\d/ tests for a digit):
str = " 09 777 55 888 "
r = str.index(/\d/)..str.rindex(/\d/)
str[r] = str[r].delete(" -")
p str # => " 0977755888 "
Passing a block to gsub is an option, capture groups available as globals:
>> str = " 09 777 55 888 "
# simple, easy to understand
>> str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }
=> " 0977755888 "
# a different take on #steenslag's answer, to avoid using range.
>> s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s
=> " 0977755888 "
Benchmark, not that it matters that much:
n = 1_000_000
puts(Benchmark.bmbm do |x|
# just a match
x.report("match") { n.times {str.match(/^ *\d(?:[ -]*\d){9} *$/) } }
# use regex in []=
x.report("[//]=") { n.times {s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s } }
# use range in []=
x.report("[..]=") { n.times {s = str.dup; r = s.index(/\d/)..s.rindex(/\d/); s[r] = s[r].delete(" -"); s } }
# block in gsub
x.report("block") { n.times {str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }} }
# long regex
x.report("regex") { n.times {str.gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, "")} }
end)
Rehearsal -----------------------------------------
match 0.997458 0.000004 0.997462 ( 0.998003)
[//]= 1.822698 0.003983 1.826681 ( 1.827574)
[..]= 3.095630 0.007955 3.103585 ( 3.105489)
block 3.515401 0.003982 3.519383 ( 3.521392)
regex 4.761748 0.007967 4.769715 ( 4.772972)
------------------------------- total: 14.216826sec
user system total real
match 1.031670 0.000000 1.031670 ( 1.032347)
[//]= 1.859028 0.000000 1.859028 ( 1.860013)
[..]= 3.074159 0.003978 3.078137 ( 3.079825)
block 3.751532 0.011982 3.763514 ( 3.765673)
regex 4.634857 0.003972 4.638829 ( 4.641259)

regex to match all whitespace except those between words and surrounding hyphens?

I'd like to sanitize a string so all whitespace is removed, except those between words, and surrounding hyphens
1234 - Text | OneWord , Multiple Words | Another Text , 456 -> 1234 - Text|OneWord,Multiple Words|Another Text,456
std::regex regex(R"(\B\s+|\s+\B)"); //get rid of whitespaces except between words
auto newStr = std::regex_replace(str, regex, "*");
newStr = std::regex_replace(newStr, std::regex("*-*"), " - ");
newStr = std::regex_replace(newStr, std::regex("*"), "");
this is what I currently use, but it is rather ugly and I'm wondering if there is a regex I can use to do this in one go.
You can use
(\s+-\s+|\b\s+\b)|\s+
Replace with $1, backreference to the captured substrings in Group 1. See the regex demo. Details:
(\s+-\s+|\b\s+\b) - Group 1: a - with one or more whitespaces on both sides, or one or more whitespaces in between word boundaries
| - or
\s+ - one or more whitespaces.
See the C++ demo:
std::string s("1234 - Text | OneWord , Multiple Words | Another Text , 456");
std::regex reg(R"((\s+-\s+|\b\s+\b)|\s+)");
std::cout << std::regex_replace(s, reg, "$1") << std::endl;
// => 1234 - Text|OneWord,Multiple Words|Another Text,456

Pattern match for (length)%code with before length

I have a pattern like x%c, where x is a single digit integer and c is an alphanumeric code of length x. % is just a token separator of length and code
For instance 2%74 is valid since 74 is of 2 digits. Similarly, 1%8 and 4%3232 are also valid.
I have tried regex of form ^([0-9])(%)([A-Z0-9]){\1}, where I am trying to put a limit on length by the value of group 1. It does not work apparently since the group is treated as a string, not a number.
If I change the above regex to ^([0-9])(%)([A-Z0-9]){2} it will work for 2%74 it is of no use since my length is to be limited controlled by the first group not a fixed digit.
I it is not possible by regex is there a better approach in java?
One way could be using 2 capture groups, and convert the first group to an int and count the characters for the second group.
\b(\d+)%(\d+)\b
\b Word boundary
(\d+) Capture group 1, match 1+ digits
% Match literally
(\d+) Capture group 2, match 1+ digits
\b Word boundary
Regex demo | Java demo
For example
String regex = "\\b(\\d+)%(\\d+)\\b";
String string = "2%74";
Pattern pattern = Pattern.compile(regex);
String strings[] = { "2%74", "1%8", "4%3232", "5%123456", "6%0" };
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
if (Integer.parseInt(matcher.group(1)) == matcher.group(2).length()) {
System.out.println("Match for " + s);
} else {
System.out.println("No match for " + s);
}
}
}
Output
Match for 2%74
Match for 1%8
Match for 4%3232
No match for 5%123456
No match for 6%0

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo

Using RegEx to filter wrong Input?

Look at this example:
string str = "January 19934";
The Outcome should be
Jan 1993
I think I have created the right RegEx ([A-z]{3}).*([\d]{4}) to use in this case but I do not know what I should do now?
How can I extract what I am looking for, using RegEx? Is there a way like receiving 2 variables, the first one being the result of the first RegEx bracket: ([A-z]{3}) and the second result being 2nd bracket:[[\d]{4}]?
Your regex contains a common typo: [A-z] matches more than just ASCII letters. Also, the .* will grab all the string up to its end, and backtracking will force \d{4} match the last 4 digits. You need to use lazy quantifier with the dot, *?.
Then, use regex_search and concat the 2 group values:
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main() {
regex r("([A-Za-z]{3}).*?([0-9]{4})");
string s("January 19934");
smatch match;
std::stringstream res("");
if (regex_search(s, match, r)) {
res << match.str(1) << " " << match.str(2);
}
cout << res.str(); // => Jan 1993
return 0;
}
See the C++ demo
Pattern explanation:
([A-Za-z]{3}) - Group 1: three ASCII letters
.*? - any 0+ chars other than line break symbols as few as possible
([0-9]{4}) - Group 2: 4 digits
This could work.
([A-Za-z]{3})([a-z ])+([\d]{4})
Note the space after a-z is important to catch space.