Regex on numbers and spaces - regex

I'm trying to match numbers surrounded by spaces, like this string:
" 1 2 3 "
I'm puzzled why the regex \s[0-9]\s matches 1 and 3 but not 2. Why does this happen?

Because the space has already been consumed:
\s[0-9]\s
This matches "spacedigitspace" so lets go through the process
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
Matches, consume " 1 "
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
Matches, consume " 3 "
You want a lookaround:
(?<=\s)\d(?=\s)
This is very different, as it look for \d and then asserts that it is preceded by, and followed by, a space. This assertion is "zero width" which means that the spaces aren't consumed by the engine.

More precisely, the regex \s[0-9]\s does not match 2 only when you go through all matches in the string " 1 2 3 " one by one. If you were to try to start matching at positions 1 or 2, " 2 " would be matched.
The reason for this is that \s is capturing part of the input - namely, the spaces around the digit. When you match " 1 ", the space between 1 and 2 is already taken; the regex engine is looking at the tail of the string, which is "2 3 ". At this point, there is no space in front of 2 that the engine could capture, so it goes straight to finding " 3 "
To fix this, put spaces into zero-length look-arounds, like this:
(?<=\s)[0-9](?=\s)
Now the engine ensures that there are spaces in front and behind the digit without consuming these spaces as part of the match. This lets the engine treat the space between 1 and 2 as a space behind 1 and also as a space in front of 2, thus returning both matches.

The input is captured, and the subsequent matches won't match, you can use a lookahead to fix this
\s+\d+(?=\s+)

The expression \s[0-9]\s mathces " 1 " and " 3 ". As the space after the 1 is matched, it can't also be used to match " 2 ".
You can use a positive lookbehind and a positive lookahead to match digits that are surrounded by spaces:
(?<= )(\d+)(?= )
Demo: https://regex101.com/r/hT1dT6/1

Related

Regex- finding a string after the first " / " till end of line or next " / "

I would like to extract data after the first occurrence of " / " until the end of line or the next instance of " / "
For example:
Test1 / Test 2 / Test 3
Test1 / Test2
Test 1 / Test2 / Test3
Test 1 / Test 2
Would return the output:
Test 2
Test2
Test2
Test 2
So far I have come up with (?:[^/\n]+/\s){1}(.*?(?:$|\s/)) which returns the results (see regex101 demo):
Test 2 /
Test2
Test2 /
Test 2
I am still learning regex and having a tough time figuring out how to exclude " /" from line 1 and 3 of results so any help or guidance on this would be appreciated.
You can use
^[^\/\n]*\/\s*([^\/\n]*?)(?=\s*(?:\/|$))
See the regex demo. When dealing with standalone strings, feel free remove \n from the pattern.
Details:
^ - start of string
[^\/\n]* - zero or more chars other than a newline and /
\/ - a / char
\s* - zero or more whitespaces
([^\/\n]*?) - Group 1: any zero or more chars other than / and newline as few as possible
(?=\s*(?:\/|$)) - a positive lookahead that requires a / or end of string after zero or more whitespaces immediately to the right of the current location.

RegEx select all between two character

Example:
I want to extract everything between "Item:" until " * "
Item: *Sofa (1 SET), 2 × Mattress, 3 × Baby Mattress, 5
Seaters Car (Fabric)*
Total price: 100.00
Subtotal: 989.00
But I only managed to extract "Item: *" and " Seaters Car (Fabric)* " by using (.*?)\*
After matching Item:, match anything but a colon with [^:]+, and then lookahead for a newline, ensuring that the match ends at the end of a line just before another label (like Total price:) starts:
Item: ([^:]+)(?=\n)

C++11 Regex search - Exclude empty submatches

From the following text I want to extract the number and the unit of measurement.
I have 2 possible cases:
This is some text 14.56 kg and some other text
or
This is some text kg 14.56 and some other text
I used | to match the both cases.
My problem is that it produces empty submatches, and thus giving me an incorrect number of matches.
This is my code:
std::smatch m;
std::string myString = "This is some text kg 14.56 and some other text";
const std::regex myRegex(
R"(([\d]{0,4}[\.,]*[\d]{1,6})\s+(kilograms?|kg|kilos?)|s+(kilograms?|kg|kilos?)(\s+[\d]{0,4}[\.,]*[\d]{1,6}))",
std::regex_constants::icase
);
if( std::regex_search(myString, m, myRegex) ){
std::cout << "Size: " << m.size() << endl;
for(int i=0; i<m.size(); i++)
std::cout << m[i].str() << std::endl;
}
else
std::cout << "Not found!\n";
OUTPUT:
Size: 5
kg 14.56
kg
14.56
I want an easy way to extract those 2 values, so my guess is that I want the following output:
WANTED OUTPUT:
Size: 3
kg 14.56
kg
14.56
This way I can always directly extract 2nd and 3th, but in this case I would also need to check which one is the number. I know how to do it with 2 separate searches, but I want to do it the right way, with a single search without using c++ to check if a submatch is an empty string.
Using this regex, you just need the contents of Group 1 and Group 2
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))\s*((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
Click for Demo
Explanation:
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
(?:kilograms?|kilos?|kg) - matches kilograms or kilogram or kilos or kilo or kg
| - OR
(?:\d{0,4}(?:\.\d{1,6})) - matches 0 to 4 digits followed by 1 to 6 digits of decimal part
\s* - matches 0+ whitespaces
You can try this out:
((?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg)))|(?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)))
As shown here: https://regex101.com/r/9O99Fz/3
USAGE -
As I've shown in the 'substitution' section, to reference the numeral part of the quantity, you have to write $2$5, and for the unit, write: $3$4
Explanation -
There are two capturing groups we could possibly need: the first one here (?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg))) is to match the number followed by the unit,
and the other (?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)) to match the unit followed by the number

R Regex number followed by punctuation followed by space

Suppose I had a string like so:
x <- "i2: 32390. 2093.32: "
How would I return a vector that would give me the positions of where a number is followed by a : or a . followed by a space?
So for this string it would be
"2: ","0. ","2: "
The regex you need is just '\\d[\\.:]\\s'. Using stringr's str_extract_all to quickly extract matches:
library(stringr)
str_extract_all("i2: 32390. 2093.32: ", '\\d[\\.:]\\s')
produces
[[1]]
[1] "2: " "0. " "2: "
You can use it with R's built-in functions, and it should work fine, as well.
What it matches:
\\d matches a digit, i.e. number
[ ... ] sets up a range of characters to match
\\. matches a period
: matches a colon
\\s matches a space.

Regex Matching optional numbers

I have a text file that is currently parsed with a regex expression, and it's working well. The file format is well defined, 2 numbers, separated by any whitespace, followed by an optional comment.
Now, we have a need to add an additional (but optional) 3rd number to this file, making the format, 2 or 3 numbers separated by whitespace with an optional comment.
I've got a regex object that at least matches all the necessary line formats, but I am not having any luck with actually capturing the 3rd (optional) number even if it is present.
Code:
#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;
bool regex_check(const std::string& in)
{
std::regex check{
"[[:space:]]*?" // eat leading spaces
"([[:digit:]]+)" // capture 1st number
"[[:space:]]*?" // each second set of spaces
"([[:digit:]]+)" // capture 2nd number
"[[:space:]]*?" // eat more spaces
"([[:digit:]]+|[[:space:]]*?)" // optionally, capture 3rd number
"!*?" // Anything after '!' is a comment
".*?" // eat rest of line
};
std::smatch match;
bool result = std::regex_match(in, match, check);
for(auto m : match)
{
std::cout << " [" << m << "]\n";
}
return result;
}
int main()
{
std::vector<std::string> to_check{
" 12 3",
" 1 2 ",
" 12 3 !comment",
" 1 2 !comment ",
"\t1\t1",
"\t 1\t 1\t !comment \t",
" 16653 2 1",
" 16654 2 1 ",
" 16654 2 1 ! comment",
"\t16654\t\t2\t 1\t ! comment\t\t",
};
for(auto s : to_check)
{
assert(regex_check(s));
}
return 0;
}
This gives the following output:
[ 12 3]
[12]
[3]
[]
[ 1 2 ]
[1]
[2]
[]
[ 12 3 !comment]
[12]
[3]
[]
[ 1 2 !comment ]
[1]
[2]
[]
[ 1 1]
[1]
[1]
[]
[ 1 1 !comment ]
[1]
[1]
[]
[ 16653 2 1]
[16653]
[2]
[]
[ 16654 2 1 ]
[16654]
[2]
[]
[ 16654 2 1 ! comment]
[16654]
[2]
[]
[ 16654 2 1 ! comment ]
[16654]
[2]
[]
As you can see, it's matching all of the expected input formats, but never is able to actually capture the 3rd number, even if it is present.
I'm currently testing this with GCC 5.1.1, but that actual target compiler will be GCC 4.8.2, using boost::regex instead of std::regex.
Let's do a step-by-step processing on the following example.
16653 2 1
^
^ is the currently matched offset. At this point, we're here in the pattern:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
(I've simplified [[:space:]] to \s and [[:digit:]] to \d for brievty.
\s*? matches, and then (\d+) matches. We end up in the following state:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Same thing: \s*? matches, and then (\d+) matches. The state is:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Now, things get trickier.
You have a \s*? here, a lazy quantifier. The engine tries to not match anything, and sees if the rest of the pattern will match. So it tries the alternation.
The first alternative is \d+, but it fails, since you don't have a digit at this position.
The second alternative is \s*?, and there are no other alternatives after that. It's lazy, so let's try to match the empty string first.
The next token is !*?, but it also matches the empty string, and it is then followed by .*?, which will match everything up to the end of the string (it does so because you're using regex_match - it would have matched the empty string with regex_search).
At this point, you've reached the end of the pattern successfully, and you got a match, without being forced to match \d+ against the string.
The thing is, this whole part of the pattern ends up being optional:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
\__________________/
So, what can you do? You can rewrite your pattern like so:
\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?
Demo (with added anchors to mimic regex_match behavior)
This way, you're forcing the regex engine to consider \d and not get away with lazy-matching on the empty string. No need for lazy quantifiers since \s and \d are disjoint.
!*?.*? also was suboptimal, since !*? is already covered by the following .*?. I rewrote it as (?:!.*)? to require a ! at the start of a comment, if it's not there the match will fail.