Regex_match doesn't match in a file - regex

I want to recognize some lines in a text file using regex, but regex_match doesn't match any line, even if I use regex patron(".*")
string dirin = "/home/user/in.srt";
string dirout = "/home/user/out.srt";
ifstream in(dirin.c_str());
ofstream out(dirout.c_str());
string line;
// regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})");
regex patron(".*");
smatch m;
while (getline(in, line)) {
if (regex_match(line, m, patron)) {
out << "ok";
};
out << line;
}
in.close();
out.close();
The code always print the string line in the out.srt file, but never the string "ok" inside the if (regex_match(line, m, patron)).
I'm testing it with the following lines
1
00:01:00,708 --> 00:01:01,800
You look at that river
2
00:01:02,977 --> 00:01:04,706
gently flowing by.
3
00:01:06,213 --> 00:01:08,238
You notice the leaves

Note that getline() reads a line with a trailing carriage return CR symbol, and note that ECMAScript . pattern does not match CR symbol considering it an end of line symbol.
regex_match requires that a whole string matches the pattern.
Thus, you need to account for an optional carriage return at the end of the pattern. You can do it by appending \r? or \s* at the end of the pattern:
regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s*");
or
regex patron(".*\\s*");
Also, consider using raw string literals if your C++ version allows it:
regex patron(R"((\d{2}):(\d{2}):(\d{2}),(\d{3})\s-->\s(\d{2}):(\d{2}):(\d{2}),(\d{3})\s*)");

Related

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

Using Regex to remove leading/trailing whitespaces, except for quotes

I am trying to write a regular expression which recognises whitespaces from a user input string, except for between quotation marks ("..."). For example, if the user enters
#load "my folder/my files/ program.prog" ;
I want my regex substitution to transform this into
#load "my folder/my files/ program.prog" ;
So far I've implemented the following (you can run it here).
#include <iostream>
#include <string>
#include <regex>
int main(){
// Variables for user input
std::string input_line;
std::string program;
// User prompt
std::cout << ">>> ";
std::getline(std::cin, input_line);
// Remove leading/trailing whitespaces
input_line = std::regex_replace(input_line, std::regex("^ +| +$|( ) +"), "$1");
// Check result
std::cout << input_line << std::endl;
return 0;
}
But this removes whitespaces between quotes too. Is there any way I can use regex to ignore spaces between quotes?
You may add another alternative to match and capture double quoted string literals and re-insert it into the result with another backreference:
input_line = std::regex_replace(
input_line,
std::regex(R"(^ +| +$|(\"[^\"\\]*(?:\\[\s\S][^\"\\]*)*\")|( ) +)"),
"$1$2");
See the C++ demo.
The "[^"\\]*(?:\\[\s\S][^"\\]*)*\" part matches a ", then 0+ chars other than \ and ", then 0 or more occurrences of any escaped char (\ and then any char matched with [\s\S]) and then 0+ chars other than \ and ".
Note I used a raw string literal R"(...)" to avoid having to escape regex escape backslashes (R"([\s\S])" = "[\\s\\S]").

Remove lines beginning with the same semi-colon delimited part with regex

I would like to use Notepad++ to remove lines with duplicate beginning of line. For example, I have a semi-colon separated file like below:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
string at the beginning of line 1;second string line 3; final string line3;
string at the beginning of line 1;second string line 4; final string line4;
I would like to remove the third and fourth lines as they have the same first substring as the first line and get the following result:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
You can try using the following regex:
^(([^;]*;).*\R(?:.*\R)*?)\2.*
Or
^(([^;]*;).*\R(?:.*\R)*?)\2.*(?:$|\R)
And replace with $1.
The idea is to find and capture text in the beginning of a line that consists of non-semicolon characters up to ; ([^;]*;), then match the rest of the line (with .*\R), then 0 or more lines ((?:.*\R)*?) up to a line that starts with the captured text in group 2, matching it to the end and capturing into the second group that we can use later.
The drawback is that you will have to click Replace All several times until no match is found.
Thanks go to #nhahtdh who noticed a bug with my previous ^(([^;]*).*\R(?:.*\R)*?)\2.* regex that can overfire.

C++ Regex getting all match's on line

When reading line by line i call this function on each line looking for function calls(names). I use this function to match the any valid characters a-z 0-9 and _ with '('. My problem is i do not understand fully the c++ style regex and how to get it to look through the entire line for possible matches?. This regex is simple and strait forward just does not work as expected but im learning this is the c++ norm.
void readCallbacks(const std::string lines)
{
std::string regxString = "[a-z0-9]+\(";
regex regx(regxString, std::regex_constants::icase);
smatch result;
if(regex_search(lines.begin(), lines.end(), result, regx, std::regex_constants::match_not_bol))
{
cout << result.str() << "\n";
}
}
You need to escape the backslash or use a raw string literal:
std::regex pattern("[a-z0-9]+\\(", std::regex_constants::icase);
// ^^
std::regex pattern(R"([a-z0-9]+\()", std::regex_constants::icase);
// ###^^^^^^^^^^^##
Also, your character range doesn't contain the desired underscore (_).

Tokenizing boost::regex matches

I have created a regex to match the lines of a file which have the following structure: string int int
int main()
{
std::string line;
boost::regex pat("\\w\\s\\d\\s\\d");
while (std::cin)
{
std::getline(std::cin, line);
boost::smatch matches;
if (boost::regex_match(line, matches, pat))
std::cout << matches[2] << std::endl;
}
}
I would like to save those numbers into a pair<string,pair<int,int>>. How can I tokenize match of the boost:regex to achieve this?
First of all your regular expression accepts "one word character then one space then one digit then one space then one digit", I bet this is not what you want. Most probably you want your expression look like:
\w+\s+\d+\s+\d+
where \w+ now means "one or more word characters". If you are sure that there is only one space between tokens you can omit + after \s but this way it is safer.
Then you need to select parts of your expression. That is called sub-expression in RE:
(\w+)\s+(\d+)\s+(\d+)
this way what matches by (\w+) (one or more word characters) will be in matches[1], first (\d+) in matches[2] and second (\d+) in matches[3]. Of course you need to put double \ when you put it in C++ string.