C++ regex to search file paths in a string - c++

I'm trying to parse strings which can contain file paths.
I'm using C++ with regex library. I'm not that good with regex, here it's the ECMAScript.
I don't know why the string :
"C:\Windows\explorer.exe C:\titi\toto.exe"
Doesn't matches the pattern (actually it only founds the first one)
(?:[a-zA-Z]\:|\\)(?:\\[a-z_\-\s0-9]+)+
Do you have a better idea to find every match ?
Thanks!
Here's my code:
wsmatch matches;
regex_constants::match_flag_type fl = regex_constants::match_default ;
regex_constants::syntax_option_type st = regex_constants::icase //Case insensitive
| regex_constants::ECMAScript
| regex_constants::optimize;
wregex pattern(L"(?:[a-zA-Z]\\:|\\\\)(?:\\\\[a-z_\\-\\s0-9]+)+", st);
// Look if matches pattern
printf("--> %ws\n", path.c_str());
if (regex_search(path, matches, pattern, fl)
&& matches.size() > 0)
{
for (u_int i = 0 ; i < matches.size() ; i++)
{
wssub_match sub_match = matches[i];
wstring sub_match_str = sub_match.str();
printf("%ws\n", sub_match_str.c_str());
}
}

You could use something like this:
.?:(\\[a-zA-Z 0-9]*)*.[a-zA-Z]*
I tested it with http://regexpal.com/ and it extracts all file paths.

Although regex provided by #mspoerr satisfies example question, but it wasn't great for me in more complex scenarios, therefore I used to write my own.
Regex:
(\w:)?([\\\w\s0-9_]*)\.\w+
Advanced test string:
C:\Wi ndows\explorer.exe asdasds
: ad C:\titi\toto.Heexe
HELLOO : qwefqwfqwf c:\aa.
(it matches only two valid file paths)

Related

RegEx to select everything After search string, excluding search string [duplicate]

I'm new to using Regex, I've been going through a rake of tutorials but I haven't found one that applies to what I want to do,
I want to search for something, but return everything following it but not the search string itself
e.g. "Some lame sentence that is awesome"
search for "sentence"
return "that is awesome"
Any help would be much appreciated
This is my regex so far
sentence(.*)
but it returns: sentence that is awesome
Pattern pattern = Pattern.compile("sentence(.*)");
Matcher matcher = pattern.matcher("some lame sentence that is awesome");
boolean found = false;
while (matcher.find())
{
System.out.println("I found the text: " + matcher.group().toString());
found = true;
}
if (!found)
{
System.out.println("I didn't find the text");
}
You can do this with "just the regular expression" as you asked for in a comment:
(?<=sentence).*
(?<=sentence) is a positive lookbehind assertion. This matches at a certain position in the string, namely at a position right after the text sentence without making that text itself part of the match. Consequently, (?<=sentence).* will match any text after sentence.
This is quite a nice feature of regex. However, in Java this will only work for finite-length subexpressions, i. e. (?<=sentence|word|(foo){1,4}) is legal, but (?<=sentence\s*) isn't.
Your regex "sentence(.*)" is right. To retrieve the contents of the group in parenthesis, you would call:
Pattern p = Pattern.compile( "sentence(.*)" );
Matcher m = p.matcher( "some lame sentence that is awesome" );
if ( m.find() ) {
String s = m.group(1); // " that is awesome"
}
Note the use of m.find() in this case (attempts to find anywhere on the string) and not m.matches() (would fail because of the prefix "some lame"; in this case the regex would need to be ".*sentence(.*)")
if Matcher is initialized with str, after the match, you can get the part after the match with
str.substring(matcher.end())
Sample Code:
final String str = "Some lame sentence that is awesome";
final Matcher matcher = Pattern.compile("sentence").matcher(str);
if(matcher.find()){
System.out.println(str.substring(matcher.end()).trim());
}
Output:
that is awesome
You need to use the group(int) of your matcher - group(0) is the entire match, and group(1) is the first group you marked. In the example you specify, group(1) is what comes after "sentence".
You just need to put "group(1)" instead of "group()" in the following line and the return will be the one you expected:
System.out.println("I found the text: " + matcher.group(**1**).toString());

c++ regex get folder from a file path

I have a file name like this
/mnt/opt/storage/ssd/subtitles/8/vtt/2011022669-5126858992107.vtt
how to replace the file name with * using regex so I get
/mnt/opt/storage/ssd/subtitles/8/vtt/*?
I know the simple for loop split or boost::filesystem approach, I'm looking for a regex_replace approach.
You don't need regexp for this:
string str = "/mnt/opt/storage/ssd/subtitles/8/vtt/2011022669-5126858992107.vtt";
auto lastSlash = str.find_last_of('/');
str.replace(str.begin() + lastSlash + 1, str.end(), "*");
Try this pattern
(([\w+\-])+)(?=(\.\w{3}))
tested in notepad++.
(?=()) its lookahaed. So it will match ([\w+-])+ only if extension (.\w{2,3)) in format .xxx or .xx is after this group.
In c++ you have to just replace group to * something like
replace (string, $1 , '*') -- i don't know c++ replace funciton, just assuming.
$1,$2,$3... its group number, in this case - $1 its (([\w+-])+).
Below is a solution with regexp_replace [live]:
std::string path = "/mnt/opt/storage/ssd/subtitles/8/vtt/2011022669-5126858992107.vtt";
std::regex re(R"(\/[^\/]*?\..+$)");
std::cout << path << '\n';
std::cout << std::regex_replace(path, re, "/*") << '\n';
outputs:
/mnt/opt/storage/ssd/subtitles/8/vtt/2011022669-5126858992107.vtt
/mnt/opt/storage/ssd/subtitles/8/vtt/*
but,... regexp seems to be a bit too heavy weight for such simple replacement

Generate regex expression from series of input

Is it somehow possible to generate a Regex expression from a series of input ?
I am not sure if this is even possible. Hence I am posting this question here.
Is there any tool or website that does this ?
More Update:
say I enter inputs like
www.google.com
google.com
http://www.google.com
it should somehow give me a regex expression dats accepts this type of input... Is this possible ?
For your URL Example, here's something that I just threw together in C#. I think it'll help you out.
// Input "pattern" should consist of a string with ONLY the following tags:
// <protocol> <web> <website> <DomainExtension> <RestOfPath>
// Ex) GenerateRegexFor("<protocol><web><webite><domainextension>") will match http://www.google.com
public string GenerateRegexFor(string pattern)
{
string regex = ProcessNextPart(pattern, "");
return regex;
}
public string ProcessNextPart(string pattern, string regex)
{
pattern = pattern.ToLower();
if (pattern.ToLower().StartsWith("<protocol>"))
{
regex += #"[a-zA-Z]+://";
pattern = pattern.Replace("<protocol>", "");
}
else if (pattern.ToLower().StartsWith("<web>"))
{
regex += #"www\d?"; //\d? in case of www2
pattern = pattern = pattern.Replace("<web>", "");
}
else if (pattern.ToLower().StartsWith("<website>"))
{
regex += #"([a-zA-Z0-9\-]*\.)+";
pattern = pattern.Replace("<website>", "");
}
else if (pattern.ToLower().StartsWith("<domainextension>"))
{
regex += "[a-zA-Z]{2,}";
pattern = pattern.Replace("<domainextension>", "");
}
else if (pattern.ToLower().StartsWith("<restofpath>"))
{
regex += #"(/[a-zA-Z0-9\-]*)*(\.[a-zA-Z]*/?)?";
pattern = pattern.Replace("<restofpath>", "");
}
if (pattern.Length > 0 && pattern != "")
return ProcessNextPart(pattern, regex);
return regex;
}
Depending on the style of URL you'd like to match, I think this should match just about anything and everything. You may want to make it a little more picky if there will be text that is similar to URLs but not URLs.
You'd use it like this:
//to match something like "www.google.com/images/whatever"
// \
// \ |www||.google.||----com------||/images/whatever
// \ | | | |
// \/ V V V V
string regex = GenerateRegexFor("<web><website><domainextension><restofpath>");
//to match something like "http://www.google.com/images/whatever"
string regex = GenerateRegexFor("<protocol><web><website><domainextension><restofpath>");
You can use any of those tags, in any order (though some of them wouldn't make much sense). Feel free to build on this, too. You could add as many tags as you wanted for it to represent any number of patterns.
Oh, and +1 for giving me something to do at work.

Conditionally replace regex matches in string

I am trying to replace certain patterns in a string with different replacement patters.
Example:
string test = "test replacing \"these characters\"";
What I want to do is replace all ' ' with '_' and all other non letter or number characters with an empty string. I have the following regex created and it seems to tokenize correctly, but I am not sure how to (if possible) perform a conditional replace using regex_replace.
string test = "test replacing \"these characters\"";
regex reg("(\\s+)|(\\W+)");
expected result after replace would be:
string result = "test_replacing_these_characters";
EDIT:
I cannot use boost, which is why I left it out of the tags. So please no answer that includes boost. I have to do this with the standard library. It may be that a different regex would accomplish the goal or that I am just stuck doing two passes.
EDIT2:
I did not remember what characters were included in \w at the time of my original regex, after looking it up I have further simplified the expression. Again the goal is anything matching \s+ should be replaced with '_' and anything matching \W+ should be replaced with empty string.
The c++ (0x, 11, tr1) regular expressions do not really work (stackoverflow) in every case (look up the phrase regex on this page for gcc), so it is better to use boost for a while.
You may try if your compiler supports the regular expressions needed:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main(int argc, char * argv[]) {
string test = "test replacing \"these characters\"";
regex reg("[^\\w]+");
test = regex_replace(test, reg, "_");
cout << test << endl;
}
The above works in Visual Studio 2012Rc.
Edit 1: To replace by two different strings in one pass (depending on the match), I'd think this won't work here. In Perl, this could easily be done within evaluated replacement expressions (/e switch).
Therefore, you'll need two passes, as you already suspected:
...
string test = "test replacing \"these characters\"";
test = regex_replace(test, regex("\\s+"), "_");
test = regex_replace(test, regex("\\W+"), "");
...
Edit 2:
If it would be possible to use a callback function tr() in regex_replace, then you could modify the substitution there, like:
string output = regex_replace(test, regex("\\s+|\\W+"), tr);
with tr() doing the replacement work:
string tr(const smatch &m) { return m[0].str()[0] == ' ' ? "_" : ""; }
the problem would have been solved. Unfortunately, there's no such overload in some C++11 regex implementations, but Boost has one. The following would work with boost and use one pass:
...
#include <boost/regex.hpp>
using namespace boost;
...
string tr(const smatch &m) { return m[0].str()[0] == ' ' ? "_" : ""; }
...
string test = "test replacing \"these characters\"";
test = regex_replace(test, regex("\\s+|\\W+"), tr); // <= works in Boost
...
Maybe some day this will work with C++11 or whatever number comes next.
Regards
rbo
The way to do this has commonly been accomplished by using four backslashes to remove the backlash effecting the actual C code. Then you will need to make a second pass for the parentheses and escape them in your regex then and only then.
string tet = "test replacing \"these characters\"";
//regex reg("[^\\w]+");
regex reg("\\\\"); //--AS COMMONLY TAUGHT AND EXPLAINED
tet = regex_replace(tet, reg, " ");
cout << tet << endl;
regex reg2("\""); //--AS SHOWN
tet = regex_replace(tet, reg2, " ");
cout << tet << endl;
And in a single pass use;
string tet = "test replacing \"these characters\"";
//regex reg("[^\\w]+");
regex reg3("\\\""); //--AS EXPLAINED
tet = regex_replace(tet, reg3, "");
cout << tet << endl;

Regular expression library that returns all matches for multiple patterns in one run for C++?

I'm looking for a regular expression (or something else) library for C++ that would allow me to specify a number of patterns, run on a string and return the matching locations of all patterns.
For example:
Patterns {"abcd", "abcd"}
String {"abcd abce abcd"}
Result:
abcd matches: 0-3, 11-14
abce matches: 5-9
Anyone know of a such a library?
I recommend boost::xpressive http://www.boost.org/doc/libs/1_39_0/doc/html/xpressive.html.
One of possible solution:
string text = "abcd abce abcd";
static const sregex abcd = as_xpr("abcd"); // static - faster
sregex abce = sregex::compile( "abce" ) // compiled
sregex all = *(keep(abcd) | keep(abce));
smatch what;
if( regex_match( text, what, all ) )
{
smatch::nested_results_type::const_iterator begin = what.nested_results().begin();
smatch::nested_results_type::const_iterator end = what.nested_results().end();
for(;it != end; it++)
{
if(it->regex_id() == abcd.regex_id())
{
// you match abcd
// use it->begin() and it->end()
// or it->position() and it->length()
continue;
}
if(it->regex_id() == abce.regex_id())
{
// you match abcd...
continue;
};
}
I think is not best solution, you could check “Semantic Actions and User-Defined Assertions” in documentation.
Regular Expressions are part of the standard extension tr1 and implemented in a number of standard libraries (i.e. dinkumware)
I think that its very straightforward to write the surrounding code yourself.
Doesn't it work with an simple or?
"abcd|abcd"
which is a valid regular expression.