Regex Replacing : to ":" etc - c++

I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom

The big advantage of using a regex is to deal with the tricky cases like & Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider &#1?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END

* Repaired SNOBOL4 Solution
* &#38; -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test &#65;. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello,  here's a test colon:." | spirit_ncr2a
output:
"Hello,  here's a test colon:."

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?

Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello,  here's a test colon:.
> Here's a test semi-colon; 'ƒ'
> &#59; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
And the recursive one produces:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}

Related

Finding newlines between $$ $$ or $ $

I want to replace all "\r\n" with two backslahes+newline "\\ \r\n" except the "\r\n" inside "$$ $$" or "$ $" or "\[ \]". (This is the latex syntax)
The following text
1.$$ Test
2.
3.$$ $
4. $
5. Test $
6.
7. $
8.
9. Test
should be
1.$$ Test
2.
3.$$ $
4. $ \\
5. Test $
6.
7. $ \\
8. \\
9. Test
One of my trials:
First I have replaced new lines between $$ $$ or $ $ or \[ \] with --newline--
Then I have replaced all new lines with double new lines (in latex \ equals double new line).
Then I have replaced --newline-- with new line.
private static String replaceNewLines(String original) {
String text = original;
text = replaceBetween(text, "\\[", "\\]");
text = replaceBetween(text, "$$", "$$");
text = replaceBetween(text, "$", "$");
text = text.replace("\r\n", "\r\n\r\n").replace("--newline--", "\r\n");
return text;
}
private static String replaceBetween(String text, String start, String end) {
int i = text.indexOf(start);
while (i >= 0) {
int j = text.indexOf(end, i + 1);
String before = text.substring(0, i);
String after = text.substring(j);
text = before + text.substring(i, j).replace("\r\n", "--newline--")
+ after;
i = text.indexOf(start, j + 1);
}
return text;
}
I would suggest going through the file in one run with a flag marked if you are in math mode or not. Depending on flag you can replace newline or not.
In more general case when nesting is possible, I would suggest using stack implementation.
Deque<String> queue = new ArrayDeque<>(Collections.emptyList());
In this case you can go through the file in one run adding appropriate strings to the stack when entering into math mode and removing them when leaving it. Again depending on the mode (i.e. on the string which is on the top of stack) replace newline or not.
You can ask when in LaTeX could nesting appear. Look at this rough example of Dirichlet function definition:
\[
\mathbb{1}(x)
=
\begin{cases}
1&\text{when $x
$ is rational number}\\
0&\text{when $
x$ is not rational number}
\end{cases}
\]
Here $ $ is inside \[ \]

. Additionally you have to take into account \text{ and } as something what causes nesting. Things get complicated.
Finally, I think that you should also take into account the pair \( and \) which is equivalent $ $.
Apart from that in LaTeX there are also environments so if you have a real LaTeX source then you have to deal with \begin{equation} etc.
BTW \\ in LaTeX just breaks the line. Double \r\n starts a new paragraph. This is not the same.
You mentioned regex in tags. Achieving the same in regex is a longer story and it depends a bit on specific regex flavour. You can read about it on the page https://www.regular-expressions.info/balancing.html

C++ regex to search file paths in a string

I'm trying to parse strings which can contain file paths.
I'm using C++ with regex library. I'm not that good with regex, here it's the ECMAScript.
I don't know why the string :
"C:\Windows\explorer.exe C:\titi\toto.exe"
Doesn't matches the pattern (actually it only founds the first one)
(?:[a-zA-Z]\:|\\)(?:\\[a-z_\-\s0-9]+)+
Do you have a better idea to find every match ?
Thanks!
Here's my code:
wsmatch matches;
regex_constants::match_flag_type fl = regex_constants::match_default ;
regex_constants::syntax_option_type st = regex_constants::icase //Case insensitive
| regex_constants::ECMAScript
| regex_constants::optimize;
wregex pattern(L"(?:[a-zA-Z]\\:|\\\\)(?:\\\\[a-z_\\-\\s0-9]+)+", st);
// Look if matches pattern
printf("--> %ws\n", path.c_str());
if (regex_search(path, matches, pattern, fl)
&& matches.size() > 0)
{
for (u_int i = 0 ; i < matches.size() ; i++)
{
wssub_match sub_match = matches[i];
wstring sub_match_str = sub_match.str();
printf("%ws\n", sub_match_str.c_str());
}
}
You could use something like this:
.?:(\\[a-zA-Z 0-9]*)*.[a-zA-Z]*
I tested it with http://regexpal.com/ and it extracts all file paths.
Although regex provided by #mspoerr satisfies example question, but it wasn't great for me in more complex scenarios, therefore I used to write my own.
Regex:
(\w:)?([\\\w\s0-9_]*)\.\w+
Advanced test string:
C:\Wi ndows\explorer.exe asdasds
: ad C:\titi\toto.Heexe
HELLOO : qwefqwfqwf c:\aa.
(it matches only two valid file paths)

Conditionally replace regex matches in string

I am trying to replace certain patterns in a string with different replacement patters.
Example:
string test = "test replacing \"these characters\"";
What I want to do is replace all ' ' with '_' and all other non letter or number characters with an empty string. I have the following regex created and it seems to tokenize correctly, but I am not sure how to (if possible) perform a conditional replace using regex_replace.
string test = "test replacing \"these characters\"";
regex reg("(\\s+)|(\\W+)");
expected result after replace would be:
string result = "test_replacing_these_characters";
EDIT:
I cannot use boost, which is why I left it out of the tags. So please no answer that includes boost. I have to do this with the standard library. It may be that a different regex would accomplish the goal or that I am just stuck doing two passes.
EDIT2:
I did not remember what characters were included in \w at the time of my original regex, after looking it up I have further simplified the expression. Again the goal is anything matching \s+ should be replaced with '_' and anything matching \W+ should be replaced with empty string.
The c++ (0x, 11, tr1) regular expressions do not really work (stackoverflow) in every case (look up the phrase regex on this page for gcc), so it is better to use boost for a while.
You may try if your compiler supports the regular expressions needed:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main(int argc, char * argv[]) {
string test = "test replacing \"these characters\"";
regex reg("[^\\w]+");
test = regex_replace(test, reg, "_");
cout << test << endl;
}
The above works in Visual Studio 2012Rc.
Edit 1: To replace by two different strings in one pass (depending on the match), I'd think this won't work here. In Perl, this could easily be done within evaluated replacement expressions (/e switch).
Therefore, you'll need two passes, as you already suspected:
...
string test = "test replacing \"these characters\"";
test = regex_replace(test, regex("\\s+"), "_");
test = regex_replace(test, regex("\\W+"), "");
...
Edit 2:
If it would be possible to use a callback function tr() in regex_replace, then you could modify the substitution there, like:
string output = regex_replace(test, regex("\\s+|\\W+"), tr);
with tr() doing the replacement work:
string tr(const smatch &m) { return m[0].str()[0] == ' ' ? "_" : ""; }
the problem would have been solved. Unfortunately, there's no such overload in some C++11 regex implementations, but Boost has one. The following would work with boost and use one pass:
...
#include <boost/regex.hpp>
using namespace boost;
...
string tr(const smatch &m) { return m[0].str()[0] == ' ' ? "_" : ""; }
...
string test = "test replacing \"these characters\"";
test = regex_replace(test, regex("\\s+|\\W+"), tr); // <= works in Boost
...
Maybe some day this will work with C++11 or whatever number comes next.
Regards
rbo
The way to do this has commonly been accomplished by using four backslashes to remove the backlash effecting the actual C code. Then you will need to make a second pass for the parentheses and escape them in your regex then and only then.
string tet = "test replacing \"these characters\"";
//regex reg("[^\\w]+");
regex reg("\\\\"); //--AS COMMONLY TAUGHT AND EXPLAINED
tet = regex_replace(tet, reg, " ");
cout << tet << endl;
regex reg2("\""); //--AS SHOWN
tet = regex_replace(tet, reg2, " ");
cout << tet << endl;
And in a single pass use;
string tet = "test replacing \"these characters\"";
//regex reg("[^\\w]+");
regex reg3("\\\""); //--AS EXPLAINED
tet = regex_replace(tet, reg3, "");
cout << tet << endl;

How can I extract a substring after a match position?

I have a requirement to grep a string or pattern (say around 200 characters before and after the string or pattern) from an extremely long line ed file. The file contains streams of data (market trading data) coming from a remote server and getting appended onto this line of the file.
I know that I can match lines containing a specific pattern using grep (or other tools), but once I have such lines, how can I extract a portion of the line? I want to grab the part of the line with the pattern plus roughly 200 characters before and after the pattern. I would be especially interested in answers using...(supply tools or languages you're comfortable with here).
If what you need is the 200 characters before and after the expression plus the expression itself, then you are looking at:
/.{200}aaa.{200}/
If you need captures for each (allowing you to extract each part as a unit), then you use this regexp:
/(.{200})(aaa)(.{200})/
If your grep has -o then that will output only the matched part.
echo "abc def ghi jkl mno pqr" | egrep -o ".{4}ghi.{4}"
produces:
def ghi jkl
(.{0,200}(pattern).{0,200}), or something?
Is this what you want (in C)?
If it is, feel free to adapt to your specific needs.
#include <stdio.h>
#include <string.h>
void prt_grep(const char *haystack, const char *needle, int padding) {
char *ptr, *start, *finish;
ptr = strstr(haystack, needle);
if (!ptr) return;
start = (ptr - padding);
if (start < haystack) start = haystack;
finish = ptr + strlen(needle) + padding;
if (finish > haystack + strlen(haystack)) finish = haystack + strlen(haystack);
for (ptr = start; ptr < finish; ptr++) putchar(*ptr);
}
int main(void) {
const char *longline = "123456789 ASDF 123456789";
const char *pattern = "ASDF";
prt_grep(longline, pattern, 5); /* you want 200 */
return 0;
}
I think I might approach the problem by matching the part of the string I need, then using the match position as the starting point for the substring extraction. In Perl, once your regex suceeds, the pos built-in tells you where you left off:
if( $long_string = m/$regex/ ) {
$substring = substr( $long_string, pos( $long_string ), 200 );
}
I tend to write my programs in Perl instead of doing everything in the regular expression. There's nothing particularly special about Perl in this case.
I think this may be more basic that everybody is thinking, correct me if I'm wrong...
Do you want to print before and after the string excluding the string?
awk -F "ASDF" '{print "Before ASDF" $1 "\n" "After ASDF" $2}' $FILE
This will print something like:
Before ASDF blablabla
After ASDF blablablabla
Change it to match your needs, remove the "\n" and or the "Before..." and "After..." comments
Do you want to supress the string from the file?
This will replace the string with a blank space, again, change it to whatever you need.
sed -i 's/ASDF/\ /' longstring.txt
HTH

How to parse a command line with regular expressions?

I want to split a command line like string in single string parameters. How look the regular expression for it. The problem are that the parameters can be quoted. For example like:
"param 1" param2 "param 3"
should result in:
param 1, param2, param 3
You should not use regular expressions for this. Write a parser instead, or use one provided by your language.
I don't see why I get downvoted for this. This is how it could be done in Python:
>>> import shlex
>>> shlex.split('"param 1" param2 "param 3"')
['param 1', 'param2', 'param 3']
>>> shlex.split('"param 1" param2 "param 3')
Traceback (most recent call last):
[...]
ValueError: No closing quotation
>>> shlex.split('"param 1" param2 "param 3\\""')
['param 1', 'param2', 'param 3"']
Now tell me that wrecking your brain about how a regex will solve this problem is ever worth the hassle.
I tend to use regexlib for this kind of problem. If you go to: http://regexlib.com/ and search for "command line" you'll find three results which look like they are trying to solve this or similar problems - should be a good start.
This may work:
http://regexlib.com/Search.aspx?k=command+line&c=-1&m=-1&ps=20
("[^"]+"|[^\s"]+)
what i use
C++
#include <iostream>
#include <iterator>
#include <string>
#include <regex>
void foo()
{
std::string strArg = " \"par 1\" par2 par3 \"par 4\"";
std::regex word_regex( "(\"[^\"]+\"|[^\\s\"]+)" );
auto words_begin =
std::sregex_iterator(strArg.begin(), strArg.end(), word_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i)
{
std::smatch match = *i;
std::string match_str = match.str();
std::cout << match_str << '\n';
}
}
Output:
"par 1"
par2
par3
"par 4"
Without regard to implementation language, your regex might look something like this:
("[^"]*"|[^"]+)(\s+|$)
The first part "[^"]*" looks for a quoted string that doesn't contain embedded quotes, and the second part [^"]+ looks for a sequence of non-quote characters. The \s+ matches a separating sequence of spaces, and $ matches the end of the string.
Regex: /[\/-]?((\w+)(?:[=:]("[^"]+"|[^\s"]+))?)(?:\s+|$)/g
Sample: /P1="Long value" /P2=3 /P3=short PwithoutSwitch1=any PwithoutSwitch2
Such regex can parses the parameters list that built by rules:
Parameters are separates by spaces (one or more).
Parameter can contains switch symbol (/ or -).
Parameter consists from name and value that divided by symbol = or :.
Name can be set of alphanumerics and underscores.
Value can absent.
If value exists it can be the set of any symbols, but if it has the space then value should be quoted.
This regex has three groups:
the first group contains whole parameters without switch symbol,
the second group contains name only,
the third group contains value (if it exists) only.
For sample above:
Whole match: /P1="Long value"
Group#1: P1="Long value",
Group#2: P1,
Group#3: "Long value".
Whole match: /P2=3
Group#1: P2=3,
Group#2: P2,
Group#3: 3.
Whole match: /P3=short
Group#1: P3=short,
Group#2: P3,
Group#3: short.
Whole match: PwithoutSwitch1=any
Group#1: PwithoutSwitch1=any,
Group#2: PwithoutSwitch1,
Group#3: any.
Whole match: PwithoutSwitch2
Group#1: PwithoutSwitch2,
Group#2: PwithoutSwitch2,
Group#3: absent.
Most languages have other functions (either built-in or provided by a standard library) which will parse command lines far more easily than building your own regex, plus you know they'll do it accurately out of the box. If you edit your post to identify the language that you're using, I'm sure someone here will be able to point you at the one used in that language.
Regexes are very powerful tools and useful for a wide range of things, but there are also many problems for which they are not the best solution. This is one of them.
This will split an exe from it's params; stripping parenthesis from the exe; assumes clean data:
^(?:"([^"]+(?="))|([^\s]+))["]{0,1} +(.+)$
You will have two matches at a time, of three match groups:
The exe if it was wrapped in parenthesis
The exe if it was not wrapped in parenthesis
The clump of parameters
Examples:
"C:\WINDOWS\system32\cmd.exe" /c echo this
Match 1: C:\WINDOWS\system32\cmd.exe
Match 2: $null
Match 3: /c echo this
C:\WINDOWS\system32\cmd.exe /c echo this
Match 1: $null
Match 2: C:\WINDOWS\system32\cmd.exe
Match 3: /c echo this
"C:\Program Files\foo\bar.exe" /run
Match 1: C:\Program Files\foo\bar.exe
Match 2: $null
Match 3: /run
Thoughts:
I'm pretty sure that you would need to create a loop to capture a possibly infinite number of parameters.
This regex could easily be looped onto it's third match until the match fails; there are no more params.
If its just the quotes you are worried about, then just write a simple loop to dump character by character to a string ignoring the quotes.
Alternatively if you are using some string manipulation library, you can use it to remove all quotes and then concatenate them.
there's a python answer thus we shall have a ruby answer as well :)
require 'shellwords'
Shellwords.shellsplit '"param 1" param2 "param 3"'
#=> ["param 1", "param2", "param 3"] or :
'"param 1" param2 "param 3"'.shellsplit
Though answer is not RegEx specific but answers Python commandline arg parsing:
dash and double dash flags
int/float conversion based on SO answer
import sys
def parse_cmd_args():
_sys_args = sys.argv
_parts = {}
_key = "script"
_parts[_key] = [_sys_args.pop(0)]
for _part in _sys_args:
# Parse numeric values float and integers
if _part.replace("-", "1", 1).replace(".", "1").replace(",", "").isdigit():
_part = int(_part) if '.' not in _part and float(_part)/int(_part) == 1 else float(_part)
_parts[_key].append(_part)
elif "=" in _part:
_part = _part.split("=")
_parts[_part[0].strip("-")] = _part[1].strip().split(",")
elif _part.startswith(("-")):
_key = _part.strip("-")
_parts[_key] = []
else:
_parts[_key].extend(_part.split(","))
return _parts
Something like:
"(?:(?<=")([^"]+)"\s*)|\s*([^"\s]+)
or a simpler one:
"([^"]+)"|\s*([^"\s]+)
(just for the sake of finding a regexp ;) )
Apply it several time, and the group n°1 will give you the parameter, whether it is surrounded by double quotes or not.
If you are looking to parse the command and the parameters I use the following (with ^$ matching at line breaks aka multiline):
(?<cmd>^"[^"]*"|\S*) *(?<prm>.*)?
In case you want to use it in your C# code, here it is properly escaped:
try {
Regex RegexObj = new Regex("(?<cmd>^\\\"[^\\\"]*\\\"|\\S*) *(?<prm>.*)?");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
It will parse the following and know what is the command versus the parameters:
"c:\program files\myapp\app.exe" p1 p2 "p3 with space"
app.exe p1 p2 "p3 with space"
app.exe
Here's a solution in Perl:
#!/usr/bin/perl
sub parse_arguments {
my $text = shift;
my $i = 0;
my #args;
while ($text ne '') {
$text =~ s{^\s*(['"]?)}{}; # look for (and remove) leading quote
my $delimiter = ($1 || ' '); # use space if not quoted
if ($text =~ s{^(([^$delimiter\\]|\\.|\\$)+)($delimiter|$)}{}) {
$args[$i++] = $1; # acquired an argument; save it
}
}
return #args;
}
my $line = <<'EOS';
"param 1" param\ 2 "pa\"ram' '3" 'pa\'ram" "4'
EOS
say "ARG: $_" for parse_arguments($line);
Output:
ARG: param 1
ARG: param\ 2
ARG: pa"ram' '3
ARG: pa'ram" "4
Note the following:
Arguments can be quoted with either " or ' (with the "other"
quote type treated as a regular character for that argument).
Spaces and quotes in arguments can be escaped with \.
The solution can be adapted to other languages. The basic approach is to (1) determine the delimiter character for the next string, (2) extract the next argument up to an unescaped occurrence of that delimiter or to the end-of-string, then (3) repeat until empty.
\s*("[^"]+"|[^\s"]+)
that's it
(reading your question again, just prior to posting I note you say command line LIKE string, thus this information may not be useful to you, but as I have written it I will post anyway - please disregard if I have missunderstood your question.)
If you clarify your question I will try to help but from the general comments you have made i would say dont do that :-), you are asking for a regexp to split a series of parmeters into an array. Instead of doing this yourself I would strongly suggest you consider using getopt, there are versions of this library for most programming languages. Getopt will do what you are asking and scales to manage much more sophisticated argument processing should you require that in the future.
If you let me know what language you are using I will try and post a sample for you.
Here are a sample of the home pages:
http://www.codeplex.com/getopt
(.NET)
http://www.urbanophile.com/arenn/hacking/download.html
(java)
A sample (from the java page above)
Getopt g = new Getopt("testprog", argv, "ab:c::d");
//
int c;
String arg;
while ((c = g.getopt()) != -1)
{
switch(c)
{
case 'a':
case 'd':
System.out.print("You picked " + (char)c + "\n");
break;
//
case 'b':
case 'c':
arg = g.getOptarg();
System.out.print("You picked " + (char)c +
" with an argument of " +
((arg != null) ? arg : "null") + "\n");
break;
//
case '?':
break; // getopt() already printed an error
//
default:
System.out.print("getopt() returned " + c + "\n");
}
}