Escaping filename suffix in a regex? - regex

I'm trying to match a variable length string followed by the filetype suffix in an XML filename using a regex:
varrrrrriableLengthString.xml
Currently I'm using this regex with a greedy match, the second backslash is to escape the first, which is to escape the dot.
[A-Za-z0-9]+\\.[xX][mM][lL]
I've tested this on RegExr, and it matches with only one backslash. However my CPP parser requires the double backslash.
How can I properly escape the filename suffix?

You can also escape chars using the [] notation, in your case [.]. The main advantage is that there is no "one or two backslashes?" question anymore, and I find it more readable IMHO.
It just does not work with brackets, i.e. to escape a [ (or ]), you still have to use \[ (or \\[ for a string literal) and not [[].
Backslashes still have to be escaped using another backslash too.

Related

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

How to prepend a backslash using clojure.string/replace

I am trying to use clojure.string/replace to escape certain characters like asterisks and backticks with backslashes (like ex*mple -> ex\*mple), but I cannot make sense of the function's own escaping rules:
If I try (cs/replace "ex*mple" #"[\*`]" "\\$0"), it treats the $0 literally and returns ex$0mple.
If I try (cs/replace "ex*mple" #"[\*`]" "\\\\$0") it adds two slashes: ex\\*mple.
What is the right way to do it?
Your second approach, (cs/replace "ex*mple" #"[\*`]" "\\\\$0"), is correct. The reason you see two backslashes in the result is because that's how Clojure shows single backslashes in strings. If you print "ex\\*mple", you'll see ex\*mple.
Clojure uses backslash as an escape character in strings, so backslashes themselves have to be escaped. ex\*mple is not a valid string in Clojure because \* is an unsupported escape character.

Regex replace escaped characters

I would like to define a regex pattern which replaces escaped characters with the corresponding value.
For example the string
xy\tz\\x
Should be converted to
xy{tab}z\x
The problem is how to handle things like
xy\\\\\t
this string should become
xy\\{tab}
I don't know how to create a pattern which matches only odd backslashes.
This isn't something that can be accomplished using a single pattern. To start, strip out collections of backslashes:
s/\\\\/\\/g
This replaces two backslashes with a single one.
Then you can just apply one pattern per escaped character:
s/\\t/\t/g
The trick here is to escape the backslash you want to replace. What this'll do is replace the literal string "\t" with a tab character.

Regex match backslash star

Can't work this one out, this matches a single star:
// Escaped multiply
Text = Text.replace(new RegExp("\\*", "g"), '[MULTIPLY]');
But I need it to match \*, I've tried:
\\*
\\\\*
\\\\\*
Can't work it out, thanks for any help!
You were close, \\\\\\* would have done it.
Better use verbatim strings, that makes it easier:
RegExp(#"\\\*", "g")
\\ matches a literal backslash (\\\\ in a normal string), \* matches an asterisk (\\* in a normal string).
Remember that there are two 'levels' of escaping.
First, you are escaping your strings for the C# compiler, and you are also escaping your strings for the Regex engine.
If you want to match "\*" literally, then you need to escape both of these characters for the regex engine, since otherwise they mean something different. We escape these with backslashes, so you will have "\\\*".
Then, we have to escape the backslashes in order to write them as a literal string. This means replacing each backslash with two backslashes: "\\\\\\*".
Instead of this last part, we could use a "verbatim string", where no escapes are applied. In this case, you only need the result from the first escaping: #"\\\*".
Your syntax is completely wrong. It looks more like Javascript than C#.
This works fine:
string Text = "asdf*sadf";
Text = Regex.Replace(Text, "\\*", "[MULTIPLY]");
Console.WriteLine(Text);
Output:
asdf[MULTIPLY]sadf
To match \* you would use the pattern "\\\\\\*".

How can I match double-quoted strings with escaped double-quote characters?

I need a Perl regular expression to match a string. I'm assuming only double-quoted strings, that a \" is a literal quote character and NOT the end of the string, and that a \ is a literal backslash character and should not escape a quote character. If it's not clear, some examples:
"\"" # string is 1 character long, contains dobule quote
"\\" # string is 1 character long, contains backslash
"\\\"" # string is 2 characters long, contains backslash and double quote
"\\\\" # string is 2 characters long, contains two backslashes
I need a regular expression that can recognize all 4 of these possibilities, and all other simple variations on those possibilities, as valid strings. What I have now is:
/".*[^\\]"/
But that's not right - it won't match any of those except the first one. Can anyone give me a push in the right direction on how to handle this?
/"(?:[^\\"]|\\.)*"/
This is almost the same as Cal's answer, but has the advantage of matching strings containing escape codes such as \n.
The ?: characters are there to prevent the contained expression being saved as a backreference, but they can be removed.
NOTE: as pointed out by Louis Semprini, this is limited to 32kb texts due a recursion limit built into Perl's regex engine (that unfortunately silently returns a failure when hit, instead of crashing loudly).
How about this?
/"([^\\"]|\\\\|\\")*"/
matches zero or more characters that aren't slashes or quotes OR two slashes OR a slash then a quote
A generic solution(matching all backslashed characters):
/ \A " # Start of string and opening quote
(?: # Start group
[^\\"] # Anything but a backslash or a quote
| # or
\\. # Backslash and anything
)* # End of group
" \z # Closing quote and end of string
/xms
See Text::Balanced. It's better than reinvent wheel. Use gen_delimited_pat to see result pattern and learn form it.
RegExp::Common is another useful tool to be aware of. It contains regexps for many common cases, included quoted strings:
use Regexp::Common;
my $str = '" this is a \" quoted string"';
if ($str =~ $RE{quoted}) {
# do something
}
Here's a very simple way:
/"(?:\\?.)*?"/
Just remember if you're embedding such a regex in a string to double the backslashes.
Try this piece of code : (\".+")