Seeking quoted string generator - regex

Does anyone know of a free website or program which will take a string and render it quoted with the appropriate escape characters?
Let's say, for instance, that I want to quote
It's often said that "devs don't know how to quote nested "quoted strings"".
And I would like to specify whether that gets enclosed in single or double quotes. I don't personally care for any escape character other than backslash, but other's might.

If none of the double quotes of the string is already escaped, you can simply do:
str = str.replace(/"/g, "\\\"");
Otherwise, you should check if it is already escaped and replace only if it isn't; You can use lookbehind for that. The following is what came to my mind first but it would fail for strings like escaped backslash followed by quotes \\" :(
str = str.replace(/(?<!\\)"/g, "\\\"");
The following makes sure that the second last character, if exists, is not a backslash.
str = str.replace(/(?<!(^|[^\\])\\)"/g, "\\\"");
Update: Just remembered that JavaScript doesn't support look-behind; you can use the same regex on a look-behind supporting regex engine like perl/php/.net etc.

Any decent regex library in any decent programming language will have a function to do this - not that it's hard to write one yourself (as the other answers have indicated). So having a separate website or program to do it would be mostly useless.
Perl has the quotemeta function
PCRE's C++ wrapper has a function RE::QuoteMeta (warning: giant file at that link) which does the same thing
PHP has preg_quote if you're using Perl-compatible regexes
Python's re module has an escape function
In Java, the java.util.regex.Pattern class has a quote method
Perl and most of the other regular expression engines based on Perl have metacharacters \Q...\E, meaning that whatever comes between \Q and \E is interpreted literally
Most tools that use POSIX regular expressions (e.g. grep) have an option that makes them interpret their input as a literal string (e.g. grep -F)

In Python, for enclosing in single quotes:
import re
mystr = """It's often said that "devs don't know how to quote nested "quoted strings""."""
print("""'%s'""" % re.sub("'", r"\'", mystr))
Output:
'It\'s often said that "devs don\'t know how to quote nested "quoted strings"".'
You could easily adapt this into a more general form, and/or wrap it in a script for command-line invocation.

so, I guess the answer is "no". Sorry, guys, but I didn't learn anything that I don't know. Probably my fault for not phrasing the question correctly.
+1 for everyone who posted

Related

Regex For Strings in C

I'm looking to make a regular expression for some strings in C.
This is what i have so far:
Strings in C are delimited by double quotes (") so the regex has to be surrounded by \" \".
The string may not contain newline characters so I need to do [^\n] ( I think ).
The string may also contain double quotes or back slash characters if and only if they're escaped. Therefore [\\ \"] (again I think).
Other than that anything else goes.
Any help is much appreciated I'm kind of lost on how to start writing this regex.
A simple flex pattern to recognize string literals (including literals with embedded line continuations):
["]([^"\\\n]|\\.|\\\n)*["]
That will allow
"string with \
line continuation"
But not
"C doesn't support
multiline strings"
If you don't want to deal with line continuations, remove the \\\n alternative. If you need trigraph support, it gets more irritating.
Although that recognizes strings, it doesn't attempt to make sense of them. Normally, a C lexer will want to process strings with backslash sequences, so that "\"\n" is converted to the two characters "NL (0x22 0x0A). You might, at some point, want to take a look at, for example, Optimizing flex string literal parsing (although that will need to be adapted if you are programming in C).
Flex patterns are documented in the flex manual. It might also be worthwhile reading a good reference on regular expressions, such as John Levine's excellent book on Flex and Bison.

Regexp languages and replacements in Emacs

When I use the regexp-builder, I need to escape things in a different way from the way I do it when using replace-regexp. Now, this thread explains that these two commands use a different syntax, but why is that?
Also, I went through this blog post: Re-builder: The Interactive Regexp Builder, and I added
(require 're-builder)
(setq reb-re-syntax 'string)
to my .emacs file following the advice on the site. However, I still need to type " around my regexp to make it work. I thought changing the syntax language would take care of this but it doesn't.
With this, my actual questions are:
Is it sill the case that Emacs does not support PCRE? Are there any workarounds to this?
Once I have the right regexp in regex-builder, is there any way to directly send the regexp to replace-regexp and enter the replacement string?
There's a package in the MELPA repository called pcre2el that adds PCRE support to many parts of Emacs, including regexp-builder and replace-regexp.
Regarding question #2: No (at least not by default), but there's another way to do that without using re-builder.
Start by doing a regexp isearch for your pattern. Because it's an isearch, you'll see the matches interactively, a bit like re-builder (albeit without coloured groupings).
Still in isearch, once you're happy with the pattern, type C-M-% to call isearch-query-replace-regexp which will prompt you for the replacement.
You can of course simply copy your re-builder string from its buffer and yank it as a replacement string (but that's undoubtedly not news).
I was curious about the need for quotes in re-builder with string syntax. It seems that's it's just a formality of the system, and reb-read-regexp returns everything between the first and last " when using that syntax. Maybe it's intended to ensure that leading or trailing whitespace can't confuse matters -- re-builder does use leading whitespace for improved visibility, and trailing whitespace would be harder to spot. Or maybe it just made some of the code more convenient/consistent.
No, Emacs doesn't support PCRE, and as far as I know there is no work-around for that.
I don't think so.
To answer your first question, why does re-builder use a different syntax than replace-regexp:
By default, re-builder uses the syntax that is appropriate for writing elisp programs. In the context of a written program, regexps are entered within strings. Inside a string, backslashes have a special meaning which conflicts with using the backslash as part of a regexp. Consequently, within a string, you need to double a backslash to use it to signify part of the regexp syntax.
replace-regexp, on the other hand, is designed to be used interactively by the user, and it explicitly expects the input to be a regexp. As a convenience, it interprets backslashes as regexp syntax, not as string escapes. Which is why you can use single backslashes in this context.

Is the syntax for writing regular expression standardized

Is the syntax for writing regular expression standardized? That is, if I write a regular expression in C++ it will work in Python or Javascript without any modifications.
No, there are several dialects of Regular Expressions.
They generally have many elements in common.
Some popular ones are listed and compared here.
Simple regular expressions, mostly yes. However, across the spectrum of programming languages, there are differences.
No, here are some differences that comes to mind:
JavaScript lets you write inline regex (where \ in \s need not be escaped as \\s), that are delimited by the / character. You can specify flags after the closing /. JS also has RegExp constructor that takes the escaped string as the first argument and an optional flag string as second argument.
/^\w+$/i and new RegExp("^\\w+$", "i") are valid and the same.
In PHP, you can enclose the regex string inside an arbitrary delimiter of your choice (not sure of the super set of characters that can be used as delimiters though). Again you should escape backslashes here.
"|[0-9]+|" is same as #[0-9]+#
Python and C# supports raw strings (not limited to regex, but really helpful for writing regex) that lets you write unescaped backslashes in your regex.
"\\d+\\s+\\w+" can be written as r'\d+\s+\w+' in Python and #'\d+\s+\w+' in C#
Delimiters like \<, \A etc are not globally supported.
JavaScript doesn't support lookbehind and the DOTALL flag.

How can I convert a Perl regex to work with Boost::Regex?

What is the Boost::Regex equivalent of this Perl regex for words that end with ing or ed or en?
/ing$|ed$|en$/
...
The most important difference is that regexps in C++ are strings so all regexp specific backslash sequences (such as \w and \d should be double quoted ("\\w" and "\\d")
/^[\.:\,()\'\`-]/
should become
"^[.:,()'`-]"
The special Perl regex delimiter / doesn't exist in C++, so regexes are just a string. In those strings, you need to take care to escape backslashes correctly (\\ for every \ in your original regex). In your example, though, all those backslashes were unnecessary, so I dropped them completely.
There are other caveats; some Perl features (like variable-length lookbehind) don't exist in the Boost library, as far as I know. So it might not be possible to simply translate any regex. Your examples should be fine, though. Although some of them are weird. .*[0-9].* will match any string that contains a number somewhere, not all numbers.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)