How to write regex express string literal in scala - regex

String litertal consist zero or more character enclosed by double quote(").
Use escape sequences(listed below) to represent special characters within a string.
It is a compile-time error for a newline or EOF characterto appear inside a string literal.
All the supported escape sequences are as follow:
\b backspace
\f formfeed
\r carriage return
\n newline
\t tab
\" double quote
\ backslash
The following are valid examples of string literal:
" This is a string contain tab \t"
" Hello stackoverflow \"\b"
Can you help me write a regex match string literal?
Thanks so much.

The most general way is to use Pattern.quote() method which returns a regular expression that matches the literal string passed as its argument. You can use it in Scala as well as in Java.

If you want to match e.g. the string represented by the literal "contain tab \t", you would use the regexp "contain tab \t".r—so, there is no need for any special handling of TAB inside the regexp.

Related

Kotlin built-in regex: Escape all regex metacharacters in a string with backslash

As you probably all know, regular expressions have some metacharacters, such as \, |, ., ?, +, *,…. If you want to search for a substring including one of these characters without actually using the regex behaviour, you can escape it with a backslash.
So if you want to search for "Is it true?" in a string you would use the pattern
"Is it true\?".
I am using Kotlin and its built-in regular expressions. Is there a way in Kotlin (a function or something) to get a string from another string in which all of the special characters in the input string are escaped?
So if the input to such a function were "This is good." the output would be "This is good\.", and for "? a [+" it would be "\? a \[\+". → every regex special character in the output is escaped with a backslash.

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

Ignore escaped double quote characters swift

I am trying to validate a phone number using NSPredicate and regex. The only problem is when setting the regex Swift thinks that I am trying to escape part of it due to the backslashes. How can I get around this?
My code is as follows:
let phoneRegEx = "^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$"
In Swift regular string literals, you need to double-escape the slashes to define literal backslashes:
let phoneRegEx = "^((\\(?0\\d{4}\\)?\\s?\\d{3}\\s?\\d{3})|(\\(?0\\d{3}\\)?\\s?\\d{3}\\s?\\d{4})|(\\(?0\\d{2}\\)?\\s‌​?\\d{4}\\s?\\d{4}))(\\s?#(\\d{4}|\\d{3}))?$"
Starting from Swift 5, you can use raw string literals and escape regex escapes with a single backslash:
let phoneRegEx = #"^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s‌?\d{4}\s?\d{4}))(\s?#(\d{4}|\d{3}))?$"#
Please refer to the Regular Expression Metacharacters table on the ICU Regular Expressions page to see what regex escapes should be escaped this way.
Please mind the difference between the regex escapes (in the above table) and string literal escape sequences used in the regular string literals that you may check, say, at Special Characters in String Literals:
String literals can include the following special characters:
The escaped special characters \0 (null character), \\ (backslash), \t (horizontal tab), \n (line feed), \r (carriage return), \" (double quotation mark) and \' (single quotation mark)
An arbitrary Unicode scalar value, written as \u{n}, where n is a 1–8 digit hexadecimal number (Unicode is discussed in Unicode below)
So, in regular string literals, "\"" is a " string written as a string literal, and you do not have to escape a double quotation mark for the regex engine, so "\"" string literal regex pattern is enough to match a " char in a string. However, "\\\"", a string literal repesenting \" literal string will also match " char, although you can already see how redundant this regex pattern is. Also, "\n" (an LF symbol) matches a newline in the same way as "\\n" does, as "\n" is a literal representation of the newline char and "\\n" is a regex escape defined in the ICU regex escape table.
In raw string literals, \ is just a literal backslash.

How is \\n and \\\n interpreted by the expanded regular expression?

Within an ERE, a backslash character (\, \a, \b, \f, \n,
\r, \t, \v) is considered to begin an escape sequence.
Then I see \\n and [\\\n], I can guess though both \\n and [\\\n] here means \ followed by new line, but I'm confused by the exact process to interpret such sequence as how many \s are required at all?
UPDATE
I don't have problem understanding regex in programing languages so please make the context within the lexer.
[root# ]# echo "test\
> hi"
This is dependent on the programming language and on its string handling options.
For example, in Java strings, if you need a literal backslash in a string, you need to double it. So the regex \n must be written as "\\n". If you plan to match a backslash using a regex, then you need to escape it twice - once for Java's string handler, and once for the regex engine. So, to match \, the regex is \\, and the corresponding Java string is "\\\\".
Many programming languages have special "verbatim" or "raw" strings where you don't need to escape backslashes. So the regex \n can be written as a normal Python string as "\\n" or as a Python raw string as r"\n". The Python string "\n" is the actual newline character.
This can becoming confusing, because sometimes not escaping the backslash happens to work. For example the Python string "\d\n" happens to work as a regex that's intended to match a digit, followed by a newline. This is because \d isn't a recognized character escape sequence in Python strings, so it's kept as a literal \d and fed that way to the regex engine. The \n is translated to an actual newline, but that happens to match the newline in the string that the regex is tested against.
However, if you forget to escape a backslash where the resulting sequence is a valid character escape sequence, bad things happen. For example, the regex \bfoo\b matches an entire word foo (but it doesn't match the foo in foobar). If you write the regex string as "\bfoo\b", the \bs are translated into backspace characters by the string processor, so the regex engine is told to match <backspace>foo<backspace> which obviously will fail.
Solution: Always use verbatim strings where you have them (e. g. Python's r"...", .NET's #"...") or use regex literals where you have them (e. g. JavaScript's and Ruby's /.../). Or use RegexBuddy to automatically translate the regex for you into your language's special format.
To get back to your examples:
\\n as a regex means "Match a backslash, followed by n"
[\\\n] as a regex means "Match either a backslash or a newline character".
Actually regex string specified by string literal is processed by two compilers: programming language compiler and regexp compiler:
Original Compiled Regex compiled
"\n" NL NL
"\\n" '\'+'n' NL
"\\\n" '\'+NL NL
"\\\\n" '\'+'\'+'n' '\'+'n'
So you must use the shortest format "\n".
Code examples:
JavaScript:
'a\nb'.replace(RegExp("\n"),'<br>')
'a\nb'.replace(RegExp("\\n"),'<br>')
'a\nb'.replace(RegExp("\\\n"),'<br>')
but not:
'a\nb'.replace(/\\\n/,'<br>')
Java:
System.out.println("a\nb".replaceAll("\n","<br>"));
System.out.println("a\nb".replaceAll("\\n","<br>"));
System.out.println("a\nb".replaceAll("\\\n","<br>"));
Python:
str.join('<br>',regex.split('\n','a\nb'))
str.join('<br>',regex.split('\\n','a\nb'))
str.join('<br>',regex.split('\\\n','a\nb'))

How can I match double-quoted strings with escaped double-quote characters?

I need a Perl regular expression to match a string. I'm assuming only double-quoted strings, that a \" is a literal quote character and NOT the end of the string, and that a \ is a literal backslash character and should not escape a quote character. If it's not clear, some examples:
"\"" # string is 1 character long, contains dobule quote
"\\" # string is 1 character long, contains backslash
"\\\"" # string is 2 characters long, contains backslash and double quote
"\\\\" # string is 2 characters long, contains two backslashes
I need a regular expression that can recognize all 4 of these possibilities, and all other simple variations on those possibilities, as valid strings. What I have now is:
/".*[^\\]"/
But that's not right - it won't match any of those except the first one. Can anyone give me a push in the right direction on how to handle this?
/"(?:[^\\"]|\\.)*"/
This is almost the same as Cal's answer, but has the advantage of matching strings containing escape codes such as \n.
The ?: characters are there to prevent the contained expression being saved as a backreference, but they can be removed.
NOTE: as pointed out by Louis Semprini, this is limited to 32kb texts due a recursion limit built into Perl's regex engine (that unfortunately silently returns a failure when hit, instead of crashing loudly).
How about this?
/"([^\\"]|\\\\|\\")*"/
matches zero or more characters that aren't slashes or quotes OR two slashes OR a slash then a quote
A generic solution(matching all backslashed characters):
/ \A " # Start of string and opening quote
(?: # Start group
[^\\"] # Anything but a backslash or a quote
| # or
\\. # Backslash and anything
)* # End of group
" \z # Closing quote and end of string
/xms
See Text::Balanced. It's better than reinvent wheel. Use gen_delimited_pat to see result pattern and learn form it.
RegExp::Common is another useful tool to be aware of. It contains regexps for many common cases, included quoted strings:
use Regexp::Common;
my $str = '" this is a \" quoted string"';
if ($str =~ $RE{quoted}) {
# do something
}
Here's a very simple way:
/"(?:\\?.)*?"/
Just remember if you're embedding such a regex in a string to double the backslashes.
Try this piece of code : (\".+")