How does the dot metacharacter match newline characters? - regex

I thought that the dot . in regex will match any character, except the end-of-line character.
However, in R, I found that the dot can match anything, including the newline characters \n, \r or \r\n:
grep(c("\r","\n","\r\n"),pattern=".")
[1] 1 2 3
Can someone explain the contradiction?

The page here http://www.regular-expressions.info/dot.html explains how the rule that dot does not match the end-of-line character exists mostly for historic reasons:
The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
However,
Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks.
Apparently, R is one such language where by default, dot will match every character. (I point you to Joshua's comment above, recommending you look at ?regex and the POSIX 1003.2 standard.)
The page I linked above also mentions Perl and suggests how under its default mode, dot will not match line breaks.
Notice how R's grep function has a perl option. If you turn it on, you do get a different output:
> grep(".", c("\r","\n","\r\n"), perl = TRUE)
[1] 1 3
This is telling me that \n is the line break character, but not \r. Something that comparing cat("\r") and cat("\n") can confirm.
(I'm on a Mac OS if it makes any difference.)

Related

Line feed regular expression doesn't work in Geany

I want to detect line feed with Geany in Ubuntu. I used regular expressions such as \n , \r and \r\n, but it doesn't detect anything.
There are some line ending settings that I also try to change to make it work, but still no success:
And finally, I also tried to use different encoding from document → set encoding menu, but still no success.
I guess I am doing something wrong, but I still don't know what.
As Mohammad Yusuf Ghazi comments, you need to enable the Use multi-line matching option. See the Geany docs:
The Use multi-line matching dialog option enables multi-line regular expressions.
Multi-line regular expressions work just like single-line ones but a match can span several lines.
Besides, you may also use \R shorthand class for any line break sequence:
Newline sequences
Outside a character class, the escape sequence \R matches any Unicode newline sequence. This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), NEL (next line, U+0085), LS (line separator, U+2028), or PS (paragraph separator, U+2029). The two-character sequence is treated as a single unit that cannot be split. Inside a character class, \R matches the letter "R".
In a simple search just within the open file, select from the end of the line to the beginning of the first line. Copy and paste in the find box.
It will be seen as a squarebox with four letters written inside. This will definitely detect each LF in Geany.
In regex, use $ instead of \r\n or \R. It will detect the end of line in multiline mode in Geany.

vim search replace including newline

I've been googling for 3 hours now without success.
I have a huge file which is a concatenation of many XML files.
Thus I want to search replace all occurences of <?xml [whatever is between]<body> (including those words).
And then same for </body>[whatever is between]</html> (including those words).
The closest I came from is
:%s/<?xml \(.*\n\)\{0,180\}\/head>//g
FYI If I try this:
:%s/<?xml \(\(.*\)\+\n\)\+\/head>\n//g
I get a E363: pattern uses more memory than 'maxmempattern'.
I've tried to follow this without success.
To match any number of symbols including a newline between <?xml and <body>, you can use
:%s/<?xml \_.*<\/head>//g
The \_.* can be used to match any symbols including a newline. To match as few symbols as possible, use .\{-}: :%s/<?xml \_.\{-}<\/head>//g.
See Vim wiki, Patterns including end-of-line section:
\_. Any character including a newline
And from the Vim regex help, 4.3 Quantifiers, Greedy and Non-Greedy section:
\{-}
matches 0 or more of the preceding atom, as few as possible
UPDATE
As far as escaping regex metacharacters is concerned, you can refer to Vim Regular Expression Special Characters: To Escape or Not To Escape help page. You can see } is missing on the list. Why? Because a regex engine is usually able to tell what kind of } it is from the context. It knows if it is preceded with \{ or not, and can parse the expression correctly. Thus, there is no reason to escape this closing brace, which keeps the pattern "clean".

regex: find a line somewhere after another line

I need a regular expression to find a specific line in a file that occurs somewhere after another line. for example, I may want to find the string "friend", but only when it occurs on a line after a line containing the string "hello". so for example:
hello there
how are you
my friend
should pass, but
how are you
my friend
hello
or
hello friend
how are you
should not pass.
The only thing I've thought of is something like hello[.\s]*\n[.\s]*friend, which does not work.
EDIT: I'm using a customized program that has a lot of limitations. I don't have access to switches or custom modes. I need a single regular expression that works for the standard python regex mode.
hello[.\s]*\n[.\s]*friend
First note that a dot inside a character class matches for a literal dot, not as a "match all" character, so you really want alternation, not character class for this. But also not that a "match all" dot will also match spaces, so you don't even need alternation.
So overall, you really just need this:
hello.*?friend
Now comes the problem with matching across new-line chars. By default the "match all" dot does not match new-line chars. You can flag/modifier it to match it, but how you do that depends on what language you are using. In php or perl, you can use the s modifier, e.g.
php:
preg_match('~hello.*?friend~s',$content);
edit:
If you are trying to use regex in something like an editor (or otherwise can't add flags/modifiers), most editors have an option to flag it as such. If not, you can try alternation with newline chars like so:
hello(.|\r?\n)*friend
You need to include two newline characters.
hello(?:.*\n)+.*friend
This expects atleast one newline character present inbetween.
I'm by no means a regex expert (particularly not in Python), but my RegexBuddy app thinks this will work:
(?s)hello.*\n+.*friend
The (?s) is apparently an inline way of specifying the "Dot matches newline" option, which seems to be necessary for the \n to work.

What is a cross platform regex for removal of line breaks?

I am sure this has been asked before, but I cannot find it.
Basically, assuming you are parsing a text file of unknown origin and want to replace line breaks with some other delimiter, is this the best regex, or is there another?
(\r\n)|(\n)|(\r)
Fletcher - this did get asked once before.
Here you go: Regular Expression to match cross platform newline characters
Spoiler Alert!
The regex I use when I want to be
precise is "\r\n?|\n".
Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos. If implemented correctly, you can then match all the various ascii or Unicode line endings transparently using \R.
In Unicode you need to detect NEL (OS/390 line ending, \x85) LS (Line Separator, \x2028) and PS (Paragraph Separator, \x2029) if you want to be completely cross platform these days.
It is debatable whether LS, NEL, and PS should be treated as line breaks, line endings, or white space. The XML 1.0 standard, for example, does not recognize NEL as a line break character. ECMAScript treats LS and PS as line breaks but NEL as whitespace. Perl unicode regexs will treat VT, FF, CR, CRLF, NEL, LS and PS as line breaks for the purpose of ^ and $ regex meta characters.
The Unicode Implementation Guide (section 5.8 and table 5.3) is probably the best bet of what the definitive treatment of what a "newline" is.
If you are only concerned with ascii with the DOS/Windows/Unix/Mac classic variants, the regex equivalent to \R is (?>\r\n|[\r\n])
In Unicode, the equivalent to \R is (?>\r\n|\n|\x0b|\f|\r|\x85|\x2028|\x2029) The \x0b in there is a vertical tab; once again, this may or may not fit you definition of what a line break is, but that does match the recommendation of the Unicode Implantation. (FF, or \x0C is not included in the regex since a Form Feed is a new page, not a new line in the definition.)
The regex to find any Unicode line terminator should be
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) rather
than as drewk wrote it, at least in Perl. Taken directly from the perl
5.10.0 documentation (it was removed in later versions).
Note the braces after \x: U+2029 is \x{2029}
but \x2029 is an ASCII whitespace (U+0020) + a digit 2 + a
digit 9. \n outside a character class ,is also not guaranteed to match \x{0a}.
If your platform does not support the \R class as suggested by #dawg above, you may still be able to make a pretty elegant and robust solution if your platform supports negative lookaround or character class subtraction (e.g. in Java class subtraction is through the syntax [x&&[^y]]).
In most regular expresssion grammars, the dot character is defined to mean "any character except the newline character" (see for example, for JavaScript, here). If you match something with the following characteristics:
not (any character except the newline character) → the newline character; and
is whitespace
Since I'm currently working in JavaScript, which AFAIK doesn't have the \R shorthand or character class subtraction, I can still use negative lookahead to get what I want. The following regular expression matches all newlines:
/((?!.)\s)+/g
And the following JavaScript code, at least when run in Chrome 42.0.2311.90m on Windows 7, wipes out all the kinds of newlines that JavaScript (i.e. the "ECMAScript" mentioned in #dawg's third paragraph) recognizes:
var input = "hello\r\n\f\v\u2028\u2029 world";
var output = input.replace(/((?!.)\s)+/g, "");
document.write(output); // hello world
Just replace /[\r\n]+/g with an empty string "".
It'll replace all \r and \n no matter what order they appear in the string.

How to deal with the new line character in the Silverlight TextBox

When using a multi-line TextBox (AcceptsReturn="True") in Silverlight, line feeds are recorded as \r rather than \r\n. This is causing problems when the data is persisted and later exported to another format to be read by a Windows application.
I was thinking of using a regular expression to replace any single \r characters with a \r\n, but I suck at regex's and couldn't get it to work.
Because there may be a mixture of line endings just blindy replacing all \r with \r\n doesn't cut it.
So two questions really...
If regex is the way to go what's the correct pattern?
Is there a way to get Silverlight to respect it's own Environment.NewLine character in TextBox's and have it insert \r\n rather just a single \r?
I don't know Silverlight, but I imagine (I hope!) there's a way to get it to respect Environment.NewLine—that would be a better approach. If there isn't, however, you can use a regex. I'll assume you have text which contains all of \r, \n, and \r\n, and never uses those as anything but line endings—you just want consistency. (If they show up as non-line ending data, the regex solution becomes much harder, and possibly impossible.) You thus want to replace all occurrences of \r(?!\n)|(?<!\r)\n with \r\n. The first half of the first regex matches any \r not followed by a \n; the second half matches a lone \n which wasn't preceded by a \r.
The fancy operators in this regex are termed lookaround: (?=...) is a positive lookahead, (?<=...) is a positive lookbehind, (?!...) is a negative lookahead, and (?<!...) is a negative lookbehind. Each of them is a zero-width assertion like ^ or $; they match successfully without consuming input if the given regex succeeds/fails (for positive/negative, respectively) to match after/before (for lookahead/lookbehind) the current location in the string.
I don't know Silverlight at all (and I find the behavior you're describing very strange), but perhaps you could try searching for \r(?!\n) and replacing that with \r\n.
\r(?!\n) means "match a \r if and only if it's not followed by \n".
If you also happen to have \n without preceding \rs and want to "normalize" those too, then search for \r(?!\n)|(?<!\r)\n and replace with \r\n.
(?<!\r)\n means "match a \n if and only if it's not preceded by \r".