vim search replace including newline - regex

I've been googling for 3 hours now without success.
I have a huge file which is a concatenation of many XML files.
Thus I want to search replace all occurences of <?xml [whatever is between]<body> (including those words).
And then same for </body>[whatever is between]</html> (including those words).
The closest I came from is
:%s/<?xml \(.*\n\)\{0,180\}\/head>//g
FYI If I try this:
:%s/<?xml \(\(.*\)\+\n\)\+\/head>\n//g
I get a E363: pattern uses more memory than 'maxmempattern'.
I've tried to follow this without success.

To match any number of symbols including a newline between <?xml and <body>, you can use
:%s/<?xml \_.*<\/head>//g
The \_.* can be used to match any symbols including a newline. To match as few symbols as possible, use .\{-}: :%s/<?xml \_.\{-}<\/head>//g.
See Vim wiki, Patterns including end-of-line section:
\_. Any character including a newline
And from the Vim regex help, 4.3 Quantifiers, Greedy and Non-Greedy section:
\{-}
matches 0 or more of the preceding atom, as few as possible
UPDATE
As far as escaping regex metacharacters is concerned, you can refer to Vim Regular Expression Special Characters: To Escape or Not To Escape help page. You can see } is missing on the list. Why? Because a regex engine is usually able to tell what kind of } it is from the context. It knows if it is preceded with \{ or not, and can parse the expression correctly. Thus, there is no reason to escape this closing brace, which keeps the pattern "clean".

Related

How to check for multiple special characters using Regex in Perl

I am a beginner in perl and I am looking on how to search for multiple special characters in a file using Regex. Basically, I have a closing tag /> which I need to verify in a file. I have read that when we have special characters, we need to precede using '\'. But I have two special characters together and I am not sure how to have this check done.
I am using something like below, including /> in-between /\""/ but its not working :
$line =~ /\/>/
Could someone help me with this pattern matching using Regex?
I believe \Y or \Q..\E is your friend here.
/text\Y$literal\Ytext/
/\Q.*\Etext/
\Y listed as without any runtime variable interpolations. This means all perl variables will be treated as literal symbols.
\Q..\E listed as (disable) pattern metacharacter. This means usual regex special characters will be treated as literal symbols without the needing to escape them.
http://perldoc.perl.org/perlre.html

How does the dot metacharacter match newline characters?

I thought that the dot . in regex will match any character, except the end-of-line character.
However, in R, I found that the dot can match anything, including the newline characters \n, \r or \r\n:
grep(c("\r","\n","\r\n"),pattern=".")
[1] 1 2 3
Can someone explain the contradiction?
The page here http://www.regular-expressions.info/dot.html explains how the rule that dot does not match the end-of-line character exists mostly for historic reasons:
The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
However,
Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks.
Apparently, R is one such language where by default, dot will match every character. (I point you to Joshua's comment above, recommending you look at ?regex and the POSIX 1003.2 standard.)
The page I linked above also mentions Perl and suggests how under its default mode, dot will not match line breaks.
Notice how R's grep function has a perl option. If you turn it on, you do get a different output:
> grep(".", c("\r","\n","\r\n"), perl = TRUE)
[1] 1 3
This is telling me that \n is the line break character, but not \r. Something that comparing cat("\r") and cat("\n") can confirm.
(I'm on a Mac OS if it makes any difference.)

match first space on a line using sublime text and regular expressions

So regular expressions have always been tough for me. Im getting frustrated trying to find a regular expression that will select the first white space on a line. So then i can use sublime text to replace that with a /
If you could give a quick explanation that would help to
In the spirit of #edi's answer, but with some explanation of what's happening. Match the beginning of the line with ^, then look for a sequence of characters that are not whitespace with [^\s]* or \S* (the former may work in more editors, libraries, etc than the latter), then find the first whitespace character with \s. Putting these together, you have
^[^\s]*\s
You may want to group the non-whitespace and whitespace parts, so you can do the replacement you're talking about:
^([^\s]*)(\s)
Then the replacement pattern is just \1/
You can use this regex.
^([^\s]*)\s

How to highlight words beginning with ‘#’ in Vim syntax?

I have a very simple Vim syntax file for personal notes. I would like to highlight people's name and I chose a Twitter-like syntax #jonathan.
I tried:
syntax match notesPerson "\<#\S\+"
To mean: words beginning with # and having at least one non-whitespace character. The problem is that # seems to be a special character in Vim regular expressions.
I tried to escape \# and enclose in brackets [#], the usual tricks, but that didn't work. I could try something like (^|\s) (beginning of line or whitespace) but that's exactly the problem that word-boundary tries to solve.
Highlighting works on simplified regular expressions, so this is more a question of finding the right regex than anything else. What am I missing?
# is a special character only if you have enabled the “very magic”
mode by having \v somewhere in the pattern prior to that #.
You have another problem here: # does not start a new word. \< is
not just “word boundary” like perl/PCRE’s \b, but “left word
boundary” (in help: “beginning of the word”) meaning that \< must be
followed by some keyword character. As # is not normally a keyword
character, pattern \<# will never match. (And even if it was like
\b, it would match constructs like abc#def which is definitely not
what you want for the aforementioned reasons.)
You should use \k\#<!#\k\S* instead: \k\#<! ensures that # is not preceded by any keyword character, \k\S* makes sure that first character of the name is a keyword one (you could probably also use #\<\S\+).
There is another solution: include # into 'iskeyword' option and leave the regex as is:
:setlocal iskeyword+=#-#
See :help 'isfname' for the explanation why #-# is used here.
(The 'iskeyword' option has exactly the same syntax and will,
in fact, redirect you there for the explanation.)

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.