How to highlight words beginning with ‘#’ in Vim syntax? - regex

I have a very simple Vim syntax file for personal notes. I would like to highlight people's name and I chose a Twitter-like syntax #jonathan.
I tried:
syntax match notesPerson "\<#\S\+"
To mean: words beginning with # and having at least one non-whitespace character. The problem is that # seems to be a special character in Vim regular expressions.
I tried to escape \# and enclose in brackets [#], the usual tricks, but that didn't work. I could try something like (^|\s) (beginning of line or whitespace) but that's exactly the problem that word-boundary tries to solve.
Highlighting works on simplified regular expressions, so this is more a question of finding the right regex than anything else. What am I missing?

# is a special character only if you have enabled the “very magic”
mode by having \v somewhere in the pattern prior to that #.
You have another problem here: # does not start a new word. \< is
not just “word boundary” like perl/PCRE’s \b, but “left word
boundary” (in help: “beginning of the word”) meaning that \< must be
followed by some keyword character. As # is not normally a keyword
character, pattern \<# will never match. (And even if it was like
\b, it would match constructs like abc#def which is definitely not
what you want for the aforementioned reasons.)
You should use \k\#<!#\k\S* instead: \k\#<! ensures that # is not preceded by any keyword character, \k\S* makes sure that first character of the name is a keyword one (you could probably also use #\<\S\+).
There is another solution: include # into 'iskeyword' option and leave the regex as is:
:setlocal iskeyword+=#-#
See :help 'isfname' for the explanation why #-# is used here.
(The 'iskeyword' option has exactly the same syntax and will,
in fact, redirect you there for the explanation.)

Related

regex: find a line somewhere after another line

I need a regular expression to find a specific line in a file that occurs somewhere after another line. for example, I may want to find the string "friend", but only when it occurs on a line after a line containing the string "hello". so for example:
hello there
how are you
my friend
should pass, but
how are you
my friend
hello
or
hello friend
how are you
should not pass.
The only thing I've thought of is something like hello[.\s]*\n[.\s]*friend, which does not work.
EDIT: I'm using a customized program that has a lot of limitations. I don't have access to switches or custom modes. I need a single regular expression that works for the standard python regex mode.
hello[.\s]*\n[.\s]*friend
First note that a dot inside a character class matches for a literal dot, not as a "match all" character, so you really want alternation, not character class for this. But also not that a "match all" dot will also match spaces, so you don't even need alternation.
So overall, you really just need this:
hello.*?friend
Now comes the problem with matching across new-line chars. By default the "match all" dot does not match new-line chars. You can flag/modifier it to match it, but how you do that depends on what language you are using. In php or perl, you can use the s modifier, e.g.
php:
preg_match('~hello.*?friend~s',$content);
edit:
If you are trying to use regex in something like an editor (or otherwise can't add flags/modifiers), most editors have an option to flag it as such. If not, you can try alternation with newline chars like so:
hello(.|\r?\n)*friend
You need to include two newline characters.
hello(?:.*\n)+.*friend
This expects atleast one newline character present inbetween.
I'm by no means a regex expert (particularly not in Python), but my RegexBuddy app thinks this will work:
(?s)hello.*\n+.*friend
The (?s) is apparently an inline way of specifying the "Dot matches newline" option, which seems to be necessary for the \n to work.

grep regex lookahead or start of string (or lookbehind or end of string)

I want to match a string which may contain a type of character before the match, or the match may begin at the beginning of the string (same for end of string).
For a minimal example, consider the text n.b., which I'd like to match either at the beginning of a line and end of a line or between two non-word characters, or some combination. The easiest way to do this would be to use word boundaries (\bn\.b\.\b), but that doesn't match; similar cases happen for other desired matches with non-word characters in them.
I'm currently using (^|[^\w])n\.b\.([^\w]|$), which works satisfactorily, but will also match the non-word characters (such as dashes) which appear immediately before and after the word, if available. I'm doing this in grep, so while I could easily pipe the output into sed, I'm using grep's --color option, which is disabled when piping into another command (for obvious reasons).
EDIT: The \K option (i.e. (\K^|[^\w])n\.b\.(\K[^\w]|$) seems to work, but it also does discard the color on the match within the output. While I could, again, invoke auxiliary tools, I'd love it if there was a quick and simple solution.
EDIT: I have misunderstood the \K operator; it simply removes all the text from the match preceding its use. No wonder it was failing to color the output.
If you're using grep, you must be using the -P option, or lookarounds and \K would throw errors. That means you also have negative lookarounds at your disposal. Here's a simpler version of your regex:
(?<!\w)n\.b\.(?!\w)
Also, be aware that (?<=...) and (?<!...) are lookbehinds, and (?=...) and (?!...) are lookaheads. The wording of your title suggests you may have gotten those mixed up, a common beginner's mistake.
Apparently matching beginning of string is possible inside lookahead/lookbehinds; the obvious solution is then (?<=^|[^\w])n\.b\.(?=[^\w]|$).

Basic Vim - Search and Replace text bounded by specific characters

Say I wanted to replace :
"Christoph Waltz" = "That's a Bingo";
"Gary Coleman" = "What are you talking about, dear Willis?";
to just have :
"Christoph Waltz"
"Gary Coleman"
i.e. I want to remove all the characters including and after the = and the ;
I thought the regex for finding the pattern would be \=.*?\;. In vim, I tried :
:%s/\=.*?\;$//g
but it gave me an Invalid Command error and Nothing after \=. How do I remove the above text? Apologies, but I'm new to this.
Vim's regular expression dialect is different; its escaping is optimized for text searches. See :help perl-patterns for a comparison with Perl regular expressions. As #EvergreenTree has noted, you can influence the escaping behavior with special atoms; cp. :help /\v.
For your example, the non-greedy match is .\{-}, not .*?, and, as mentioned, you mustn't escape several literal characters:
:%s/ =.\{-};$//
(The /g flag is superfluous, too; there can be only one match anchored to the end via $.)
This is because of vim's weird handling of regexes by default. Instead of \= interpreting as a literal =, it interprets it as a special regex character. To make vim's regex system work more normally, you can prefix it with \v, which is "very magic" mode. This should do the trick:
%s/\v\=.*\;$//g
I won't explain how vim enterprets every single character in very magic mode here, but you can read about it in this help topic:
:help /magic

Emacs regex wordWord boundary (specifically concerning underscores)

I am trying to replace all occurrences of a whole word on emacs (say foo) using M-x replace-regexp.
The problem is that I don't want to replace occurrences of foo in underscored words such as word_foo_word
If I use \bfoo\b to match foo then it will match the underscored strings; because as I understand emacs considers underscores to be part of word boundaries, which is different to other RegEx systems such as Perl.
What would be the correct way to proceed?
The regexp \<foo\> or \bfoo\b matches foo only when it's not preceded or followed by a word constituent character (syntax code w, usually alphanumerics, so it matches in foo_bar but not in foo1).
Since Emacs 22, the regexp \_<foo_bar\_> matches foo_bar only when it's not preceded or followed by a symbol-constituent character. Symbol constituents include not only word constituents (alphanumerics) but also punctuation characters that are allowed in identifiers, meaning underscores in most programming languages.
You wrote:
as I understand emacs considers underscores to be part of word boundaries, which is different to other regex systems
The treatment of underscores, like everything else in emacs, is configurable. This question:
How to make forward-word, backward-word, treat underscore as part of a word?
...asks the converse.
I think you could solve your problem by changing the syntax of underscores in the syntax table so that they are not part of words, and then doing the search/replace.
To do that, you need to know the mode you are using, and the name of the syntax table for that mode. In C++, it would be like this:
(modify-syntax-entry ?_ "." c++-mode-syntax-table)
The dot signifies "punctuation", which implies not part of a word. For more on that, try M-x describe-function on modify-syntax-entry.

What regex will capitalize any letters following whitespace?

I'm looking for a Perl regex that will capitalize any character which is preceded by whitespace (or the first char in the string).
I'm pretty sure there is a simple way to do this, but I don't have my Perl book handy and I don't do this often enough that I've memorized it...
s/(\s\w)/\U$1\E/g;
I originally suggested:
s/\s\w/\U$&\E/g;
but alarm bells were going off at the use of '$&' (even before I read #Manni's comment). It turns out that they're fully justified - using the $&, $` and $' operations cause an overall inefficiency in regexes.
The \E is not critical for this regex; it turns off the 'case-setting' switch \U in this case or \L for lower-case.
As noted in the comments, matching the first character of the string requires:
s/((?:^|\s)\w)/\U$1\E/g;
Corrected position of second close parenthesis - thanks, Blixtor.
Depending on your exact problem, this could be more complicated than you think and a simple regex might not work. Have you thought about capitalization inside the word? What if the word starts with punctuation like '...Word'? Are there any exceptions? What about international characters?
It might be better to use a CPAN module like Text::Autoformat or Text::Capitalize where these problems have already been solved.
use Text::Capitalize 0.2;
print capitalize_title($t), "\n";
use Text::Autoformat;
print autoformat{case => "highlight", right=>length($t)}, $t;
It sounds like Text::Autoformat might be more "standard" and I would try that first. Its written by Damian. But Text::Capitalize does a few things that Text::Autoformat doesn't. Here is a comparison.
You can also check out the Perl Cookbook for recipie 1.14 (page 31) on how to use regexps to properly capitalize a title or headline.
Something like this should do the trick -
s!(^|\s)(\w)!$1\U$2!g
This simply splits up the scanned expression into two matches - $1 for the blank/start of string and $2 for the first character of word. We then substitute both $1 and $2 after making the start of the word upper-case.
I would change the \s to \b which makes more sense since we are checking for word-boundaries here.
This isn't something I'd normally use a regex for, but my solution isn't exactly what you would call "beautiful":
$string = join("", map(ucfirst, split(/(\s+)/, $string)));
That split()s the string by whitespace and captures all the whitespace, then goes through each element of the list and does ucfirst on them (making the first character uppercase), then join()s them back together as a single string. Not awful, but perhaps you'll like a regex more. I personally just don't like \Q or \U or other semi-awkward regex constructs.
EDIT: Someone else mentioned that punctuation might be a potential issue. If, say, you want this:
...string
changed to this:
...String
i.e. you want words capitalized even if there is punctuation before them, try something more like this:
$string = join("", map(ucfirst, split(/(\w+)/, $string)));
Same thing, but it split()s on words (\w+) so that the captured elements of the list are word-only. Same overall effect, but will capitalize words that may not start with a word character. Change \w to [a-zA-Z] to eliminate trying to capitalize numbers. And just generally tweak it however you like.
If you mean character after space, use regular expressions using \s. If you really mean first character in word you should use \b instead of all above attempts with \s which is error prone.
s/\b(\w)/\U$1/g;
You want to match letters behind whitespace, or at the start of a string.
Perl can't do variable length lookbehind. If it did, you could have used this:
s/(?<=\s|^)(\w)/\u$1/g; # this does not work!
Perl complains:
Variable length lookbehind not implemented in regex;
You can use double negative lookbehind to get around that: the thing on the left of it must not be anything that is not whitespace. That means it'll match at the start of the string, but if there is anything in front of it, it must be whitespace.
s/(?<!\S)(\w)/\u$1/g;
The simpler approach in this exact case will probably be to just match the whitespace; the variable length restriction falls away, then, and include that in the replacement.
s/(\s|^)(\w)/$1\u$2/g;
Occasionally you can't use this approach in repeated substitutions because that what precedes the actual match has already been eaten by the regex, and it's good to have a way around that.
Capitalize ANY character preceded by whitespace or at beginning of string:
s/(^|\s)./\u$1/g
Maybe a very sloppy way of doing it because it's also uppercasing the whitespace now. :P
The advantage is that it works with letters with all possible accents (and also with special Danish/Swedish/Norwegian letters), which are problematic when you use \w and \b in your regex. Can I expect that all non-letters are untouched by the uppercase modifier?