Emacs syntax highlighting of nested regular expressions - regex

I'm trying to write an Emacs major mode for working with biological sequence data (i.e. DNA and peptides), and I want to implement syntax highlighting where different letters are colored differently. Since the mode needs to be able to differentiate DNA sequences and amino acid sequences and color them differently, I am putting each sequence in the files on a single line with a single-character prefix (+ or #) that should indicate how the following line should be highlighted.
So, for example, if the file contained a line that read:
+AAGATCCCAGATT
The "A"s should all be in one color that is different from the rest of the line.
I have tried the following as a test:
(setq dna-keyword
'(("^\+\\([GCT\-]*\\(A\\)\\)*" (2 font-lock-function-name-face))
)
)
(define-derived-mode bioseq-mode fundamental-mode
(setq font-lock-defaults '(dna-keyword))
(setq mode-name "bioseq mode")
)
But that only matches the last A instead of all of them.
My first thought was to try to match the whole line with one regexp and then use another regexp to match just the A's within that line, but I have no idea if that's possible in the context of font-lock-mode or how it would be accomplished. Any ideas on how to do something like that, or how to accomplish this in a different way?

Indeed, Emacs provides just what you need to do this with the "anchored match" feature of font-lock mode. The syntax is a bit hairy, but it allows you to specify additional "matchers" (basically a regexp, subexpression identifier and face name) which (by default) will be applied following the position where the main "matcher" regexp finished up to the end of the line. There are more complicated ways of customizing exactly what range of text they apply to, but that's the general idea.
Here's a simple example which also shows how you could define your own faces for the purpose:
(defface bioseq-mode-a
'((((min-colors 8)) :foreground "red"))
"Face for As in bioseq-mode")
(defface bioseq-mode-g
'((((min-colors 8)) :foreground "blue"))
"Face for Gs in bioseq-mode")
(setq dna-keyword
'(("^\\+" ("A" nil nil (0 'bioseq-mode-a)))
("^\\+" ("G" nil nil (0 'bioseq-mode-g)))))
You can also specify two or more anchored matchers for one main matcher (the main matcher here being the regexp "^\\+"). To make this work, each anchored matcher after the first needs to explicitly return to the beginning of the line before beginning its search; otherwise it would only begin highlighting after the last occurrence of the previous anchored matcher. This is accomplished by putting (beginning-of-line) in the PRE-MATCH-FORM slot (element 2 of the list; see below).
(setq dna-keyword
'(("^\\+"
("A" nil nil (0 'bioseq-mode-a))
("G" (beginning-of-line) nil (0 'bioseq-mode-g)))))
I think it's mostly a matter of taste which you prefer; the second way might be slightly clearer code if you have many different anchored matchers for a single line, but I doubt there's a significant performance difference.
Here's the relevant bit of the documentation for font-lock-defaults:
HIGHLIGHT should be either MATCH-HIGHLIGHT or MATCH-ANCHORED.
[....]
MATCH-ANCHORED should be of the form:
(MATCHER PRE-MATCH-FORM POST-MATCH-FORM MATCH-HIGHLIGHT ...)
where MATCHER is a regexp to search for or the function name to call to make
the search, as for MATCH-HIGHLIGHT above, but with one exception; see below.
PRE-MATCH-FORM and POST-MATCH-FORM are evaluated before the first, and after
the last, instance MATCH-ANCHORED's MATCHER is used. Therefore they can be
used to initialize before, and cleanup after, MATCHER is used. Typically,
PRE-MATCH-FORM is used to move to some position relative to the original
MATCHER, before starting with MATCH-ANCHORED's MATCHER. POST-MATCH-FORM might
be used to move back, before resuming with MATCH-ANCHORED's parent's MATCHER.
The above-mentioned exception is as follows. The limit of the MATCHER search
defaults to the end of the line after PRE-MATCH-FORM is evaluated.
However, if PRE-MATCH-FORM returns a position greater than the position after
PRE-MATCH-FORM is evaluated, that position is used as the limit of the search.
It is generally a bad idea to return a position greater than the end of the
line, i.e., cause the MATCHER search to span lines.
I always find that I have to read the font-lock documentation about three times before it starts to make sense to me ;-)

Related

Fastest Way in vbscript to Check if a String Contains a Word/Phrase from a List of Many Words/Phrases

I am implementing a function which is to check a blurb (e.g. a message/forum post, etc) against a (potentially long) list of banned words/phrases, and simply return true if any one or more of the words is found in the blurb, and false if not.
This is to be done in vbScript.
The old developer currently has a very large IF statement using instr() e.g.
If instr(ucase(contactname), "KORS") > 0 OR _
instr(ucase(contactname), "D&G") > 0 OR _
instr(ucase(contactname), "DOLCE") > 0 OR _
instr(ucase(contactname), "GABBANA") > 0 OR _
instr(ucase(contactname), "TIFFANY") > 0 OR _
'...
Then
I am trying to decide between two solutions to replace the above code:
Using regular expression to find matches, where the regex would be a simple (but potentially long) regex like this: "KORS|D&G|DOLCE|GABBANA|TIFFANY" and so on, and we would do a regular expression test to return true if any one or more of the words is found.
Using an array where each array item contains a banned word, and loop through each array item checking it against the blurb. Once a match is found the loop would terminate and a variable would be set to TRUE, etc.
It seems to me that the regular expression option is the best, since it is one "check" e.g. the blurb tested against the pattern. But I am wondering if the potentially very long regex pattern would add enough processing overhead to negate the simplicity and benefit of doing the one "check" vs. the many "checks" in the array looping scenario?
I am also open to additional options which I may have overlooked.
Thanks in advance.
EDIT - to clarify, this is for a SINGLE test of one "blurb" e.g. a comment, a forum post, etc. against the banned word list. It only runs one time during a web request. The benchmarking should test size of the word list and NOT the number of executions of the use case.
You could create a string that contains all of your words. Surround each word with a delimiter.
Const TEST_WORDS = "|KORS|D&G|DOLCE|GABBANA|TIFFANY|"
Then, test to see if your word (plus delimiter) is contained within this string:
If InStr(1, TEST_WORDS, "|" & contactname & "|", vbTextCompare) > 0 Then
' Found word
End If
No need for array loops or regular expressions.
Seems to me (without checking) that such complex regexp would be slower, and also evaluating such complex 'Or' statement wold be slow (VBS will evaluate all alternatives).
Should all alternatives be evaluated to know expression value - of course not.
What I would do, is to populate an array with banned words and then iterate through it, checking if the word is within text being searched - and if word is found discontinue iteration.
You could store the most 'popular' banned words on the top of the array (some kind of rank), so you would be most likely to find them in few first steps.
Another benefit of using array is that it is easier to manage its' values compared to 'hardcoded' values within if statement.
I just tested 1 000 000 checks with regexp ("word|anotherword") vs InStr for each word and it seems I was not right.
Regex check took 13 seconds while InStr 71 seconds.
Edited: Checking each word separately with regexp took 78 seconds.
Still I think that if you have many banned words checking them one by one and breaking if any is found would be faster (after last check I would consider joining them by (5? 10?) and checking not such complex regexp each time).

Emacs Brace and Bracket Highlighting?

When entering code Emacs transiently highlights the matching brace or bracket. With existing code however is there a way to ask it to highlight a matching brace or bracket if I highlight its twin?
I am often trying to do a sanity check when dealing with compiler errors and warnings. I do enter both braces usually when coding before inserting the code in between, but have on occasion unintentionally commented out one brace when commenting out code while debugging.
Any advice with dealing with brace and bracket matching with Emacs?
OS is mostly Linux/Unix, but I do use it also on OS X and Windows.
If you're dealing with a language that supports it, give ParEdit a serious look. If you're not using with a Lisp dialect, it's not nearly as useful though.
For general brace/bracket/paren highlighting, look into highlight-parentheses mode (which color codes multiple levels of braces whenever point is inside them). You can also turn on show-paren-mode through customizations (that is M-x customize-variable show-paren-mode); that one strongly highlights the brace/bracket/paren matching one at point (if the one at point doesn't match anything, you get a different color).
my .emacs currently contains (among other things)
(require 'highlight-parentheses)
(define-globalized-minor-mode global-highlight-parentheses-mode highlight-parentheses-mode
(lambda nil (highlight-parentheses-mode t)))
(global-highlight-parentheses-mode t)
as well as that show-paren-mode customization, which serves me well (of course, I also use paredit when lisping, but these are still marginally useful).
Apart from the answer straight from the manual or wiki, also have a look at autopair.
tried on emacs 26
(show-paren-mode 1)
(setq show-paren-style 'mixed)
enable showing parentheses
set the showing in such as highlit the braces char., or if either one invisible higlight what they enclose
for toggling the cursor position / point between both, put this script in .emacs
(defun swcbrace ()(interactive)
(if (looking-at "(")(forward-list)
(backward-char)
(cond
((looking-at ")")(forward-char)(backward-list))
((looking-at ".)")(forward-char 2)(backward-list))
)))
(global-set-key (kbd "<C-next>") 'swcbrace)
it works toggling by press Control-Pgdn
BTW, for the immediate question: M-x blink-matching-open will "re-blink" for an existing close paren, as if you had just inserted it. Another way to see the matching paren is to use M-C-b and M-C-f (which jump over matched pairs of parens), which are also very useful navigation commands.
I second ParEdit. it is very good atleast for lisp development.
FWIW I use this function often to go to matching paren (back and forth).
;; goto-matching-paren
;; -------------------
;; If point is sitting on a parenthetic character, jump to its match.
;; This matches the standard parenthesis highlighting for determining which
;; one it is sitting on.
;;
(defun goto-matching-paren ()
"If point is sitting on a parenthetic character, jump to its match."
(interactive)
(cond ((looking-at "\\s\(") (forward-list 1))
((progn
(backward-char 1)
(looking-at "\\s\)")) (forward-char 1) (backward-list 1))))
(define-key global-map [(control ?c) ?p] 'goto-matching-paren) ; Bind to C-c p
Declaimer: I am NOT the author of this function, copied from internet.
If you just want to check the balanced delimiters, be them parentheses, square brackets or curly braces, you can use backward-sexp (bound to CtrlAltB) and forward-sexp (bound to CtrlAltF) to skip backward and forward to the corresponding delimiter. These commands are very handy to navigate through source files, skipping structures and function definitions, without any buffer modifications.
You can set the below in your init.el:
(setq show-paren-delay 0)
(show-paren-mode 1)
to ensure matching parenthesis are highlighted.
Note that (setq show-paren-delay 0) needs to be set before (show-paren-mode 1) so that there's no delay in highlighting, as per the wiki.
If you want to do a quick check to see whether brackets in the current file are balanced:
M-x check-parens
Both options tested on Emacs 27.1

problem with using replace-regexp in lisp

in my file, I have many instances of ID="XXX", and I want to replace the first one with ID="0", the second one with ID="1", and so on.
When I use regexp-replace interactively, I use ID="[^"]*" as the search string, and ID="\#" as the replacement string, and all is well.
now I want to bind this to a key, so I tried to do it in lisp, like so:
(replace-regexp "ID=\"[^\"]*\"" "ID=\"\\#\"")
but when I try to evaluate it, I get a 'selecting deleted buffer' error. It's probably something to do with escape characters, but I can't figure it out.
Unfortunately, the \# construct is only available in the interactive call to replace-regexp. From the documentation:
In interactive calls, the replacement text may contain `\,'
followed by a Lisp expression used as part of the replacement
text. Inside of that expression, `\&' is a string denoting the
whole match, `\N' a partial match, `\#&' and `\#N' the respective
numeric values from `string-to-number', and `\#' itself for
`replace-count', the number of replacements occurred so far.
And at the end of the documentation you'll see this hint:
This function is usually the wrong thing to use in a Lisp program.
What you probably want is a loop like this:
(while (re-search-forward REGEXP nil t)
(replace-match TO-STRING nil nil))
which will run faster and will not set the mark or print anything.
Which then leads us to this bit of elisp:
(save-excursion
(goto-char (point-min))
(let ((count 0))
(while (re-search-forward "ID=\"[^\"]*\"" nil t)
(replace-match (format "ID=\"%s\"" (setq count (1+ count)))))))
You could also use a keyboard macro, but I prefer the lisp solution.

in emacs-lisp, how do I correctly use replace-regexp-in-string?

Given a string, I want to replace all links within it with the link's description. For example, given
this is a [[http://link][description]]
I would like to return
this is a description
I used re-builder to construct this regexp for a link:
\\[\\[[^\\[]+\\]\\[[^\\[]+\\]\\]
This is my function:
(defun flatten-string-with-links (string)
(replace-regexp-in-string "\\[\\[[^\\[]+\\]\\[[^\\[]+\\]\\]"
(lambda(s) (nth 2 (split-string s "[\]\[]+"))) string))
Instead of replacing the entire regexp sequence, it only replaces the trailing "]]". This is what it produces:
this is a [[http://link][descriptiondescription
I don't understand what's going wrong. Any help would be much appreciated.
UPDATE: I've improved the regex for the link. It's irrelevant to the question but if someone's gonna copy it they may as well get the better version.
Your problem is that split-string is clobbering the match data, which
replace-regexp-in-string is relying on being unchanged, since it is going to
go use that match data to decide which sections of the string to cut out. This
is arguably a doc bug in that replace-regexp-in-string does not mention that
your replacement function must preserve the match data.
You can work around by using save-match-data, which is a macro provided for
exactly this purpose:
(defun flatten-string-with-links (string)
(replace-regexp-in-string "\\[\\[[a-zA-Z:%#/\.]+\\]\\[[a-zA-Z:%#/\.]+\\]\\]"
(lambda (s) (save-match-data
(nth 2 (split-string s "[\]\[]+")))) string))

Change styling of If Statements in Emacs?

My company has recently added some new code styling rules, and I would was wondering if there is an easy way via emacs to change a couple things w/ a regex replace.
if statements now have to look like the following:
if (expression) {
where I have many that look like so:
if(expression){
lacking the spaces. Is there an easy way to fix this?
You might be able to regexp replace it if expression is always on one line, but I'd use a throw-away function just to be safe:
(defun my-fix-style ()
(interactive)
(save-excursion
(goto-char (point-min))
(while (re-search-forward "\\_<if(" nil t)
(backward-char)
(insert " ")
(forward-sexp)
(unless (looking-at "[ \t\n]")
(insert " ")))))
Just my two cents, but if I was going to do this with emacs, I'd probably avoid regular expressions and approach it in two parts. First I'd do the search to replace if( with if ( by doing a Meta-% "if(" "if (" The quote marks are just for delineation, they don't belong in the entered text. Then either answer each individual replacement query, or give it a "!" to tell it to do all replacements. Repeat the process for the closing ){ to ) {.
Off the top of my head, I'd expect the first substituion to work without issue. The second one will also get "){" combinations in loops, but if your new standard demands a space for if statements, I'd expect it to do so for loops as well, so that seems like it should be a good thing.