emacs major-mode define font-lock for line preceding regexp - regex

I'm working on making a custom emacs major-mode, but I'm completely unfamiliar with lisp - so I'm struggling. I'm trying to add a font lock such that a line of repeating '=' or '-' is highlighted, along with the line above it (so that I can use these as headings), i.e.
This is a Colored Heading
=========================
this is a differently-colored sub-heading
-----------------------------------------
I've tried to set this up with:
(font-lock-add-keywords nil '(("\\(.*\n=\{3,\}\\)"
1 font-lock-warning-face prepend)))
but it isn't working. I thought this meant:
'.*' any characters
'\n' followed by a newline
'=\{3,\}' followed by 3 or more '=' characters
Where am I going wrong?

"\{" and "\}" are treated as an escape sequence, which they're not.
You need to use "\\{" and "\\}" instead:
(font-lock-add-keywords nil '(("\\(.*\n=\\{3,\\}\\)"
1 font-lock-warning-face prepend)))

Related

Regex which grabs everything between two characters at the end of a line

I'm looking to create a regex which grabs the text between two ":"s but only if it is the "last set", for example:
\--- org.codehaus.groovy.modules.http-builder:http-builder:0.7.1
should return:
http-builder
It should be noted that it's possible to get something like:
\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1
because the input does not necessarily follow conventions (based on the problem at hand) but the required information is ALWAYS in the last two ":"s.
I've tried some of the following (minus the end of line):
1) (?<=\:).*(?=\:)
2) [^(.*:)].*[^(:.*)]
3) :.*: (this was the most successful, although I got the ":"s with the result but there are issues when there is more than one set of ":"s)
Futher information:
I need to use Groovy for this
I can read it using a stream or a file (in case that matters)
Thanks for reading and any help!
:([^:]*):[^:]*$
That means:
Sequence must start with a :
Then start capturing (
Capture all characters that are not colons [^:]*
End capturing ) ...
... at the next colon :
Then there's another sequence of chars [^:]*
And after that sequence the line must end $ (no more sequence)
Or if you can use non-greedy matches, you can also use
:(.*?):[^:]*$
.* means capture as many characters as possible, while .*? means capture as little characters as possible. Not all regex implementation support that, though.
How about splitting on the : and grabbing the next-to-last segment?
['org.codehaus.groovy.modules.http-builder:http-builder:0.7.1',
/\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1/].each { line ->
assert 'http-builder' == line.split(':')[-2]
}

i need help in regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor
Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

emacs function re-search-forward interpreting \( \) group characters literally in regexp

I successfully used replace-regexp interactively to replace every instance of quoted text in the buffer shown below with a non-quoted version. The regexp I searched for was
\"\([^\"]*\)\"
and the NEWTEXT I inserted was \1.
* "PROJECT START"
:PROPERTIES:
:ID: 1
:Unique_ID: 17
:DURATION: "0 days"
:TYPE: "Fixed Work"
:OUTLINE_LEVEL: 1
:END:
Interactively, the aboe text was turned into the text below.
* PROJECT START
:PROPERTIES:
:ID: 1
:Unique_ID: 17
:DURATION: 0 days
:TYPE: Fixed Work
:OUTLINE_LEVEL: 1
:END:
I tried to do this same search and replace programmatically by inserting the following two lines
(while (re-search-forward "\"\([^\"]*\)\"" nil t)
(replace-match "\1" nil nil ))
at the top of the buffer and executing, but it simply returned nil without finding a single match.
When I omit the
\( \)
grouping and replace \1 with \&
(while (re-search-forward "\"[^\"]*\"" nil t)
(replace-match "\&" nil nil ))
I get every quoted string replaced with '&'.
* &
:PROPERTIES:
:ID: 1
:Unique_ID: 17
:DURATION: &
:TYPE: &
:OUTLINE_LEVEL: 1
:END:
Everything I've seen in the documentation for both of these functions indicates that they should recognize these special characters, and the examples of its use in responses to other questions on this forum use these special characters.
Can anyone help me understand why the grouping and \&, \N, \ characters aren't being interpreted correctly?
You need to escape the "\"s for "(", ")", and "\1". I.e.:
(while (re-search-forward "\"\\([^\"]*\\)\"" nil t)
(replace-match "\\1" nil nil ))

Strip comments from text except for comment char between quotes

I'm trying to build a regexp for removing comments from a configuration file. Comments are marked with the ; character. For example:
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
The difficulty I have is ignoring the comment character when it's placed between quotes.
Any ideas?
You could try matching a semicolon only if it's followed by an even number of quotes:
;(?=(?:[^"]*"[^"]*")*[^"]*$).*
Be sure to use this regex with the Singleline option turned off and the Multiline option turned on.
In Python:
>>> import re
>>> t = """; This is a comment line
... keyword1 keyword2 ; comment
... keyword3 "key ; word 4" ; comment"""
>>> regex = re.compile(';(?=(?:[^"]*"[^"]*")*[^"]*$).*', re.MULTILINE)
>>> regex.sub("", t)
'\nkeyword1 keyword2 \nkeyword3 "key ; word 4" '
No regex :)
$ grep -E -v '^;' input.txt
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
You may use regexp to get all strings out first, replace them with some place-holder, and then simply cut off all \$.*, and replace back the strings at last :)
Something like this:
("[^"]*")*.*(;.*)
First, match any number of text between quotes, then match a ;. If the ; is between quotes it will be matches by the first group, not by the second group.
I (somewhat accidentally) came up with a working regex:
replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm, '$1')
I wanted:
remove single line comments at start of line or end of line,
to use single and double quotes,
the ability to have just one quote in a comment: that's useful (but accept " as well)
(so matching on a balanced set (even number) of quotes after a comment-delimiter as in Tim Pietzcker's answer was not suitable),
leave comment-delimiter ; alone in correctly (closed) quoted 'strings'
mix quoting style
multiple quoted strings (and comments in/after comments)
nest single/double quotes in resp. double/single quoted 'strings'
data to work on is like valid ini-files (or assembly), as long as it doesn't contain escaped quotes or regex-literals etc.
Lacking look-back on javascript I thought it might be an idea to not match comments (and replace them with ''), but match on data preceding the comment and then replace the full match data with the sub-match data.
One could envision this concept on a line by line basis (so replace the full line with the match, thereby 'loosing' the comment), BUT the multiline parameter doesn't seem to work exactly that way (at least in the browser).
[^'";]* starts eating any characters from the 'start' that are not '";.
(Completely counter-intuitive (to me), [^'";\r\n]* will not work.)
(?:'[^']*'|"[^"]*")? is a non-capturing group matching zero or one set of quote any chars quote (and (?:(['"])[^\2]*\2)? in /^((?:[^'";]*(?:(['"])[^\2]*\2)?)*)[ \t]*;.*$/gm or
(?:(['"])[^\2\r\n]*\2)? in /^((?:[^'";]*(?:(['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm (although mysteriously better) do not work (broke on db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as), but not adding another capturing group for re-use in the match is a good thing as they come with penalties anyway).
The above combo is placed in a non-capturing group which may repeat zero or more times and it's result is placed in a capturing group 1 to pass along.
That leaves us with [ \t]*;.* which 'simply' matches zero or more spaces and tabs followed by a semicolon, followed by zero or more chars that are not a new line. Note how ; is NOT optional !!!
To get a better idea of how this (multi-line parameter) works, hit the exp button in the demo below.
function demo(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '$1'
)
.replace( /[ \t]*$/gm, ''); //optional trim
}
function demo_exp(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '**S**$1**E**' //to see start and end of match.
);
}
<textarea style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
"Text; in" and between "quotes; plus" semicolons; this is the comment
; This is a comment line
keyword1 keyword2 ; comment
keyword3 'key ; word 4' ; comment and one quote ' ;see it?
_b64decode:
db 0x83,0xc6,0x3A ; add si, b64decode_end - _b64decode ;39
push 'a'
pop di
cmp byte [si], 0x2B ; '+'
b64decode_end:
;append base64 data here
;terminate with printable character less than '+'
db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>
Hope this helps.
PS: Please comment valid examples if and where this might break! Since I generally agree (from extensive personal experience) that it is impossible to reliably remove comments using regex (especially higher level programming languages), my gut is still saying this can't be fool-proof. However I've been throwing existing data and crafted 'what-ifs' at it for over 2 hours and couldn't get it to break (, which I'm usually very good at).

How to color # (at symbol) in Emacs?

I can color keywords in emacs using the following lisp code in .emacs:
(add-hook 'c-mode-common-hook
(lambda () (font-lock-add-keywords nil
'(("\\<\\(bla[a-zA-Z1-9_]*\\)" 1 font-lock-warning-face t)))))
This code color all keywords that start with "bla". Example: blaTest123_test
However when I try to add # (the 'at' symbol) instead of "bla", it doesn't seem to work. I don't think # is a special character for regular expressions.
Do you know how I can get emacs to highlight keywords starting with the # symbol?
Your problem is the \< in your regexp, which
matches the empty string, but only at the beginning of a word. `\<' matches at the beginning of the buffer (or string) only if a word-constituent character follows.
and # is not a word-constituent character.
See: M-: (info "(elisp) Regexp Backslash") RET
This unrestricted pattern will colour any #:
(font-lock-add-keywords nil
'(("#" 0 font-lock-warning-face t)))
And this will do something like what you want, by requiring either BOL or some white space immediately beforehand.
(font-lock-add-keywords nil
'(("\\(?:^\\|\\s-\\)\\(#[a-zA-Z1-9_]*\\)" 1 font-lock-warning-face t)))