Represent "less-than" symbol in Sublime Text regex in syntax definition - regex

I'm working with an ancient pre-XML markup that uses codes of the form "$=x", where x may be an alphabetic character or a symbol on the keyboard, such as ; (semicolon), ? (question mark), or < (right left angle bracket, aka greater-than less-than). [Note after editing: the confusion manifested in the question as originally phrased goes to the heart of the problem. See my comment to the accepted answer. RS]
So I've modified a copy of XML.tmLanguage syntax definition file in my User folder to identify the eleven different categories that these codes represent, so I can easily see them in the large text files (which also contain XML markup) I'm working with.
For all the symbols except < I'm able to escape the symbol by preceding it with a backslash. But in the Boost regex engine that ST2 uses, \< is how you indicate that you want to match only at the start of a word. Consequently I've been unable to get this code to be properly recognized and highlighted.
I've looked everywhere for how to escape the < symbol in this circumstance. I've tried preceding it with 0, 1, 2, 3 and 4 back-slashes; and I also tried using the hexadecimal escape code \x{3009}. [Note: this is the code for greater-than instead of less-than.]
All in vain. (A few alternatives didn't generate an error message but also didn't highlight the code.)
Because the codes I'm working with need to be colored differently, I can't use a generic symbol in lieu of <, and I can't specify it either. How do I get this?

The tmLanguage file is written in XML, so Sublime Text feeds it through an XML parser first, before giving pieces to its regex engine's parser.
XML uses < to open tags such as <string>, so you can't use it directly as a character. Instead, there are these standard character references:
& for & (ampersand)
< for < (less than)
> for > (greater than; not required)
" for " (quote mark; only required in attribute values quoted with ")
&apos; for ' (apostrophe; only required in attribute values quoted with ')
So use <string>\$=<</string> in the syntax file. When Sublime Text reads the file, its own XML parser will turn this into \$=< for the regex parser.
Backslash sequences don't help because the XML parser passes them through unchanged to the regex parser, which then sees \< or \\, neither of which are what you want.
\x{3008} is passed by the XML parser to the regex parser, where it's decoded to 〈, a character which looks somewhat similar to < but doesn't match it. \x3C would work though.
By the way, tmLanguage files use plist (property list) XML, so you can convert it to a format that's easier to edit, or use a plist editor such as http://tustin2121.github.io/jsPlistor/ (from Is there any online .plist editor?).

Try to use > for a syntax file.

Related

gvim syntax highlight for different types of lines

I've done several syntax highlighting files for simple custom formats in the past (even changing the format a bit to be capable of making the syntax file basing on my skills, in effects).
But this time I feel confused and I will appreciate some help.
The file format is (obviously) a text file where every line contain three distinct elements separated by spaces, they can be "symbols" (names containing a series of alphanumerical chars plus hyphens) or "string" (a series of any chars, spaces included, but not pipes).
Strings can be only at start or end of a line, the middle element can be only a symbol. And string are delimited by a pipe at the end if it is the first element and at the start if it is the last element.
But a line can be also all symbols, string first and rest symbols, and string last and rest symbols.
Strings are always followed by a pipe if they are the first element, or
with a pipe as prefix if they are the last element.
Examples:
All symbols
this-is-a-symbol another-one and-another
First string
This is a string potentially containing any char| symbol symbol
Last string
symbol symbol |A string at the end of the line
First and last as strings
This is a string| now-we-have-a-symbol |And here another string
This four examples are the only possibilities available for a correct formatting.
All symbols need to be colored differently, a specific color for first element, a specific color for second, and one for third.
But strings will have one unique different color regardless of position.
If the pipe chars can be "dimmed" with a color similar (not precisely the same) to background this will be a big plus. But I think I can manage this myself.
A line in the file not like the ones showed will have to be highlighted as an error (like red background).
Some help?
ps: stackoverflow apply a sort of syntax highlighting to my examples which can be misleading
I have found a simpler approach than what I initially thought was necessary in terms of regular expressions. At end I just need to match the first element and the last, how can I've not think of that... So this is my solution, it seems to work well for my specifics. It only doesn't highlight bad formatted lines. Good enough for now. Thanks for the patience and the attention.
" Vim syntax file
" Language: ff .txt
if exists("b:current_syntax")
finish
endif
setlocal iskeyword+=:
syn match Asymbol /^[a-zA-Z0-9\-]* /
syn match Csymbol / [a-zA-Z0-9\-]*$/
syn match Astring /^.*| /
syn match Cstring / |.*$/
highlight link Asymbol Constant
highlight link Csymbol Statement
highlight link Astring Include
highlight link Cstring Comment
let b:current_syntax = "ff"

Catch a special character with regex

I have a xml file, and I have to match the char < and > inside the tag and replace them, but I have some difficulties catching them...
The xml is something link this:
<tag>text</tag>
<tag2>3 is > than 2</tag2>
<tag3>But 1 in < than 4</tag3>
I found a solution using this regex
(\s>\s|\s<\s)
including a whitespace, the character and another whitespace... but how if there aren't the whitespaces?
Edit In fact I need to replace these symbols with < and >...
The xml fields are obtained from a third party software that gave away the output xml file like the one I've written above.
I know that the best approach is that when the software reads the data it encodes the < and > as < and > in the xml, but I hoped that there was a way to do it afterwards
So basically you are receiving incorrectly formed XML and you want to replace < and > and replace it with < and >
Bad news. It is not possible to do it with regex in a XML generic way. Try building a parser.
Good news. If you introduce some limitations (i.e. if the data you are receiving comply with some requirements), there may be some good solutions.
You need a way to distinguish which symbols are part of the tags, and which symbols are part of the content.
For example, if you consider that tags have only letters and numbers, but no spaces(or other symbols) in between, something like
(?<lt><)(?:(?!\/?[[:alnum:]]*>))|(?:\s[[:alnum:]]*)(?<gt>>)
could probably work. You can play with it in https://regex101.com/r/uF0iR2/2
It is the concatenation | of two queries. The first one is the < but not followed but the rest of a tag. And the second one is the > but prefixed with something that has an space. We could avoid the negative lookahead ?! but then we could end up colliding with the other "query". We cannot do negative look-behind because there cannot be quantifiers.
Finally, unrelated, another possibility for (\s>\s|\s<\s) is (\s[<>]\s)

tinyXML lib cannot read ‘&’ properly

I have an XML file with some lines like the following:
<rule pat="&&&&&&&&&&&&&&&(?<B>B) ?(?<AND>&) ?(?<E>E)">
I use TinyXML lib in C++ to parse this XML file, but when I try to get the 'pat' attribute of such lines, the TinyXML turns out just ignore any occurrence of character &. That is, the result read by TinyXML turns to be like:
(?<B>B) ?(?<AND>) ?(?<E>E)
with all & missing!
This char is a part of my regular expression pattern, so this will lead to further error in my program.
Do anyone have any idea why this character & is so SPECIAL and TinyXML just cannot read? even a stand alone & will be dismissed?
That's because that is not a valid XML file. You can't just stick an & character anywhere in XML. You have to escape it with entities:
&
TinyXML will only read valid XML files (or at least mostly valid ones).
Similarly, you need to escape the < and > characters too, with < and >.
That's not well formed XML. If you want an & character, you need to put &.
In xml, & is represented as &

Can an tinyxml someone explain which characters need to be escaped?

I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with &amp or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.

Converting C++ code to HTML safe

I decided to try http://www.screwturn.eu/ wiki as a code snippet storage utility. So far I am very impressed, but what irkes me is that when I copy paste my code that I want to save, '<'s and '[' (http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Character_references) invariably screw up the output as the wiki interprets them as either wiki or HTML tags.
Does anyone know a way around this? Or failing that, know of a simple utility that would take C++ code and convert it to HTML safe code?
You can use the ##...## tag to escape the code and automatically wrap it in PRE tags.
Surround your code in <nowiki> .. </nowiki> tags.
I don't know of utilities, but I'm sure you could write a very simple app that does a find/replace. To display angle brackets, you just need to replace them with > and < respectively. As for the square brackets, that is a wiki specific problem with the markdown methinks.
Dario Solera wrote "You can use the ##...## tag to escape the code and automatically wrap it in PRE tags."
If you don't want it wrapped just use: <esc></esc>
List of characters that need escaping:
< (less-than sign)
& (ampersand)
[ (opening square bracket)
Have you tried wrapping your code in html pre or code tags before pasting? Both allow any special characters (such as '<') to be used without being interpreted as html. pre also honors the formatting of the contents.
example
<pre>
if (foo <= bar) {
do_something();
}
</pre>
To post C++ code on a web page, you should convert it to valid HTML first, which will usually require the use of HTML character entities, as others have noted. This is not limited to replacing < and > with < and >. Consider the following code:
unsigned int maskedValue = value&mask;
Uh-oh, does the HTML DTD contain an entity called &mask;? Better replace & with & as well.
Going in an alternate direction, you can get rid of [ and ] by replacing them with the trigraphs ??( and ??). In C++, trigraphs and digraphs are sequences of characters that can be used to represent specific characters that are not available in all character sets. They are unlikely to be recognized by most C++ programmers though.