Use alternate syntax highlighting in middle of TextMate2 comment - regex

By the very nature of a comment, this might not make sense.
On the other hand, what I'm trying to achieve is not too different from an escape character.
As a simple example, I want
# comment :break: comment
to show up more like like
#comment
"break"
# comment
would, but without the second #, everything is on the same line, and instead of quotes I have some other escape character. Although, like quotes (and unlike escape characters that I'm familiar with [e.g., \]), I intend to explicitly indicate the beginning and the end of the interruption to the comment.
Thanks to #Graham P Heath, I was able to achieve alternate forms of comments in this question. What I'm after is an enhancement to what was achieved there. In my scenario, # is a comment in the language I'm using (R), and #' functions both as an R comment and as the start of code in another language. Now, I can get everything after the #' to take on syntax highlighting that is different from the typical R comment, but I'm trying to get a very modest amount of syntax highlighting in this sub-language (#' actually indicates the start of markdown code, and I want the "raw" syntax highlighting for text surround in a pair of ` ).
The piece of the language grammar that I'm trying to interrupt is as follows:
{ begin = '(^[ \t]+)?(?=#'' )';
end = '(?!\G)';
beginCaptures = { 1 = { name = 'punctuation.whitespace.comment.leading.r'; }; };
patterns = (
{ name = 'comment.line.number-sign-tick.r';
begin = "#' ";
end = '\n';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
);
},

I'm pretty sure I've figured it out. What I didn't understand previously was how the scoping worked. I still don't understand it fully, but I now know enough to create nested definitions (regex) for the begin and end of each type of syntax.
The scoping makes things so much easier! Previously I wanted to do regex like (?<=\A#'\s.*)(\$) to find a dollar sign within the #'-style comment ... but obviously that won't work because of the repetition with * (+ wouldn't work for the same reason). Via scoping, it's already implied that we have to be inside the \A#'\s match before \$ will be matched.
Here is the relevant portion of my Language Grammar:
{ begin = '(^[ \t]+)?(?=#\'' )';
end = '(?!\G)';
beginCaptures = { 1 = { name = 'punctuation.whitespace.comment.leading.r'; }; };
patterns = (
{ name = 'comment.line.number-sign-tick.r';
begin = "#' ";
end = '\n';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
patterns = (
// Markdown within Comment
{ name = 'comment.line.number-sign-tick-raw.r';
begin = '(`)(?!\s)'; // backtick not followed by whitespace
end = '(?<!\s)(`)'; // backtick not preceded by whitespace
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
// Equation within comment
{ name = 'comment.line.number-sign-tick-eqn.r';
begin = '((?<!\G)([\$]{1,2})(?!\s))';
end = '(?<!\s)([\$]{1,2})';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
// Markdown within Equation
patterns = (
{ name = 'comment.line.number-sign-tick-raw.r';
begin = '(`)(?!\s)'; // backtick not followed by whitespace
end = '(?<!\s)(`)'; // backtick not preceded by whitespace
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
);
},
);
},
);
},
here is some R code:
# below is a `knitr` (note no effect of backticks) code chunk
#+ codeChunk, include=FALSE
# normal R comment, follow by code
data <- matrix(rnorm(6,3, sd=7), nrow=2)
#' This would be recognized as markdown by `knitr::spin()`, with the preceding portion as "raw" text
`note that this doesnt go to the 'raw' format ... it is normal code!`
#+ anotherChunk
# also note how the dollar signs behave normally
data <- as.list(data)
data$blah <- "blah"
`data`[[1]] # backticks behaving
#' I can introduce a Latex-style equation, filling in values from R using `knitr` code chunks: $\frac{top}{bottom}=\frac{`r topValue`}{`r botValue`}$ then continue on with markdown.
And here is what that looks like in TextMate2 after making these changes:
Pretty good, except the backticked pieces take on the italics when they're inside an equation. I can live with that. I can even convince myself that I wanted it that way ;) (by the way, I specified fontName='regular' for the courier new, so I don't know why that's getting overridden)

Related

How to get last two words written in Regex in Javascript

I am trying to get data after a colon.
This is my code:
function myFunction() {
var withBreaks = "*Cats are:* cool Pets [CATS]"
var sheet = SpreadsheetApp.getActiveSheet()
if (withBreaks) {
var tmp;
tmp = withBreaks.match(/^[\*]Cats are:[\*][\s]([a-z]+[\s]+[A-Za-z].*)$/m); //
var username = (tmp && tmp[1]) ? tmp[1].trim() : 'No username';
sheet.appendRow([username])
}
};
So I'm trying to get information after the
*Cats are:*. This code works, but, sometimes some sentences would have an asterisk and sometimes there wouldn't be an asterisk to different sentences. I would like to make one that is more unifying, if that clarifies my question a bit.
What I would like to do is, without specifying the asterisk, get data after the :. So anything after Cats are:. Do I have to specify the asterisk?
I suggest
/^\**Cats are:\**\s*([\s\S]*)/
Here, any text is captured into Group 1 with ([\s\S]*) and the asterisks are made optional with * quantifier meaning 0 or more repetitions.
See the regex demo
If the asterisks can appear 1 or 0 times, replace * with ?:
/^\*?Cats are:\*?\s*([\s\S]*)/
^ ^
See another regex demo.

Regex that will extract the string between two known strings [duplicate]

I want to match a portion of a string using a regular expression and then access that parenthesized substring:
var myString = "something format_abc"; // I want "abc"
var arr = /(?:^|\s)format_(.*?)(?:\s|$)/.exec(myString);
console.log(arr); // Prints: [" format_abc", "abc"] .. so far so good.
console.log(arr[1]); // Prints: undefined (???)
console.log(arr[0]); // Prints: format_undefined (!!!)
What am I doing wrong?
I've discovered that there was nothing wrong with the regular expression code above: the actual string which I was testing against was this:
"date format_%A"
Reporting that "%A" is undefined seems a very strange behaviour, but it is not directly related to this question, so I've opened a new one, Why is a matched substring returning "undefined" in JavaScript?.
The issue was that console.log takes its parameters like a printf statement, and since the string I was logging ("%A") had a special value, it was trying to find the value of the next parameter.
Update: 2019-09-10
The old way to iterate over multiple matches was not very intuitive. This lead to the proposal of the String.prototype.matchAll method. This new method is in the ECMAScript 2020 specification. It gives us a clean API and solves multiple problems. It is in major browsers and JS engines since Chrome 73+ / Node 12+ and Firefox 67+.
The method returns an iterator and is used as follows:
const string = "something format_abc";
const regexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
const matches = string.matchAll(regexp);
for (const match of matches) {
console.log(match);
console.log(match.index)
}
As it returns an iterator, we can say it's lazy, this is useful when handling particularly large numbers of capturing groups, or very large strings. But if you need, the result can be easily transformed into an Array by using the spread syntax or the Array.from method:
function getFirstGroup(regexp, str) {
const array = [...str.matchAll(regexp)];
return array.map(m => m[1]);
}
// or:
function getFirstGroup(regexp, str) {
return Array.from(str.matchAll(regexp), m => m[1]);
}
In the meantime, while this proposal gets more wide support, you can use the official shim package.
Also, the internal workings of the method are simple. An equivalent implementation using a generator function would be as follows:
function* matchAll(str, regexp) {
const flags = regexp.global ? regexp.flags : regexp.flags + "g";
const re = new RegExp(regexp, flags);
let match;
while (match = re.exec(str)) {
yield match;
}
}
A copy of the original regexp is created; this is to avoid side-effects due to the mutation of the lastIndex property when going through the multple matches.
Also, we need to ensure the regexp has the global flag to avoid an infinite loop.
I'm also happy to see that even this StackOverflow question was referenced in the discussions of the proposal.
original answer
You can access capturing groups like this:
var myString = "something format_abc";
var myRegexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
var matches = myRegexp.exec(myString);
console.log(matches[1]); // abc
And if there are multiple matches you can iterate over them:
var myString = "something format_abc";
var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
match = myRegexp.exec(myString);
while (match != null) {
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
console.log(match[0])
match = myRegexp.exec(myString);
}
Here’s a method you can use to get the n​th capturing group for each match:
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'something format_abc something format_def something format_ghi';
var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
var myString = "something format_abc";
var arr = myString.match(/\bformat_(.*?)\b/);
console.log(arr[0] + " " + arr[1]);
The \b isn't exactly the same thing. (It works on --format_foo/, but doesn't work on format_a_b) But I wanted to show an alternative to your expression, which is fine. Of course, the match call is the important thing.
Last but not least, I found one line of code that worked fine for me (JS ES6):
let reg = /#([\S]+)/igm; // Get hashtags.
let string = 'mi alegría es total! ✌🙌\n#fiestasdefindeaño #PadreHijo #buenosmomentos #france #paris';
let matches = (string.match(reg) || []).map(e => e.replace(reg, '$1'));
console.log(matches);
This will return:
['fiestasdefindeaño', 'PadreHijo', 'buenosmomentos', 'france', 'paris']
In regards to the multi-match parentheses examples above, I was looking for an answer here after not getting what I wanted from:
var matches = mystring.match(/(?:neededToMatchButNotWantedInResult)(matchWanted)/igm);
After looking at the slightly convoluted function calls with while and .push() above, it dawned on me that the problem can be solved very elegantly with mystring.replace() instead (the replacing is NOT the point, and isn't even done, the CLEAN, built-in recursive function call option for the second parameter is!):
var yourstring = 'something format_abc something format_def something format_ghi';
var matches = [];
yourstring.replace(/format_([^\s]+)/igm, function(m, p1){ matches.push(p1); } );
After this, I don't think I'm ever going to use .match() for hardly anything ever again.
String#matchAll (see the Stage 3 Draft / December 7, 2018 proposal), simplifies acccess to all groups in the match object (mind that Group 0 is the whole match, while further groups correspond to the capturing groups in the pattern):
With matchAll available, you can avoid the while loop and exec with /g... Instead, by using matchAll, you get back an iterator which you can use with the more convenient for...of, array spread, or Array.from() constructs
This method yields a similar output to Regex.Matches in C#, re.finditer in Python, preg_match_all in PHP.
See a JS demo (tested in Google Chrome 73.0.3683.67 (official build), beta (64-bit)):
var myString = "key1:value1, key2-value2!!#key3=value3";
var matches = myString.matchAll(/(\w+)[:=-](\w+)/g);
console.log([...matches]); // All match with capturing group values
The console.log([...matches]) shows
You may also get match value or specific group values using
let matchData = "key1:value1, key2-value2!!#key3=value3".matchAll(/(\w+)[:=-](\w+)/g)
var matches = [...matchData]; // Note matchAll result is not re-iterable
console.log(Array.from(matches, m => m[0])); // All match (Group 0) values
// => [ "key1:value1", "key2-value2", "key3=value3" ]
console.log(Array.from(matches, m => m[1])); // All match (Group 1) values
// => [ "key1", "key2", "key3" ]
NOTE: See the browser compatibility details.
Terminology used in this answer:
Match indicates the result of running your RegEx pattern against your string like so: someString.match(regexPattern).
Matched patterns indicate all matched portions of the input string, which all reside inside the match array. These are all instances of your pattern inside the input string.
Matched groups indicate all groups to catch, defined in the RegEx pattern. (The patterns inside parentheses, like so: /format_(.*?)/g, where (.*?) would be a matched group.) These reside within matched patterns.
Description
To get access to the matched groups, in each of the matched patterns, you need a function or something similar to iterate over the match. There are a number of ways you can do this, as many of the other answers show. Most other answers use a while loop to iterate over all matched patterns, but I think we all know the potential dangers with that approach. It is necessary to match against a new RegExp() instead of just the pattern itself, which only got mentioned in a comment. This is because the .exec() method behaves similar to a generator function – it stops every time there is a match, but keeps its .lastIndex to continue from there on the next .exec() call.
Code examples
Below is an example of a function searchString which returns an Array of all matched patterns, where each match is an Array with all the containing matched groups. Instead of using a while loop, I have provided examples using both the Array.prototype.map() function as well as a more performant way – using a plain for-loop.
Concise versions (less code, more syntactic sugar)
These are less performant since they basically implement a forEach-loop instead of the faster for-loop.
// Concise ES6/ES2015 syntax
const searchString =
(string, pattern) =>
string
.match(new RegExp(pattern.source, pattern.flags))
.map(match =>
new RegExp(pattern.source, pattern.flags)
.exec(match));
// Or if you will, with ES5 syntax
function searchString(string, pattern) {
return string
.match(new RegExp(pattern.source, pattern.flags))
.map(match =>
new RegExp(pattern.source, pattern.flags)
.exec(match));
}
let string = "something format_abc",
pattern = /(?:^|\s)format_(.*?)(?:\s|$)/;
let result = searchString(string, pattern);
// [[" format_abc", "abc"], null]
// The trailing `null` disappears if you add the `global` flag
Performant versions (more code, less syntactic sugar)
// Performant ES6/ES2015 syntax
const searchString = (string, pattern) => {
let result = [];
const matches = string.match(new RegExp(pattern.source, pattern.flags));
for (let i = 0; i < matches.length; i++) {
result.push(new RegExp(pattern.source, pattern.flags).exec(matches[i]));
}
return result;
};
// Same thing, but with ES5 syntax
function searchString(string, pattern) {
var result = [];
var matches = string.match(new RegExp(pattern.source, pattern.flags));
for (var i = 0; i < matches.length; i++) {
result.push(new RegExp(pattern.source, pattern.flags).exec(matches[i]));
}
return result;
}
let string = "something format_abc",
pattern = /(?:^|\s)format_(.*?)(?:\s|$)/;
let result = searchString(string, pattern);
// [[" format_abc", "abc"], null]
// The trailing `null` disappears if you add the `global` flag
I have yet to compare these alternatives to the ones previously mentioned in the other answers, but I doubt this approach is less performant and less fail-safe than the others.
Your syntax probably isn't the best to keep. FF/Gecko defines RegExp as an extension of Function.
(FF2 went as far as typeof(/pattern/) == 'function')
It seems this is specific to FF -- IE, Opera, and Chrome all throw exceptions for it.
Instead, use either method previously mentioned by others: RegExp#exec or String#match.
They offer the same results:
var regex = /(?:^|\s)format_(.*?)(?:\s|$)/;
var input = "something format_abc";
regex(input); //=> [" format_abc", "abc"]
regex.exec(input); //=> [" format_abc", "abc"]
input.match(regex); //=> [" format_abc", "abc"]
There is no need to invoke the exec method! You can use "match" method directly on the string. Just don't forget the parentheses.
var str = "This is cool";
var matches = str.match(/(This is)( cool)$/);
console.log( JSON.stringify(matches) ); // will print ["This is cool","This is"," cool"] or something like that...
Position 0 has a string with all the results. Position 1 has the first match represented by parentheses, and position 2 has the second match isolated in your parentheses. Nested parentheses are tricky, so beware!
With es2018 you can now String.match() with named groups, makes your regex more explicit of what it was trying to do.
const url =
'https://stackoverflow.com/questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression?some=parameter';
const regex = /(?<protocol>https?):\/\/(?<hostname>[\w-\.]*)\/(?<pathname>[\w-\./]+)\??(?<querystring>.*?)?$/;
const { groups: segments } = url.match(regex);
console.log(segments);
and you'll get something like
{protocol: "https", hostname: "stackoverflow.com", pathname: "questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression", querystring: "some=parameter"}
A one liner that is practical only if you have a single pair of parenthesis:
while ( ( match = myRegex.exec( myStr ) ) && matches.push( match[1] ) ) {};
Using your code:
console.log(arr[1]); // prints: abc
console.log(arr[0]); // prints: format_abc
Edit: Safari 3, if it matters.
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'Rs.200 is Debited to A/c ...2031 on 02-12-14 20:05:49 (Clear Bal Rs.66248.77) AT ATM. TollFree 1800223344 18001024455 (6am-10pm)';
var myRegEx = /clear bal.+?(\d+\.?\d{2})/gi;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'something format_abc something format_def something format_ghi';
var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
Your code works for me (FF3 on Mac) even if I agree with PhiLo that the regex should probably be:
/\bformat_(.*?)\b/
(But, of course, I'm not sure because I don't know the context of the regex.)
As #cms said in ECMAScript (ECMA-262) you can use matchAll. It return an iterator and by putting it in [... ] (spread operator) it converts to an array.(this regex extract urls of file names)
let text = `File1 File2`;
let fileUrls = [...text.matchAll(/href="(http\:\/\/[^"]+\.\w{3})\"/g)].map(r => r[1]);
console.log(fileUrls);
/*Regex function for extracting object from "window.location.search" string.
*/
var search = "?a=3&b=4&c=7"; // Example search string
var getSearchObj = function (searchString) {
var match, key, value, obj = {};
var pattern = /(\w+)=(\w+)/g;
var search = searchString.substr(1); // Remove '?'
while (match = pattern.exec(search)) {
obj[match[0].split('=')[0]] = match[0].split('=')[1];
}
return obj;
};
console.log(getSearchObj(search));
You don't really need an explicit loop to parse multiple matches — pass a replacement function as the second argument as described in: String.prototype.replace(regex, func):
var str = "Our chief weapon is {1}, {0} and {2}!";
var params= ['surprise', 'fear', 'ruthless efficiency'];
var patt = /{([^}]+)}/g;
str=str.replace(patt, function(m0, m1, position){return params[parseInt(m1)];});
document.write(str);
The m0 argument represents the full matched substring {0}, {1}, etc. m1 represents the first matching group, i.e. the part enclosed in brackets in the regex which is 0 for the first match. And position is the starting index within the string where the matching group was found — unused in this case.
We can access the matched group in a regular expressions by using backslash followed by number of the matching group:
/([a-z])\1/
In the code \1 represented matched by first group ([a-z])
I you are like me and wish regex would return an Object like this:
{
match: '...',
matchAtIndex: 0,
capturedGroups: [ '...', '...' ]
}
then snip the function from below
/**
* #param {string | number} input
* The input string to match
* #param {regex | string} expression
* Regular expression
* #param {string} flags
* Optional Flags
*
* #returns {array}
* [{
match: '...',
matchAtIndex: 0,
capturedGroups: [ '...', '...' ]
}]
*/
function regexMatch(input, expression, flags = "g") {
let regex = expression instanceof RegExp ? expression : new RegExp(expression, flags)
let matches = input.matchAll(regex)
matches = [...matches]
return matches.map(item => {
return {
match: item[0],
matchAtIndex: item.index,
capturedGroups: item.length > 1 ? item.slice(1) : undefined
}
})
}
let input = "key1:value1, key2:value2 "
let regex = /(\w+):(\w+)/g
let matches = regexMatch(input, regex)
console.log(matches)
One line solution:
const matches = (text,regex) => [...text.matchAll(regex)].map(([match])=>match)
So you can use this way (must use /g):
matches("something format_abc", /(?:^|\s)format_(.*?)(?:\s|$)/g)
result:
[" format_abc"]
JUST USE RegExp.$1...$n th group
eg:
1.To match 1st group RegExp.$1
To match 2nd group RegExp.$2
if you use 3 group in regex likey(note use after string.match(regex))
RegExp.$1 RegExp.$2 RegExp.$3
var str = "The rain in ${india} stays safe";
var res = str.match(/\${(.*?)\}/ig);
//i used only one group in above example so RegExp.$1
console.log(RegExp.$1)
//easiest way is use RegExp.$1 1st group in regex and 2nd grounp like
//RegExp.$2 if exist use after match
var regex=/\${(.*?)\}/ig;
var str = "The rain in ${SPAIN} stays ${mainly} in the plain";
var res = str.match(regex);
for (const match of res) {
var res = match.match(regex);
console.log(match);
console.log(RegExp.$1)
}
Get all group occurrence
let m=[], s = "something format_abc format_def format_ghi";
s.replace(/(?:^|\s)format_(.*?)(?:\s|$)/g, (x,y)=> m.push(y));
console.log(m);
I thought you just want to grab all the words containing the abc substring and store the matched group/entries, so I made this script:
s = 'something format_abc another word abc abc_somestring'
console.log(s.match(/\b\w*abc\w*\b/igm));
\b - a word boundary
\w* - 0+ word chars
abc - your exact match
\w* - 0+ word chars
\b - a word boundary
References: Regex: Match all the words that contains some word
https://javascript.info/regexp-introduction

Flex RegEx to find string not starting with a pattern

I'm writing a lexer to scan a modified version of an INI file.
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")
These are the regular expressions that I currently use:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.
How could I achieve the result I want?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
Correct input file example:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;
To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.
This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex(); function.
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
"#".* ; /* ignored */
in your flex file. This makes generated scanner not return and ignore the token just read.
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
var = "data"
in front of any
[section]
line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

Regex for custom parsing

Regex isn't my strongest point. Let's say I need a custom parser for strings which strips the string of any letters and multiple decimal points and alphabets.
For example, input string is "--1-2.3-gf5.47", the parser would return
"-12.3547".
I could only come up with variations of this :
string.replaceAll("[^(\\-?)(\\.?)(\\d+)]", "")
which removes the alphabets but retains everything else. Any pointers?
More examples:
Input: -34.le.78-90
Output: -34.7890
Input: df56hfp.78
Output: 56.78
Some rules:
Consider only the first negative sign before the first number, everything else can be ignored.
I'm trying to do this using Java.
Assume the -ve sign, if there is one, will always occur before the
decimal point.
Just tested this on ideone and it seemed to work. The comments should explain the code well enough. You can copy/paste this into Ideone.com and test it if you'd like.
It might be possible to write a single regex pattern for it, but you're probably better off implementing something simpler/more readable like below.
The three examples you gave prints out:
--1-2.3-gf5.47 -> -12.3547
-34.le.78-90 -> -34.7890
df56hfp.78 -> 56.78
import java.util.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(strip_and_parse("--1-2.3-gf5.47"));
System.out.println(strip_and_parse("-34.le.78-90"));
System.out.println(strip_and_parse("df56hfp.78"));
}
public static String strip_and_parse(String input)
{
//remove anything not a period or digit (including hyphens) for output string
String output = input.replaceAll("[^\\.\\d]", "");
//add a hyphen to the beginning of 'out' if the original string started with one
if (input.startsWith("-"))
{
output = "-" + output;
}
//if the string contains a decimal point, remove all but the first one by splitting
//the output string into two strings and removing all the decimal points from the
//second half
if (output.indexOf(".") != -1)
{
output = output.substring(0, output.indexOf(".") + 1)
+ output.substring(output.indexOf(".") + 1, output.length()).replaceAll("[^\\d]", "");
}
return output;
}
}
In terms of regex, the secondary, tertiary, etc., decimals seem tough to remove. However, this one should remove the additional dashes and alphas: (?<=.)-|[a-zA-Z]. (Hopefully the syntax is the same in Java; this is a Python regex but my understanding is that the language is relatively uniform).
That being said, it seems like you could just run a pretty short "finite state machine"-type piece of code to scan the string and rebuild the reduced string yourself like this:
a = "--1-2.3-gf5.47"
new_a = ""
dash = False
dot = False
nums = '0123456789'
for char in a:
if char in nums:
new_a = new_a + char # record a match to nums
dash = True # since we saw a number first, turn on the dash flag, we won't use any dashes from now on
elif char == '-' and not dash:
new_a = new_a + char # if we see a dash and haven't seen anything else yet, we append it
dash = True # activate the flag
elif char == '.' and not dot:
new_a = new_a + char # take the first dot
dot = True # put up the dot flag
(Again, sorry for the syntax, I think you need some curly backets around the statements vs. Python's indentation only style)

Notepad++ How to replace everything in between certain lines

So I have a code like this
TitleManager:AddSubTitleMissionInfo_LUA({
m_iID = 10,
m_wstrDescription = "Professional Killer",
m_eClearType = TITLE_MISSION_CLEAR_TYPE.TMCT_MOB_KILL_COUNT,
m_bAutomaticDescription = True,
m_ClearCondition = {
m_eMobID = {68},
m_iMobKillCount = {1}
}
})
TitleManager:AddSubTitleMissionInfo_LUA({
m_iID = 20,
m_wstrDescription = "Sneaky Assassin",
m_eClearType = TITLE_MISSION_CLEAR_TYPE.TMCT_MOB_KILL_COUNT,
m_bAutomaticDescription = True,
m_ClearCondition = {
m_eMobID = {69},
m_iMobKillCount = {1}
}
})
TitleManager:AddSubTitleMissionInfo_LUA({
m_iID = 20,
m_wstrDescription = "Merciless Thug",
m_eClearType = TITLE_MISSION_CLEAR_TYPE.TMCT_MOB_KILL_COUNT,
m_bAutomaticDescription = True,
m_ClearCondition = {
m_eMobID = {70,71},
m_iMobKillCount = {1,1}
}
})
There are like a hundred of those, all different.
How do I replace everything in between the curly brackets.
m_ClearCondition = {
}
to
m_ClearCondition = {
m_eMobID = {50},
m_iMobKillCount = {1}
}
I really hope someone could answer my question, I would be really grateful.
Description
There are several unclear things about your sample text from your post. But this expression assumes:
each m_clearcondition value will have no nested brackets more then one level deep.
will find each key name m_clearcondition and place that into capture group 1
the close bracket will be captured into group 2
Regex:
^(\s+m_ClearCondition\s*=\s*\{)(?:\{[^}]*\}|[^}])*(\})
Replace with: $1\n m_eMobID = {50},\n m_iMobKillCount = {1}\n $2
Example:
Live demo of the regex: http://regexr.com?35n0l
In this example I'm using Notepad++ 6.4.2. There where known problems using regex in Notepad version 5 and lower.
Regex is your friend. Ctrl-H (Find/replace) and enable regular expressions (bottom left of dialog).
\{\s*\}$
replace with:
{\nwhatever you want\n}\n
Regex Match Explanation:
{} are reserved, so escaping them will match the actual brace character
\s of course will match all whitespace (including new lines and tabs)
$ would be the end of the line; this just helps the match
\n in the replace would insert new lines where it makes sense; these are optional.