Regular expression set max length for string literal - c++

I am trying to figure out how to set the max length in a regular expression. My goal is to set my regular expression for string literals to a max length of 80.
Here is my expression if you need it:
["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*[']
I've tried adding {0,80} at both the front and the end of the expression, but either all the strings break down into smaller identifiers or none do so far.
Thanks in advance for the help!
Edit:
Here's a better explanation of what I am trying to accomplish.
Given "This string is over 80 characters long", when run through flex instead of being listed like:
line: 1, lexeme: |THIS STRING IS OVER 80 CHARACTERS LONG|, length: 81, token 4003
I would need it to be broken up like this:
line: 1, lexeme: |THIS|, length: 1, token 6000
line: 1, lexeme: |STRING|, length: 1, token 6000
line: 1, lexeme: |IS|, length: 1, token 6000
line: 1, lexeme: |OVER|, length: 1, token 6000
line: 1, lexeme: |80|, length: 1, token 6000
line: 1, lexeme: |CHARACTERS|, length: 1, token 6000
line: 1, lexeme: |LONG|, length: 1, token 6000
While string "THIS STRING IS NOT OVER 80 CHARACTERS LONG" would be shown as:
line: 1, lexeme: |THIS STRING IS NOT OVER 80 CHARACTERS LONG|, length: 50, token: 4003

I've tried adding {0,80} at both the front and the end of the expression
The brace operator is not a length limit; it's a range of repetition counts. It has to go where a repetition operator (*, + or ?) would go: immediately after the subpattern being repeated.
So in your case you might use: (I left out the ' alternative for clarity.)
["]([^"\\\n]|\\(.|\n)){0,80}["]
Normally, I would advise you not to do this, or at least to do it with some care. (F)lex regular expressions are compiled into state transition tables, and the only way to compile a maximum repetition count is to copy the subpattern states once for each repetition. So the above pattern needs to make 80 copies of the state transitions for ([^"\\]|\\(.|\n)). (With a simple subpattern like that, the state blow-up might not be too serious. But with more complex subpatterns, you can end up with enormous transition tables.)
Edit: Split long strings into tokens as though they weren't quoted.
An edit to the question suggests that what is expected is to treat strings of length greater than 80 characters as though the quote marks had never been entered; that is, report them as individual word and number tokens without any intervening whitespace. That seems so idiosyncratic to me that I can't convince myself that I'm reading the requirement correctly, but in case I am, here's the outline of a possible approach.
Let's suppose that the intention is that short strings should be reported as single tokens, while long strings should be reinterpreted as a series of tokens (possibly but not necessarily the same tokens as would be produced by an unquoted input). If that's the case, there are really two lexical analyses which need to be specified, and they will not use the same pattern rules. (For one thing, the rescan needs to recognise a quotation mark as the end of a literal, causing the scanner to revert to normal processing, while the original scan considers a quotation mark to start a string literal.)
One possibility would be to just collect the entire long string and then use a different lexical scanner to break it into whatever parts seem useful, but that would require some complicated plumbing in order to record the resulting token stream and return it one token at a time to the yylex caller. (This would be reasonably easy if yylex were pushing tokens to a parser, but that's a whole other scenario.) So I'm going to discard this option other than to mention that it is possible.
So the apparently simpler option is to ensure that the original scan halts on the 81st character, so that it can change the lexical rules and back up the scan to apply the new rules.
(F)lex provides start conditions as a way of providing different lexical contexts. By using the BEGIN macro in a (f)lex action, it is possible to dynamically switch between start conditions, switching the scanner into a different context. (They're called "start conditions" because they change the scanner's state at the start of the token.)
Each start condition (except the default start condition, which is called INITIAL) needs to be declared in the flex prologue. In this case, we'll only need one extra start condition, which we'll call SC_LONG_STRING. (By convention, start condition names are written in ALL CAPS since they are translated into either C macros or enumeration values.)
Flex has two possible mechanisms for backing up a scan, either of which will work here. I'm going to show the explicit back-up because it's safer and more flexible; the alternative is to use the trailing context operator (/) which would work perfectly well in this solution, but not in other very similar contexts.
So we start by declaring our start condition and then the rules for handling quoted strings in the default (INITIAL) lexical context:
%x SC_LONG_STRING
%%
I'm only showing the double-quote rules since the single-quote rules are virtually identical. (Single-quote will require another start condition, because the termination pattern is different.)
The first rule matches strings where there are at most 80 characters or escape sequences in the literal, using a repetition operator as described above.
["]([^"\\\n]|\\(.|\n)){0,80}["] { yylval.str = strndup(yytext + 1, yyleng - 2);
return TOKEN_STRING;
}
The second rule matches exactly one additional non-quote character. It does not attempt to find the end of the string; that will be handled within the SC_LONG_STRING rules. The rule does two things:
Switch to a different start condition.
Tell the scanner to backup the scan, using the yyless(n) special action, which truncates the current token at n characters, and causes the next token scan to restart at that point. So `yyless(1) leaves only the " in the current token. Since we don't return at this point, the current token is immediately discarded.
["]([^"\\\n]|\\(.|\n)){81} { BEGIN(SC_LONG_STRING); yyless(1); }
The final rule is a fallback for unterminated strings; it will trigger if something that looks like a string was started but neither of the above rules matched it. That can only happen if a newline or end-of-file indicator is encountered before the closing quote:
["]([^"\\\n]|\\(.|\n)){0,80} { yyerror("Unterminated string"); }
Now, we need to specify the rules for SC_LONG_STRING. For simplicity, I'm going to assume that it is only desired to split the string into whitespace separated units; if you want to do a different analysis, you can change the patterns here.
Start conditions are identified by writing the name of the start condition inside angle brackets. The start condition name is considered part of the pattern, so it should not be followed by a space (space characters aren't allowed in lex patterns unless they are quoted characters). Flex is more flexible; read the cited manual section for more details.
The first rule simply returns to the INITIAL start condition when a double quote terminates the string. The second rule discards whitespace inside the long string, and the third rule passes the whitespace-separated components on to the caller. Finally, we need to consider the possible error of an unterminated long string, which will result in encountering a newline or end-of-file indicator.
<SC_LONG_STRING>["] { BEGIN(INITIAL); }
<SC_LONG_STRING>[ \t]+ ;
<SC_LONG_STRING>([^"\\ \n\t]|\\(.|\n))+ { yylval.str = strdup(yytext);
return TOKEN_IDENTIFIER;
}
<SC_LONG_STRING>\n |
<SC_LONG_STRING><<EOF>> { yyerror("Unterminated string"); }
Original answer: Produce a meaningful error for long strings
You need to specify what you are planning to do if the user enters a string which is too long If your scanner doesn't recognise it as a string, then it will return some kind of fall-back token which will probably induce a syntax error from the parser; that provides no useful feedback to the user, so they probably won't have a clue where the syntax error comes from. And you certainly cannot restart the lexical analysis in the middle of a string which happens to be too long: that will end up interpreting text which was supposed to be quoted as though it were tokens.
A much better strategy is to recognise strings of any length, and then check the length in the action associated with the pattern. As a first approximation, you could try this:
["]([^"\\]|\\(.|\n)){0,80}["] { if (yyleng <= 82) return STRING;
else {
yyerror("String literal exceeds 80 characters");
return BAD_STRING;
}
}
(Note: (F)lex sets the variable yyleng to the length of yytext, so there is never a need to call strlen(yytext). strlen needs to scan its argument to compute the length, so it's quite a bit less efficient. Moreover, even in cases where you need to call strlen, you should not use it to check if a string exceeds a maximum length. Instead, use strnlen, which will limit the length of the scan.)
But that's just a first approximation, because it counts source characters, rather than the length of a string literal. So, for example, assuming you plan to allow hex escapes, the string literal "\x61" will be counted as though it has four characters, which could easily cause string literals containing escapes to be incorrectly rejected as too long.
That problem is ameliorated but not solved by the pattern with a limited repeat count, because the pattern itself does not fully parse escape sequences. In the pattern ["]([^"\\]|\\(.|\n)){0,80}["], the \x61 escape sequence will be counted as three repetitions (\x, 6, 1), which is still more than the single character it represents. As another example, splices (\ followed by a newline) will be counted as one repetition, whereas they don't contribute to the length of the literal at all since they are simply removed.
So if you want to correctly limit the length of a string literal, you will have to parse the source representation more precisely. (Or you would need to reparse it after identifying it, which seems wasteful.) This is usually done with a start condition.

In case you use the regex inside flex, and need to monitor its length, the easiest thing would be to look at the matched string held in yylex (or similar):
["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*['] { if (strlen(yylex) > 82) { ... } }
I used 82 to account for the two double quotes characters '"'.

Related

XML Regular expression for excluding a Word called "County" in the input string and can accept any word/number/empty space [duplicate]

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?
OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Tokenize parse option

Consider a slightly different toy example from my previous question:
. local string my first name is Pearly,, and my surname is Spencer
. tokenize "`string'", parse(",,")
. display "`1'"
my first name is Pearly
. display "`2'"
,
. display "`3'"
,
. display "`4'"
and my surname is Spencer
I have two questions:
Does tokenize work as expected in this case? I thought local macro
2 should be ,, instead of , while local macro 3 contain the rest of the string (and local macro 4 be empty).
Is there a way to force tokenize to respect the double comma as a parsing
character?
tokenize -- and gettoken too -- won't, from what I can see, accept repeated characters such as ,, as a composite parsing character. ,, is not illegal as a specification of parsing characters, but is just understood as meaning that , and , are acceptable parsing characters. The repetition in practice is ignored, just as adding "My name is Pearly" after "My name is Pearly" doesn't add information in a conversation.
To back up: know that without other instructions (such as might be given by a syntax command) Stata will parse a string according to spaces, except that double quotes (or compound double quotes) bind harder than spaces separate.
tokenize -- and gettoken too -- will accept multiple parse characters pchars and the help for tokenize gives an example with space and + sign. (It's much more common, in my experience, to want to use space and comma , when the syntax for a command is not quite what syntax parses completely.)
A difference between space and the other parsing characters is that spaces are discarded but other parsing characters are not discarded. The rationale here is that those characters often have meaning you might want to take forward. Thus in setting up syntax for a command option, you might want to allow something like myoption( varname [, suboptions])
and so whether a comma is present and other stuff follows is important for later code.
With composite characters, so that you are looking for say ,, as separators I think you'd need to loop around using substr() or an equivalent. In practice an easier work-around might be first to replace your composite characters with some neutral single character and then apply tokenize. That could need to rely on knowing that that neutral character should not occur otherwise. Thus I often use # as a character placeholder because I know that it will not occur as part of variable or scalar names and it's not part of function names or an operator.
For what it's worth, I note that in first writing split I allowed composite characters as separators. As I recall, a trigger to that was a question on Statalist which was about data for legal cases with multiple variations on VS (versus) to indicate which party was which. This example survives into the help for the official command.
On what is a "serious" bug, much depends on judgment. I think a programmer would just discover on trying it out that composite characters don't work as desired with tokenize in cases like yours.

Replace odd length substrings of character

I am struggling with a little problem concerning regular expressions.
I want to replace all odd length substrings of a specific character with another substring of the same length but with a different character.
All even sequences of the specified character should remain the same.
Simplified example: A string contains the letters a,b and y and all the odd length sequences of y's should be replaced by z's:
abyyyab -> abzzzab
Another possible example might be:
ycyayybybcyyyyycyybyyyyyyy
becomes
zczayybzbczzzzzcyybzzzzzzz
I have no problem matching all the sequences of odd length using a regular expression.
Unfortunately I have no idea how to incorporate the length information from these matches into the replacement string.
I know I have to use backreferences/capture groups somehow, but even after reading lots of documentation and Stack Overflow articles I still don't know how to pursue the issue correctly.
Concerning possible regex engines, I am working with mainly with Emacs or Vim.
In case I have overlooked an easier general solution without a complicated regular expression (e.g. a small and fixed series of simple search and replace commands), this would help too.
Here's how I'd do it in vim:
:s/\vy#<!y(yy)*y#!/\=repeat('z', len(submatch(0)))/g
Explanation:
The regex we're using is \vy#<!y(yy)*y#!. The \v at the beginning turns on the magic option, so we don't have to escape as much. Without it, we would have y\#<!y\(yy\)*y\#!.
The basic idea for this search, is that we're looking for a 'y' y followed by a run of pairs of 'y's,(yy)*. Then we add y#<! to guarantee there isn't a 'y' before our match, and add y\#! to guarantee there isn't a 'y' after our match.
Then we replace this using the eval register, i.e. \=. From :h sub-replace-\=:
*sub-replace-\=* *s/\=*
When the substitute string starts with "\=" the remainder is interpreted as an
expression.
The special meaning for characters as mentioned at |sub-replace-special| does
not apply except for "<CR>". A <NL> character is used as a line break, you
can get one with a double-quote string: "\n". Prepend a backslash to get a
real <NL> character (which will be a NUL in the file).
The "\=" notation can also be used inside the third argument {sub} of
|substitute()| function. In this case, the special meaning for characters as
mentioned at |sub-replace-special| does not apply at all. Especially, <CR> and
<NL> are interpreted not as a line break but as a carriage-return and a
new-line respectively.
When the result is a |List| then the items are joined with separating line
breaks. Thus each item becomes a line, except that they can contain line
breaks themselves.
The whole matched text can be accessed with "submatch(0)". The text matched
with the first pair of () with "submatch(1)". Likewise for further
sub-matches in ().
TL;DR, :s/foo/\=blah replaces foo with blah evaluated as vimscript code. So the code we're evaluating is repeat('z', len(submatch(0))) which simply makes on 'z' for each 'y' we've matched.

How can I use regular expressions to match a 'broken' string, or a proper string?

What I mean is that I need a regular expression that can match either something like this...
"I am a sentence."
or something like this...
"I am a sentence.
(notice the missing quotation mark at the end of the second one). My attempt at this so far is
["](\\.|[^"])*["]*
but that isn't working. Thanks for the help!
Edit for clarity: I am intending for this to be something like a C style string. I want functionality that will match with a string even if the string is not closed properly.
You could write the pattern as:
["](\\.|[^"\n])*["]?
which only has two small changes:
It excludes newline characters inside the string, so that the invalid string will only match to the end of the line. (. does not match newline, but a negated character class does, unless of course the newline is explicitly negated.)
It makes the closing doubke quote optional rather than arbitrarily repeated.
However, it is hard to imagine a use case in which you just want to silently ignore the error. So I wiuld recommend writing two rules:
["](\\.|[^"\n])*["] { /* valid string */ }
["](\\.|[^"\n])* { /* invalid string */ }
Note that the first pattern is guaranteed to match a valid string because it will match one more character than the other pattern and (f)lex always goes with the longer match.
Also, writing two overlapping rules like that does not cause any execution overhead, because of the way (f)lex compiles the patterns. In effect, the common prefix is automatically factored out.

How do I scan for a "string" constant in a flex scanner?

As part of a class assignment to create a flex scanner, I need to create a rule that recognizes a string constant. That is, a collection of characters between a set of quotation marks. How do I identify a bad string?
The only way a string literal can be "bad" is if it is missing the closing quote mark. Unfortunately, that is not easy to detect, since it is likely that there is another string literal in the program, and the opening quote of the following string literal will be taken as the missing close quote. Once the quote marks are out of synch, the lexical scan will continue incorrectly until the end of file is detected inside a supposed string literal, at which point an error can be reported.
Languages like the C family do not allow string literals to contain newline characters, which allows missing quotes to be detected earlier. In that case, a "bad" string literal is one which contains a newline. It's quite possible that the lexical scan will incorrectly include characters which were intended to be outside of the string literal, but error recovery is somewhat easier than in languages in which a missing quote effectively inverts the entire program.
It's worth noting that it is almost as common to accidentally fail to escape a quote inside a quoted string, which will result in the string literal being closed prematurely; the intended close quote will then be lexed as an open quote, and the eventual lexical error will again be delayed.
(F)lex uses the "longest match" rule to identify which pattern to recognize. If the string pattern doesn't allow newlines, as in C, it might be (in a simplified version, leaving out the complication of escapes) something like:
\"[^"]*\"
(remembering that in flex, . does not match a newline.) If the closing quote is not present in the line, this pattern will not match, and it is likely that the fallback pattern will succeed, matching only the open quote. That's good enough if immediate failure is acceptable, but if you want to do error recovery, you probably want to ignore the rest of the line. In that case, you might add a pattern such as
\"[^"]*
That will match every valid string as well, of course (not including the closing quote) but it doesn't matter because the valid string literal pattern's match will be longer (by one character). So the pattern without the closing quote will only match unterminated string literals.