How does flex match the beginning of line anchor? - regex

I've always wondered how the beginning of input anchor (^) was converted to a FSA in flex. I know that the end of line anchor ($) is matched by the expression r/\n where r is the expression to match. How's the beginning of input anchor matched? The only solution I see is to use start conditions. How can it be implemented in a program?

End of line marker $ is different from \n in that it matches EOF as well, even if the end-of-line marker \n or \r\n is not found at the end of the file.
I did not look at flex's implementation, but I would implement both ^ and $ using boolean flags. The ^ flag would be initially set, then reset to false after the first character in a line, then set back to true after the next end-of-line marker, and so on.

If your scanner uses the ^anchor, then every start-condition needs two initial-state entries:
Beginning-of-line, and
otherwise.
Flex does this, and peeks behind the input pointer to determine which entry to consult.

The beginning of line anchor is matched by the pattern:
beginningOfLine ^.
(a caret followed by a point)
Example (numbering lines of a text):
%{
int ln = 1;
%}
beginningOfLine ^.
newline \n
%%
{beginningOfLine} { if (ln == 1) {
printf ("%d \t", ln);
printf (yytext);
ln++;
}else{
printf (yytext);
}
}
{newline} { printf ("\n");
printf ("%d \t", ln);
ln++; }
%%

Related

Character not at begining of line; not followed or preceded by character

I'm trying to isolate a " character when (simultaneously):
it's not in the beginning of the line
it's not followed by the character ";"
it's not preceded by the character ";"
E.g.:
Line: "Best Before - NO MATCH
Line: Best Before"; - NO MATCH
Line: ;"Best "Before - NO MATCH
Line: Best "Before - MATCH
My best solution is (?<![;])([^^])(")(?![;]) but it's not working correctly.
I also tried (?<![;])(")(?![;]), but it's only partial (missing the "not at the beginning" part)
I don't understand why I'm spelling the "AND not at the beginning" wrong.
Where am I missing it?
If you want to allow partial matches, you can extend the lookbehind with an alternation not asserting the start of the string to the left.
The semi colon [;] does not have to be between square brackets.
(?<!;|^)"(?!;)
Regex demo
if you want to match the " when there is no occurrence of '" to the left and right, and a infinite quantifier in a lookbehind assertion is allowed:
(?<!^.*;(?=").*|^)"(?!;|.*;")
Regex demo
In notepad++ you can use
^.*(?:;"|";).*$(*SKIP)(*F)|(?<!^)"
Regex demo
You can use the fact that not preceded by ; means that it's also not the first character on the line to simplify things
[^;]"(?:[^;]|$)
This gives you
Match a character that's not a ; (so there must be a character and thus the next character can't be the start of the line)
Match a "
Match a character that's not a ; or the end of the line
I know you are asking for a regex solution, but, almost always, strings can also be filtered using string methods in whatever language you are working in.
For the sake of completeness, to show that regex is not your only available tool here, here is a short javascript using the string methods:
myString.charAt()
myString.includes()
Working Example:
const checkLine = (line) => {
switch (true) {
// DOUBLE QUOTES AT THE BEGINNING
case(line.charAt(0) === '"') :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY FOLLOWED BY SEMI-COLON
case(line.includes('";')) :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY PRECEDED BY SEMI-COLON
case(line.includes(';"')) :
return console.log(line, '// NO MATCH');
default:
return console.log(line, '// MATCH');
}
}
checkLine('"Best Before');
checkLine('Best Before";');
checkLine(';"Best "Before');
checkLine('Best "Before');
Further Reading:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/includes

How to extract all characters after the 30th in simple string.

I have a text field that returns a string of characters from 38 to 40 long. I need to just extract from the 30th character to the end.
I used .{9}$ to grab the last nine, then realized that the original strings are not a set amount of characters and only the first 29 is not needed. Everything after is the case number and is what I need. Again the number of characters needed can be anywhere from 9 to 12 long
Skip the First 29 and Extract the Rest
Here are string methods .slice(), .substring(), and .replace(). A RegEx that'll skip the first 29 characters and spaces then extract the rest is:
(?:[\s\S]{29})([\s\S]*?)
Start matching non-capture: (?:...
Any space or non-space character: [\s\S]
Twenty-nine times: {29}
End matching non-capture: ..)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Next, start matching capture: (...
Any space or non-space character: [\s\S]
Zero or more times: *
Lazily: ?
End matching capture: ...)
Demo
var str = '123456789㉈123456789㉉123456789㉊123456789㉋';
var rgx = /(?:[\s\S]{29})([\s\S]*?)/;
console.log('var str = ' + str);
console.log('--------------------------------------------------');
console.log('str.slice(29) // ' + str.slice(29));
console.log('--------------------------------------------------');
console.log('str.substring(29) // ' + str.substring(29));
console.log('--------------------------------------------------');
console.log(`var rgx = ${rgx}`);
console.log('str.replace(rgx, "$1") // ' + str.replace(rgx, '$1'));
.as-console-wrapper {
min-height: 100%;
}
.as-console-row-code.as-console-row-code {
font-size: 18px;
}
.as-console-row.as-console-row::after {
display: none;
}
you can use substring function
[YourString].substring(29)
When using Python, simply do:
string[30:]
This will automatically return the thirtiest character to the end.
^.{30}\K.*$
^ asserts position at start of the string
.{30} Quantifier — Matches exactly 30 times
\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
.* matches any character (except for line terminators)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

regular expression which should allow limited special characters

Can any one tell me the regular expression for textfield which should not allow following characters and can accept other special characters,alphabets,numbers and so on :
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ # &
this will not allow string that contains any of the characters in any part of the string mentioned above.
^(?!.*[+\-&|!(){}[\]^"~*?:#&]+).*$
See Here
Brief Explanation
Assert position at the beginning of a line (at beginning of the string or after a line break character) ^
Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!.*[+\-&|!(){}[\]^"~*?:#&]+)
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character present in the list below [+\-&|!(){}[\]^"~*?:#&]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The character "+" +
A "-" character \-
One of the characters &|!(){}[” «&|!(){}[
A "]" character \]
One of the characters ^"~*?:#&” «^"~*?:#&
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Assert position at the end of a line (at the end of the string or before a line break character) $
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html
I recognize those as the characters which need to be escaped for Solr. If this is the case, and if you are coding in PHP, then you should use my PHP utility functions from Github. Here is one of the Solr functions from there:
/**
* Escape values destined for Solr
*
* #author Dotan Cohen
* #version 2013-05-30
*
* #param value to be escaped. Valid data types: string, array, int, float, bool
* #return Escaped string, NULL on invalid input
*/
function solr_escape($str)
{
if ( is_array($str) ) {
foreach ( $str as &$s ) {
$s = solr_escape($s);
}
return $str;
}
if ( is_int($str) || is_float($str) || is_bool($str) ) {
return $str;
}
if ( !is_string($str) ) {
return NULL;
}
$str = addcslashes($str, "+-!(){}[]^\"~*?:\\");
$str = str_replace("&&", "\\&&", $str);
$str = str_replace("||", "\\||", $str);
return $str;
}

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Regular expression in Flex

I want to check if the string is not empty (having whitespaces only also counts as empty). How to compose the regular expression in actionscript?
The pattern should be something like /^\s*$/ (for a single line string); ^ and $ represent the start and end of the line and \s* means match zero or more whitespace characters. For example:
var s:String = /* ... */;
var allWhitespaceOrEmpty:RegExp = /^\s*$/;
if (allWhitespaceOrEmpty.test(s))
{
// is empty or all whitespace
}
else
{
// is non-empty with at least 1 non-whitespace char
}
Perhaps a simpler way as commenter Alexander Farber points out is to check for any character except a whitespace character, which is matched by \S in regex:
var nonWhitespaceChar:RegExp = /\S/;
if (nonWhitespaceChar.test(s))
{
// is non-empty with at least 1 non-whitespace char
}