Javacc regular expression that matches all string except certain ones

Javacc regular expression that matches all string except certain ones - regex

What is the regular expression that'd match anything , except these strings:
=>, | ?
If you're familiar with javacc, I'm trying to define my ANYTHING token as shown below:
TOKEN :
{
<ARROW: "=>" >
|
<ANYTHING: (["\u0001" - "\uffdc"])+>
|
<PIPE: "|">
|
<UPPER_CHAR: (["A"-"Z"])>
}
Thanks

Try something like this:
Test.jj
options {
STATIC = false ;
}
PARSER_BEGIN(Test)
public class Test {
public static void main(String[] args) throws Exception {
Test parser = new Test(new java.io.StringReader("foo=>bar=baz|done"));
parser.Parse();
}
}
PARSER_END(Test)
TOKEN :
{
< ARROW : "=>" >
| < PIPE : "|" >
| < ANYTHING : (~["=", "|"] | "=" ~[">"])+ >
}
void Parse() :
{}
{
(Any())* <EOF>
}
void Any() :
{Token t;}
{
( t=<ARROW> {System.out.println("ARROW = '" + t.image + "'");}
| t=<PIPE> {System.out.println("PIPE = '" + t.image + "'");}
| t=<ANYTHING> {System.out.println("ANYTHING = '" + t.image + "'");}
)
}
Generate the parser classes:
javacc Test.jj
and run the main method:
java Test
will print the following for the input "foo=>bar=baz|done":
ANYTHING = 'foo'
ARROW = '=>'
ANYTHING = 'bar=baz'
PIPE = '|'
ANYTHING = 'done'

You are probably best off to use lexical states. Here is an RE solution that might work
("=")+ | ( ~["=","|"] | ("=")* ~["=","|",">"] )+ ("=")*
By the way I changed the question a little so that the empty string is not an "anything", so this re should match any string that is not empty and does not contain any "|"s nor any "=>"s.

Related

Regex to match block of text inside square brackets that can be nested

I'm writing a parser in scala that reads a string composed by repetitions of '+', '-', '<', '>' and '.' characters. The string may also have '[' and ']' characters and inside them there is a repetition of the first group of characters.
I need a Regex that matches everything inside square brackets, the problem is that the brackets can be nested.
I've already tried with this regex: \[.*\] and many others that I've found on SO but none seems to be working.
The regex I'm looking for should work like this:
"[+++.]" matches "+++."
"[++[-]]" should match "++[-]"
edit (added a use case):
"[+++.] [++[-]]" should NOT match "+++.] [++[-]" but 2 matches of "+++." and "++[-]"

That would be pretty tough with a single regex, but with some post-processing you might get a bit closer.
def parse(s :String) :Array[String] =
"\\[(.*)\\]".r.unanchored
.findAllMatchIn(s)
.toArray
.flatMap(_.group(1).split(raw"][^\[\]]+\["))
usage:
parse("+++.]") //res0: Array[String] = Array()
parse("[+++.]") //res1: Array[String] = Array("+++.")
parse("[++[-]]") //res2: Array[String] = Array("++[-]")
parse("[+++.] [++[-]]") //res3: Array[String] = Array("+++.", "++[-]")
parse("[++[-]--] [+]") //res4: Array[String] = Array(++[-]--, +)

After some research I think I may have found the solution, however it is not usable in Scala. What is needed is a recursive regex that matches balanced constructs, in my case:
\[(?:[+-\[\]]|(?R))*\]
and as far as I know these kind are not supported in scala, so I'll just leave this here if someone needs it for other languages.
However I solved my problem by implementing the parser in another way, I just thought that having a regex like that would have been a simpler and smoother solution.
What I was implementing was a brainfuck language interpreter and here is my parser class:
class brainfuck(var pointer: Int, var array: Array[Int]) extends JavaTokenParsers {
def Program = rep(Statement) ^^ { _ => () }
def Statement: Parser[Unit] =
"+" ^^ { _ => array(pointer) = array(pointer) + 1 } |
"-" ^^ { _ => array(pointer) = array(pointer) - 1 } |
"." ^^ { _ => println("elem: " + array(pointer).toChar) } |
"," ^^ { _ => array(pointer) = readChar().toInt } |
">" ^^ { _ => pointer = pointer + 1 } |
"<" ^^ { _ => pointer = pointer - 1 } |
"[" ~> rep(block|squares) <~ "]" ^^ { items => while(array(pointer)!=0) { parseAll(Program,items.mkString) } }
def block =
"""[-+.,<>]""".r ^^ { b => b.toString() }
def squares: Parser[String] = "[" ~> rep(block|squares) <~ "]" ^^ { b => var res = "[" + b.mkString + "]"; res }
}

How do I test this WML parser?

So I had an assignment in which I was supposed to make a WML parser that obeys the following grammar rules:
Here are the gramma rules
The prof posted the solution and I'm trying to test inside intellij but it keeps giving me an error. here's the code:
import WML.args
import scala.util.parsing.combinator._
import scala.io.Source
// -------------------------------------------------------------------------
// These classes form our AST
abstract class ASTNode(val kind : String)
// Outermost program structure
case class ASTProgram(val outer : List[ASTNode]) extends ASTNode("PROGRAM") {
override def toString : String = kind+" ( " + outer.mkString(" ") + " )"
}
// The other non-terminals
case class ASTInvoke(val name : ASTItext, val targs : ASTTargs) extends ASTNode("INVOKE") {
override def toString : String = kind + " ( " + name + " " + targs + " )"
}
case class ASTTargs(val targs : List[ASTItext]) extends ASTNode("TARGS") {
override def toString : String = kind + " ( " + targs.mkString(" ") + " )"
}
case class ASTItext(val itext : List[ASTNode]) extends ASTNode("ITEXT") {
override def toString : String = kind + " ( " + itext.mkString(" ") + " )"
}
case class ASTTvar(val name : String, val opt: ASTItext) extends ASTNode("TVAR") {
override def toString : String = kind + " ( " + "VNAME" + { if (opt==null) "" else " " +opt.toString } + " )"
}
case class ASTTdef(val name : ASTDtext, val dparams : ASTDparams, val body : ASTDtext ) extends ASTNode("DEFINE") {
override def toString : String = kind + " ( " + name + " " + dparams + " " + body + " )"
}
case class ASTDparams(val dparams : List[ASTDtext]) extends ASTNode("DPARAMS") {
override def toString : String = kind + " ( " + dparams.mkString(" ") + " )"
}
case class ASTDtext(val k:String, val dtext : List[ASTNode]) extends ASTNode(k) {
override def toString : String = kind + " ( " + dtext.mkString(" ") + " )"
}
// Our various baseline forms of plain text, Outertext, Inneritext, Innerdtext, Bodytext
// These are really just tokens, but we use nodes to represent them in order to keep track of the contents.
case class ASTText(val k : String,val s : String) extends ASTNode(k) {
override def toString : String = kind
}
// -------------------------------------------------------------------------
// Now the actual parser
class WMLParser extends RegexParsers {
// Tokens. We start with the fixed character sequences, just as strings.
val TSTART = "{{"
val TEND = "}}"
val DSTART = "{'"
val DEND = "'}"
val VSTART = "{{{"
val VEND = "}}}"
val PIPE = "|"
val PIPES = "||"
// We also have tokens for more complex sequences, forming the outermost text,
// as well as the inner text of invocations, definitions, and arguments.
// anything but TSTART or DSTART
val OUTERTEXT = "^([^{]|\\{(?!([{'])))+".r
// anything but TSTART, DSTART, VSTART, PIPE(s), TEND
val INNERITEXT = "^([^{|}]|\\{(?!([{|']))|\\}(?!\\}))+".r
// anything but TSTART, DSTART, VSTART, PIPE(s), DEND
val INNERDTEXT = "^([^{|']|\\{(?!([{|']))|'(?!\\}))+".r
// anything but TSTART, DSTART, VSTART, DEND
val BODYTEXT = "^([^{']|\\{(?!([{']))|'(?!\\}))+".r
// anything but PIPE or VEND
val VNAME = "^([^\\}|]|\\}(?!\\})|\\}\\}(?!\\}))+".r
// Some helper functions
// This deals with our optional component in Targs.
// Here we assume a missing optional part is the empty string---that's actually
// an over-assumption, but it is ok for now.
def convertStringItextToTargs(x:List[String ~ Option[ASTItext]]) : ASTTargs = {
new ASTTargs(x.map( (z) => z match {case s ~ Some(i) => i
case s ~ None => new ASTItext(new ASTText("INNERITEXT","")::Nil) } ))
}
// A helper to convert dtext string, node pairs into just nodes.
def convertStringItextToDparams(x:List[String ~ ASTDtext]) : ASTDparams = {
new ASTDparams(x.map( {case s ~ i => i}))
}
// Now the grammar rules.
// First a few trivial things. We don't really need these as rules per se, as each of
// these just converts a regular expression match into an ASTNode, but making separate
// rules for these makes the type conversion easier.
def textOuter: Parser[ASTNode] = OUTERTEXT ^^ { (x:String) => new ASTText("OUTERTEXT",x) }
def textInner: Parser[ASTNode] = INNERITEXT ^^ { (x:String) => new ASTText("INNERITEXT",x) }
def textInnerd: Parser[ASTNode] = INNERDTEXT ^^ { (x:String) => new ASTText("INNERDTEXT",x) }
def textBody: Parser[ASTNode] = BODYTEXT ^^ { (x:String) => new ASTText("BODYTEXT",x) }
// Our main entry point.
// <program> ::= (OUTERTEXT |<invoke>|<define>)*
def program: Parser[ASTNode] = rep( textOuter | invoke | define ) ^^ {
(x:List[ASTNode]) => { new ASTProgram(x) }
}
// <invoke> ::= TSTART <itext> <targs> TEND
def invoke: Parser[ASTInvoke] = TSTART ~ itext ~ targs ~ TEND ^^ {
case _ ~ i ~ a ~ _ => { new ASTInvoke(i,a) }
}
// <targs> ::= (PIPE <itext>?)*
def targs: Parser[ASTTargs] = rep( PIPE ~ opt(itext) ) ^^ { convertStringItextToTargs _ }
// Here, note that we list tvar before invoke, to give it preference in parsing
// <itext> ::= (INNERTEXT|<tvar>|<invoke>|<define>)*
def itext: Parser[ASTItext] = rep( textInner | tvar | invoke | define ) ^^ { (x:List[ASTNode]) => { new ASTItext(x) } }
// <tvar> ::= VSTART VNAME (PIPE itext)? VEND
def tvar: Parser[ASTTvar] = VSTART ~ VNAME ~ opt(PIPE ~ itext) ~ VEND ^^ {
case _ ~ n ~ None ~ _ => { new ASTTvar (n,null) }
case _ ~ n ~ Some(_ ~ i) ~ _ => { new ASTTvar (n,i) }
}
// <define> ::= DSTART <dtextn> <dparams> PIPES <dtext> DEND
def define: Parser[ASTTdef] = DSTART ~ dtextn ~ dparams ~ PIPES ~ dtextb ~ DEND ^^ {
case _ ~ d ~ p ~ _ ~ b ~ _ => { new ASTTdef(d,p,b) }
}
// <dparams> ::= (SPIPE <dtextp>)*
def dparams: Parser[ASTDparams] = rep( PIPE ~ dtextp ) ^^ { convertStringItextToDparams _ }
// We have 3 forms of dtext. Each of them allows inner invokes, definitions, and args.
// The template name can be empty, so we use * for repetition
// <dtextn> ::= (INNERDTEXT|<tvar>|<invoke>|<define>)*
def dtextn: Parser[ASTDtext] = rep( textInnerd | tvar | invoke | define ) ^^ { (x:List[ASTNode]) => { new ASTDtext("DTEXTN",x) } }
// The parameters cannot be empty, so we use + (ie rep1) for repetition
// <dtextp> ::= (INNERDTEXT|<tvar>|<invoke>|<define>)+
def dtextp: Parser[ASTDtext] = rep1( textInnerd | tvar | invoke | define ) ^^ { (x:List[ASTNode]) => { new ASTDtext("DTEXTP",x) } }
// Finally, the body itself can be empty, so we use * for repetition. Different from a name, however,
// the text itself can include pipe symbols.
// <dtextb> ::= (BODYTEXT|<tvar>|<invoke>|<define>)*
def dtextb: Parser[ASTDtext] = rep( textBody | tvar | invoke | define ) ^^ { (x:List[ASTNode]) => { new ASTDtext("DTEXTB",x) } }
// We do not want whitespace discarded, we will do that ourselves
override val whiteSpace = "".r
}
// And finally a program to invoke the parser on a file or input string
object WML extends App {
def main(args: Array[String]): Unit = {
def help(): Unit = {
println("Specify (-s string|filename)")
System.exit(1)
}
if (args.length==0) {
help()
}
// taking a string input was not required, but it is convenient for testing, so allow it with a -s specifier.
val source = args match {
case Array(_,"-s",s) => s
case Array("-s",s) => s
case Array(_,fn) => Source.fromFile(fn, "UTF-8").mkString
case Array(fn) => Source.fromFile(fn, "UTF-8").mkString
}
val p = new WMLParser
val result = p.parseAll(p.program,source);
if (result.successful) {
println(result.get)
} else {
println("Parse failure: " + result);
}
}
}
It asks me to override the main method and when I do, I get this error:
objc[12857]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home/bin/java (0x104e314c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x105e9c4e0). One of the two will be used. Which one is undefined.
Exception in thread "main" scala.MatchError: [Ljava.lang.String;#3551a94 (of class [Ljava.lang.String;)
at WML$.main(solution2.scala:176)
at WML.main(solution2.scala)
Here is what I have on line 176:
case Array(fn) => Source.fromFile(fn, "UTF-8").mkString
Thanks.

Can you please try upgrading your JDK version to update 161?
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Looks like this is a problem with IntelliJ!

String replacing nested JSON in Scala

I have a Scala method that will be given a String like so:
"blah blah sediejdri \"foos\": {\"fizz\": \"buzz\"}, odedrfj49 blah"
And I need to strip the "foos JSON" out of it using pure Java/Scala (no external libs). That is, find the substring matching the pattern:
\"foos\" : {ANYTHING},
...and strip it out, so that the input string is now:
"blah blah sediejdri odedrfj49 blah"
The token to search for will always be \"foos\", but the content inside the JSON curly braces will always be different. My best attempt is:
// Ex: "blah \"foos\": { flim flam }, blah blah" ==> "blah blah blah", etc.
def stripFoosJson(var : toClean : String) : String = {
val regex = ".*\"foos\" {.*},.*"
toClean.replaceAll(regex, "")
}
However I my regex is clearly not correct. Can anyone spot where I'm going awry?

Here are 2 solutions I came up with, hope it helps. I think you forgot to handle possible spaces with \s* etc.
object JsonStrip extends App {
// SOLUTION 1, hard way, handles nested braces also:
def findClosingParen(text: String, openPos: Int): Int = {
var closePos = openPos
var parensCounter = 1 // if (parensCounter == 0) it's a match!
while (parensCounter > 0 && closePos < text.length - 1) {
closePos += 1
val c = text(closePos)
if (c == '{') {
parensCounter += 1
} else if (c == '}') {
parensCounter -= 1
}
}
if (parensCounter == 0) closePos else openPos
}
val str = "blah blah sediejdri \"foos\": {\"fizz\": \"buzz\"}, odedrfj49 blah"
val indexOfFoos = str.indexOf("\"foos\"")
val indexOfFooOpenBrace = str.indexOf('{', indexOfFoos)
val indexOfFooCloseBrace = findClosingParen(str, indexOfFooOpenBrace)
// here you would handle if the brace IS found etc...
val stripped = str.substring(0, indexOfFoos) + str.substring(indexOfFooCloseBrace + 2)
println("WITH BRACE COUNT: " + stripped)
// SOLUTION 2, with regex:
val reg = "\"foos\"\\s*:\\s*\\{(.*)\\}\\s*,\\s*"
println("WITH REGEX: " + str.replaceAll(reg, ""))
}

This regex \\"foos\\": {(.*?)} should match what you want, in most regex engine, you might need to replace " with \". If your JSON can contains other curly brackets, you can use this \\"foos\\": \{(?>[^()]|(?R))*\}, it uses recursion to match balanced groups of brackets. Note that this one only works in pcre regex engine, others won't support recursion.

Couchbase xdcr regex - How do I exclude keys using regex?

I am trying to exclude certain documents from being transported to ES using XDCR.
I have the following regex that filters ABCD and IJ
https://regex101.com/r/gI6sN8/11
Now, I want to use this regex in the XDCR filtering
^(?!.(ABCD|IJ)).$
How do I exclude keys using regex?
EDIT:
What if I want to select everything that doesn't contains ABCDE and ABCHIJ.
I tried
https://regex101.com/r/zT7dI4/1

edit:
Sorry, after further looking at it, this method is invalid. For instance, [^B] allows an A to get by, letting AABCD slip through (since it will match AA at first, then match BCD with the [^A]. Please disregard this post.
Demo here shows below method is invalid
(disregard this)
You could use a posix style trick to exclude words.
Below is to exclude ABCD and IJ.
You get a sense of the pattern from this.
Basically, you put all the first letters into a negative class
as the first in the alternation list, then handle each word
in a separate alternation.
^(?:[^AI]+|(?:A(?:[^B]|$)|AB(?:[^C]|$)|ABC(?:[^D]|$))|(?:I(?:[^J]|$)))+$
Demo
Expanded
^
(?:
[^AI]+
|
(?: # Handle 'ABCD`
A
(?: [^B] | $ )
| AB
(?: [^C] | $ )
| ABC
(?: [^D] | $ )
)
|
(?: # Handle 'IJ`
I
(?: [^J] | $ )
)
)+
$

Hopefully one day there will be built-in support for inverting the match expression. In the mean time, here's a Java 8 program that generates regular expressions for inverted prefix matching using basic regex features supported by the Couchbase XDCR filter.
This should work as long as your key prefixes are somehow delimited from the remainder of the key. Make sure to include the delimiter in the input when modifying this code.
Sample output for red:, reef:, green: is:
^([^rg]|r[^e]|g[^r]|re[^de]|gr[^e]|red[^:]|ree[^f]|gre[^e]|reef[^:]|gree[^n]|green[^:])
File: NegativeLookaheadCheater.java
import java.util.*;
import java.util.stream.Collectors;
public class NegativeLookaheadCheater {
public static void main(String[] args) {
List<String> input = Arrays.asList("red:", "reef:", "green:");
System.out.println("^" + invertMatch(input));
}
private static String invertMatch(Collection<String> literals) {
int maxLength = literals.stream().mapToInt(String::length).max().orElse(0);
List<String> terms = new ArrayList<>();
for (int i = 0; i < maxLength; i++) {
terms.addAll(terms(literals, i));
}
return "(" + String.join("|", terms) + ")";
}
private static List<String> terms(Collection<String> words, int index) {
List<String> result = new ArrayList<>();
Map<String, Set<Character>> prefixToNextLetter = new LinkedHashMap<>();
for (String word : words) {
if (word.length() > index) {
String prefix = word.substring(0, index);
prefixToNextLetter.computeIfAbsent(prefix, key -> new LinkedHashSet<>()).add(word.charAt(index));
}
}
prefixToNextLetter.forEach((literalPrefix, charsToNegate) -> {
result.add(literalPrefix + "[^" + join(charsToNegate) + "]");
});
return result;
}
private static String join(Collection<Character> collection) {
return collection.stream().map(c -> Character.toString(c)).collect(Collectors.joining());
}
}

How do I use a regex to find a duplicated string on the left and the right of an equals sign

I have a file with some translation and some missing translations where the english key equals the translation.
...
/* comment1 */
"An unexpected error occurred." = "Ein unerwarteter Fehler ist aufgetreten.";
/* comment2 */
"Enter it here..." = "Enter it here...";
...
Is it possible to:
Find all occurrences of "X" = "X";?
Bonus: For all occurrences delete the line, the comment line above and newline above that?

You'll need to use backreferences here, something along the lines of:
/"(.+)"\s*=\s*"\1"/
^ ^
| |
| backreference to first string
|
capture group for first string
Note that the syntax for backreferences varies between languages, the above one works for your case in Ruby, e.g.
❯ irb
2.2.2 :001 > r = /"(.+)"\s*=\s*"\1"/
=> /"(.+)"\s*=\s*"\1"/
2.2.2 :002 > r.match('"foo" = "foo"')
=> #<MatchData "\"foo\" = \"foo\"" 1:"foo">
2.2.2 :003 > r.match('"foo" = "bar"')
=> nil
In response to your comment about wanting to do it in a text editor, remove the leading/trailing slashes and the above regex should work fine in Sublime Text... YMMV in other editors.

For the Bonus question:
(\R\R)?+/\*[^*]*(?:\*+(?!/)[^*]*)*\*/\R("[^"]*") = \2;(?(1)|\R{0,2})
demo
(works with notepad++, remove the newline above, except for the first item.)

You can find all the occurences by matching each line with the following pattern: "(.*?)"\s*=\s*"\1", if you got a match you can delete the line.
Java working example
public class StackOverflow32507709 {
public static String pattern;
static {
pattern = "\"(.*?)\"\\s*=\\s*\"\\1\"";
}
public static void main(String[] args) {
String[] text = {
"/* comment1 */",
"\r\n",
"\"An unexpected error occurred\" = \"German translation...\";\r\n",
"\r\n",
"\"Enter it here\" = \"Enter it here\";\r\n"
};
List<String> filteredTranslations = new ArrayList<String>();
Pattern p = Pattern.compile(pattern);
for (String line : text) {
Matcher m = p.matcher(line);
if (!m.find()) {
filteredTranslations.add(line);
}
m.reset();
}
for (String filteredTranslation : filteredTranslations) {
System.out.println(filteredTranslation);
}
}
}

You need to use a backreference, like this: http://www.regular-expressions.info/backref.html
I can't give you a full answer because you haven't said which programming language you are using, but I'm sure you can figure it out from there.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Javacc regular expression that matches all string except certain ones - regex

What is the regular expression that'd match anything , except these strings: =>, | ? If you're familiar with javacc, I'm trying to define my ANYTHING token as shown below: TOKEN : { <ARROW: "=>" > | <ANYTHING: (["\u0001" - "\uffdc"])+> | <PIPE: "|"> | <UPPER_CHAR: (["A"-"Z"])> } Thanks

Related

Regex to match block of text inside square brackets that can be nested

How do I test this WML parser?

String replacing nested JSON in Scala

Couchbase xdcr regex - How do I exclude keys using regex?

How do I use a regex to find a duplicated string on the left and the right of an equals sign

Categories

Resources