Problems with an ambiguous grammar and PEG.js (no examples found) - regex

I want to parse a file with lines of the following content:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
so now explaining the parts (not all spaces are described, because of the markup here):
simple word is one or several words (phrase) separated by the whitespace
abbr - is a fixed part of the string (never changes)
8 - optional number
. - always included
word, simple phrase, one another phrase - one or several words or phrases separated by comma
- ( - fixed part, always included
simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; - (optional) one or several phrases separated by ;
"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose" - (optional) one or several phrases with quotation marks "separated by ;
) This is the end of the line - always included
In the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1 type) or with them (phrase2 type).
So the phrases are Natural Language sentences (with all the punctuation possible)...
BUT:
the internal content is irrelevant (i.e. I do not need to parse the Natural Language itself in the NLP meaning)
it is just required to mark it as a phrase1 or phrase2 types:
those without and with quotation marks, i.e., if the phrase, which is placed between ( and ; or ; and ; or ; and ) or even between ( and ) is augmented with quotation marks, then it is phrase2 type
otherwise, if the phrase either begins or ends without quotation marks, though it could have all the marks within the phrase, it is the phrase1 type
Since to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
or (the difference is only in " abbr -" and "_ "abbr" _ "-"" ):
start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
But even this simple grammar cannot parse the beginning of the string. The errors are:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
etc.
So it looks the problem is in the ambiguity: "abbr" is consumed withing the term as a word token. Although I defined the rule !(" abbr -"), which I thought has a meaning, that the next word token will be only consumed, if the next substring is not of " abbr -" kind.
I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:
& expression
! expression
$ expression
& { predicate }
! { predicate }
TL;DR:
related to PEG.js:
are there any examples of applying the rules:
& expression
! expression
$ expression
& { predicate }
! { predicate }
general question:
what is the possible approach to handle such complex strings with intuitive ambiguous grammars? This is still not a Natural Language and it looks like it has some formal structure, just with several optional parts. One of the ideas is to split the strings by preprocessing (with the help of Regular Expressions, in the places of fixed elements, i.e. "abbr -" ") This is the end of the line") and then create for each splited part a separate grammar. But it seems to have have performance issues and scalability problems (i.e. - what if the fixed elements will change a bit - e.g. there will be no - char anymore.)
Update1:
I found the rule which solves the problem with matching the "abbr -" ambiguity:
term = term:(word (!" abbr -" _? word))+ {return term.join("")}
but the result looks strange:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
if removing the predicate: term = term:(word (!" abbr -" _? word))+:
[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
I expected something like:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
or at least:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
The expression is grouped, so why is it separated in so many nesting levels and even undefined is included in the output? Are there any general rules to fold the result based on the expression in the rule?
Update2:
I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]
If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!

so why is it separated in so many nesting levels and even undefined is included in the output?
If you look at the documentation for PEG.js, you'll see almost every operator collects the results of its operands into an array. undefined is returned by the ! operator.
The $ operator bypasses all this nesting and just gives you the actual string that matches, eg: [a-z]+ will give an array of letters, but $[a-z]+ will give a string of letters.
I think most of the parsing here follows the pattern: "give me everything until I see this string". You should express this in PEG by first using ! to make sure you haven't hit the terminating string, and then just taking the next character. For example, to get everything up to " abbr -":
(!" abbr -" .)+
If the terminating string is a single character, you can use [^] as a short form of this, eg: [^x]+ is a shorter way of saying (!"x" .)+.
Parsing comma/semicolon separated phrases rather than comma/semicolon terminated phrases is slightly annoying, but treating them as optional terminators seems to work (with some triming).
start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)"
phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
$.*
phrase_comma = p:$[^-,]+ [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+ [; ]* { return p.trim() }
q_phrase_semi = '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+
gives
[
"simple word",
" abbr -",
"8",
".",
" ",
"(012)",
[
"word",
"simple phrase",
"one another phrase"
],
"- (",
[
"simply dummy text of the printing",
"Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
],
[
"It is a long established!",
"Sometimes by accident, sometimes on purpose (injected humour and the like)",
"sometimes on purpose"
],
")",
" This is the end of the line"
]

Related

Python regular expression to find and replace multiple matches

I have a string as follows which can have any number of spaces after the first [ or before the last ]:
my_string = " [ 0.53119281 1.53762345 ]"
I have a regular expression which matches and replaces each one individually as follows:
my_regex_start = "(\[\s+)" #Find square bracket and any number of white spaces
replaced_1 = re.sub(my_regex_start, '[', my_string) --> "[0.53119281 -0.16633733 ]"
my_regex_end = "(\s+\])" #Find any number of white spaces and a square bracket
replaced_2 = re.sub(my_regex_end, ']', my_string) -->" [ 0.53119281 -0.16633733]"
I have a regular expression which finds one OR the other:
my_regex_both = "(\[\s+)|(\s+\])" ##Find square bracket and any number of white spaces OR ny number of white spaces and a square bracket
How can I use this my_regex_both to replace the first one and OR the second one if any or both are found?
Instead of catching the brackets, you can replace the spaces that are preceded by [ or followed by ] with an empty string:
import re
my_string = "[ 0.53119281 1.53762345 ]"
my_regex_both = r"(?<=\[)\s+|\s+(?=\])"
replaced = re.sub(my_regex_both, '', my_string)
print(replaced)
Output:
[0.53119281 1.53762345]
Another option you can use aside from MrGeek's answer would be to use a capture group to catch everything between your my_regex_start and my_regex_end like so:
import re
string1 = " [ 0.53119281 1.53762345 ]"
result = re.sub(r"(\[\s+)(.*?)(\s+\])", r"[\2]", string1)
print(result)
I have just sandwiched (.*?) between your two expressions. This will lazily catch what is between which can be used with \2
OUTPUT
[0.53119281 1.53762345]

How to only replace the vowels of words that match the words in a given array with a "*"?

I need to create a ruby method that accepts a string and an array and if any of the words in the string matches the words in the given array then all the vowels of the matched words in the string should be replaced with a "*". I have tried to do this using regex and an "if condition" but I don't know why this does not work. I'd really appreciate if somebody could explain me where I have gone wrong and how I can get this code right.
def censor(sentence, arr)
if arr.include? sentence.downcase
sentence.downcase.gsub(/[aeiou]/, "*")
end
end
puts censor("Gosh, it's so hot", ["gosh", "hot", "shoot", "so"])
#expected_output = "G*sh, it's s* h*t"
are.include? sentence.downcase reads, “If one of the elements of arr equals sentence.downcase ...”, not what you want.
baddies = ["gosh", "it's", "hot", "shoot", "so"]
sentence = "Gosh, it's so very hot"
r = /\b#{baddies.join('|')}\b/i
#=> /\bgosh|it's|hot|shoot|so\b/i
sentence.gsub(r) { |w| w.gsub(/[aeiou]/i, '*') }
#=> "G*sh *t's s* very h*t"
In the regular expression, \b is a word break and #{baddies.join('|')} requires a match of one of the baddies. The word breaks are to avoid, for example, "so" matching "solo" or "possible". One could alternatively write:
/\b#{Regexp.union(baddies).source}\b/
#=> /\bgosh|it's|hot|shoot|so\b/
See Regexp::union and Regexp#source. source is needed because Regexp.union(baddies) is unaffected by the case-indifference modifier (i).
Another approach is split the sentence into words, manipulate each word, then rejoin all the pieces to form a new sentence. One difficulty with this approach concerns the character "'", which serves double-duty as a single quote and an apostrophe. Consider
sentence = "She liked the song, 'don't box me in'"
baddies = ["don't"]
the approach I've given here yields the correct result:
r = /\b#{baddies.join('|')}\b/i
#=> /\bdon't\b/i
sentence.gsub(r) { |w| w.gsub(/[aeiou]/i, '*') }
#=> "She liked the song 'd*n't box me in'"
If we instead divide up the sentence into parts we might try the following:
sentence.split(/([\p{Punct}' ])/)
#=> ["She", " ", "liked", " ", "", " ", "the", " ", "song", ",", "",
# " ", "", "'", "don", "'", "t", " ", "box", " ", "me", " ", "in", "'"]
As seen, the regex split "don't" into "don" and "'t", not what we want. Clearly, distinguishing between single quotes and apostrophes is a non-trivial task. This is made difficult by the the fact that words can begin or end with apostrophes ("'twas") and most nouns in the possessive form that end with "s" are followed by an apostrophe ("Chris' car").
Your code does not return any value if the condition is valid.
One option is to split words by spaces and punctuation, manipulate, then rejoin:
def censor(sentence, arr)
words = sentence.scan(/[\w'-]+|[.,!?]+/) # this splits the senctence into an array of words and punctuation
res = []
words.each do |word|
word = word.gsub(/[aeiou]/, "*") if arr.include? word.downcase
res << word
end
res.join(' ') # add spaces also before punctuation
end
puts censor("Gosh, it's so hot", ["gosh", "hot", "shoot", "so"])
#=> G*sh , it's s* h*t
Note that res.join(' ') add spaces also before punctuation. I'm not so good with regexp, but this could solve:
res.join(' ').gsub(/ [.,!?]/) { |punct| "#{punct}".strip }
#=> G*sh, it's s* h*t
This part words = sentence.scan(/[\w'-]+|[.,!?]+/) returns ["Gosh", ",", "it's", "so", "hot"]

VB.NET - Regex.Replace error with [ character

I want to remove some characters from a textbox. It works, but when i try to replace the "[" character it gives a error. Why?
Return Regex.Replace(html, "[", "").Replace(",", " ").Replace("]", "").Replace(Chr(34), " ")
When i delete the "[", "").Replace( part it works great?
Return Regex.Replace(html, ",", " ").Replace("]", "").Replace(Chr(34), " ")
The problem is that since the [ character has a special meaning in regex, It must be escaped in order to use it as part of a regex sequence, therefore to escape it all you have to do is add a \ before the character.
Therefore this would be your proper regex code Return Regex.Replace(html, "\[", "").Replace(",", " ").Replace("]", "").Replace(Chr(34), " ")
Because [ is a reserved character that regex patterns use. You should always escape your search patterns using Regex.Escape(). This will find all reserved characters and escape them with a backslash.
Dim searchPattern = Regex.Escape("[")
Return Regex.Replace(html, searchPattern, ""). 'etc...
But why do you need to use regex anyway? Here's a better way of doing it, I think, using StringBuilder:
Dim sb = New StringBuilder(html) _
.Replace("[", "") _
.Replace(",", " ") _
.Replace("]", "") _
.Replace(Chr(34), " ")
Return sb.ToString()

Using REGEX to find keywords

Hey all I have the following badwords that I want to check to see if they are in a string that I am passing:
Private Function injectionCheck(queryString As String) As Integer
Dim badWords() As String = {"EXEC", "EXECUTE", ";", "-", "*", "--", "#",
"UNION", "DROP", "DELETE", "UPDATE", "INSERT", "MASTER",
"TABLE", "XP_CMDSHELL", "CREATE", "XP_FIXEDDRIVES",
"SYSCOLUMNS", "SYSOBJECTS"}
Dim pattern As String = "\b(" + Regex.Escape(badWords(0))
For Each key In badWords.Skip(1)
pattern += "|" + Regex.Escape(key)
Next
pattern += ")\b"
Return Regex.Matches(queryString, pattern, RegexOptions.IgnoreCase).Count
End Function
For the pattern I get the following:
\b(EXEC|EXECUTE|;|-|\*|--|#|UNION|DROP|DELETE|UPDATE|INSERT|MASTER|TABLE|XP_CMDSHELL|
CREATE|XP_FIXEDDRIVES|SYSCOLUMNS|SYSOBJECTS)\b
Which looks correct to me. But every time I call it I get 0 as the response to this:
Dim blah As Integer = injectionCheck("select * from bob where something = 'you'")
So what am I leaving out that needs to be there since the above should not return 0 - It should return 2 since both * and ' are used that should not be used.
If you plan to match words as whole words, but the keywords may start/end with non-word characters, you might get into a similar trouble. The word boundary meaning depends on the context: \b--\b will match in X--X but not in , --,.
You need an unambiguous boundary matching. Use lookarounds (?<!\w) as leading and (?!\w) as a trailing word boundary.
Implement the changes as shown below:
Dim pattern As String = "(?<!\w)(" + Regex.Escape(badWords(0)) ' <== HERE
For Each key In badWords.Skip(1)
pattern += "|" + Regex.Escape(key)
Next
pattern += ")(?!\w)" ' <== AND HERE

Convert punctuation to space

I have a bunch of strings with punctuation in them that I'd like to convert to spaces:
"This is a string. In addition, this is a string (with one more)."
would become:
"This is a string In addition this is a string with one more "
I can go thru and do this manually with the stringr package (str_replace_all()) one punctuation symbol at a time (, / . / ! / ( / ) / etc. ), but I'm curious if there's a faster way I'd assume using regex's.
Any suggestions?
x <- "This is a string. In addition, this is a string (with one more)."
gsub("[[:punct:]]", " ", x)
[1] "This is a string In addition this is a string with one more "
See ?gsub for doing quick substitutions like this, and ?regex for details on the [[:punct:]] class, i.e.
‘[:punct:]’ Punctuation characters:
‘! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { |
} ~’.
have a look at ?regex
library(stringr)
str_replace_all(x, '[[:punct:]]',' ')
"This is a string In addition this is a string with one more "