Why ++ becomes -+-+-+- : string.gsub "strange" behavior

Why ++ becomes -+-+-+- : string.gsub "strange" behavior - replace

Why ++ becomes -+-+-+- ?
I'd like to clean a string from double operating signs. How should I process ?
String = "++"
print (String ) -- -> ++
String = string.gsub( String, "++", "+")
print (String ) -- -> + ok
String = string.gsub( String, "--", "+")
print (String ) -- -> +++ ?
String = string.gsub( String, "+-", "-")
print (String ) -- -> -+-+-+- ??
String = string.gsub( String, "-+", "-")
print (String ) -- -> -+-+-+- ??? ;-)

The core problem is that gsub operates on patterns (Lua's minimal regular expressions) and your string contains unescaped magic characters. However, even knowing that I found myself surprised by your results.
It's easier to see what gsub is doing if we change the replacement string:
string.gsub('+', '--', '|') => |+|
string.gsub('+++', '--', '|') => |+|+|+|
- means "0 or more occurrences of the preceding atom". Unlike +, it's non-greedy, matching the fewest characters possible.
I just tested it and apparently "fewest characters possible" mostly means 0 characters. For instance, my intuition about this:
string.gsub('aaa','a-', '|')
Is that the expression a- would match each a, replace them with '|', resulting in '|||'. In fact, it matches on the 0-length gaps before and after each character, resulting in: '|a|a|a|'
In fact, it doesn't matter what atom we precede with -, it always matches on the smallest length, 0:
string.gsub('aaa','x-', '|') => |a|a|a|
string.gsub('aaa','a-', '|') => |a|a|a|
string.gsub('aaa','?-', '|') => |a|a|a|
string.gsub('aaa','--', '|') => |a|a|a|
You can see that last one is your case and explains your results. Your next result is the exact same thing:
string.gsub('+++','+-','|') => |+|+|+|
Your final result is more straightforward:
string.gsub('-+-+-+-','-+','|') => |+|+|+|
In this case, you're matching "1 or more occurances of the atom -", so you're just replacing the - characters, just as you'd expect.

Related

Ruby, checking if an item exists in an array using Regular expression

I am attempting to search through an array of strings (new_string) and check if it includes any 'operators'
where am I going wrong?
def example
operators = ["+", "-"]
string = "+ hi"
new_string = string.split(" ")
if new_string.include? Regexp.union(operators)
print "true"
else
print "false"
end
end

You can use any? instead, which takes a pattern:
pattern = Regexp.union(['+', '-']) #=> /\+|\-/
['foo', '+', 'bar'].any?(pattern) #=> true
But since you already have a string, you can skip the splitting and use match?:
'foo + bar'.match?(pattern) #=> true

You wish to determine if a string (string) contains at least one character in a given array of characters (operators). The fact that those characters are '+' and '-' is not relevant; the same methods would be used for any array of characters. There are many ways to do that. #Stefan gives one. Here are a few more. None of them mutate (modify) string.
string = "There is a + in this string"
operators = ["+", "-"]
The following is used in some calculations.
op_str = operators.join
#=> "+-"
#1
r = /[#{ op_str }]/
#=> /[+-]/
string.match?(r)
#=> true
[+-] is a character class. It asserts that the string matches any character in the class.
#2
string.delete(op_str).size < string.size
#=> true
See String#delete.
#3
string.tr(op_str, '').size < string.size
#=> true
See String#tr.
#4
string.count(op_str) > 0
#=> true
See String#count.
#5
(string.chars & operators).any?
#=> true
See Array#&.

How can I remove all trailing backslashes from a string in Scala?

I want to remove all trailing backslashes ('\') from a string.
For example:
"ab" -> "ab"
"ab\\\\" -> "ab"
"\\\\ab\\" -> "\\\\ab"
"\\" -> ""
I am able to do this using below code but unable to handle the scenario where the String has only slash(es). Please let me know if this can be achieved through a different regex.
val str = """\\\\q\\"""
val regex = """^(.*[^\\])(\\+)$""".r
str match {
case regex(rest, slashes) => str.stripSuffix(slashes)
case _ => str
}

Converting my comment as an answer. This should work for removing all trailing backslashes:
str = str.replaceFirst("\\\\+$", "");
\\\\+ matches 1+ backslashes (single backslash is entered as \\\\ in Java/Scala).

While not a regex, I suggest a simpler solution : str.reverse.dropWhile(_ == '\\').reverse

Not using a regex, but you could use String.lastIndexWhere(p: (Char) ⇒ Boolean) to get the position of the last character which is not a '\' in order to substring until this character:
str.substring(0, str.lastIndexWhere(_ != '\\') + 1)

If, for some reason, you're committed to a regex solution, it can be done.
val regex = """[^\\]?(\\*)$""".r.unanchored
str match {
case regex(slashes) => str.stripSuffix(slashes)
}

You can do the same with slice function
str.slice(0,str.lastIndexWhere(_ != '\\')+1)

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test

While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/

In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).

This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}

Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

Regex to match tokens in a string using string.gmatch

I need a regex to use in string.gmatch that matches sequences of alphanumeric characters and non alphanumeric characters (quotes, brackets, colons and the like) as separated, single, matches, so basically:
str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for token in str:gmatch(regex) do
print(token)
end
Should print:
function
test
(
arg1
,
arg2
)
{
dosomething
(
0x121f
,
"
String
"
)
;
}
How can I achieve this? In standard regex I've found that ([a-zA-Z0-9]+)|([\{\}\(\)\";,]) works for me but I'm not sure on how to translate this to Lua's regex.

local str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for p, w in str:gmatch"(%p?)(%w*)" do
if p ~= "" then print(p) end
if w ~= "" then print(w) end
end

You need a workaround involving a temporary char that is not used in your code. E.g., use a § to insert it after the alphanumeric and non-alphanumeric characters:
str = str:gsub("%s*(%w+)%s*", "%1§") -- Trim chunks of 1+ alphanumeric characters and add a temp char after them
str = str:gsub("(%W)%s*", "%1§") -- Right trim the non-alphanumeric char one by one and add the temp char after each
for token in str:gmatch("[^§]+") do -- Match chunks of chars other than the temp char
print(token)
end
See this Lua demo
Note that %w in Lua is an equivalent of JS [a-zA-Z0-9], as it does not match an underscore, _.

Why is this regexp slow when the input line is long and has many spaces?

VBScript's Trim function only trims spaces. Sometimes I want to trim TABs as well. For this I've been using this custom trimSpTab function that is based on a regular expression.
Today I ran into a performance problem. The input consisted of rather long lines (several 1000 chars).
As it turns out
- the function is slow, only if the string is long AND contains many spaces
- the right-hand part of the regular expression is reponsible for the poor performance
- the run time seems quadratic to the line length (O(n^2))
So why is this line trimmed fast
" aaa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx bbb " '10000 x's
and this one trimmed slowly
" aaa bbb " '10000 spaces
Both contain only 6 characters to be trimmed.
Can you propose a modification to my trimSpTab function?
Dim regex
Set regex = new regexp
' TEST 1 - executes in no time
' " aaa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX bbb "
t1 = Timer
character = "X"
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
' TEST 2 - executes in 1 second on my machine
' " aaa bbb "
t1 = Timer
character = " "
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
Sub trimTest (character)
sInput = " aaa " & String (10000, character) & " bbb "
trimmed = trimSpTab (sInput)
End Sub
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Global = True
regex.Pattern = "^[ \t]+|[ \t]+$" 'trim left+right
trimSpTab = regex.Replace (s, "")
End Function
I have tried this (with regex.Global = false) but to no avail
regex.Pattern = "^[ \t]+" 'trim left
s = regex.Replace (s, "")
regex.Pattern = "[ \t]+$" 'trim right
trimSpTab = regex.Replace (s, "")
UPDATE
I've come up with this alternative in the mean time. It processes a 100 million character string is less than a second.
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Pattern = "^[ \t]+"
s = strReverse (s)
s = regex.Replace (s, "")
s = strReverse (s)
s = regex.Replace (s, "")
trimSpTab = s
End Function

Solution
As mentioned in the question, your current solution is to reverse the string. However, this is not necessary, since .NET regex supports RightToLeft matching option. For the same regex, the engine will start matching from right to left instead of default behavior of matching from left to right.
Below is sample code in C#, which I hope you can adapt to VB solution (I don't know VB enough to write sample code):
input = new Regex("^[ \t]+").Replace(input, "", 1)
input = new Regex("[ \t]+$", RegexOptions.RightToLeft).Replace(input, "", 1)
Explanation
The long run time is due to the engine just trying to match [ \t]+ indiscriminately in the middle of the string and end up failing when it is not an trailing blank sequence.
The observation that the complexity is quadratic is correct.
We know that the regex engine starts matching from index 0. If there is a match, then the next attempt starts at the end of the last match. Otherwise, the next attempt starts at the (current index + 1). (Well, to simplify things, I don't mention the case where a zero-length match is found).
Below shall illustrate the farthest attempt (some is a match, some are not) of the engine matching the regex ^[ \t]+|[ \t]+$. _ is used to denote space (or tab character) for clarity.
_____ab_______________g________
^----
^
^
^--------------
^-------------
^------------
...
^
^
^-------
When there is a long sequence of spaces & tabs in the middle of the string (which will not produce a match), the engine attempts matching at every index in the long sequence of spaces & tabs. As the result, the engine ends up going through O(k2) characters on a non-matching sequence of spaces & tabs of length k.

Your evidence proves that VBScript's RegExp implementation does not optimize for the $ anchor: It spends time (backtracking?) for each of the spaces in the middle of your test string. Without doubt, that's a fact good to know.
If this causes you real world problems, you'll have to find/write a better (R)Trim function. I came up with:
Function trimString(s, p)
Dim l : l = Len(s)
If 0 = l Then
trimString = s
Exit Function
End If
Dim ps, pe
For ps = 1 To l
If 0 = Instr(p, Mid(s, ps, 1)) Then
Exit For
End If
Next
For pe = l To ps Step -1
If 0 = Instr(p, Mid(s, pe, 1)) Then
Exit For
End If
Next
trimString = Mid(s, ps, pe - ps + 1)
End Function
It surely needs testing and benchmarks for long heads or tails of white space, but I hope it gets you started.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why ++ becomes -+-+-+- : string.gsub "strange" behavior - replace

Related

Ruby, checking if an item exists in an array using Regular expression

How can I remove all trailing backslashes from a string in Scala?

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

Regex to match tokens in a string using string.gmatch

Why is this regexp slow when the input line is long and has many spaces?

Categories

Resources