positive lookbehind in kotlin doesn't work in match - regex

I'm iterating on this file:
[INFO] com.demo:communication:jar:3.5.0-SNAPSHOT
[INFO] +- com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
[INFO] | +- org.apache.logging.log4j:log4j-api:jar:2.7:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.7:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
[INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
[INFO] | | +- org.slf4j:slf4j-api:jar:1.7.21:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
I want to remove all the prefix on every line: "[INFO] " / "[INFO] +- " / "[INFO] | | - " etc
I'm using this function I wrote on every line in the file:
private fun extractDependency(raw: String): Dependency {
val uniqueDependencyRegex = Regex.fromLiteral("(?<=\\+- ).*")
val duplicateDependencyRegex = Regex.fromLiteral("(?<=\\().+?(?=\\))")
val projectRegex = Regex.fromLiteral("(?<=\\[INFO\\] ).*")
when {
uniqueDependencyRegex matches raw -> {
val matchResult = uniqueDependencyRegex.matchEntire(raw)
println(matchResult)
}
duplicateDependencyRegex matches raw -> {
val matchResult = duplicateDependencyRegex.matchEntire(raw)
println(matchResult)
}
projectRegex matches raw -> {
val matchResult = projectRegex.matchEntire(raw)
println(matchResult)
}
else -> {
//TODO - throw exception
}
}
return Dependency("test", "test", "test", "test")
}
I'm expecting it to work after I tested the regular expressions:
First Condition
Second Condition
Third Condition
The result I want is:
com.demo:communication:jar:3.5.0-SNAPSHOT
com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile
org.apache.logging.log4j:log4j-core:jar:2.7:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate
org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
org.slf4j:slf4j-api:jar:1.7.21:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate

You could either match [INFO] followed by a character class that will match any of the listed characters [| +\\(-], or match ) at the end of the string.
In the replacement use an empty string.
^\[INFO\][| +\\(-]+|\)$
With double escaped backslashes
^\\[INFO\\][| +\\\\(-]+|\\)$
regex demo
A bit more precise pattern could be repeatedly matching any of the occurring patterns like | or +- or \- and capture the content in group 1 between optional parenthesis. Then use the group in the replacement.
^\[INFO\](?:(?: +(?:\||\+-|\\-))+)? +\(?(.*?)\)?$
Regex demo

Related

What's wrong with this cucumber regex?

I tried to define this custom parameter type:
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?, joiner: String?, imaginaryPartString1: String?, imaginaryPartString2: String? ->
// handling logic omitted as it doesn't matter yet
}
Cucumber is giving an error:
java.util.NoSuchElementException
at java.base/java.util.Spliterators$2Adapter.nextInt(Spliterators.java:733)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:128)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:86)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:18)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.TreeRegexp.match(TreeRegexp.java:78)
at io.cucumber.cucumberexpressions.Argument.build(Argument.java:15)
at io.cucumber.cucumberexpressions.CucumberExpression.match(CucumberExpression.java:144)
at io.cucumber.core.stepexpression.StepExpression.match(StepExpression.java:22)
at io.cucumber.core.stepexpression.ArgumentMatcher.argumentsFrom(ArgumentMatcher.java:30)
It sounds as if I have missed a closing group but I've been staring at it for hours and can't see what I've done wrong yet.
I thought maybe throwing it to StackOverflow would immediately get someone to spot the issue. LOL
In case it matters, the definition of $realRegex is as follows:
private const val doubleTermRegex = "\\d+(?:\\.\\d+)?"
private const val realTermRegex = "(?:-?(?:√?(?:$doubleTermRegex|π)|∞))"
const val realRegex = "(?:$realTermRegex(?:\\s*\\/\\s*$realTermRegex)?)"
That code has been around for much longer and is exercised by many tests, so I'm guessing the issue is in the new code, but I guess you never know.
Versions in use:
Cucumber-Java8 5.7.0
Kotlin-JVM 1.5.21
Java 11
In any case, here's the whole file for context.
package garden.ephemeral.rocket.util
import garden.ephemeral.rocket.util.RealParser.Companion.realFromString
import garden.ephemeral.rocket.util.RealParser.Companion.realRegex
import io.cucumber.java8.En
class ComplexStepDefinitions: En {
lateinit var z1: Complex
lateinit var z2: Complex
init {
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?,
joiner: String?,
imaginaryPartString1: String?,
imaginaryPartString2: String? ->
if (realPartString != null) {
if (imaginaryPartString1 != null) {
val imaginarySign = if (joiner == "-") -1.0 else 1.0
Complex(realFromString(realPartString), imaginarySign * realFromString(imaginaryPartString1))
} else {
Complex(realFromString(realPartString), 0.0)
}
} else if (imaginaryPartString2 != null) {
Complex(0.0, realFromString(imaginaryPartString2))
} else {
throw AssertionError("It shouldn't have matched the regex")
}
}
}
}
You've used (?x) as flag expression to enabled white-space and comments in pattern. Which is probably a good idea given the size of the monster.
When using Cucumber Expressions, Cucumber parses your regex. And the parser didn't include support for flag expressions as you are experiencing now. Updating Cucumber to the latest version should fix that problem. You are welcome.

How to remove a whitespace character and insert a string in its place

I'm trying to take a string such as Hello World this is Bob and have it formatted to
:h: :e: :l: :l: :o: | :w: :o: :r: :l: :d: | :t: :h: :i: :s: | :i: :s: | :b: :o: :b:
This is where I'm having the issue:
text = text.scan(/\s/).join(' | ')
def to_clipboard(text)
string = text.to_s
IO.popen('pbcopy', 'w') {|t| t << text}
string
end
#########################################
print "Enter message: "
text = gets.downcase.chomp!
# text = text.scan(/\s/).join(' | ')
formatted = text.scan(/\S[A-Za-z]{0,0}/).join(': :')
formatted.insert(0,':')
formatted[formatted.size, 0] = ':'
# verify expected format
puts formatted
to_clipboard(formatted)
def bannerize(str)
str.downcase.gsub(/./) do |c|
if c == ' '
"| "
else
":#{c}: "
end
end.rstrip
end
bannerize("Hello this is Mugsy")
#=> ":h: :e: :l: :l: :o: | :t: :h: :i: :s: | :i: :s: | :m: :u: :g: :s: :y:"
Alternatively,
def bannerize(str)
h = Hash.new { |_,k| ":#{k}: " }.tap { |h| h[' '] = "| " }
str.downcase.gsub(/./,h).rstrip
end
This uses the form of Hash::new that creates an empty hash h with a default proc, after which the key-value pair ' '=>" " is added to the hash so that it becomes { ' '=>" " }. The default proc causes h[k] to return ":#{k}: " if it does not have a key k; that is, if k is anything other than a space.
The form of String#gsub that employs a hash for making substitutions is then used with this hash h.
Try to split by space, make a replacement in each word and then combine results:
words = text.split(/\s/)
words.map {|s| s.gsub(/(.)/, ':\1: ')}
formatted = words.join('| ')
I suggest
text = 'Hello World this is Bob'
p text.strip.gsub(/\s+|(\S)/) { |m| m == $~[1] ? ":#{$~[1].downcase}: " : '| ' }.rstrip
## => ":h: :e: :l: :l: :o: | :w: :o: :r: :l: :d: | :t: :h: :i: :s: | :i: :s: | :b: :o: :b:"
See the Ruby demo online.
NOTES:
The string is first stripped from leading/trailing whitespace in order not to add pipe at the start.
gsub(/\s+|(\S)/) { |m| m == $~[1] ? ":#{$~[1].downcase}: " : '| ' } does the main job within one regex replace pass:
/\s+|(\S)/ - matches 1+ whitespaces, or matches and captures into Group 1 any single non-whitespace character
The replacement is a block where m represents the match value. If the match value is equal Group 1 value (m == $~[1] ?) then the replacement is :, then Group 1 value in lower case (#{$~[1].downcase}), and then : , else, the replacement is | .
Since there may be a trailing space after gsub, the string is rstripped.

Spark - extracting numeric values from an alphanumeric string using regex

I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1
Here's one approach using a UDF that applies pattern matching to extract the string content into a case class. The pattern matching centers around the numeric value with Regex pattern [+-]?(?:\d*\.)?\d+ to extract the first occurrence of numbers like "1.23", ".99", "-100", etc. A subsequent check of numbers in the remaining substring captured in suffix determines whether the numeric substring in the original string is legitimate.
import org.apache.spark.sql.functions._
import spark.implicits._
case class RegexRes(prefix: String, suffix: String, value: Option[Double], pure_text: String)
val regexExtract = udf{ (s: String) =>
val pattern = """(.*?)([+-]?(?:\d*\.)?\d+)(.*)""".r
s match {
case pattern(pfx, num, sfx) =>
if (sfx.exists(_.isDigit))
RegexRes("", "", None, s)
else
RegexRes(pfx, sfx, Some(num.toDouble), "")
case _ =>
RegexRes("", "", None, s)
}
}
val df = Seq(
"11 H", "111L", ".9", "<.004", ">= 0.78", "val<=0.6", "xyz 100 abc", "1-9", "aaa 100.3.4", "a1q1"
).toDF("result")
df.
withColumn("regex_res", regexExtract($"result")).
select($"result", $"regex_res.prefix", $"regex_res.suffix", $"regex_res.value", $"regex_res.pure_text").
show
// +-----------+------+------+-----+-----------+
// | result|prefix|suffix|value| pure_text|
// +-----------+------+------+-----+-----------+
// | 11 H| | H| 11.0| |
// | 111L| | L|111.0| |
// | .9| | | 0.9| |
// | <.004| <| |0.004| |
// | >= 0.78| >= | | 0.78| |
// | val<=0.6| val<=| | 0.6| |
// |xyz 100 abc| xyz | abc|100.0| |
// | 1-9| | | null| 1-9|
// |aaa 100.3.4| | | null|aaa 100.3.4|
// | a1q1| | | null| a1q1|
// +-----------+------+------+-----+-----------+

Check a given string is a math function or not in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an expression like this abs(iodlin_vod2*1e6)/(array*nfin*nrx*nfinger*weff). My code splits this expression using strsplit. Now, from this result, I should check each word whether it is a math function, if it is not I should perform some operation.
I am stuck at checking the word if it is math function or not. Can someone please help.
Qsn edited:
This is result of strsplit
sstemp
[1] "abs" "iodlin_vod2 " " 1e+06" "array " " nfin "
" nrx "
[7] " nfinger " " weff"
I want to eliminate abs and 1e+06 from further operations in my code.
First let's look at the call tree:
ttt <- "abs(str1*1e6)/(str2*str3)"
library(pryr)
call_tree(parse(text=ttt))
#\- ()
# \- `/
# \- ()
# \- `abs
# \- ()
# \- `*
# \- `str1
# \- 1e+06
# \- ()
# \- `(
# \- ()
# \- `*
# \- `str2
# \- `str3
See also Hadley's book.
Now let's create this in a machine usable format and clean up a bit:
test <- gsub("\\\\\\-|\\s*|`", "",
unlist(
strsplit(
vapply(parse(text = ttt), pryr:::tree,
character(1), width = getOption("width")),
"\\n")
)
)
#[1] "()" "/" "()" "abs" "()" "*" "str1" "1e+06" "()" "(" "()" "*" "str2" "str3"
Then we can test:
vapply(test, function(x) is.function(tryCatch(getFunction(x), error = function(cond) NA)), logical(1))
# () / () abs () * str1 1e+06 () ( () * str2 str3
#FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
As you see, there are four different functions: /, abs, *, and ( in this expression.
This solution will fail if you have a non-function object with a function name in your expression.

Regular Expression : Splitting a string of list of multivalues

My goal is splitting this string with regular expression:
AA(1.2,1.3)+,BB(125)-,CC(A,B,C)-,DD(QWE)+
in a list of:
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Regards.
This regex works with your sample string:
,(?![^(]+\))
This splits on comma, but uses a negative lookahead to assert that the next bracket character is not a right bracket. It will still split even if there are no following brackets.
Here's some java code demonstrating it working with your sample plus some general input showing its robustness:
String input = "AA(1.2,1.3)+,BB(125)-,FOO,CC(A,B,C)-,DD(QWE)+,BAR";
String[] split = input.split(",(?![^(]+\\))");
for (String s : split) System.out.println(s);
Output:
AA(1.2,1.3)+
BB(125)-
FOO
CC(A,B,C)-
DD(QWE)+
BAR
I don't know what language you are working with, but this makes it in grep:
$ grep -o '[A-Z]*([A-Z0-9.,]*)[^,]*' file
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Explanation
[A-Z]*([A-Z0-9.,]*)[^,]*
^^^^^^ ^^^^^^^^^^^ ^^^^^
| ^ | ^ |
| | | | everything but a comma
| ( char | ) char
| A-Z 0-9 . or , chars
list of chars from A to Z