How do I replace a delimiter that appears only in between something? - regex

I have a use case with this data:
1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"
Currently, I am doing this to replace the + with space as follows:
def normalizer(value: String): String = {
if (value == null) {
null
} else {
value.replaceAll("\\+", BLANK_SPACE)
}
}
val testUDF = udf(normalizer(_: String): String)
df.withColumn("newCol", testUDF($"value"))
But this is replacing all "+". How do I replace "+" that comes between strings while also handling use cases like: "apple+case+10++cover" => "apple case 10+ cover"?
The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"

You can use regexp_replace to do this instead of a udf, it should be much faster. For most of the cases, you can use negative lookahead in the regexp, but for "+apple" you actually want to replace "+" with "" (and not a space). The easiest way is to simply use to regexps.
df.withColumn("newCol", regexp_replace($"value", "^\\+", ""))
.withColumn("newCol", regexp_replace($"newCol", "\\+(?!\\+|$)", " "))
This will give:
+--------------------+--------------------+
|value |newCol |
+--------------------+--------------------+
|apple+case |apple case |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple |apple |
|iphone8+ |iphone8+ |
+--------------------+--------------------+
To make this more modular and reusable, you can define it as a function:
def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\\+", ""), "\\+(?!\\+|$)", " ")
df.withColumn("newCol", normalizer("value"))

You may try making two regex replacements:
df.withColumn("newCol", regexp_replace(
regexp_replace(testUDF("value"), "(?<=\d)\+(?!\+)", "+ "),
"(?<!\d)\+", " ")).show
The inner regex replacement would target the edge case of single plus preceded by a digit, which should be replaced by adding a space (but not deleting the plus). Example:
apple+case+10+cover --> apple+case+10+ cover
The outer regex replacement then targets all pluses which are not preceded by a digit, and replaces them with a space. Example, continuing from above:
apple+case+10+ cover --> apple case 10+ cover

Related

Extract the repetitive parts of a String by Regex pattern matching in Scala

I have this code for extracting the repetitive : separated sections of a regex, which does not give me the right output.
val pattern = """([a-zA-Z]+)(:([a-zA-Z]+))*""".r
for (p <- pattern findAllIn "it:is:very:great just:because:it is") p match {
case pattern("it", pattern(is, pattern(very, great))) => println("it: "+ is + very+ great)
case pattern(it, _,rest) => println( it+" : "+ rest)
case pattern(it, is, very, great) => println(it +" : "+ is +" : "+ very +" : " + great)
case _ => println("match failure")
}
What am I doing wrong?
How can I write a case expression which allows me to extract each : separated part of the pattern regex?
What is the right syntax with which to solve this?
How can the match against unknown number of arguments to be extracted from a regex be done?
In this case print:
it : is : very : great
just : because : it
is
You can't use repeated capturing group like that, it only saves the last captured value as the current group value.
You can still get the matches you need with a \b[a-zA-Z]+(?::[a-zA-Z]+)*\b regex and then split each match with ::
val text = "it:is:very:great just:because:it is"
val regex = """\b[a-zA-Z]+(?::[a-zA-Z]+)*\b""".r
val results = regex.findAllIn(text).map(_ split ':').toList
results.foreach { x => println(x.mkString(", ")) }
// => it, is, very, great
// just, because, it
// is
See the Scala demo. Regex details:
\b - word boundary
[a-zA-Z]+ - one or more ASCII letters
(?::[a-zA-Z]+)* - zero or more repetitions of
: - a colon
[a-zA-Z]+ - one or more ASCII letters
\b - word boundary

EBNF for capturing a comma between two optional values

I have two optional values, and when both are present, a comma needs to be in between them. If one or both values are present, there may be a trailing comma, but if no values are present, no comma is allowed.
Valid examples:
(first,second,)
(first,second)
(first,)
(first)
(second,)
(second)
()
Invalid examples:
(first,first,)
(first,first)
(second,second,)
(second,second)
(second,first,)
(second,first)
(,first,second,)
(,first,second)
(,first,)
(,first)
(,second,)
(,second)
(,)
(,first,first,)
(,first,first)
(,second,second,)
(,second,second)
(,second,first,)
(,second,first)
I have EBNF code (XML-flavored) that suffices, but is there a way I can simplify it? I would like to make it more readable / less repetitive.
tuple ::= "(" ( ( "first" | "second" | "first" "," "second" ) ","? )? ")"
If it’s easier to understand in regex, here’s the equivalent code, but I need a solution in EBNF.
/\(((first|second|first\,second)\,?)?\)/
And here’s a helpful railroad diagram:
This question becomes even more complex when we abstract it to three terms: "first", "second", and "third" are all optional, but they must appear in that order, separated by commas, with an optional trailing comma. The best I can come up with is a brute-force method:
"(" (("first" | "second" | "third" | "first" "," "second" | "first" "," "third" | "second" "," "third" | "first" "," "second" "," "third") ","?)? ")"
Clearly, a solution involving O(2n) complexity is not very desirable.
I found a way to simplify it, but not by much:
"(" ( ("first" ("," "second")? | "second") ","? )? ")"
For the three-term solution, take the two-term solution and prepend a first term:
"(" (("first" ("," ("second" ("," "third")? | "third"))? | "second" ("," "third")? | "third") ","?)? ")"
For any (n+1)-term solution, take the n-term solution and prepend a first term. This complexity is O(n), which is significantly better than O(2n).
This expression might help you to maybe design a better expression. You can do this with only using capturing groups and swipe from left to right and pass your possible inputs, maybe similar to this:
\((first|second|)(,|)(second|)([\)|,]+)
I'm just guessing that you wish to capture the middle comma:
This may not be the exact expression you want. However, it might show you how this might be done in a simple way:
^(?!\(,)\((first|)(,|)(second|)([\)|,]+)$
You can add more boundaries to the left and right of your expression, maybe similar to this expression:
This graph shows how the second expression would work:
Performance
This JavaScript snippet shows the performance of the second expression using a simple 1-million times for loop, and how it captures first and second using $1 and $3.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "(first,second,)";
var regex = /^(?!\(,)\((first|second|)(,|)(second|)([\)|,]+)$/gms;
var match = string.replace(regex, "$1 and $3");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
I'm not familiar with EBNF but I am familiar with BNF and parser grammars. The following is just a variation of what you have based on my own regex. I am assuming the unquoted parens are not considered tokens and are used to group related elements.
tuple ::= ( "(" ( "first,second" | "first" | "second" ) ","? ")" ) | "()"
It matches on either (first,second or (first or (second
Then it matches on an optional ,
Followed by a closing parens. )
or the empty parens grouping. ()
But I doubt this is an improvement.
Here is my Java test code. The first two lines of strings in the test data match. The others do not.
String[] testdata = {
"(first,second,)", "(first,second)", "(first,)", "(first)",
"(second,)", "(second)", "()",
"(first,first,)", "(first,first)", "(second,second,)",
"(second,second)", "(second,first,)", "(second,first)",
"(,first,second,)", "(,first,second)", "(,first,)", "(,first)",
"(,second,)", "(,second)", "(,)", "(,first,first,)",
"(,first,first)", "(,second,second,)", "(,second,second)",
"(,second,first,)", "(,second,first)"
};
String reg = "\\(((first,second)|first|second),?\\)|\\(\\)";
Pattern p = Pattern.compile(reg);
for (String t : testdata) {
Matcher m = p.matcher(t);
if (m.matches()) {
System.out.println(t);
}
}

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test
While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).
This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}
Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

Scala regex "starts with lowercase alphabets" not working

val AlphabetPattern = "^([a-z]+)".r
def stringMatch(s: String) = s match {
case AlphabetPattern() => println("found")
case _ => println("not found")
}
If I try,
stringMatch("hello")
I get "not found", but I expected to get "found".
My understanding of the regex,
[a-z] = in the range of 'a' to 'z'
+ = one more of the previous pattern
^ = starts with
So regex AlphabetPattern is "all strings that start with one or more alphabets in the range a-z"
Surely I am missing something, want to know what.
Replace case AlphabetPattern() with case AlphabetPattern(_) and it works. The extractor pattern takes a variable to which it binds the result. Here we discard it but you could use x or whatever.
edit: Further to Randall's comment below, if you check the docs for Regex you'll see that it has an unapplySeq rather than an unapply method, which means it takes multiple variables. If you have the wrong number, it won't match, rather like
list match { case List(a,b,c) => a + b + c }
won't match if list doesn't have exactly 3 elements.
There are some issues with the match statement. s match is matching on the value of s which is checked against AlphabetPattern and _ which always evaluates to _ since s is never equal to "^([a-z]+)".r. Use one of the find methods in Scala.Util.Regex to look for a match with the given `Regex.
For example, using findFirstIn to find the first match of a string in AlphabetPattern.
scala> AlphabetPattern.findFirstIn("hello")
res0: Option[String] = Some(hello)
The stringMatch method using findFirstIn and a case statement:
scala> def stringMatch(s: String) = AlphabetPattern findFirstIn s match {
| case Some(s) => println("Found: " + s)
| case None => println("Not found")
| }
stringMatch: (s:String)Unit
scala> stringMatch("hello")
Found: hello

How to parse a command line with regular expressions?

I want to split a command line like string in single string parameters. How look the regular expression for it. The problem are that the parameters can be quoted. For example like:
"param 1" param2 "param 3"
should result in:
param 1, param2, param 3
You should not use regular expressions for this. Write a parser instead, or use one provided by your language.
I don't see why I get downvoted for this. This is how it could be done in Python:
>>> import shlex
>>> shlex.split('"param 1" param2 "param 3"')
['param 1', 'param2', 'param 3']
>>> shlex.split('"param 1" param2 "param 3')
Traceback (most recent call last):
[...]
ValueError: No closing quotation
>>> shlex.split('"param 1" param2 "param 3\\""')
['param 1', 'param2', 'param 3"']
Now tell me that wrecking your brain about how a regex will solve this problem is ever worth the hassle.
I tend to use regexlib for this kind of problem. If you go to: http://regexlib.com/ and search for "command line" you'll find three results which look like they are trying to solve this or similar problems - should be a good start.
This may work:
http://regexlib.com/Search.aspx?k=command+line&c=-1&m=-1&ps=20
("[^"]+"|[^\s"]+)
what i use
C++
#include <iostream>
#include <iterator>
#include <string>
#include <regex>
void foo()
{
std::string strArg = " \"par 1\" par2 par3 \"par 4\"";
std::regex word_regex( "(\"[^\"]+\"|[^\\s\"]+)" );
auto words_begin =
std::sregex_iterator(strArg.begin(), strArg.end(), word_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i)
{
std::smatch match = *i;
std::string match_str = match.str();
std::cout << match_str << '\n';
}
}
Output:
"par 1"
par2
par3
"par 4"
Without regard to implementation language, your regex might look something like this:
("[^"]*"|[^"]+)(\s+|$)
The first part "[^"]*" looks for a quoted string that doesn't contain embedded quotes, and the second part [^"]+ looks for a sequence of non-quote characters. The \s+ matches a separating sequence of spaces, and $ matches the end of the string.
Regex: /[\/-]?((\w+)(?:[=:]("[^"]+"|[^\s"]+))?)(?:\s+|$)/g
Sample: /P1="Long value" /P2=3 /P3=short PwithoutSwitch1=any PwithoutSwitch2
Such regex can parses the parameters list that built by rules:
Parameters are separates by spaces (one or more).
Parameter can contains switch symbol (/ or -).
Parameter consists from name and value that divided by symbol = or :.
Name can be set of alphanumerics and underscores.
Value can absent.
If value exists it can be the set of any symbols, but if it has the space then value should be quoted.
This regex has three groups:
the first group contains whole parameters without switch symbol,
the second group contains name only,
the third group contains value (if it exists) only.
For sample above:
Whole match: /P1="Long value"
Group#1: P1="Long value",
Group#2: P1,
Group#3: "Long value".
Whole match: /P2=3
Group#1: P2=3,
Group#2: P2,
Group#3: 3.
Whole match: /P3=short
Group#1: P3=short,
Group#2: P3,
Group#3: short.
Whole match: PwithoutSwitch1=any
Group#1: PwithoutSwitch1=any,
Group#2: PwithoutSwitch1,
Group#3: any.
Whole match: PwithoutSwitch2
Group#1: PwithoutSwitch2,
Group#2: PwithoutSwitch2,
Group#3: absent.
Most languages have other functions (either built-in or provided by a standard library) which will parse command lines far more easily than building your own regex, plus you know they'll do it accurately out of the box. If you edit your post to identify the language that you're using, I'm sure someone here will be able to point you at the one used in that language.
Regexes are very powerful tools and useful for a wide range of things, but there are also many problems for which they are not the best solution. This is one of them.
This will split an exe from it's params; stripping parenthesis from the exe; assumes clean data:
^(?:"([^"]+(?="))|([^\s]+))["]{0,1} +(.+)$
You will have two matches at a time, of three match groups:
The exe if it was wrapped in parenthesis
The exe if it was not wrapped in parenthesis
The clump of parameters
Examples:
"C:\WINDOWS\system32\cmd.exe" /c echo this
Match 1: C:\WINDOWS\system32\cmd.exe
Match 2: $null
Match 3: /c echo this
C:\WINDOWS\system32\cmd.exe /c echo this
Match 1: $null
Match 2: C:\WINDOWS\system32\cmd.exe
Match 3: /c echo this
"C:\Program Files\foo\bar.exe" /run
Match 1: C:\Program Files\foo\bar.exe
Match 2: $null
Match 3: /run
Thoughts:
I'm pretty sure that you would need to create a loop to capture a possibly infinite number of parameters.
This regex could easily be looped onto it's third match until the match fails; there are no more params.
If its just the quotes you are worried about, then just write a simple loop to dump character by character to a string ignoring the quotes.
Alternatively if you are using some string manipulation library, you can use it to remove all quotes and then concatenate them.
there's a python answer thus we shall have a ruby answer as well :)
require 'shellwords'
Shellwords.shellsplit '"param 1" param2 "param 3"'
#=> ["param 1", "param2", "param 3"] or :
'"param 1" param2 "param 3"'.shellsplit
Though answer is not RegEx specific but answers Python commandline arg parsing:
dash and double dash flags
int/float conversion based on SO answer
import sys
def parse_cmd_args():
_sys_args = sys.argv
_parts = {}
_key = "script"
_parts[_key] = [_sys_args.pop(0)]
for _part in _sys_args:
# Parse numeric values float and integers
if _part.replace("-", "1", 1).replace(".", "1").replace(",", "").isdigit():
_part = int(_part) if '.' not in _part and float(_part)/int(_part) == 1 else float(_part)
_parts[_key].append(_part)
elif "=" in _part:
_part = _part.split("=")
_parts[_part[0].strip("-")] = _part[1].strip().split(",")
elif _part.startswith(("-")):
_key = _part.strip("-")
_parts[_key] = []
else:
_parts[_key].extend(_part.split(","))
return _parts
Something like:
"(?:(?<=")([^"]+)"\s*)|\s*([^"\s]+)
or a simpler one:
"([^"]+)"|\s*([^"\s]+)
(just for the sake of finding a regexp ;) )
Apply it several time, and the group n°1 will give you the parameter, whether it is surrounded by double quotes or not.
If you are looking to parse the command and the parameters I use the following (with ^$ matching at line breaks aka multiline):
(?<cmd>^"[^"]*"|\S*) *(?<prm>.*)?
In case you want to use it in your C# code, here it is properly escaped:
try {
Regex RegexObj = new Regex("(?<cmd>^\\\"[^\\\"]*\\\"|\\S*) *(?<prm>.*)?");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
It will parse the following and know what is the command versus the parameters:
"c:\program files\myapp\app.exe" p1 p2 "p3 with space"
app.exe p1 p2 "p3 with space"
app.exe
Here's a solution in Perl:
#!/usr/bin/perl
sub parse_arguments {
my $text = shift;
my $i = 0;
my #args;
while ($text ne '') {
$text =~ s{^\s*(['"]?)}{}; # look for (and remove) leading quote
my $delimiter = ($1 || ' '); # use space if not quoted
if ($text =~ s{^(([^$delimiter\\]|\\.|\\$)+)($delimiter|$)}{}) {
$args[$i++] = $1; # acquired an argument; save it
}
}
return #args;
}
my $line = <<'EOS';
"param 1" param\ 2 "pa\"ram' '3" 'pa\'ram" "4'
EOS
say "ARG: $_" for parse_arguments($line);
Output:
ARG: param 1
ARG: param\ 2
ARG: pa"ram' '3
ARG: pa'ram" "4
Note the following:
Arguments can be quoted with either " or ' (with the "other"
quote type treated as a regular character for that argument).
Spaces and quotes in arguments can be escaped with \.
The solution can be adapted to other languages. The basic approach is to (1) determine the delimiter character for the next string, (2) extract the next argument up to an unescaped occurrence of that delimiter or to the end-of-string, then (3) repeat until empty.
\s*("[^"]+"|[^\s"]+)
that's it
(reading your question again, just prior to posting I note you say command line LIKE string, thus this information may not be useful to you, but as I have written it I will post anyway - please disregard if I have missunderstood your question.)
If you clarify your question I will try to help but from the general comments you have made i would say dont do that :-), you are asking for a regexp to split a series of parmeters into an array. Instead of doing this yourself I would strongly suggest you consider using getopt, there are versions of this library for most programming languages. Getopt will do what you are asking and scales to manage much more sophisticated argument processing should you require that in the future.
If you let me know what language you are using I will try and post a sample for you.
Here are a sample of the home pages:
http://www.codeplex.com/getopt
(.NET)
http://www.urbanophile.com/arenn/hacking/download.html
(java)
A sample (from the java page above)
Getopt g = new Getopt("testprog", argv, "ab:c::d");
//
int c;
String arg;
while ((c = g.getopt()) != -1)
{
switch(c)
{
case 'a':
case 'd':
System.out.print("You picked " + (char)c + "\n");
break;
//
case 'b':
case 'c':
arg = g.getOptarg();
System.out.print("You picked " + (char)c +
" with an argument of " +
((arg != null) ? arg : "null") + "\n");
break;
//
case '?':
break; // getopt() already printed an error
//
default:
System.out.print("getopt() returned " + c + "\n");
}
}