Scala regex match and split - regex

So here's what I want to do:
input string: "abc From: blah"
I want to split this so that the result is
["abc" "From: blah"] or ["abc" "From" "blah"
I have several other patterns to match
["abcd" "To:" "blah"] etc
So I have the following regex
val datePattern = """((.*>.*)|(.*(On).*(wrote:)$)|(.*(Date):.*(\+\d\d\d\d)?$)|(.*(From):.*(\.com)?(\]|>)?$))"""
val reg = datePattern.r
If I do a match the result comes out fine.
If I do a split on the same regex I get an empty list.
inputStr match {
case reg(_*) => return "Match"
case _ => return "Output: None"
}
on the input string :
"abc From: blah blah"
returns Match
Split
inputStr.split(datePattern)
returns an empty array. What am I possibly missing ?

Since the regexp matches the string, split will remove the entire string (considered as a separator).
The default behavior is not to return two empty strings, but an empty array in this case, as given by the split signification.
https://stackoverflow.com/a/14602089/1287856
Concerning why your regex matches in its entirety, you might find this website useful (it concerns your example directly)
https://regex101.com/r/zY0lX9/1
Split finds the whole regexp and removes all its occurences from the string, returning the interleaved strings as an array. You may want to split on something like "(?=From:)" so that it does not remove anything.

Related

Ruby non-greedy modifier did not apply?

I have a regexp with a non-greedy modifier which does not seem to work. I have tried so many variations of the regexp and various other ways I could think of, without success, that I am losing my head
I want to remove all the empty strings embedded in the string s below. With my regexp I was expecting to remove all the things that matched something=""
s = 'a,b="cde",f="",g="hi",j=""'
puts s; puts s.gsub( /,.+?="",?/ , "," ).chomp(','); nil
Expected:
a,b="cde",g="hi"
What I get:
a,g="hi"
Why isn't the .+? non greedy in the gsub regexp below?
It works if I constrain the . to a set of characters [\w\d_-], but that forces me to do assumptions:
puts s; puts s.gsub( /,[\w\d_-]+?=""/ , "" ).chomp(','); nil
# outputs:
a,b="cde",f="",g="hi",j=""
a,g="hi"
It also works if I do some sort of negative lookup like:
puts s; puts s.gsub( /,.+?="",?/ , "," ).chomp(','); nil
# outputs:
a,b="cde",f="",g="hi",j=""
a,g="hi"
But still I do not understand why it did not work in the first case.
Regex matches from left to right. Your regex ,.+?="",? matches the first comma in the string a,b="cde",f="",g="hi",j="", the one between a and b. Then it tries to find ="" that exists after the ,g so you get the actual result.
What you want is: ,[^=]+?="",? that matches 1 or more any character that is not an equal sign before ="" and you'll get a,b="cde",g="hi" as result.

What's the regex expression to split string by these special characters (such as squared brackets [] and dashes / \)?

I want to split some string such as "[MX0149/M4200], and total\test//now" this should output: MX0149 M4200 and total test now.
So far my regex expression is as follow: [\s#&.,;><_=()!?/$#+-]+ but I want it to include splitting string by squared brackets [] and dashes / .
You can use \W+ (which is same as [^a-zA-Z0-9_]) to split your string and get your desired output.
Check this Java code,
String s = "[MX0149/M4200], and total\\test//now";
Arrays.stream(s.split("\\W+"))
.filter(x -> x.length() > 0) // this is to remove empty string as first string will be empty
.forEach(System.out::println); // print all the splitted strings
Prints,
MX0149
M4200
and
total
test
now
Let me know if this works for you.

Splitting String with " ' " but NOT " ?' "

I want to split a String whereever the ' character is present unless there is a question mark in front of it(?') - in that case I don't want to split.
What's the best way to go about doing that?
I'm splitting like so (if the delimiter is a Char):
message.Split(New Char() {"'"C})
And if it's a String:
message.Split(New String() {"break"}, StringSplitOptions.None)
Do I then have to test each item in the given array to see if it ends with a ? char, and then concatenate the Strings again - that just doesn't seem like an optimal solution..?
Do you have to make a regular expression, and how would you do that in vb.net?
You will need a Regex.Split with a (?<!\?)' regex:
Regex.Split(message, "(?<!\?)'")
See the regex demo
The (?<!\?) negative lookbehind will fail the match if a literal ? appears right to the left of the single apostrophe.
In VB.NET, you can use Linq to remove any empty strings you get with this regex split:
Dim message As String = "'sss?'ss'"
Dim my_result() As String = Regex.Split(message, "(?<!\?)'") _
.Where(Function(strn As String) String.IsNullOrWhiteSpace(strn) = False) _
.ToArray()
Console.WriteLine(String.Join(", ", my_result))
' => sss?'ss

Removing commas used as thousand separators in dollar price amounts in longer strings using Scala regex

I am trying to remove the , in dollar values in a string. For example I have a string: val str = "Hello the cost is $323,999 and it has 3 modes 1,2, and 3"
I basically want to get the output: "Hello the cost is $323999 and it has 3 modes 1,2, and 3"
I used the regex:
val pattern = """\$([0-9]+(?:,[0-9]+)*)""".r
val replacedStr = pattern replaceAllIn (str, m => m.group(1).replace(",", ""))
The issue is that due to the $3 in the regex match, scala is trying to find a group 3 in the regex match and giving me java.lang.IndexOutOfBoundsException: No group 3
How do I get rid of this issue?
Add the dollar symbol back when replacing, but escape it with double backslashes:
val pattern = """\$([0-9]+(?:,[0-9]+)*)""".r
val replacedStr = pattern replaceAllIn (str, m => "\\$" + m.group(1).replace(",", ""))
^^^^^
See IDEONE demo
You need to tell the regular expression compiler to ignore the dollar symbol, but since it is Java String, two backslashes must be used to get a literal backslash into the String.

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}