Split line at commas, only if commas not contained between quotes - regex

Is there any way to use the split function in scala so that it splits a line at commas but doesn't at commas contained within 2 double quotes?
For example, I have the following:
x: String = """"??", "hamburger", "ketchup, mayo, mustard", "pizza""""
and I tried this:
x.split(',') but it didn't work. I then thought about removing all double quotes but that still doesn't solve my problem.
Any help would be greatly appreciated!
EDIT:
Here's a snippet of my code to see how I can incorporate this:
val data1 = noheader1.map { line =>
val values = line._1.split(',') //This is what I am trying to change
val name = values(2).replaceAll("\"", ""))
I am a bit new to scala and even more so to regex, so could someone clarify how to write that weird regex expression in my code so that I can obtain an ARRAY of the comma separated words of the line?

Try this!
(?>"(?>\\.|[^"])*?"|(,))
Regex101

Instead of split() you can use a regular expression and findAllIn(), like such:
val x = """"??", "hamburger", "ketchup, mayo, mustard", "pizza""""
""""[^"]+"""".r.findAllIn(x).toList
This will result in, List("??", "hamburger", "ketchup, mayo, mustard", "pizza")
Note: I am using triple-quotes (""") in the example.

Perhaps not so elegant as other regex already suggested, consider the splitting element between items as ", " and so
x.split("\",\\s+\"")
Array("??, hamburger, ketchup, mayo, mustard, pizza")
Then in the resulting array, to the head "?? apply stripPrefix("\"") and to the last pizza" apply stripSuffix("\"").

Related

regular expression search backwards, How to deal with words with and without?

I tested by https://regexr.com/
There two sample words.
BOND_aa_SB1_66-1.pdf
BOND_bb_SB2.pdf
I want to extract SB1, SB2 from each sample.
but my regular expression is not perfect.
It is working
(?<=BOND_.*_).*
But It is difficult to write the following.
I try
(?<=BOND_.*_).*(?=(_|\.))
But first sample result is 'SB1_66-1'
I just want to extract SB1
sb1 The following may or may not exist. if there is content, it can be separated by starting with _.
How should I fix it?
To extract the third underscore-separated term, we can use re.search as follows:
inp = ["BOND_aa_SB1_66-1.pdf", "BOND_bb_SB2.pdf"]
output = [re.search(r'^BOND_[^_]+_([^_.]+)', x).group(1) for x in inp]
print(output) # ['SB1', 'SB2']
s = "BOND_aa_SB1_66-1.pdf BOND_bb_SB2.pdf"
(re.findall(r'(SB\d+)', s))
['SB1', 'SB2']

How to get items into array from string with comma separated values in type script and any item has comma it will be in double quotes

I've been struggling to get all items of below string into an array.
abc,"de,f",hi,"hello","te,st&" items into an array in Typescript.
If any string has comma (,) or ampersand (&) in it,It will be placed in double quotes.
Tried split function but it fails as my strings can have comma as well.
Any help in this regard is highly appreciated.
Thank you.
If you are looking to use a regular expression matching, can you try a different regEx that would match strings inside quotes first, then strings outside quotes, something like (\".+?\")|(^[^\"]+,)|(,[^\"]+,)
I don't know how relevant it would be in case of TypeScript, but I am guessing you'd be able to work something out that takes this Pattern and gives you the matches one by one
First of all, I think that you are making the things more complicated than what they are by implementing the following logic:
has comma (,) or ampersand (&) in it,It will be placed in double quotes.
Instead of doing this that way, you should systematically put your elements inside double quote:
abc,"de,f",hi,"hello","te,st&"
→
"abc","de,f","hi","hello","te,st&"
you will have then the following string to parse.
A regex like this one will do the job:
(?<=,")([^"]*)(?=",)|(?<=")([^"]*)(?=",)|(?<=")([^"]*)(?="$)
using back references $1$2$3, you can extract your elements.
RegEx /(?:^|,)(\"(?:[^\"])\"|[^,])/ has helped me get the required values.
var test = '"abc,123",test,123,456,"def:get"';
test.split(/(\"(?:[^\"])\"|[^,])/);
Its returning the below array.
["", ""abc,123"", ",", "test", ",", "123", ",", "456", ",", ""def:get"", ""]
And when a particular values in side double quotes,I just trimmed them to get the actual values and have ignore empty items of array..
use the split a string .....
let fullName = "First,Last"
let fullNameArr = fullName.characters.split{$0 == ","}.map(String.init)
fullNameArr[0] // First
fullNameArr[1] // Last

Match return substring between two substrings using regexp

I have a list of records that are character vectors. Here's an example:
'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'
From these names I would like to extract whatever's between the two substrings 1mil_ and _ks_drivers_sorted.csv.
So in this case the output would be:
0,1_1_1_lb200
0_1_lb100
1_1_lb2_100_100
1_1_lb100
I'm using MATLAB so I thought to use regexp to do this, but I can't understand what kind of regular expression would be correct.
Or are there some other ways to do this without using regexp?
Let the data be:
x = {'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'};
You can use lookbehind and lookahead to find the two limiting substrings, and match everything in between:
result = cellfun(#(c) regexp(c, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match'), x);
Or, since the regular expression only produces one match, the following simpler alternative can be used (thanks #excaza for noticing):
result = regexp(x, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match', 'once');
In your example, either of the above gives
result =
4×1 cell array
'0,1_1_1_lb200'
'0_1_lb100'
'1_1_lb2_100_100'
'1_1_lb100'
For me the easy way to do this is just use espace or nothing to replace what you don't need in your string, and the rest is what you need.
If is a list, you can use a loop to do this.
Exemple to replace "1mil_" with "" and "_ks_drivers_sorted.csv" with ""
newChr = strrep(chr,'1mil_','')
newChr = strrep(chr,'_ks_drivers_sorted.csv','')

Regex without brackets

I have the following tag from an XML file:
<msg><![CDATA[Method=GET URL=http://test.de:80/cn?OP=gtm&Reset=1(Clat=[400441379], Clon=[-1335259914], Decoding_Feat=[], Dlat=[0], Dlon=[0], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[7363], ntCoent-Length=[15783], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:270/CSI:-/Me:1/Total:271]]>
Now I try to get from this message: Clon, Dlat, Dlon and Clat.
However, I already created the following regex:
(?<=Clat=)[\[\(\d+\)\n\n][^)n]+]
But the problem is here, I would like to get only the numbers without the brackets. I tried some other expressions.
Do you maybe know, how I can expand this expression, in order to get only the values without the brackets?
Thank you very much in advance.
Best regards
The regex
(clon|dlat|dlon|clat)=\[(-?\d+)\]
Gives
As I stated before, if you use this regex to extract the information out of this CDATA element, that's okay. But you really want to get to the contents of that element using an XML parser.
Example usage
Regex r = new Regex(#"(clon|dlat|dlon|clat)=\[(-?\d+)\]");
string s = ".. here's your cdata content .. ";
foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
{
var name = match.Groups[1].Value; //will contain "clon", "dlat", "dlon" or "clat"
var inner_value = match.Groups[2].Value; //will contin the value inside the square-brackets, e.g. "400441379"
//Do something with the matches
}

Replace multiple words in pig

I am new to Pig. In the script that I am writing I want to perform an operation similar to this:
foreach X GENERATE REPLACE(word,'.*abc.*','abc') OR REPLACE(word,'.*def.*','def').
If the first pattern matches then abc is replaced else if second pattern is matched then def is replaced. But I suppose the syntax is incorrect. Can someone help me with the syntax?
There are a few ways to do this, but since if the regex doesn't match the string, you'll just get your string back, this is pretty compact:
Y = FOREACH X GENERATE REPLACE(REPLACE(word, '.*abc.*', 'abc'), '.*def.*', 'def');