I need to split the CSV file at commas, but the problem is that file can contain commas inside fields. So for an example:
one,two,tree,"four,five","six,seven".
It uses double quotes to escape, but I could not solve it.
I tried to use something like this with this regex, but I got an error: REGEX_TOO_COMPLEX.
data: lv_sep type string,
lv_rep_pat type string.
data(lv_row) = iv_row.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
concatenate '$1$2' lv_sep into lv_rep_pat.
"replace all commas that are separator with the new separator
replace all occurrences of regex '(?:"((?:""|[^"]+)+)"|([^,]*))(?:,|$)' in lv_row with lv_rep_pat.
split lv_row at lv_sep into table rt_cells.
You must use this Regex => ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
DATA: lv_sep TYPE string,
lv_rep_pat TYPE string.
DATA(lv_row) = 'one,two,tree,"four,five","six,seven"'.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
CONCATENATE '$1$2' lv_sep INTO lv_rep_pat.
"replace all commas that are separator with the new separator
REPLACE ALL OCCURRENCES OF REGEX ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)' IN lv_row WITH lv_rep_pat.
SPLIT lv_row AT lv_sep INTO TABLE data(rt_cells).
LOOP AT rt_cells into data(cells).
WRITE cells.
SKIP.
ENDLOOP.
Testing output
I never ever touched ABAP, so please see this as pseudo code
I'd recommend using a non-regex solution here:
data: checkedOffsetComma type i,
checkedOffsetQuotes type i,
baseOffset type i,
testString type string value 'val1, "val2, val21", val3'.
LOOP AT SomeFancyConditionYouDefine.
checkedOffsetComma = baseOffset.
checkedOffsetQuotes = baseOffset.
find FIRST OCCURRENCE OF ','(or end of line here) in testString match OFFSET checkedOffsetComma.
write checkedOffsetComma.
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write checkedOffsetQuotes.
*if the next comma is closer than the next quotes
IF checkedOffsetComma < checkedOffsetQuotes.
REPLACE SECTION checkedOffsetComma 1 OF ',' WITH lv_rep_pat.
baseOffset = checkedOffsetComma.
ELSE.
*if we found quotes, we go to the next quotes afterwards and then continue as before after that position
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write baseOffset.
ENDIF.
ENDLOOP.
This assumes that there are no quotes in quotes thingies. Didn't test, didn't validate in any way. I'd be happy if this at least partly compiles :)
Related
I tried the solution for the picking commas outside quotes using regexp in Matlab (MacOSX)
str='"This string has comma , inside the quotes", 2nd string, 3rd string'
I expect the three tokens
"This string has comma , inside the quotes"
2nd string
3rd string
I used the following but get an empty solution
regexp(str, '\^([^"]|"[^"]*")*?(,)\')
ans =
[]
What should be correct regexp grammar for this.
Without regular expressions
You could
Detect the positions of commas outside double-quotation marks: they are commas that have an even (possibly zero) number of double-qoutation marks to their left.
Split the string at those points.
Remove commas at the end of all substrings except the last.
Code:
pos = find(~mod(cumsum(str=='"'),2)&str==','); %// step 1
result = mat2cell(str, 1, diff([0 pos numel(str)])); %// step 2
result(1:end-1) = cellfun(#(x) x(1:end-1), result(1:end-1), 'uniformoutput', 0); %// step 3
With regular expressions
Split at commas preceded by an even (possibly zero) number of double-quotation marks:
result = regexp(str,'(?<=(".*").*),', 'split');
all
So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If
In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
I am trying to remove commas inside double quotes from a csv file in notepad++, this is what I have:
1070,17,2,GN3-670,"COLLAR B, M STAY","2,606.45"
and I need this:
1070,17,2,GN3-670,"COLLAR B M STAY","2606.45"
I ma trying to use notepad find/replace option with a reg exp. pattern.
I tried all kind of combination but didn't manage to do :( The file contains 1 million rows.
After whole today I am not anymore sure if a simple regex can do? Maybe I should go with a script...python?
mrki, this will do what you want (tested in N++):
Search: ("[^",]+),([^"]+")
Replace: $1$2 or \1\2
How does this work? The first parentheses capture the beginning of the string up to (but not including) the comma into Group 1. The second parentheses capture the end of the string after the comma into Group 2. The replacement substitutes the string with a concatenation of Group 1 and Group 2.
In more detail: in the first parentheses, we match the opening double quotes then any number of characters that are not a comma. That is the meaning of [^,]+. In the second parentheses, we match any number of characters that are not a double quote with [^"]+, then the closing double quotes .
Try the following
import re
print re.sub(',(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)',"",string)
This will remove comma between quotes
Just an update to #zx81's brilliant solution.
Lets say you have 2commas between quotes
Then the regex search has to be modified as follows:
("[^",]+),([^",]+),([^"]+")
Replace needs to be modified as
$1$2$3
So on modify it depending on the # of commas.
I tried exploring to see if recursive regex was possible but the does not seem to be possible as of now
For a line with multiple instances of "comma within double quotes", I can think of the following perl script - you need to have a header line without this kind of instance so that you know how many comma-separated fields there should be.
#! /usr/bin/perl -w
use strict;
my $n_fields = "";
while (<>) {
s/\s+$//;
if (/^\#/) { # header line
my #t = split(/,/);
$n_fields = scalar(#t); # total number of fields
} else { # actual data
my $n_commas = $_ =~s/,/,/g; # total number of commas
foreach my $i (0 .. $n_commas - $n_fields) { # iterate ($n_commas - $n_fields + 1) times
s/(\"[^",]+),([^"]+\")/$1\\x2c$2/g; # single replacement per previous answers
}
s/\"//g; # removal of double quotes (if you want)
}
print "$_\n";
}
I have multiple lines in some text files such as
.model sdata1 s tstonefile='../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p' passive=2
I want to extract the text between the single quotes in MATLAB.
Much help would be appreciated.
To get all of the text inside multiple '' blocks, regexp can be used as follows:
regexp(txt,'''(.[^'']*)''','tokens')
This says to get text surrounded by ' characters, which does not include a ' in the captured text. For example, consider this file with two lines (I made up different file name),
txt = ['.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2 ', char(10), ...
'.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'' passive=2']
>> stringCell = regexp(txt,'''(.[^'']*)''','tokens');
>> stringCell{:}
ans =
'../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'
ans =
'../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'
>>
Trivia:
char(10) gives a newline character because 10 is the ASCII code for newline.
The . character in regexp (regex in the rest of the coding word) pattern usually does not match a newline, which would make this a safer pattern. In MATLAB, a dot in regexp does match a newline, so to disable this, we could add 'dotexceptnewline' as the last input argument to `regexp``. This is convenient to ensure we don't get the text outside of the quotes instead, but not needed since the first match sets precedent.
Instead of excluding a ' from the match with [^''], the match can be made non-greedy with ? as follows, regexp(txt,'''(.*?)''','tokens').
If you plan to use textscan:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','''');
fclose(fid);
output = rawdata{:}(2)
As also used in other answers the single apostrophe 'is represented by a double one: '', e.g. for delimiters.
considering the comment:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','\n');
fclose(fid);
lines = rawdata{1,1};
L = size(lines,1);
output = cell(L,1);
for ii=1:L
temp = textscan(lines{ii},'%s','delimiter','''');
output{ii,1} = temp{:}(2);
end
One easy way is to split the string with single quote delimiter and take the even-numbered strings in the output:
str = fileread('test.txt');
out = regexp(str, '''', 'split');
out = out(2:2:end);
You can do this using regular expressions. Assuming that there is only one occurrence of text between quotation marks:
% select all chars between single quotation marks.
out = regexp(inputString,'''(.*)''','tokens','once');
After identifing which lines you want to extract info from, you could tokenize it or do something like this if they all have the same form:
test='.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2';
a=strfind(test,'''')
test=test(a(1):a(2))
[15-]
[41-(32)]
[48-(45)]
[70-15]
[40-(64)]
[(128)-42]
[(128)-56]
I have these values for which I want to extract the value not in curled brackets. If there is more than one, then add them together.
What is the regular expression to do this?
So the solution would look like this:
[15-] -> 15
[41-(32)] -> 41
[48-(45)] -> 48
[70-15] -> 85
[40-(64)] -> 40
[(128)-42] -> 42
[(128)-56] -> 56
You would be over complicating if you go for a regex approach (in this case, at least), also, regular expressions does not support mathematical operations, as pointed out by #richardtallent.
You can use an approach as shown here to extract a substring which omits the initial and final square brackets, and then, use the Split (as shown here) and split the string in two using the dash sign. Lastly, use the Instr function (as shown here) to see if any of the substrings that the split yielded contains a bracket.
If any of the substrings contain a bracket, then, they are omitted from the addition, or they are added up if otherwise.
Regular expressions does not support performing math on the terms. You can loop through the groups that are matched and perform the math outside of Regex.
Here's the pattern to extract any number within the square brackets that are not in cury brackets:
\[
(?:(?:\d+|\([^\)]*\))-)*
(\d+)
(?:-[^\]]*)*
\]
Each number will be returned in $1.
This works by looking for a number that is prefixed by any number of "words" separated by dashes, where the "words" are either numbers themselves or parenthesized strings, and followed by, optionally, a dash and some other stuff before hitting the end brace.
If VBA's RegEx doesn't support uncaptured groups (?:), remove all of the ?:'s and your captured numbers will be in $3 instead.
A simpler pattern also works:
\[
(?:[^\]]*-)*
(\d+)
(?:-[^\]]*)*
\]
This simply looks for numbers delimited by dashes and allowing for the number to be at the beginning or end.
Private Sub regEx()
Dim RegexObj As New VBScript_RegExp_55.RegExp
RegexObj.Pattern = "\[(\(?[0-9]*?\)?)-(\(?[0-9]*?\)?)\]"
Dim str As String
str = "[15-]"
Dim Match As Object
Set Match = RegexObj.Execute(str)
Dim result As Integer
Dim value1 As Integer
Dim value2 As Integer
If Not InStr(1, Match.Item(0).submatches.Item(0), "(", 1) Then
value1 = Match.Item(0).submatches.Item(0)
End If
If Not InStr(1, Match.Item(0).submatches.Item(1), "(", 1) And Not Match.Item(0).submatches.Item(1) = "" Then
value2 = Match.Item(0).submatches.Item(1)
End If
result = value1 + value2
MsgBox (result)
End Sub
Fill [15-] with the other strings.
Ok! It's been 6 years and 6 months since the question was posted. Still, for anyone looking for something like that maybe now or in the future...
Step 1:
Trim Leading and Trailing Spaces, if any
Step 2:
Find/Search:
\]|\[|\(.*\)
Replace With:
<Leave this field Empty>
Step 3:
Trim Leading and Trailing Spaces, if any
Step 4:
Find/Search:
^-|-$
Replace With:
<Leave this field Empty>
Step 5:
Find/Search:
-
Replace With:
\+