TCL multi capture group for simplified csv string parsing with regexp - regex

I'm trying to parse a simplified CSV format with TCL regexp. I chose regexp over split to perform rudimentary format compliance test.
My problem is that I want to use a count quantifier but want to exclude the ',' from the match.
My test line:
set line "2017/08/21 16:06:20.0, REALTIME, late by 0.3, EOS450D, 1/640, F/8.0, ISO 100, Partial 450D 0.0%"
So far I have:
regexp -all {(?:([^\,]*)\,){8}} $line dummy date tm off cam exp fnum iso com
My thought process is:
Get a match group for all characters that are not comma up to the next comma.
Now I want to match this 8 time so I put it into a non-capturing group followed by a counting quantifier. But that defeats the purpose as now nothing is matched. What I need is a way to make the match go through the CSV 8 times and capture the text but not the comma.
My CSV is simplified in the following.
No quoted strings in the CSV
No empty entries in CSV
I've checked google for csv matching but most hits were too blown up due to allowing special cases in the CSV content.
Thanks,
Gert

In the regexp command, the interaction between the -all switch and the match variables is that the values captured in the last iteration of matching are used to fill the variables. This means that you can't fill eight variables by having one capture group and iteratively matching it eight times.
Your regular expression doesn't match anyway, since it requires a comma after the last field.
For this particular example, you could use the invocation
% regexp -all -inline {[^,]+} $line
{2017/08/21 16:06:20.0} { REALTIME} { late by 0.3} { EOS450D} { 1/640} { F/8.0} { ISO 100} { Partial 450D 0.0%}
This means to match all groups of characters that aren't commas (note that the comma isn't special: you don't need to escape it) and return them as a list.
As you noted, this is the same as using
% split $line ,
(which is also about five times faster).
You didn't want to use split because you wanted to do some validation: it is unclear what forms of validation you wanted to do, but you can easily validate the number of fields found:
% set fields [split $line ,]
% if {[llength $fields] ne 8} {puts stderr "wrong number of fields"}
You can store the fields in variables and validate them separately, which is a lot easier to get right than trying to validate them all at the same time while extracting them:
lassign $fields date tm off cam exp fnum iso com
if {![regexp {ISO\s+\d+} $iso]} {puts stderr "in search of valid ISO"}
The best method is still to split the data string using the csv package. Even if you just want to use this simplified CSV now, sooner than you think you might want to, say, allow fields with commas in them.
package require csv
set fields [::csv::split $line]
Documentation:
csv (package),
if,
lassign,
llength,
package,
puts,
regexp,
set,
split,
Syntax of Tcl regular expressions
ETA: Getting rid of leading/trailing whitespace. This is a bit unusual, since CSV data is usually arranged to be fields of strictly significant text separated by a separator character. If there is anything to be trimmed, it is usually done when saving the data.
A good way is to put the matched groups through an lmap/string trim filter:
lmap field [regexp -all -inline {[^,]+} $line] {string trim $field}
Another way is to get rid of whitespace around commas first, and then split:
split [regsub -all {\s*,\s*} $line ,] ,
You can use the Tcllib variant of split that splits by regular expression:
package require textutil
::textutil::splitx $line {\s*,\s*}
You can also swap out the earlier regular expression for [^\s,][^,]*[^\s,] (will not match fields of less than two characters). This is a regular expression that is on the verge of becoming too complex to be useful.

Related

Tcl Remove all characters after a string without removing the string

In tcl is there a way to trim out all character AFTER a designated string? I have seen lots of posts on removing all after and including the string but not what I am hoping to do. I have a script that searches for file names with the suffix .sv but some of them are .sv.**bunch of random stuff**. and I don't need the random stuff as it is not relevant to me.
I have experimented with different regsub and string trim commands but they always remove the .sv as well.
The results being appended to a list are similar to as follows...
test_module_1.sv.random_stuff
test_module_2.sv.random_stuff
test_module_3.sv.random_stuff
test_module_4.sv.random_stuff
test_module_5.sv.random_stuff
etc etc
You can place match matched parts of a regex pattern when you use regsub. An example:
regsub {(\.sv).*} $str {\1} new
Will remove .sv and anything after it if any and replace that by the first matched group, that is the part between parens, or in this case, .sv so that an input of example.sv.random will become example.sv.
However, you can also easily replace with .sv like so:
regsub {\.sv.*} $str {.sv} new
Or another approach not involving replacing would be to get the part of the string up until the .sv part:
string range $str 0 [expr {[string first ".sv" $str]+2}]
Here [string first ".sv" $str] gets the position of .sv in the string (if there are multiple, it will get the first), adds 2 characters (sv after . are 2 chars long) to it and string range gets all characters up to and including .sv.
Or if you want to stick with regexes:
regexp {.+?\.sv} $str match
$match will contain the result string. The expression used grabs all characters up to and including .sv.

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Perl Regex: How to remove quotes inside quotes from CSV line

I've got a line from a CSV file with " as field encloser and , as field seperator as a string. Sometimes there are " in the data that break the field enclosers. I'm looking for a regex to remove these ".
My string looks like this:
my $csv = qq~"123456","024003","Stuff","","28" stuff with more stuff","2"," 1.99 ","",""~;
I've looked at this but I don't understand how to tell it to only remove quotes that are
not at the beginning of the string
not at the end of the string
not preceded by a ,
not followed by a ,
I managed to tell it to remove 3 and 4 at the same time with this line of code:
$csv =~ s/(?<!,)"(?!,)//g;
However, I cannot fit the ^ and $ in there since the lookahead and lookbehind both do not like being written as (?<!(^|,)).
Is there a way to achieve this only with a regex besides splitting the string up and removing the quote from each element?
For manipulating CSV data I'd reccomend using Text::CSV - there's a lot of potential complexity within CSV data, which while possible to contruct code to handle yourself, isn't worth the effort when there's a tried and tested CPAN module to do it for you
Don't use Regex for parsing CSV file, CPAN provides lot of good modules like as nickifat suggest, use Text::CSV or you can use Text::ParseWords like
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
print join "|" => #f;
}
__DATA__
"123456","024003","Stuff","",""28" stuff with more stuff","2"," 1.99 ","",""
Output:
123456|024003|Stuff||28 stuff with more stuff|2| 1.99 ||
This should work:
$csv =~ s/(?<=[^,])"(?=[^,])//g
1 and 2 implies that there must be at least one character before and after the comma, hence the positive lookarounds. 3 and 4 implies that these characters can be anything but a comma.
Thanks for the help here. I was having issues with badly formatted CSV with embedded double-quotes. I would make one slight addition to the lookahead portion of the regex otherwise null values at the end of the line will be corrupted:
(?<=[^,])\"(?=[^,\n])
Adding the \n will eliminate a match against the last double-quote at end-of-line.
the suggested
$csv =~ s/(?<=[^,])"(?=[^,])//g;
is probably the best answer. Without these advanced regex features, you could also do the same with
$csv =~ s/([^,])"([^,])/$1$2/g;
or
$csv = join (',', map {s/"//g;"\"$_\""} split (',', $csv));
I think you should be aware that your string is not well formated csv. In a csv file, double quotes inside values must be doubled (http://en.wikipedia.org/wiki/Comma-separated_values). With your format, values cannot contain quotes near commas.
csv is a not so simple format. If you decides to use "real" csv, you should use a module.
Otherwise, you should probably remove all the double quotes in order to simplify your code and clarify that you are not doing csv.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.