Delete everything before a double quote - regex

I'm trying to clean a CSV file which has a column with contents like this:
Sometexthere1", "code"=>"47.51-2-01"}]
And I would like to remove everything before the first quote (") in order to keep just this:
Sometexthere1
I know that I can use $` to get everything before some match in regex, but I am not understanding how to keep just the string before the first double quote.

Parameter expansion does this well enough:
# Define a variable
s='Sometexthere1", "code"=>"47.51-2-01"}]'
# expand it, removing the longest possible match (from the end) for '"'*
result=${s%%'"'*}
# demonstrate that result by printing it
printf '%s\n' "$result"
...properly returns Sometexthere1.

You probably mean "delete everything after a double quote"? In Open Refine, you can use this GREL formula :
value.replace(/".+/, "")
> Result : Sometexthere1

Related

Tcl regular expression to detect square brackets

I have a .csv files where many rows have one of the field values like this:
scl[0]
scl[1]
scl[2]
sda[1]
sda[2]
sda[3]
I am storing them in a variable while reading the csv files in line by line format,like:
set string [$m get cell 0 1]
Now when I do regexp to check whether the cell has scl[0] I am unable to pass the square bracket to this regular expression:
I gave this syntax:
if{[regexp "scl\[0\]" $string]} {
...
}
But the if condition doesn't get executed.
If in case of scl(0), i.e () instead of {} in csv file, I gave {[regexp "scl\[(\]0\[)\]" $string]} which worked. The same format I tried apply to square brackets still it doesn't get evaluated.
Am I missing something?
Please help.
Thanks
Note that \ has special meaning inside double quotes. So just do:
regexp "scl\\[0\\]" $string
or:
regexp {scl\[0\]} $string
You could also use string equal: then you only need to worry about one level of quoting:
string equal {scl[0]} $string
Documentation:
string

Removing parentheses as unwanted text in R using gsub

I'm trying to clean up a column in my data frame where the rows look like this:
1234, text ()
and I need to keep just the number in all the rows. I used:
df$column = gsub(", text ()", "", df$column)
and got this:
1234()
I repeated the operation with only the parentheses, but they won't go away. I wasn't able to find an example that deals specifically with parentheses being eliminated as unwanted text. sub doesn't work either.
Anyone knows why this isn't working?
Parentheses are stored metacharacters in regex. You should escape them either using \\ or [] or adding fixed = TRUE. But in your case you just want to keep the number, so just remove everything else using \\D
gsub("\\D", "", "1234, text ()")
## [1] "1234"
If your column always looks like a format described above :
1234, text ()
Something like the following should work:
string extractedNumber = Regex.Match( INPUT_COLUMN, #"^\d{4,}").Value
Reads like: From the start of the string find four or more digits.

Regex in perl and variable name

I have some issues with a regular expression in perl. I'm trying to add a string at the beginning of a another string (in fact, insert a string at the beginning of the name of a file). What I want is check before inserting that string if the file already begins by it.
This is the code I have:
if ($ficheroSinExt !~ m/^$strCadena/){
# if it doesn't exist at the beginning, I insert it...
$ficheroSinExt = $strCadena . " " . $ficheroSinExt;
}
else{
print "---->It already exists!!!\n";
}
I'm testing it with two filenames with only one containing [Perl] at the beginning ("[Perl] File1.pdf" and "File2.pdf"), and $strCadena contains [Perl]. I end up adding [Perl] for both files, so their new names are "[Perl] [Perl] File1.pdf" and "[Perl] File2.pdf".
I think the problem comes from the ^$strCadena of the match operator, but I don't arrive to work-around it. Could you please give me a hand?
Thanks in advance,
Diego
Quote the special characters:
if ($ficheroSinExt !~ m/^\Q$strCadena/){
# here __^
You want to disable pattern metacharacters (see perlre)
if ($ficheroSinExt !~ m/^\Q$strCadena\E/){

R! remove element from list which start from specific letters

I create a list of files:
folder_GLDAS=dir(foldery[numeryfolderow],pattern="_OBC.asc",recursive=F,full.names=T)
Unfortunately there is one additional object which i would like to remove (file name begin with "NOWY" - NOWYevirainf_OBC.asc).
How can I find index of this element on list to remove it by typing:
folder_GLDAS<=folder_GLDAS[-to_remove] ??
Filter by using a regular expression.
folder_GLDAS <- folder_GLDAS[!grepl("^NOWY", folder_GLDAS)]
(You can also swap grepl for str_detect in stringr.)
Assuming that your list is one-dimensional, something like this should work:
*folder_GLDAS<-*folder_GLDAS[substr(*folder_GLDAS,1,4)!='NOWY']
You can actually make a (rather complex) PERL regex pattern that matches all names that end in "_OBC.asc" but DO NOT start with "NOWY": "^(?!NOWY).*_OBC\\.asc$"
Unfortunately the PERL syntax is not recognized by dir. But you could do it with grep like this:
folder_GLDAS <- dir(foldery[numeryfolderow],recursive=F,full.names=T)
folder_GLDAS <- grep(folder_GLDAS, pattern="^(?!NOWY).*_OBC\\.asc$", perl=T, value=T)
Also note that the "." in "_OBC.asc" needs to be escaped - otherwise you'll match for example "_OBCXasc" as well).

Regular expression to match CSV delimiters

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:
1,"abcd",2,"de,fg",3,"hijk"
I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.
See my post that solves this problem for more detail.
^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.
CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.
What language are you using?
As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.
I suggest the tool CSVFIX as likely to do what you need.
To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):
"""",,"",a,"a,b"
Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:
"",,"",a",b c",
The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!
This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:
use strict;
use warnings;
my #list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
foreach my $string (#list)
{
print "Pattern: <<$string>>\n";
while ($string =~ m/ (?: " ( (?:""|[^"])* ) " | ( [^,"] [^,]* ) | ( .? ) )
(?: $ | , ) /gx)
{
print "Found QF: <<$1>>\n" if defined $1;
print "Found PF: <<$2>>\n" if defined $2;
print "Found EF: <<$3>>\n" if defined $3;
}
}
Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!
Output:
Pattern: <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern: <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>
We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct.
Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.
Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?
Without context it's impossible to give a more specific solution.
Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.
Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (#$rows) {
say join("\t", #$row);
}
That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?
I know this is old, but this RegEx works for me:
/(\"[^\"]+\")|[^,]+/g
It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.
Just sharing, maybe this will help someone.