Replace delimiter in a file without changing the value between quotes - regex

I have a csv file containing:
# Director, Movie Title, Year, Comment
Ethan Coen, No Country for Old Men, 2007, none
Ethan Coen, "O Brother, Where Art Thou?", 2000, none
Ethan Coen, The Big Lebowski, 1998, "uncredited (with his brother, Joel)"
I want to change the field separator from "," to "|" but I don't want to change the the comma if it's in a quoted string:
so the result should be like:
# Director| Movie Title| Year| Comment
Ethan Coen| No Country for Old Men| 2007| none
Ethan Coen| "O Brother, Where Art Thou?"| 2000| none
Ethan Coen| The Big Lebowski| 1998| "uncredited (with his brother, Joel)"
I tried this but the output I get is :
sed -e 's/(".)(.")/|\1 \2/g'
This is the result I am getting so far
Ethan Coen, |"O Brother, Where Art Thou? ", 2000, none
Ethan Coen, The Big Lebowski, 1998, |"uncredited (with his brother, Joel) "

Approach: Change the quoted commas in \r, replace the remaining commas and change \r back.
The first attempt works with the given input, but is still wrong:
# Wrong
sed -E 's/("[^,]*),([\"]*)/\1\r\2/g; s/,/|/g;s/\r/,/g' file
It fails on lines with 2 commas in one field.
The first replacement should be repeated until all quoted commas are replaced:
sed -E ':a;s/("[^,"]*),([^"]*)"/\1\r\2"/g; ta; s/,/|/g;s/\r/,/g' file

This might work for you (GNU sed):
sed -E 's#"[^"]*"#$(echo &|sed "y/,/\\n/;s/.*/\\\"\&\\\"/")#g;s/.*/echo "&"/e;y/,\n/|,/' file
The substitution translates ,'s between double quotes into newlines, then translates ,'s to |'s and \n's to ,'s.

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

sed and it's regex for optional symbols

I'm writing a script for splitting big FLAC audio files to small pieces according to cue list. I use cueprint for determining tag information, but in some cases it does not provide what I need and I need to use sed for digging info directly from cue file. Now I'm fighting GENRE thing. The trouble with it is that it comes in different ways:
REM GENRE "Gothic"
REM GENRE Gothic
are both seem to be withing standart, but very difficult to parse.
For the second case something like
sed -nr -e "s/^(REM GENRE )(.*)\r/\2/p" *.cue
work perfectly and return Gothic as expected. But for the first case "Gothic" is returned, which isn't what I wont wore feather processing.
Well, you'll say "use an "optional" quotes in the first and third part of the regex, like this
sed -nr -e "s/^(REM GENRE \"?)(.*)\"?\r/\2/p" *.cue
But this does not work as expected, the result is
Gothic"
with a trailing double quote.
Any ides how to parse both quoted and unquoted string with sed?
sed matches greedily. When you match (.*)"?, .* matches Gothic", and "? matches an empty string. You'll have to exclude double quotes from the .* string, e.g.
sed -nr 's/^REM GENRE "?([^"]*)"?\r?/\1/p' *.cue
Note that this will cause trouble with quoted strings that cotain quotes, as in "Goth\"ic". To avoid this problem, a little larger guns are required. I'd suggest
sed -nr '/^REM GENRE "?(([^"]|\\")*)"?\r?/ { s//\1/; s/\\"/"/g; p; }'
That is
/^REM GENRE "?(([^"]|\\")*)"?\r?/ { # if a line contains the pattern
s//\1/ # isolate the capturing group
s/\\"/"/g # unescape quotes
p # then print.
}
Note the ([^"]|\\")* in the regex that matches non-quote characters and escaped quotes.
Change the (.*) in the middle to ([^\"]*) to exclude quotation marks.
You could use this,
sed -nr -e 's/^(REM GENRE )"?([^"\r]*)/\2/p' *.cue

Replace last occurrence of a character in a field with awk

I'm trying to replace the last occurrence of a character in a field with awk. Given is a file like this one:
John,Doe,Abc fgh 123,Abc
John,Doe,Ijk-nop 45D,Def
John,Doe,Qr s Uvw 6,Ghi
I want to replace the last space " " with a comma ",", basically splitting the field into two. The result is supposed to look like this:
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
I've tried to create a variable with the number of occurrences of spaces in the field with
{var1=gsub(/ /,"",$3)}
and then integrate it in
{var2=gensub(/ /,",",var1,$4); print var2}
but the how-argument in gensub does not allow any characters besides numbers and G/g.
I've found a similar thread here but wasn't able to adapt the solution to my problem.
I'm fairly new to this so any help would be appreciated!
With GNU awk for gensub():
$ awk 'BEGIN{FS=OFS=","} {$3=gensub(/(.*) /,"\\1,","",$3)}1' file
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Get the book Effective Awk Programming by Arnold Robbins.
Very well-written question btw!
Here is a short awk
awk '{$NF=RS$NF;sub(" "RS,",")}1' file
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Updated due to Eds comment.
Or you can use the rev tools.
rev file | sed 's/ /,/' | rev
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Revers the line, then replace first space with ,, then revers again.

How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters

If file: list.txt contains really ugly data like so:
aaaa
#bbbb
cccc, dddd; eeee
ffff;
#gggg hhhh
iiii
jjjj,kkkk ;llll;mmmm
nnnn
How do we parse/split that file, excluding the commented lines, delimiting it by all commas, semicolons, and all white-space (including tabs, spaces, and newline and carrage-return characters) with a bash script?
Using shell commands:
grep -v "^[ |\t]*#" file|tr ";," "\n"|awk '$1=$1'
It can be done with the following code:
#!/bin/bash
### read file:
file="list.txt"
IFSO=$IFS
IFS=$'\r\n'
while read line; do
### skip lines that begin with a "#" or "<whitespace>#"
match_pattern="^\s*#"
if [[ "$line" =~ $match_pattern ]];
then
continue
fi
### replace semicolons and commas with a space everywhere...
temp_line=(${line//[;|,]/ })
### splitting the line at whitespaces requires IFS to be set back to default
### and then back before we get to the next line.
IFS=$IFSO
split_line_arr=($temp_line)
IFS=$'\r\n'
### push each word in the split_line_arr onto the final array
for word in ${split_line_arr[*]}; do
array+=(${word})
done
done < $file
echo "Array items:"
for item in ${array[*]} ; do
printf " %s\n" $item
done
This was not posed as a question, but rather a better solution to what others have touched upon when answering other related questions. The bit that is unique here is that those other questions/solutions did not really address how to split a string when it is delimited with a combination of spaces and characters and comments; this is one solution that address all three simultaneously...
Related questions:
How to split one string into multiple strings separated by at least one space in bash shell?
How do I split a string on a delimiter in Bash?
Additional notes:
Why do this with bash when other scripting languages are better suited for splitting? A bash script is more likely to have all the libraries it needs when running from a basic upstart or cron (sh) shell, compared with a perl program for example. An argument list is often needed in these situations and we should expect the worst from people who maintain those lists...
Hopefully this post will save bash newbies a lot of time in the future (including me)... Good luck!
sed 's/[# \t,]/REPLACEMENT/g' input.txt
above command replaces comment characters ('#'), spaces (' '), tabs ('\t'), and commas (',') with an arbitrary string ('REPLACEMENT')
to replace newlines, you can try:
sed 's/[# \t,]/replacement/g' input.txt | tr '\n' 'REPLACEMENT'
if you have Ruby on your system
File.open("file").each_line do |line|
next if line[/^\s*#/]
puts line.split(/\s+|[;,]/).reject{|c|c.empty?}
end
output
# ruby test.rb
aaaa
cccc
dddd
eeee
ffff
iiii
jjjj
kkkk
llll
mmmm
nnnn

replacing doublequotes in csv

I've got nearly the following problem and didn't find the solution. This could be my CSV file structure:
1223;"B630521 ("L" fixed bracket)";"2" width";"length: 5"";2;alternate A
1224;"B630522 ("L" fixed bracket)";"3" width";"length: 6"";2;alternate B
As you can see there are some " written for inch and "L" in the enclosing ".
Now I'm looking for a UNIX shell script to replace the " (inch) and "L" double quotes with 2 single quotes, like the following example:
sed "s/$OLD/$NEW/g" $QFILE > $TFILE && mv $TFILE $QFILE
Can anyone help me?
Update (Using perl it easy since you get full lookahead features)
perl -pe 's/(?<!^)(?<!;)"(?!(;|$))/'"'"'/g' file
Output
1223;"B630521 ('L' fixed bracket)";"2' width";"length: 5'";2;alternate A
1224;"B630522 ('L' fixed bracket)";"3' width";"length: 6'";2;alternate B
Using sed, grep only
Just by using grep, sed (and not perl, php, python etc) a not so elegant solution can be:
grep -o '[^;]*' file | sed 's/"/`/; s/"$/`/; s/"/'"'"'/g; s/`/"/g'
Output - for your input file it gives:
1223
"B630521 ('L' fixed bracket)"
"2' width"
"length: 5'"
2
alternate A
1224
"B630522 ('L' fixed bracket)"
"3' width"
"length: 6'"
2
alternate B
grep -o is basically splitting the input by ;
sed first replaces " at start of line by `
then it replaces " at end of line by another `
it then replaces all remaining double quotes " by single quite '
finally it puts back all " at the start and end
Maybe this is what you want:
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g"
I.e.: Find double quotes (") following a number ([0-9]) but not followed by a semicolon ([^;]) and replace it with two single quotes.
Edit:
I can extend my command (it's becoming quite long now):
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g"
As you are using SunOS I guess you cannot use extended regular expressions (sed -r)? Therefore I did it that way: The first s command replaces all inch " with '', the second and the third s are the same. They substitute all " that are not a direct neighbor of a ; with a single '. I have to do it twice to be able to substitute the second " of e.g. "L" because there's only one character between both " and this character is already matched by \([^;]\). This way you would also substitute "" with ''. If you have """ or """" etc. you have to put one more (but only one more) s.
For the "L" try this:
sed "s/\"L\"/'L'/g"
For inches you can try:
sed "s/\([0-9]\)\"\"/\1''\"/g"
I am not sure it is the best option, but I have tried and it works. I hope this is helpful.