Replace all commas between two quotes in a bash script - regex

I need that all "," between two " are replaced with ";" within a bash script. I'm close, but hours on the internet and stackoverflow led me to this:
echo ',,Lung,,"Lobular, each.|lungs, right.",false,,,,"organ, left.",,,,,' | sed -r ':a;s/(".*?),(.*?")/\1;\2/;ta'
With the result:
,,Lung,,"Lobular; each.|lungs; right.";false;;;;"organ; left.",,,,,
Correct would be:
,,Lung,,"Lobular; each.|lungs; right.",false,,,,"organ; left.",,,,,

Not sure how you want to deal with lines that have an odd number of double quotes (eg, the double quoted string spans multiple lines), but perhaps:
awk '!(NR%2){gsub(",",";")} 1' RS=\" ORS=\"
This simply treats " as the record separator and does the replacement only on odd numbered records. Seems to work as desired. (Or, rather, it works as you seem to desire!)
As oguz points out in a comment, this prints an extra " at the end. That can be fixed with:
awk '!(NR%2){gsub(",",";")} {printf RFS $0} {RFS="\""}' RS=\"
which is a bit uglier but more correct . (or, rather, less incorrect!) If your input stream ends with a ", that quote will be truncated. If, however, your input is terminated by a newline rather than a ", this will do what you want.
OTOH, you might just want to do:
perl -wpE 'BEGIN{$/=\1}; y/,/;/ if $in; $in = ! $in if $_ eq "\""'
Which reads one character and uses a simple state machine. ($_ is the current character, so $in = ! $in changes state when a double quote is seen and the transliteration only happens when $in is non-zero.)

If you /really/ wanted to use sed, you could do a whole line replace and include a clause like ^(([^"]*"[^"*]")*[^"]*) at the beginning of your existing expression in order to ensure that the matched quotes are "odd".

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

how to define a space in a regular expression (in awk)?

I want to print the texts inside of " ". for example I have the following strings:
gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj
And I want to print only the following:
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"
I have tried awk with the following regular expression:
awk '{for(i = 1; i <= NF; i++) if($i ~ /^\"[A-Za-z.$]*([A-Za-z.$][[:space:]]*[A-Za-z.$])*\"$/) print $i}' sample.txt
but it prints everything before space and actually does not recognize the spaces I have defined in my regular expression. My current output is:
"jkfgh"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj
as you can see, only the ones without any space are printed correctly.
I have also tried [[:blank:]], \t and also ' ' but did not work.
I appreciate if someone can tell me how to change this regular expression and include space.
The question's title is misleading and based on a fundamental misconception about awk.
The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk.
More generally, you can use [[:space:]] to match a space, a tab or a newline (GNU Awk also supports \s), and [[:blank:]] to match a space or a tab.
However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.
The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.
#fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.
If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):
re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt
This will work with single- and double-quoted strings.
Explanation:
FPAT="\"$re\"|'$re'" defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).
Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in $1, ... and NF.
Therefore, the loop for(i=1;i<=NF;++i) is already limited to enumerating only the matching fields.
Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).
If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:
gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...
You are just getting those without any space because you loop through fields and they are space separated. Thus, you need to change the approach to something handling the spaces differently. Assuming there are no nested quotes, you can use for example:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "\"%s\"", $i; print ""}' file
That is, use " as field separator and print the even fields.
This is equivalent to using FS more elegantly:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s", FS, $i, FS; print ""}' file
Note in the previous approaches the output has no space in between fields. If you need it, you can use:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
The trick (i>NF-2?"\n":" ") is a matter of printing the whole field together with a separator. If we are in the last field, we set it as new line; otherwise, as a space. More idiomatically, you can also say (i>NF-2?RS:OFS) using the default values of RS (record separator, new line) and OFS (output field separator, space).
Test
$ awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"

Why this single and double quote make so much difference in output

Why outputs of these two commands differ?
cat config.xml|perl -ne 'print $1,"\n" if /([0-9\.]+):161/'
cat config.xml|perl -ne "print $1,"\n" if /([0-9\.]+):161/"
First works as expected printing out matched group while seconds prints whole line.
I see two main things wrong with your command.
First off, double quotes allow shell interpolation, and $1 will be taken for a shell variable and replaced. Since it unlikely exists, it will be replaced with an empty string. So instead of print $1, you get print, which is shorthand for print $_, and is probably why the entire line prints.
Second, you have unescaped double quotes inside your command, so you are in fact passing three strings to Perl:
print ,
\n
if /(....)/
As for why or how this works with your shell, I don't know, since I do not have access to your OS, nor know which one it is. In Windows, I get a Perl bareword warning for n (Unquoted string "n" may clash with future reserved word at -e line 1.) which means that the \n is interpreted as a string. Now, here's the tricky part. What we get is this:
print , \n if /.../
Which means that \n is no longer an argument to print, it is a statement that comes after print and it is in void context, so it gets ignored. We can see this by this warning (which I had to fake in my shell):
Useless use of single ref constructor in void context at -e line 1.
(Note that you do not get these warnings as you do not use warnings -- the -w switch)
So what we are left with is
print if /.../
Which is exactly the code for the behaviour you described: It prints the whole line when a match is found.
What you can do to visualize the problem in your shell is add the -MO=Deparse switch to your one-liner, as shown here:
C:\perl>perl -MO=Deparse -ne"print ,"\n" if /a/"
LINE: while (defined($_ = <ARGV>)) {
print($_), \'n' if /a/;
}
-e syntax OK
Now we can clearly see that the print statement is separated from the newline, and that the newline is a reference to a string.
Solution:
However, your code has other problems, and if done right you can avoid all the shell difficulties. First, you have a UUOC (Useless Use of Cat). A file argument can be given to perl when using the -n switch on the command line. Secondly, you do not need to use variables for this, you can simply print the return value of your regex:
perl -nlwe 'print for /(...)/' config.xml
The -l switch will handle newlines for you, and in this case add newline to the print. The for is necessary to avoid printing empty matches.
Inside double quote, some stuffs are substituted ($variable, `command`, ..). While inside single quote, they are remained as is.
$ echo "$HOME"
/home/falsetru
$ echo '$HOME'
$HOME
$ echo "`echo 1`"
1
$ echo '`echo 1`'
`echo 1`
Nested quotes:
$ echo ""hello""
hello
$ echo '"hello"'
"hello"
$ echo "\"hello\""
"hello"
Escape double quotes, $ to get same result:
cat config.xml | perl -ne "print \$1,\"\n\" if /([0-9\.]+):161/"
Two things:
Nested quotes.
Variables expand differently.
The first command has one string that happens to contain some double quotes. The variable is not expanded.
The second command has two strings with an unquoted \n in between. The variable is expanded.
Let's say $1 contains "blah"
The first passes this string to perl:
print $1,"\n" if /([0-9\.]+):161/
the second, this:
print blah,\n if /([0-9\.]+):161/

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"