How can I format this data with bash script - regex

I want to format data from this
header1|header2|header3
"ID001"|"""TEST"""|"
TEST TEST TEST"|"TEST 4"
"ID002"|"TEST"|"TESTTESTTEST"|"TEST 5"
into
header1|header2|header3
"ID001"|"TEST"|"TEST TEST TEST"|"TEST 4"
"ID002"|"TEST"|"TESTTESTTEST"|"TEST 5"
So the logics are
keep the header as original
check other lines if not start with " then move this line to end of previous line
replace """ to "
I want to format this with bash script.
I've created this line but still not working
#!/bin/bash
if [ $# -eq 0 ]
then
echo "No arguments supplied"
exit;
fi
FOLD=$1"*"
CHECK=$1"/bix.done"
if test -f $CHECK; then
date > /result.txt
echo "starting Covert.... "
echo "from folder : " $1
for file in $FOLD
do
if [[ $file != *History* ]]; then
if [[ $file == *.csv ]]; then
FILETEMP=$file".temp"
mv $file $FILETEMP
awk '/^"/ {if (f) print f; f=$0; next} {f=f FS $0} END {print f}' $FILETEMP > $file
#rm $FILETEMP
fi
fi
done
date > /home/result.txt
fi
#ls $1 -l

This might work for you (GNU sed):
sed '1b;:a;N;/\n"/!s/\n//;ta;s/"""/"/g;P;D' file
Always print the first header line. Append the next line to the current line and if that line does not begin with a " remove the newline and repeat until there is such a line. Now substitute a single " for """ globally, print the first line and repeat.

Specific to joining the 2nd line and condensing the muliple-double quotes to a single double-quote you could do:
sed '2{s/""*/"/g;h;N;s/\n//}' file
print all lines by default, except for
the 2 second line, then
s/""*/"/g substitute multiple double quotes for a single double quote,
h copy pattern-space to hold space,
N append the next line to hold space, and
s/\n// substitute the '\n' with nothing joining the line.
Example Use/Output
With your data in file you could do:
$ sed '2{s/""*/"/g;h;N;s/\n//}' file
header1|header2|header3
"ID001"|"TEST"|"TEST TEST TEST"|"TEST 4"
"ID002"|"TEST"|"TESTTESTTEST"|"TEST 5"
(note: if you need to condense multiple double quotes to single double quotes in all lines, you can turn the command around and use sed 's/""*/"/g;2{h;N;s/\n//}')

It's been resolved with below codes
if test -f $CHECK; then
date > /home/startconvert.txt
echo "starting Convert.... "
echo "from folder : " $1
for file in $FOLD
do
if [[ $file != *History* ]]; then
if [[ $file == *.csv ]]; then
#FILETEMP=$file".temp"
#mv $file $FILETEMP
#awk '/^"/ {if (f) print f; f=$0; next} {f=f FS $0} END {print f}' $FILETEMP > $file
#rm $FILETEMP
perl -i -0777pe 's/\r\n([^"])/ $1/g' $file;
perl -i -0777pe 's/\n"""/"/' $file;
perl -i -0777pe 's/\r("\|)/ $1/g' $file;
sed -i -e 's/"""/"/g' $file;
perl -i -0777pe 's/\n([^"])/ $1/g' $file;
perl -i -0777pe 's/\n("\|)/ $1/g' $file;
sed -i -e 's/""-/-/g' $file;
perl -i -0777pe 's/\n([^"])/ $1/g' $file;
perl -i -0777pe 's/\r([^"])/ $1/g' $file;
perl -i -0777pe 's/\r\n([^"])/ $1/g' $file;
fi
fi
done
date > /home/endconvert.txt
fi

Not sure about the bash part, this expression though,
[\r\n]^([^"])
with a replacement of $1 might be somewhat close.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Related

Bash regex =~ doesn’t support multiline mode?

using =~ operator to match output of a command and grab group from it. Code is as follows:
Comamndout=$(cmd) Match=‘^hello world’ If $Comamndout =~ $Match; then
echo something fi
Commandout is in pattern
Something
Hello world
But if statement is failing.
Is bash regex support multiline search with everyline start with ^ and end with $.
No, the =~ operator doesn't perform a multiline search. A newline must be matched literally:
string=$(cmd)
regexp='(^|'$'\n'')hello world'
if [[ $string =~ $regexp ]]; then
echo matches
fi
=~ would treat multiple lines as one line.
if [[ $(echo -e "abc\nd") =~ ^a.*d$ ]]; then
echo "find a string '$(echo -e "abc\nd")' that starts with a and ends with d"
fi
Output:
find a string 'abc
d' that starts with a and ends with d
P.S.
When processing multiple lines, it is common to use grep or read with either re-direct or pipeline.
For a grep and pipeline example:
# to find a line start with either a or e
echo -e "abc\nd\ne" | grep -E "^[ae]"
Output:
abc
e
For a read and redirect example:
while read line; do
if [[ $line =~ ^a} ]] ; then
echo "find a line '${line}' start with a"
fi
done <<< $(echo -e "abc\nd\ne")
Output:
find a line 'abc' start with a
P.S.
-e of echo means translate following \n into new line. -E of grep means using the extended regular expression to match.

sed find with a regex and replace does not work [duplicate]

I'm trying to refine my code by getting rid of unnecessary white spaces, empty lines, and having parentheses balanced with a space in between them, so:
int a = 4;
if ((a==4) || (b==5))
a++ ;
should change to:
int a = 4;
if ( (a==4) || (b==5) )
a++ ;
It does work for the brackets and empty lines. However, it forgets to reduce the multiple spaces to one space:
int a = 4;
if ( (a==4) || (b==5) )
a++ ;
Here is my script:
#!/bin/bash
# Script to refine code
#
filename=read.txt
sed 's/((/( (/g' $filename > new.txt
mv new.txt $filename
sed 's/))/) )/g' $filename > new.txt
mv new.txt $filename
sed 's/ +/ /g' $filename > new.txt
mv new.txt $filename
sed '/^$/d' $filename > new.txt
mv new.txt $filename
Also, is there a way to make this script more concise, e.g. removing or reducing the number of commands?
If you are using GNU sed then you need to use sed -r which forces sed to use extended regular expressions, including the wanted behavior of +. See man sed:
-r, --regexp-extended
use extended regular expressions in the script.
The same holds if you are using OS X sed, but then you need to use sed -E:
-E Interpret regular expressions as extended (modern) regular expressions
rather than basic regular regular expressions (BRE's).
You have to preceed + with a \, otherwise sed tries to match the character + itself.
To make the script "smarter", you can accumulate all the expressions in one sed:
sed -e 's/((/( (/g' -e 's/))/) )/g' -e 's/ \+/ /g' -e '/^$/d' $filename > new.txt
Some implementations of sed even support the -i option that enables changing the file in place.
Sometimes, -r and -e won't work.
I'm using sed version 4.2.1 and they aren't working for me at all.
A quick hack is to use the * operator instead.
So let's say we want to replace all redundant space characters with a single space:
We'd like to do:
sed 's/ +/ /'
But we can use this instead:
sed 's/ */ /'
(note the double-space)
May not be the cleanest solution. But if you want to avoid -E and -r to remain compatible with both versions of sed, you can do a repeat character cc* - that's 1 c then 0 or more c's == 1 or more c's.
Or just use the BRE syntax, as suggested by #cdarke, to match a specific number or patternsc\{1,\}. The second number after the comma is excluded to mean 1 or more.
This might work for you:
sed -e '/^$/d' -e ':a' -e 's/\([()]\)\1/\1 \1/g' -e 'ta' -e 's/ */ /g' $filename >new.txt
on the bash front;
First I made a script test.sh
cat test.sh
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
SRC=`echo $line | awk '{print $1}'`
DEST=`echo $line | awk '{print $2}'`
echo "moving $SRC to $DEST"
mv $SRC $DEST || echo "move $SRC to $DEST failed" && exit 1
done < "$1"
then we make a data file and a test file aaa.txt
cat aaa.txt
<tag1>19</tag1>
<tag2>2</tag2>
<tag3>-12</tag3>
<tag4>37</tag4>
<tag5>-41</tag5>
then test and show results.
bash test.sh list.txt
Text read from file: aaa.txt bbb.txt
moving aaa.txt to bbb.txt

Why the following variations of s/g from command line are wrong?

I have a small file as follows:
$ cat text.txt
vacation
cat
This is a test
This command substitutes all occurrences of cat to CAT correctly as I wanted:
$ perl -p -i -e '
s/cat/CAT/g;
' text.txt
But why the following two mess the file up?
The following deletes the contents of the file
$ perl -n -i -e '
$var = $_;
$var =~ s/cat/CAT/g;
' text.txt
And this one just does not do the substitution correctly
$ perl -p -i -e '
$var = $_;
$var =~ s/cat/CAT/g;
' text.txt
$ cat text.txt
cation
cat
This is a test
Why? What am I messing up here?
-p prints out each line automatically (the contents of $_, which contains the current line's contents), which re-populates the file (due to the -i flag in use), where -n loops over the file like -p does, but it doesn't automatically print. You have to do that yourself, otherwise it just overwrites the file with nothing. The -n flag allows you to skip over lines that you don't want to re-insert into the original file (amongst other things), whereby with -p, you'd have to use conditional statements along with next() etc. to achieve the same result.
perl -n -i -e '
$var = $_;
$var =~ s/cat/CAT/g;
print $var;
' text.txt
See perlrun.
In your last example, -p will only automatically print $_ (the original, unmodified line). It doesn't auto-print $var at all, so in that case, you'd have to print $var like in the example above, but then you'd get both the original line, and the modified one printed to the file.
You're better off not assigning $_ to anything if all you're doing is overwriting a file. Just use it as is. eg. (same as your first example):
perl -p -i -e '
s/cat/CAT/g;
' text.txt

bash regex multiple match in one line

I'm trying to process my text.
For example i got:
asdf asdf get.this random random get.that
get.it this.no also.this.no
My desired output is:
get.this get.that
get.it
So regexp should catch only this pattern (get.\w), but it has to do it recursively because of multiple occurences in one line, so easiest way with sed
sed 's/.*(REGEX).*/\1/'
does not work (it shows only first occurence).
Probably the good way is to use grep -o, but i have old version of grep and -o flag is not available.
This grep may give what you need:
grep -o "get[^ ]*" file
Try awk:
awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
You might need to tweak the regex between the slashes for your specific issue. Sample output:
$ awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
get.this
get.that
get.it
With awk:
awk -v patt="^get" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
printf "%s%s", $i, OFS;
print ""
}' <<< "$text"
bash
while read -a words; do
for word in "${words[#]}"; do
if [[ $word == get* ]]; then
echo -n "$word "
fi
done
echo
done <<< "$text"
perl
perl -lane 'print join " ", grep {$_ =~ /^get/} #F' <<< "$text"
This might work for you (GNU sed):
sed -r '/\bget\.\S+/{s//\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1 /g;s/ $//}' file
or if you want one per line:
sed -r '/\n/!s/\bget\.\S+/\n&\n/g;/^get/P;D' file

In bash, how can I check a string for partials in an array?

If I have a string:
s='path/to/my/foo.txt'
and an array
declare -a include_files=('foo.txt' 'bar.txt');
how can I check the string for matches in my array efficiently?
You could loop through the array and use a bash substring check
for file in "${include_files[#]}"
do
if [[ $s = *${file} ]]; then
printf "%s\n" "$file"
fi
done
Alternately, if you want to avoid the loop and you only care that a file name matches or not, you could use the # form of bash extended globbing. The following example assumes that array file names do not contain |.
shopt -s extglob
declare -a include_files=('foo.txt' 'bar.txt');
s='path/to/my/foo.txt'
printf -v pat "%s|" "${include_files[#]}"
pat="${pat%|}"
printf "%s\n" "${pat}"
#prints foo.txt|bar.txt
if [[ ${s##*/} = #(${pat}) ]]; then echo yes; fi
For an exact match to the file name:
#!/bin/bash
s="path/to/my/foo.txt";
ARR=('foo.txt' 'bar.txt');
for str in "${ARR[#]}";
do
# if [ $(echo "$s" | awk -F"/" '{print $NF}') == "$str" ]; then
if [ $(basename "$s") == "$str" ]; then # A better option than awk for sure...
echo "match";
else
echo "no match";
fi;
done