Find and replace in same string in bash - regex

I have a string with some words in them, example a=1 b=2 c=3 a=50. Now I want to parse this and create another string a=50 b=2 c=3 which is essentially the same as above except that if the same phrase before the = is encountered for the second time the first one is over written with the latest one, so in the end there are only unique phrases on the left of =. Here is what I got till now:
a="a=1 b=2 c=3 a=50"
o=()
for i in $a
do
reg=${i%=*}
if [[ ${o[*]} == *"$reg"* ]]
then
o=$(echo ${o[*]} | sed -e "s/\$reg=\S/\$i")
else
o+=( $i )
fi
done
What am I doing wrong here?

I'd take an entirely different approach, not based on regular expressions or string rewriting.
declare -A values=( ) # Initialize an associative array ("hash", "map")
while IFS= read -r -d' ' word; do # iterate over input words, separated by spaces
if [[ $word = *=* ]]; then # ignore any word that doesn't have an "=" in it
values[${word%%=*}]=${word#*=} # add everything before the "=" as a key...
fi # ...with everything after the "=" as a value
done
for key in "${!values[#]}"; do # Then iterate over keys we found
value="${values[$key]}" # ...extract the values for each...
printf '%s=%s ' "$key" "$value" # ...and print the pairs.
done
echo # When done iterating, print a newline.
Because the words are being processed first-to-last through the string, updates take effect before the print loop is reached.

Using awk
$ awk -F= -v RS=" |\n" '{a[$1]=$2} END{for (k in a) printf "%s=%s ",k,a[k]}' <<<"a=1 b=2 c=3 a=50"
a=50 b=2 c=3
How it works:
-F=
Set the field separator to be an equal sign.
-v RS=" |\n"
Set the record separator to be either a space or a newline.
a[$1]=$2
Update associative array a with the latest value.
END{for (k in a) printf "%s=%s ",k,a[k]}
In no particular order, print out the final values.
Using bash
Like Charles Duffy's approach, this uses read -d" " to parse the string. This approach, however, uses IFS="=" to separate names and values.
Two loops are required. The first gathers the values. The second reassembles the updated values in the original order:
a="a=1 b=2 c=3 a=50"
declare -A b
while IFS== read -d" " name value
do
b["$name"]="$value"
done <<<"$a "
declare -A seen
while IFS== read -d" " name value
do
[ "${seen[$name]}" ] || o="$o $name=${b["$name"]}"
seen[$name]=1
done <<<"$a "
echo "$o"

Easily done with perl:
echo "a=1 b=2 c=3 a=50" \
| sed "s/ /\n/g" \
| perl -e '
my %hash = ();
while(<>){
$line = $_;
if($line =~ m/(\S+)=(\S+)/) {
$hash{$1} = $2;
}
}
for $key (sort keys %hash) {
print "$key=$hash{$key}\n";
}'
...or, all on one line:
echo "a=1 b=2 c=3 a=50" | sed "s/ /\n/g" | perl -e 'my %hash = (); while(<>){ $line = $_; if($line =~ m/(\S+)=(\S+)/) { $hash{$1} = $2; } } for $key (sort keys %hash) { print "$key=$hash{$key}\n"; }'

Related

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.
awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.
With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file
awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file
Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma
A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

Parsing a .csv-like file in bash

I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test
A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'
Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1

Tokenize and capture with sed

Suppose we have a string like
"dir1|file1|dir2|file2"
and would like to turn it into
"-f dir1/file1 -f dir2/file2"
Is there an elegant way to do this with sed or awk for a general case of n > 2?
My attempt was to try
echo "dir1|file1|dir2|file2" | sed 's/\(\([^|]\)|\)*/-f \2\/\4 -f \6\/\8/'
An awk solution:
awk -F'|' '{ for (i=1;i<=NF;i+=2) printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") }' \
<<<"dir1|file1|dir2|file2"
-F'|' splits the input into fields by |
for (i=1;i<=NF;i+=2) loops over the field indices in increments of 2
printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") prints pairs of consecutive fields joined with / and prefixed with -f<space>
((i==NF-1) ? "\n" : " ") terminates each field-pair either with a space, if more fields follow, or a \n to terminate the overall output.
In a comment, the OP suggests a shorter variation, which may be of interest if you don't need/want the output to be \n-terminated:
awk -F'|' '{ for (i=1;i<=NF;++i) printf "%s", (i%2 ? " -f " $i : "/" $i ) }' \
<<<"dir1|file1|dir2|file2"
This might work for you (GNU sed):
sed 's/\([^|]*\)|\([^|]*\)|\?/-f \1\/\2 /g;s/ $//' file
This will work for dir1|file1|dir2|file2|dirn|filen type strings
The regexp forms two back references (\1,\2 used in the replacement part of the substitution command s/pattern/replacement/), the first is all non-|'s, then a |, the second is all non-|'s then an optional | i.e. for the first application of the substitution (N.B. the g flag is implemented and so the substitutions may be multiple) dir1 becomes \1 and file1 becomes \2. All that remains is to prepend -f and replace the first | by / and the second | by a space. The last space is not needed at the end of the line and is removed in the second substitution command.
$ awk -v RS='|' 'NR%2{p=$0;next} {printf " -f %s/%s", p, $0}' <<< 'dir1|file1|dir2|file2'
-f dir1/file1 -f dir2/file2
A gnu-awk solution:
s="dir1|file1|dir2|file2"
awk 'BEGIN{ FPAT="[^|]+\\|[^|]+" } {
for (i=1; i<=NF; i++) {
sub(/\|/, "/", $i);
if (i>1)
printf " ";
printf "-f " $i
};
print ""
}' <<< "$s"
-f dir1/file1 -f dir2/file2
FPAT is used for grabbing dir1|file2 into single field.

How can I check the balance of ASCII images using bash?

I have some large ASCII images that I want to check are symmetrical. Say I have the following file:
***^^^MMM
*^**^^MMM
**^^^^^MMMMM
The first line is what I want, they are all separated and have the same amount in each section (it doesn't have to be 3 of each ever time though), and the next two are not what I want. I want to count the number of *'s in a row, and then make sure there are the same amount of ^'s and M's following them. I'm trying to get some symmetry on each line, so this would be good:
**^^MM
**********^^^^^^^^^^MMMMMMMMMM
****^^^^MMMM
*^M
etc.
How can I scan through a file and maybe grep the problem lines?
I tried a few loops with cat ASCIIfile | sed 's/\^//g' | sed 's/M//g' | wc -c and assigning output to a variable and then comparing the count to the other char counts, but obviously this doesn't take into account order and lines like *^*^*M^MM were working.
Using perl:
perl -ne ' { $l=$_; chomp; ($v)=/^((.)\2*)/; $t=length($v); \
s/M{$t}//;s/\^{$t}//;s/\*{$t}//; \
print $l if length } ' input_file
Using bash/sed:
while read line; do
m=$(echo $line | sed 's/[^M]*\([M][M]*\)[^M]*/\1/' | wc -c)
s=$(echo $line | sed 's/[^*]*\([*][*]*\)[^*]*/\1/' | wc -c)
n=$(echo $line | sed 's/[^\^]*\([\^][\^]*\)[^\^]*/\1/' | wc -c)
if [[ $m -ne $s || $m -ne $n ]]; then
echo "- $line $m::$s::$n"
else
echo "+ $line $m::$s::$n"
fi
done < input_file
Pure Bash:
#!/bin/bash
for string in '***^^^MMM' '**^^MM' '****^^MMMM' '*^*M^'
do
flag=true
sym=true
patt=''
prevlen=${#string}
for c in '*' '^' 'M'
do
patt+="*\\$c"
sub="${string##$patt}"
sublen="${#sub}"
if $flag
then
flag=false
((compare = prevlen - sublen ))
else
if (( prevlen - sublen != compare ))
then
printf '%s\n' "$string is NOT symmetrical"
sym=false
break
fi
fi
prevlen=$sublen
done
if $sym
then
printf '%s\n' "$string IS symmetrical"
fi
done
To read from a file, change the first for loop to while read -r string and add < filename after the last done on the same line.

How to switch/rotate every two lines with sed/awk?

I have been doing this by hand and I just can't do it anymore-- I have thousands of lines and I think this is a job for sed or awk.
Essentially, we have a file like this:
A sentence X
A matching sentence Y
A sentence Z
A matching sentence N
This pattern continues for the entire file. I want to flip every sentence and matching sentence so the entire file will end up like:
A matching sentence Y
A sentence X
A matching sentence N
A sentence Z
Any tips?
edit: extending the initial problem
Dimitre Radoulov provided a great answer for the initial problem. This is an extension of the main problem-- some more details:
Let's say we have an organized file (due to the sed line Dimitre gave, the file is organized). However, now I want to organize the file alphabetically but only using the language (English) of the second line.
watashi
me
annyonghaseyo
hello
dobroye utro!
Good morning!
I would like to organize alphabetically via the English sentences (every 2nd sentence). Given the above input, this should be the output:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
For the first part of the question, here is a one way to swap every other line with each other in sed without using regular expressions:
sed -n 'h;n;p;g;p'
The -n command line suppresses the automatic printing. Command h puts copies the current line from the pattern space to the hold space, n reads in the next line to the pattern space and p prints it; g copies the first line from the hold space back to the pattern space, bringing the first line back into the pattern space, and p prints it.
sed 'N;
s/\(.*\)\n\(.*\)/\2\
\1/' infile
N - append the next line of input into the pattern space
\(.*\)\n\(.*\) - save the matching parts of the pattern space
the one before and the one after the newline.
\2\\
\1 - exchange the two lines (\1 is the first saved part,
\2 the second). Use escaped literal newline for portability
With some sed implementations you could use the escape sequence
\n: \2\n\1 instead.
First question:
awk '{x = $0; getline; print; print x}' filename
next question: sort by 2nd line
paste - - < filename | sort -f -t $'\t' -k 2 | tr '\t' '\n'
which outputs:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
Assuming an input file like this:
A sentence X
Z matching sentence Y
A sentence Z
B matching sentence N
A sentence Z
M matching sentence N
You could do both exchange and sort with Perl:
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
The output I get is:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
B matching sentence N
A sentence Z
M matching sentence N
A sentence Z
Z matching sentence Y
A sentence X
If you want to order by the first line (before the exchange):
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
So, if the original file looks like this:
% cat infile1
me
watashi
hello
annyonghaseyo
Good morning!
dobroye utro!
The output should look like this:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile1
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
This version should handle duplicate records correctly:
perl -lne'
$_{ $_, $. } = $v unless $. % 2;
$v = $_;
END {
print substr( $_, 0, length() - 1) , $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
And another version, inspired by the solution posted by Glenn (record exchange included and assuming the pattern _ZZ_ is not present in the text file):
sed 'N;
s/\(.*\)\n\(.*\)/\1_ZZ_\2/' infile |
sort |
sed 's/\(.*\)_ZZ_\(.*\)/\2\
\1/'