How to replace second pattern(dot) after pattern(comma) in bash - regex

How do i replace second dot after comma.
this is the closest i could go
echo '0.592922148,0.821504176,1.174.129.731' | xargs -d ',' -n1 echo | sed 's/\([^\.]*\.[^\.]*\)\./\1/' | sed 's/\([^\.]*\.[^\.]*\)\./\1/'
Output :
0.592922148
0.821504176
1.174129731
Expected output :
0.592922148,0.821504176,1.174129731

You may use
sed -e ':a' -e 's/\(\.[^.,]*\)\./\1/' -e 't a'
See online sed demo:
s='0.592922148,0.821504176,1.174.129.731'
sed -e ':a' -e 's/\(\.[^.,]*\)\./\1/' -e 't a' <<< "$s"
Details
:a - label a
s/\(\.[^.,]*\)\./\1/ - finds and captures into Group 1 a dot, then any 0+ chars other than dot and comma, and then just matches a dot, and replaces this match with the value in Group 1 (thus, removing the second matched dot)
t a - if there was a successful replacement, goes back to the a label position in the string.

While I think the sed solution is your best choice, since you have tagged your question with both sed and awk, an awk solution is fairly straight forward as well using split() and basic string concatenation. (just not nearly as short) For example you could do:
awk -v OFS=, -F, '{
for (i=1; i<=NF; i++) {
n=split ($i, a,".")
if (n > 2) {
s=a[1] "." a[2]
for (j=3; j<=n; j++)
s = s a[j]
$i=s
}
}
}1'
Where you define the field separator and output field separators as ','. Then looping over each field, check the return of split(), splitting the field into an array on '.' into array a. If the resulting number of elements is greater than 2, then put your first two elements back together restoring the first '.' in the number, and then simply concatenate the remaining fields. The 1 at the end is the default "print record" to print the updated record.
Example Use/Output
$ echo '0.592922148,0.821504176,1.174.129.731' |
> awk -v OFS=, -F, '{
> for (i=1; i<=NF; i++) {
> n=split ($i, a,".")
> if (n > 2) {
> s=a[1] "." a[2]
> for(j=3;j<=n;j++)
> s = s a[j]
> $i=s
> }
> }
> }1'
0.592922148,0.821504176,1.174129731

Could you please try following.
echo '0.592922148,0.821504176,1.174.129.731' |
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
ind=index($i,".")
if(ind){
val1=substr($i,1,ind)
val2=substr($i,ind+1)
gsub(/\./,"",val2)
$i=val1 val2
}
}
val1=val2=""
}
1'
Explanation: Adding explanation for above code.
echo '0.592922148,0.821504176,1.174.129.731' | ##Printing values as per OP mentioned and using pipe to send its output as standard input for awk command.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="," ##Setting FS and OFS as comma for each line of Input_file here.
} ##Closing BEGIN BLOCK here.
{
for(i=1;i<=NF;i++){ ##Starting a for loop to traverse through fields of line..
ind=index($i,".") ##Checking index of DOT in current field and saving it into ind variable.
if(ind){ ##Checking condition if variable ind is NOT NULL.
val1=substr($i,1,ind) ##Creating variable val1 from sub-string in current field from 1 to ind value.
val2=substr($i,ind+1) ##Creating variable val2 from sub-string in current field from ind+1 value to till complete length of current field.
gsub(/\./,"",val2) ##Globally substituting DOTs with NULL in val2 variable.
$i=val1 val2 ##Re-crearing current field with value of val1 val2.
} ##Closing BLOCK for if condition.
} ##Closing BLOCK for for loop.
val1=val2="" ##Nullifying val1 and val2 variables here.
} ##Closing main code BLOCK here.
1' ##Mentioning 1 will print edited/non-edited line.

An awk verison:
echo '0.592922148,0.821504176,1.174.129.731' | awk -F, '{for (i=1;i<=NF;i++) {sub(/\./,"#",$i);gsub(/\./,"",$i);sub(/#/,".",$i);print $i}}'
0.592922148
0.821504176
1.174129731
It splits the line inn to multiple fields by ,. Then replace first . to #. Then replace rest of . to nothing. Last replace # back to . and print it.
Edit
awk -F, '{for (i=1;i<=NF;i++) {sub(/\./,"#",$i);gsub(/\./,"",$i);sub(/#/,".",$i);a=a (i==1?"":",")$i}print a}' file
0.592922148,0.821504176,1.174129731

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.
awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.
With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file
awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file
Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma
A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

How to extract url paths recursively

i want to list all endpoints in a list of url like
https://test123.com/endpoint1/endpoint2/endpoint3
https://test456.com/endpoint1/endpoint2/endpoint3
https://test789.com/endpoint1/endpoint2/endpoint3
into output like
https://test123.com/
https://test123.com/endpoint1/
https://test123.com/endpoint1/endpoint2/
https://test123.com/endpoint1/endpoint2/endpoint3
https://test456.com/
https://test456.com/endpoint1/
https://test456.com/endpoint1/endpoint2/
https://test456.com/endpoint1/endpoint2/endpoint3
And so on, listing all endpoints recursively so i would do something with each endpoint.
I tried to use this but it print it separately.
awk '$1=$1' FS="/" OFS="\n"
thanks
Could you please try following, written and tested with shown samples.
awk '
match($0,/http[s]?:\/\/[^/]*\//){
first=substr($0,RSTART,RLENGTH)
val=substr($0,RSTART+RLENGTH)
num=split(val,array,"/")
print first
for(i=1;i<=num;i++){
value=(value?value "/":"")array[i]
print first value
}
val=first=value=""
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/http[s]?:\/\/[^/]*\//){ ##Using match function which matches http OR https :// then till first occurrence of /
first=substr($0,RSTART,RLENGTH) ##Creating first with sub-string which starts from RSTART till RLENGTH value of current line.
val=substr($0,RSTART+RLENGTH) ##Creating val which has rest of line out of match function in 3rd line of code.
num=split(val,array,"/") ##Splitting val into array with delimiter / here.
print first ##Printing first here.
for(i=1;i<=num;i++){ ##Starting for loop till value of num from i=1 here.
value=(value?value "/":"")array[i] ##Creating value which has array[i] and keep adding in its previous value to it.
print first value ##Printing first and value here.
}
val=first=value="" ##Nullify variables val, first and value here.
}
' Input_file ##Mentioning Input_file name here.
With two loops:
awk '{
x=$1 OFS $2 OFS $3 # x contains prefix https://
for(i=3; i<=NF; i++) { # NF is number of last element
printf("%s", x) # print prefix
for(j=4; j<=i; j++){
printf("%s%s", OFS, $j) # print / and single element
}
print ""
}
}' FS='/' OFS='/' file
Output:
https://test123.com
https://test123.com/endpoint1
https://test123.com/endpoint1/endpoint2
https://test123.com/endpoint1/endpoint2/endpoint3
https://test456.com
https://test456.com/endpoint1
https://test456.com/endpoint1/endpoint2
https://test456.com/endpoint1/endpoint2/endpoint3
https://test789.com
https://test789.com/endpoint1
https://test789.com/endpoint1/endpoint2
https://test789.com/endpoint1/endpoint2/endpoint3
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
$ awk -F'/' '{ep=$1 FS FS; for (i=3;i<NF;i++) print ep=ep $i FS; print ep $NF}' file
https://test123.com/
https://test123.com/endpoint1/
https://test123.com/endpoint1/endpoint2/
https://test123.com/endpoint1/endpoint2/endpoint3
https://test456.com/
https://test456.com/endpoint1/
https://test456.com/endpoint1/endpoint2/
https://test456.com/endpoint1/endpoint2/endpoint3
https://test789.com/
https://test789.com/endpoint1/
https://test789.com/endpoint1/endpoint2/
https://test789.com/endpoint1/endpoint2/endpoint3
A solution using perl.
perl -F/ -le 'print; while (3 < #F) { pop #F; print join("/", #F, "") }' input_file
Gives the following for your sample input.
https://test123.com/endpoint1/endpoint2/endpoint3
https://test123.com/endpoint1/endpoint2/
https://test123.com/endpoint1/
https://test123.com/
https://test456.com/endpoint1/endpoint2/endpoint3
https://test456.com/endpoint1/endpoint2/
https://test456.com/endpoint1/
https://test456.com/
https://test789.com/endpoint1/endpoint2/endpoint3
https://test789.com/endpoint1/endpoint2/
https://test789.com/endpoint1/
https://test789.com/
See https://perldoc.perl.org/perlrun.html#Command-Switches look for -Fpattern.

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

Dynamically generated regex for gsub not working

I have an input CSV file:
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
I need to create a string out of this as follows:
;5,1,3,5;6,2,6,7;7,4
So each character, except the first which is the value of the field $2, in the substring in between the ; denotes the row number of middle field; for example ;5,1,3,5 means that 5 is at row number 1,3,5.
I've been trying to use awk with gsub, trying to create the string MYSTR dynamically.
The regex inside the gsub is not working. I need a regex that will match ;$3 (the value of $3, which can be a two digit number) and replace it with ;$3,RowNO, if the pattern is not matched then add ;$3 at the end of the string.
This is what I have so far:
awk -F',' '{
print NR, $3;
noofchars=gsub(/;$3/,";"$3","NR,MYSTR);
print noofchars;
if ( noofchars == 1 )
;
else
MYSTR=MYSTR";"$3","NR;
print NR, $3;
print MYSTR;
}
END{print MYSTR;}' $1
The regex doesn't work because $3 isn't interpreted as the field #3 value but is seen as the anchor $ (that matches the end of the line) and a literal 3.
You can do it without gsub:
awk -F, '{a[$2]=a[$2]","NR}END{for (i in a){printf(";%d%s",i,a[i])}}'
Input
$ cat file
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
Output
$ awk -F, '{gsub(/[ ]+/,"",$3);a[$2] = ($2 in a ? a[$2]:$2) FS $3 }END{for(i in a)printf("%s%s",";",a[i]); print ""}' file
;5,1,3,5;6,2,6,7;7,4
Better Readable version
awk -F, '
{
gsub(/[ ]+/,"",$3); # suppress space char in third field
a[$2] = ($2 in a ? a[$2]:$2) FS $3 # array a where index being field2 and value will be field3, if index exists before append string with existing value
}
END{
for(i in a) # loop through array a and print values
printf("%s%s",";",a[i]);
print ""
}
' file
#vsshekhar: Try following too: It will provide you values in the correct same order which Input_file ($2) are coming.
awk -F, '{A[++i]=$2;B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR} END{for(j=1;j<=i;j++){if(B[A[j]]){printf(";%s,%s",A[j],B[A[j]]);delete B[A[j]]}};print ""}' Input_file
Adding a non-one liner form of solution too now.
awk -F, '{
A[++i]=$2;
B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR
}
END{
for(j=1;j<=i;j++){
if(B[A[j]]){
printf(";%s,%s",A[j],B[A[j]]);
delete B[A[j]]
}
};
print ""
}
' Input_file