Remove a specific part of string split by multiple-charater delimiter in bash - regex

I'm trying to process my string and steps as below:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
And I want to remove a part of string (and delimiter) that match myvar after split by delimiter \n, the result should be
\\[[123 one (/)\n\\[[789 three (/)
But I still not find out the solution.
If the delimiter is a single character like :, I can be done with sed command:
mytext2='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar2=456
echo $mytext2 | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
Result as expected: \\[[123 one (/):\\[[789 three (/)
How can be done if delimiter is a multiple-character in this case?
Thanks.

Even if you use sed -E, you still lack some support (?<=, ?!, ?=, ? etc). I suggest you use perl (Perl Compatible Regular Expressions).
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)';
myvar=456;
echo $mytext | perl -pe "s/(?<=\\\n).*${myvar}.*?(\\\n|$)//g";
Details:
(?<=\\\n): string starts after \n. \\ escape character \
.*${myvar}.*?(\\\n|$): get string which contains value of variable myvar and ends with \n or end of line.
Result.
\\[[123 one (/)\n\\[[789 three (/)

If you can find another delimiter to be used for example, :, you can first replace it with :,
echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
The full script looks like this
mytext='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar=456
v=$(echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g")
echo $v
If the multiple character delimiter is something like abcd, you can try to use sed to replace it first instead of using tr.

I sugges awk in this case since you may specify the literal multichar delimiter pattern as the input/output field separator, iterate over the fields and discard all those not matching your value:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
awk -v myvar=$myvar 'BEGIN{FS=OFS="\\\\n"} {s="";
for (i=1; i<=NF; i++) {
if ($i !~ myvar) {s = s (i==1 ? "" : OFS) $i;}
}
} END{print s}' <<< "$mytext"
# => \\[[123 one (/)\\n\\[[789 three (/)
See the online awk demo.
NOTES:
BEGIN{FS=OFS="\\\\n"} - sets the input/output field separator to \n
-v myvar=$myvar passes the myvar to awk
s="" - assigns s to an empty string
for (i=1; i<=NF; i++) {...} - iterates over all fields
if ($i !~ myvar) {...} - if the current field value matches myvar...
s = s (i==1 ? "" : OFS) $i;} - append either the current field value to s (if is the first field) or output separator and the current field value (if it is not the first)
END{print s} - prints s after the field checks.

Related

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Linux Replace With Variable Containing Double Quotes

I have read the following:
How Do I Use Variables In A Sed Command
How can I use variables when doing a sed?
Sed replace variable in double quotes
I have learned that I can use sed "s/STRING/$var1/g" to replace a string with the contents of a variable. However, I'm having a hard time finding out how to replace with a variable that contains double quotes, brackets and exclamation marks.
Then, hoping to escape the quotes, I tried piping my result though sed 's/\"/\\\"/g' which gave me another error sed: -e expression #1, char 7: unknown command: E'. I was hoping to escape the problematic characters and then do the variable replacement: sed "s/STRING/$var1/g". But I couldn't get that far either.
I figured you guys might know a better way to replace a string with a variable that contains quotes.
File1.txt:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
Variable:
var1=$(cat file1.txt)
Example:
echo "STRING" | sed "s/STRING/$var1/g"
Desired output:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
using awk
$ echo "STRING" | awk -v var="$var1" '{ gsub(/STRING/,var,$0); print $0}'
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
-v var="$var1": To use shell variable in awk
gsub(/STRING/,var,$0) : To globally substitute all occurances of "STRING" in whole record $0 with var
Special case : "If your var has & in it " say at the beginning of the line then it will create problems with gsub as & has a special meaning and refers to the matched text instead.
To deal with this situation we've to escape & as follows :
$ echo "STRING" | awk -v var="$var1" '{ gsub(/&/,"\\\\&",var); gsub(/STRING/,var,$0);print $0}'
&Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
The problem isn't the quotes. You're missing the "s" command, leading sed to treat /STRING/ as a line address, and the value of $var1 as a command to execute on matching lines. Also, $var1 has unescaped newlines and a / character that'll cause trouble in the substitution. So add the "s", and escape the relevant characters in $var1:
var1escaped="$(echo "$var1" | sed 's#[\/&]#\\&#; $ !s/$/\\/')"
echo "STRING" | sed "s/STRING/$var1escaped/"
...but realistically, #batMan's answer (using awk) is probably a better solution.
Here is one awk command that gets text-to-be-replaces from a file that may consist of all kind of special characters such as & or \ etc:
awk -v pat="STRING" 'ARGV[1] == FILENAME {
# read replacement text from first file in arguments
a = (a == "" ? "" : a RS) $0
next
}
{
# now run a loop using index function and use substr to get the replacements
s = ""
while( p = index($0, pat) ) {
s = s substr($0, 1, p-1) a
$0 = substr($0, p+length(pat))
}
$0 = s $0
} 1' File1.txt <(echo "STRING")
To be able to handle all kind of special characters properly this command avoids any regex based functions. We use plain text based functions such as index, substr etc.

awk - parse text having same character in fields as delimiter

Consider this source:
field1;field2;"data;data field3";field4;"data;data field5";field6
field1;"data;data field2";field3;field4;field5;"data;data field6"
As you can see, the field delimiter is being used inside certain fields, enclosed between ". I cannot directly parse with awk because there is no way of avoiding unwanted splitting, at least I haven't found a way. Moreover, those special fields have a variable position within a line and they can occur once, twice, 4 times etc.
I thought of a solution involving a pre-parsing step, where I replace the ; in those fields with a code of some sort. The problem is that sed / awk perform greedy REGEX match. So in the above example, I can only replace ; within the last field enclosed in quotes in each line.
How can I match each instance of quotes and replace the specific ; within them? I do not want to use perl or python etc.
Using gnu awk you can use special FPAT variable to have a regex for your fields.
You can use this command to replace all ; by | inside the double quotes:
awk -v OFS=';' -v FPAT='"[^"]*"|[^;]*' '{for (i=1; i<=NF; i++) gsub(/;/, "|", $i)} 1' file
field1;field2;"data|data field3";field4;"data|data field5";field6
field1;"data|data field2";field3;field4;field5;"data|data field6"
As an alternative to FPAT you can set the awk FS to be double quotes and then swap out your semicolon delimiter for every other field:
awk -F"\"" '{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/;/, "|", $i)}} {print $0}' yourfile
Here awk is:
Splitting the record by double quote (-F"\"")
Looping through each field that it finds ({for(i=1;i<=NF;++i))
Testing the field ordinal's mod 2 if it's 0 (if(i%2==0))
If it's even then it swaps out the semicolons with pipes (gsub(/;/, "|", $i))
Prints out the transformed record ({print $0})

Replace non-alphanumeric characters in substring

I am trying to replace any non-alphanumeric characters present in the first part (before the = sign) of a bunch of key value pairs, by a _:
Input
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
Desired Output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
I have tried patterns such as: s/\([^a-zA-Z]*\)=\(.*\)/\1=\2/g without much success. Any basic GNU/Linux tools can probably be used.
With awk
$ awk -F= -v OFS='=' '{gsub("[^a-zA-Z]", "_", $1)} 1' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Input and output field separators are set to = and then gsub("[^a-zA-Z]", "_", $1) will substitute all non-alphabet characters with _ only for first field
With perl
$ perl -pe 's/^[^=]+/$&=~s|[^a-z]|_|gir/e' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
^[^=]+ non = characters from start of line
$&=~s|[^a-z]|_|gir replace non-alphabet characters with _ only for the matched portion
Use perl -i -pe for inplace editing
Assuming your input is in a file called infile, you could do this:
while IFS== read key value; do
printf '%s=%s\n' "${key//[![:alnum:]]/_}" "${value}"
done < infile
with the output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
This sets the IFS variable to = and reads your key/value pairs line by line into a key and a value variable.
The printf command prints them and adds the = back in; "${key//[![:alnum:]]/_}" substitutes all non-alphanumeric characters in key by an underscore.
Any Posix compliant awk
$ cat f
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
$ awk 'BEGIN{FS=OFS="="}gsub(/[^[:alnum:]]/,"_",$1)+1' f
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Explanation
BEGIN{FS=OFS="="} Set input and Output field separator =
/[^[:alnum:]]/ Match a character not present in the list,
[:alnum:] matches a alphanumeric character [a-zA-Z0-9]
gsub(REGEXP, REPLACEMENT, TARGET)
This is similar to the sub function, except gsub replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g in gsub stands for global, which means replace everywhere,The gsub function returns the number
of substitutions made
+1 It takes care of default operation {print $0} whenever gsub returns 0
Thought I would throw a little ruby in:
ruby -pe '$_.sub!(/[^=]+/){|m| m.gsub(/[^[:alnum:]]/,"_")}'