Replace non-alphanumeric characters in substring - regex

I am trying to replace any non-alphanumeric characters present in the first part (before the = sign) of a bunch of key value pairs, by a _:
Input
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
Desired Output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
I have tried patterns such as: s/\([^a-zA-Z]*\)=\(.*\)/\1=\2/g without much success. Any basic GNU/Linux tools can probably be used.

With awk
$ awk -F= -v OFS='=' '{gsub("[^a-zA-Z]", "_", $1)} 1' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Input and output field separators are set to = and then gsub("[^a-zA-Z]", "_", $1) will substitute all non-alphabet characters with _ only for first field
With perl
$ perl -pe 's/^[^=]+/$&=~s|[^a-z]|_|gir/e' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
^[^=]+ non = characters from start of line
$&=~s|[^a-z]|_|gir replace non-alphabet characters with _ only for the matched portion
Use perl -i -pe for inplace editing

Assuming your input is in a file called infile, you could do this:
while IFS== read key value; do
printf '%s=%s\n' "${key//[![:alnum:]]/_}" "${value}"
done < infile
with the output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
This sets the IFS variable to = and reads your key/value pairs line by line into a key and a value variable.
The printf command prints them and adds the = back in; "${key//[![:alnum:]]/_}" substitutes all non-alphanumeric characters in key by an underscore.

Any Posix compliant awk
$ cat f
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
$ awk 'BEGIN{FS=OFS="="}gsub(/[^[:alnum:]]/,"_",$1)+1' f
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Explanation
BEGIN{FS=OFS="="} Set input and Output field separator =
/[^[:alnum:]]/ Match a character not present in the list,
[:alnum:] matches a alphanumeric character [a-zA-Z0-9]
gsub(REGEXP, REPLACEMENT, TARGET)
This is similar to the sub function, except gsub replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g in gsub stands for global, which means replace everywhere,The gsub function returns the number
of substitutions made
+1 It takes care of default operation {print $0} whenever gsub returns 0

Thought I would throw a little ruby in:
ruby -pe '$_.sub!(/[^=]+/){|m| m.gsub(/[^[:alnum:]]/,"_")}'

Related

Remove a specific part of string split by multiple-charater delimiter in bash

I'm trying to process my string and steps as below:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
And I want to remove a part of string (and delimiter) that match myvar after split by delimiter \n, the result should be
\\[[123 one (/)\n\\[[789 three (/)
But I still not find out the solution.
If the delimiter is a single character like :, I can be done with sed command:
mytext2='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar2=456
echo $mytext2 | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
Result as expected: \\[[123 one (/):\\[[789 three (/)
How can be done if delimiter is a multiple-character in this case?
Thanks.
Even if you use sed -E, you still lack some support (?<=, ?!, ?=, ? etc). I suggest you use perl (Perl Compatible Regular Expressions).
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)';
myvar=456;
echo $mytext | perl -pe "s/(?<=\\\n).*${myvar}.*?(\\\n|$)//g";
Details:
(?<=\\\n): string starts after \n. \\ escape character \
.*${myvar}.*?(\\\n|$): get string which contains value of variable myvar and ends with \n or end of line.
Result.
\\[[123 one (/)\n\\[[789 three (/)
If you can find another delimiter to be used for example, :, you can first replace it with :,
echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
The full script looks like this
mytext='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar=456
v=$(echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g")
echo $v
If the multiple character delimiter is something like abcd, you can try to use sed to replace it first instead of using tr.
I sugges awk in this case since you may specify the literal multichar delimiter pattern as the input/output field separator, iterate over the fields and discard all those not matching your value:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
awk -v myvar=$myvar 'BEGIN{FS=OFS="\\\\n"} {s="";
for (i=1; i<=NF; i++) {
if ($i !~ myvar) {s = s (i==1 ? "" : OFS) $i;}
}
} END{print s}' <<< "$mytext"
# => \\[[123 one (/)\\n\\[[789 three (/)
See the online awk demo.
NOTES:
BEGIN{FS=OFS="\\\\n"} - sets the input/output field separator to \n
-v myvar=$myvar passes the myvar to awk
s="" - assigns s to an empty string
for (i=1; i<=NF; i++) {...} - iterates over all fields
if ($i !~ myvar) {...} - if the current field value matches myvar...
s = s (i==1 ? "" : OFS) $i;} - append either the current field value to s (if is the first field) or output separator and the current field value (if it is not the first)
END{print s} - prints s after the field checks.

extract substring with SED

I have the next strings:
for example:
input1 = abc-def-ghi-jkl
input2 = mno-pqr-stu-vwy
I want extract the first word between "-"
for the fisrt string I want to get: def
if the input is the second string, I want to get: pqr
I want to use the command SED, Could you help me please?
Use
sed 's,^[^-]*-\([^-]*\).*,\1,' file
The string after the first - will be captured up to the second - and the rest will be matched, then the matched line will be replaced with the group text.
With bash:
var='input1 = abc-def-ghi-jkl'
var=${var#*-} # remove shortest prefix `*-`, this removes `input1 = abc-`
echo "${var%%-*}" # remove longest suffix `-*`, this removes `-ghi-jkl`
Or with awk:
awk -F'-' '{print $2}' <<<'input1 = abc-def-ghi-jkl'
Use - as input field separator and print the second field.
Or with cut:
cut -d'-' -f2 <<<'input1 = abc-def-ghi-jkl'
When you want to use sed, you can choose between solutions like
# Double processing
echo "$input1" | sed 's/[^-]*-//;s/-.*//'
# Normal approach
echo "$input1" | sed -r 's/^[^-]*-([^-]*)|-.*)/\1/g'
# Funny alternative
echo "$input1" | sed -r 's/(^[^-]*-|-.*)//g'
The obvious "external" tool would be cut. You can also look at a Bash builtin solution like
[[ ${input1} =~ ([^-]*)-([^-]*) ]] && printf %s "${BASH_REMATCH[2]}"
grep solution (in my opinion this is the most natural approach, as you are only trying to find matches to a regular expression - you are not looking to edit anything, so there should be no need for the more advanced command sed)
grep -oP '^[^-]*-\K[^-]*(?=-)' << EOF
> abc-qrs-bobo-the-clown
> 123-45-6789
> blah-blah-blah
> no dashes here
> mahi-mahi
> EOF
Output
qrs
45
blah
Explanation
Look at the inputs first, included here for completeness as a heredoc (more likely you would name your file as the last argument to grep.) The solution requires at least two dashes to be present in the string; in particular, for mahi-mahi it will find no match. If you want to find the second mahi as a match, you can remove the lookahead assertion at the end of the regular expression (see below).
The regular expression does this. First note the command options: -o to return only the matched substring, not the entire line; and -P to use Perl extensions. Then, the regular expression: start from the beginning of the line (^); look for zero or more non-dash characters followed by dash, and then (\K) discard this part of the required match from the substrings found to match the pattern. Then look for zero or more non-dash characters again - this will be returned by the command. Finally, require a dash following this pattern, but do not include it in the match. This is done with a lookahead (marked by (?= ... )).

Regular expression for not more than one occurance of consecutive characters

I'm looking for regular expression that will match only if 2 consecutive characters occur in string once.
for example:
1123456 - match
1122345 - not match
1121125 - not match
1234567 - not match
1112345 - not match
currently have this regex: ([0-9])\1{1,} but it matches 1122345 as well which is not what i need
This awk does it, if you have minimal awk (mawk) or GNU awk (gawk):
awk -F "" '
{
d=0
for(i=1;i<NF;i++){
if ($i==$(i+1)) d++
}
if (d==1) print
}' file
Setting the field to empty string ("") you can read each line character-wise! If character i equals character i+1, then increment d. If d==1, the string is printed.
From your sample:
$ cat file
1123456
1122345
1121125
1234567
1112345
It outputs:
1123456
Important remark:
GNU awk manual says the use of empty string as field separator is a "dark corner", meaning that it is not standard and some implementations may handle it differently. If you want to be sure that it will work with any awk, go for
awk '
{
d=0
n=split($0,ch,"")
for(i=1;i<n;i++){
if (ch[i]==ch[i+1]) d++
}
if (d==1) print
}' file
It passed the gawk --posix test and yields the same result.

regex replace and add hyphen before first zero

Input1: RC000030034
Replace1: RC-000030034
Input2: RC100003282
Replace2: RC1-00003282
Looking to add a hyphen before the first 0 in the string.
Input will always have 11 characters.
Final output will always be 12 characters.
Never will have alpha characters after the hyphen.
Based on this : "Looking to add a hyphen before the first 0 in the string."
example in javascript:
> var re = /^([^0]*)0(.*)$/
> "RC000030034".replace(re,"$1-$2")
'RC-00030034'
in bash (echo + sed):
$ echo 'RC00030034' | sed -e 's/^\([^0]*\)0\(.*\)$/\1-\2/'
RC-0030034
output = input.replace("0", "-0")
or equivalent code in your language
Most languages provide some kind of replace method for strings. The example above works in javascript, it will replace the first occurence of '0' by '-0'.
In python, there is a replace() which replacess all occurrences; there, you must use an optional argument to indicate a maximum number:
output = input.replace("0", "-0", 1)
In perl, you could use a regular expression:
$input =~ s/0/-0/;
Putting all your strings in a file (each string on a separate line) you can use the following shell command:
perl -ne 's/0/-0/; print' inputfile
Now, suppose your input allows for strings like RC0A00001234, where the hyphen should be before the second 0, behind the A (because we 'Never will have alpha characters after the hyphen').
Then the command has to change into:
perl -ne 's/(0\d*)$/-$1/; print' inputfile

Using awk or sed to merge / print lines matching a pattern (oneliner?)

I have a file that contains the following text:
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
I am hoping to achieve the following result using awk or sed (as a oneliner would be a bonus! [Perl oneliner would be acceptable as well]):
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
I'd like to have each line that matches 'subject' and 'object' combined in the order that each one is listed, separated with a comma. May I see an example of this done with awk, sed, or perl? (Preferably as a oneliner if possible?)
I have tried some uses of awk to perform this, I am still learning I should add:
awk '{if ($0 ~ /subject/) pat1=$1; if ($0 ~ /object/) pat2=$2} {print $0,pat2}'
But does not do what I thought it would! So I know I have the syntax wrong. If I were to see an example that would greatly help so that I can learn.
not perl or awk but easier.
$ pr -2ts, file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
Explanation
-2 2 columns
t ignore print header (filename, date, page number, etc)
s, use comma as the column separator
I'd do it something like this in perl:
#!/usr/bin/perl
use strict;
use warnings;
my #subjects;
while ( <DATA> ) {
m/^subject:(\w+)/ and push #subjects, $1;
m/^object:(\w+)/ and print "subject:",shift #subjects,",object:", $1,"\n";
}
__DATA__
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
Reduced down to one liner, this would be:
perl -ne '/^(subject:\w+)/ and push #s, $1; /^object/ and print shift #s,$_' file
grep, paste and process substitution
$ paste -d , <(grep 'subject' infile) <(grep 'object' infile)
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
This treats the output of grep 'subject' infile and grep 'object' infile like files due to process substitution (<( )), then pastes the results together with paste, using a comma as the delimiter (indicated by -d ,).
sed
The idea is to read and store all subject lines in the hold space, then for each object line fetch the hold space, get the proper subject and put the remaining subject lines back into hold space.
First the unreadable oneliner:
$ sed -rn '/^subject/H;/^object/{G;s/\n+/,/;s/^(.*),([^\n]*)(\n|$)/\2,\1\n/;P;s/^[^\n]*\n//;h}' infile
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
-r is for extended regex (no escaping of parentheses, + and |) and -n does not print by default.
Expanded, more readable and explained:
/^subject/H # Append subject lines to hold space
/^object/ { # For each object line
G # Append hold space to pattern space
s/\n+/,/ # Replace first group of newlines with a comma
# Swap object (before comma) and subject (after comma)
s/^(.*),([^\n]*)(\n|$)/\2,\1\n/
P # Print up to first newline
s/^[^\n]*\n// # Remove first line (can't use D because there is another command)
h # Copy pattern space to hold space
}
Remarks:
When the hold space is fetched for the first time, it starts with a newline (H adds one), so the newline-to-comma substitution replaces one or more newlines, hence the \n+: two newlines for the first time, one for the rest.
To anchor the end of the subject part in the swap, we use (\n|$): either a newline or the end of the pattern space – this is to get the swap also on the last line, where we don't have a newline at the end of the pattern space.
This works with GNU sed. For BSD sed as found in MacOS, there are some changes required:
The -r option has to be replaced by -E.
There has to be an extra semicolon before the closing brace: h;}
To insert a newline in the replacement string (swap command), we have to replace \n by either '$'\n'' or '"$(printf '\n')"'.
Since you specifically asked for a "oneliner" I assume brevity is far more important to you than clarity so:
$ awk -F: -v OFS=, 'NR>1&&$1!=p{f=1}{p=$1}f{print a[++c],$0;next}{a[NR]=$0}' file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1