Format 11 digit numbers with sed and RegEx - regex

I'm writing a script that can convert numbers (with 11 digits) like this:
Example input:
11111111111
12345678900
Example output:
(11) 1 1111-1111
(12) 3 4567-8900
And here's the command I'm using:
echo "$(sed 's/\(..\)\(.\{5\}\)/(\1)\2-/g' $file)"
But the output is:
(11)11111-1111
(12)34567-8900
Anyone can help me on how to isolate the third number with spaces as the example output? All I can use is sed and RegEx. Thank you!

With your shown samples, could you please try following, using sed's back reference capability here.
sed -E 's/^([0-9]{2})([0-9])([0-9]{4})([0-9]{4})$/(\1) \2 \3-\4/' Input_file
Explanation: Using -E option for using extended regex with sed here. Then using sed's back reference capability here. Where Creating 4 capturing groups here(which has 2,1,4,4 digits in it respectively). Then while substituting it, using necessary string(s) eg--> adding ( and ) before and after 1st captured value and so on to make it same as per OP's ask.

sed -E 's/^(..)(.)(....)(.*)$/\(\1\) \2 \3\-\4/g' "file"

Using sed and a regex seems needlessly complicated here. If the numbers are all completely uniform, just print the substrings with the adornments added.
awk 'BEGIN { split("(_) _ _-", f, /_/); split("2_1_4_4", l, /_/); }
{ n=1; for(i=1; i<=4; i++) {
printf "%s%s", f[i], substr($1, n, l[i]); n+=l[i] }
printf "\n" }' file
The BEGIN block creates two parallel data structures which drive the body of the script. The first f contains the prefix to add before the next slice of the substring, and the second l contains the lengths of the substrings we want to extract.
If you wanted to really make this simple, you could write it out as
awk '{
printf "(%s", substr($0, 1, 2);
printf ") %s", substr($0, 3, 1);
printf " %s", substr($0, 4, 4);
printf "-%s\n", substr($0, 8);
}' file
The regex solution is definitely more succinct, but spelling things out so you understand them is probably more important than notational compactness in most situations.

Related

grep regex how to get only results with one preceeding word?

My string is :
www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com
I an trying to get results only with "one" word before texas.com. Expectation when I do a regex grep :
mail.texas.com
www2.texas.com
So mail & www2 are the "one" word that I'm talking about. I tried :
grep "*.texas.com", but I get all of them in results. Can someone please help ?
You can use
grep '^[^.]*\.texas\.com'
Details:
^ - start of string
[^.]* - zero or more chars other than a . char
\.texas\.com - .texas.com string (literal . char must be escaped in the regex pattern).
See the online demo:
#!/bin/bash
s='www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com'
grep '^[^.]*\.texas\.com' <<< "$s"
Output:
mail.texas.com
www2.texas.com
With awk:
awk 'BEGIN{FS=OFS="."} /texas.com$/ && NF==3' file
Output:
mail.texas.com
www2.texas.com
Set one dot as input and output field separator, check for texas.com at the end ($) of your line and check for three fields.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With your shown samples, please try following awk code.
awk -F'.' 'NF==3 && $2=="texas" && $3=="com"' Input_file
Explanation: Simple making field separator as . for all the lines in awk program. Then in main program checking condition if NF==3(means number of fields in current line)are 3 AND 2nd field is texas and 3rd field is com if all 3 conditions are MET then print the line.

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.
For instance, the 1st indicator is a list of words
(no|noone|haven't)
and the 2nd indicator is a list of punctuation
Code:
(.|,|!)
From an input text such as
"Noone understands the plot. There is no storyline. I haven't
recommended this movie to my friends! Did you understand it?"
The desired result would be.
"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I
haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX
friends_AFFIX! Did you understand it?"
I know that there is the following sed:
sed -n '/WORD1/,/WORD2/p' /path/to/file
which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.
I have also considered to use awk, such as
awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile
yet still, it does not allow me to append the affix.
Does anyone have a suggestion to do so, either with awk or sed?
Little more compact awk
$ awk 'BEGIN{RS=ORS=" ";s="_AFFIX"}
/[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}
f{$0=$0s}
/Noone|no|haven'\''t/{f=1}1' story
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?
Perl to the rescue!
perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
join " ", map "${_}_AFFIX", split " ", $1/egi
' infile > outfile
\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
Here is one verbose awk command for the same:
s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"
awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
a=0
for (i=2; i<=NF; i++) {
if ($(i-1) ~ "\\y" kw "\\y")
a=1
if (a && $i ~ pct "$") {
p = substr($i, length($i), 1)
$i = substr($i, 1, length($i)-1)
}
if (a)
$i=$i "_AFFIX" p
if(p) {
p=""
a=0
}
}
} 1'
Output:
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

1-indexing to 0-indexing in grep/sed/awk

I'm parsing a template using linux command line using pipes which has some 1-indexed pseudo-variables, I need 0-indexing. Bascially:
...{1}...{7}...
to become
...{0}...{6}...
I'd like to use one of grep, sed or awk in this order of preference. I guess grep can't do that but I'm not sure. Is such arithmetic operation even possible using any of these?
numbers used in the file are in range 0-9, so ignore problems like 23 becoming 12
no other numbers in the file, so you can even ignore the {}
I can do that using python, ruby, whatever but I prefer not to so stick to standard command line utils
other commend line utils usable with pipes with regex that I don't know are fine too
EDIT: Reworded first bullet point to clarify
If the input allows it, you may be able to get away with simply:
tr 123456789 012345678
Note that this will replace all instances of any digit, so may not be suitable. (For example, 12 becomes 01. It's not clear from the question if you have to deal with 2 digit values.)
If you do need to handle multi-digit numbers, you could do:
perl -pe 's/\{(\d+)\}/sprintf( "{%d}", $1-1)/ge'
You can use Perl.
perl -pe 's/(?<={)\d+(?=})/$&-1/ge' file.txt
If you are sure you can ignore {...}, then use
perl -pe 's/\d+/$&-1/ge' file.txt
And if index is always just one-digit number, then go with shorter one
perl -pe 's/\d/$&-1/ge' file.txt
With gawk version 4, you can write:
gawk '
{
n = split($0, a, /[0-9]/, seps)
for (i=1; i<n; i++)
printf("%s%d", a[i], seps[i]-1)
print a[n]
}
'
Older awks can use
awk '
{
while (match(substr($0,idx+1), /[0-9]/)) {
idx += RSTART
$0 = substr($0,1, idx-1) (substr($0,idx,1) - 1) substr($0,idx+1)
}
print
}
'
Both less elegant than the Perl one-liners.

Can we really do without lazy quantifiers?

Many people say we can do without lazy quantifiers in regular expressions, but I've just run into a problem that I can't solve without them (I'm using sed here).
The string I want to process is composed of substrings separated by the word rate, for example:
anfhwe9.<<76xnf9247 rate 7dh3_29snpq+074j rate 48jdhsn3gus8 rate
I want to replace those substrings (apart from the word 'rate') with 3 dashes (---) each; the result should be:
---rate---rate---rate
From what I understand (I don't know Perl), it can be easily done using lazy quantifiers. In vim there are lazy quantifiers too; I did it using this command
:s/.\{-}rate/---rate/g
where \{-} tells vim to match as few as possible.
However, vim is a text editor and I need to run the script on many machines, some of which have no Perl installed. It could also be solved if you can tell the regex to not match an atomic grouping like .*[^(rate)]rate but that did not work.
Any ideas how to achieve this using POSIX regex, or is it impossible?
In a case like this, I would use split():
perl -n -e 'print join ("rate", ("---") x split /rate/)' [input-file]
Are there any characters that are guaranteed not to be in the input? For instance, if '!'
can't occur, you could transform the input to substitute that unique character, and then do a global replace on the transformed input:
sed 's/ rate /!/g' < input | sed -e 's/[^!]*/---/g' -e 's/!/rate/g'
Another alternative is to use awk's split command in an analogous way to
the perl suggestion above, assuming awk is any more reliably available than perl.
awk '
{ ans="---"
n=split($0, x, / rate /);
while ( n-- ) { ans = ans "rate---";}
print ans
}'
It's not easy without using lazy quantifiers or negative lookaheads (neither of which POSIX supports), but this seems to work.
([^r]*((r($|[^a]|a([^t]|$)|at([^e]|$))))?)+rate
I vaguely recall POSIX character classes being a bit persnickety. You may need to alter the character classes in that regex if they're not already POSIX-compliant.
The fact that you don't care about the contents of the substrings opens up a lot of options. For example, to add to Bob Lied's suggestion — even if '!' can occur in the input, you can start by changing it to something else:
sed -e 's/!/./g' -e 's/rate/!/g' -e 's/[^!]\+/---/g' -e 's/!/rate/g' <input >output
With awk:
awk -Frate '{
for (i = 0; ++i <= NF;)
$i = (i == 1 || i == NF) && $i == x ? x : "---"
}1' OFS=rate infile
Or, awk 'BEGIN {OFS=FS="rate"} {for (i=1; i<=NF-1; i++) {$i = "---"}; print}'

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file