Sed: Searching for a string length and character specific - regex

I've been searching for a few hours now for a way to find a string containing 21 numeric characters and place a return in front of the string itself. Finally i found the solution using:
sed -r 's/\b[0-9]{21}\b/\n&/g'
Works great!
Now i have a new set of data containing 21 numeric characters but adding to that there is some alphabetic characters at the end of the string with a variable length of 3 to 10 characters.
Sample input:
169349870913736210308ABC
168232727246529300209DEFGHI
166587299965005120122JKLMNOPQRS
162411281984306600005TUVWXYZ
What i would like is to have a space between the numeric and the alphabetical characters:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ
Do note the 16 which every number starts with. I've tried using:
sed -r 's/^\b[0-9]{21}\+[A-Z]{3,10}\b/ /g' filename
But i couldnt get it to work because i dont know and couldnt find how to specifically search for a string containing an exact amount of numeric characters combined with alphabetical characters of a special length. I've found a lot of helpfull questions on this website, but this one i couldnt find.

Use capturing group.
sed -r 's/^([0-9]{21})([A-Z]{3,10})$/\1 \2/' filename

Search from left to right first non numeric character ([^0-9]) and replace it by a whitespace and the matching (&) non numeric character:
sed 's/[^0-9]/ &/' file
Output:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ

Why not just add a space after the first 21 characters, like so:
sed 's/^...................../& /g'

Related

Regex match until third occurrence of a char is found, counting occurrence of said char starting from the end of string

Let's dive in : Input :
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
Desired output :
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
Starting from the beginning of my string, I need to match until the third occurrence of " _ " (underscore) is found, but I need to count " _ " (underscore) occurrence starting from end of string.
Any tips is appreciated,
Best regards
I believe this regex should do the trick!
^.*?(?=_[^_]*_[^_]*_[^_]*$)
Online Demo
Explanation:
^ the start of the line
.*? matches as many characters as possible
(?=...) asserts that its contents follow our match
_[^_]*_[^_]*_[^_]* Looks for exactly three underscores after our match.
$ the end of the line
You should think beyond regex to solve this problem. For example, if you are using Python just use rsplit with a limit of 3 and get the first resulting string:
>>> data = [
'p9_rec_tonly_.cr_called.seg',
'p9_tonly_.cr_called.seg',
'p10_nor_nor_.cr_called.seg',
'p10_rec_tn_.cr_called.seg',
'p10_tn_.cr_called.seg',
'p26_rec_nor_nor_.cr_called.seg',
'p26_rec_tn_.cr_called.seg',
'p26_tn_.cr_called.seg',
]
>>> for d in data:
print(d.rsplit('_', 3)[0])
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
bash you say? Well it's not a regular expression but you can do pattern substitutions (or stripping with bash):
while read var ; do echo ${var%_*_*_*} ; done <<EOT
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
EOT
${var%_*_*_*} expands variable var stripping shorted suffix match for _*_*_*.
Otherwise to perform regex operations in shell, you could normally ask a utility like sed for help and feed your lines through for instance this:
sed -e 's#_[^_]*_[^_]*_[^_]*$##'
or for short:
sed -e 's#\(_[^_]*\)\{3\}$##'
Find three groups of _ and zero or more characters of not _ at the end of line $ replacing them with nothing ('').

grep to search for a specified pattern

i want to grep all the texts in a file which contain symbols (non alpha numeric) and start with a number and which have spaces between them
grep -i "^[0-9]\|[^a-zA-Z0-9]\| "
I have written the following grep command which works perfectly , however i also wish to include those texts which are not in a particular limit say for example all those texts which are less than 3 and more than 15 should be greped
How can include that limit pattern as well in one command
I tried using
{3,15}
and all but could not get the desired output
sample input
aa
9dsa
abcd
abc#$
ab d
Sample output
aa //because length less than 3
ab d //because has space in between
9dsa // because starts with a number
abc#$ //because has special symbols in it
For clarity, simplicty, robustness, portability, etc. just use awk instead of grep to search for non-trivial conditions:
$ awk 'length()<3 || length()>15 || /[^[:alnum:]]/ || /[[:space:]]/ || /^[0-9]/' file
aa
9dsa
abc#$
ab d
I mean seriously, that couldn't get much clearer/simpler and it will work in any POSIX awk and it's trivial to change if/when your requirements change.
Below expression should help you find the required lines. I am assuming you will use grep -E so the alternation will work properly
^[[:digit:]]|[##$%^&*()]|^.{0,3}$|^.{15,}$
Below is the explanation for the regex
^[[:digit:]] - Match a line that starts with a number
[##$%^&*()] - Match a line containing the specified symbols.
Alternatively you can use [^[:alnum:]], if you want
the symbol to match any non alpha numeric character.
Beware that a space, underscore, tab, quote, etc are all
examples of non alpha numeric characters
^.{0,3}$ - Match a line containing less than 3 characters
^.{15,}$ - Match a line containing more than 15 characters

Regular Expressions with multiple dots in Linux bash shell give strange results

I tried to match a substring including a lot of dots, and it failed in Debian Linux shell. I made a simple script to look how dots are processed and found it completely out of rules. I retried it Bash, perl, Ubunta shell it all the same. The script and output are below.
#!/bin/sh
my_regex=u2734523abcABCB.C123.ABC.abc.1..2.34.2
Numbering=123456789_123456789_123456789_123456789
echo "$my_regex"
echo "$Numbering"
echo `expr index "$my_regex" '(ABC)'`
echo `expr index "$my_regex" '(ABC\.)'`
echo `expr index "$my_regex" '(\.\.)'`
echo `expr index "$my_regex" '(.)'`
echo `expr index "$my_regex" '(\.1)'`
Output:
u2734523abcABCB.C123.ABC.abc.1..2.34.2
123456789_123456789_123456789_123456789
12
12
16
16
16
The first regex should match ABC and return number-position of first character. It works.
The second one should find ABC followed by dot, it looks like it ignores dot.
The third one should find two dots but it finds first occurrence of one dot. Ignores again?
The fourth should find first any character, but it still finds the dot on position 16.
The fifth should find a dot followed by 1, it still finds the first occurrence of dot.
It seems like neither \ nor [ ] (I tried it too), nor the dot itself works as in common regular expression.
Why?
expr index has nothing to do with regular expressions.
expr index STRING CHARS outputs the index of the first occurrance of any of the CHARS in STRING. So your first search for '(ABC)' finds the first left parenthesis, A, B, C, or right parenthesis in your string. The first one is the A at position 12.
'(ABC\.)' does the same thing, except it's now also looking for a backslash or period. But the A is still the first match at position 12.
'(\.\.)' looks only for a parenthesis, backslash, or period. The first match is the period at position 16.
Likewise, all your other searches find the period at position 16, because none of the other characters you're listing come before that.
(On a side note, it's silly to capture the output with backticks only to immediately echo it. You'd get the same result by omitting the echo and backticks.)
You are incorrectly using index function of expr. As per man expr:
index STRING CHARS - index in STRING where any CHARS is found, or 0
So 2 things to note here:
index doesn't do any regex matching
index will find position of any of the char is found in string
If you want regex matching then use:
STRING : REGEXP
like this:
my_regex='u2734523abcABCB.C123.ABC.abc.1..2.34.2'
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC'
24
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC\.'
25
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.\.'
32
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*.'
38
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.1'
30
The numbers after each expr command is actually the length of the match.
There is no need to use echo here as expr anyway writes output on stdout.
You might want to take a look at BASH built-in =~ operator for regex matching.

Use SED Regex to replace certain letters with numbers

After something I guess is pretty complex, and I am pretty bad with regex's so you guys might be able to help.
See this data source:
User ID:
a123456
a12345f
a1234e6
d123d56
b12c456
c1b3456
ba23456
Basically, what I want to do, is use a regex/sed to replace all occurances of letters into numbers EXCEPT the first letter. Letters will always match their alphabet position. e.g. a = 1, b = 2, c = 3 etc.
So the result set should look like this:
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
There will also never be any letters other that a-j, and the string will always be 7 chars long.
Can anyone shed some light? Thanks! :)
Here's one way you could do it using standard tools cut, paste and tr:
$ paste -d'\0' <(cut -c1 file) <(cut -c2- file | tr 'abcdef' '123456')
a123456
a123456
a123456
d123456
b123456
c123456
b123456
This joins the first character of the line with the result of tr on the rest of the line, using the null string. tr replaces each element found in the first list with the corresponding element of the second list.
To replace a-j letters in a line by the corresponding digits except the first letter using perl:
$ perl -pe 'substr($_, 1) =~ tr/a-j/0-9/' input_file
a=0, not a=1 because j would be 10 (two digits) otherwise.
J = 0, and no, only numbers 0-9 are used, and letters simply replace their number counterpart, so there will never be a latter greater than j.
To make j=0 and a=1:
$ perl -pe 'substr($_, 1) =~ tr/ja-i/0-9/' input_file
sed '/[a-j][0-9a-j]\{6\}$/{h;y/abcdefghij/1234567890/;G;s/.\(.\{6\}\).\(.\).*/\2\1/;}' YourFile
filter on "number" only
remind line (for 1st letter)
change all letter to digit (including 1st)
add first form of number (as second line in buffer)
take 1st letter of second line and 6 last of 1st one, reorder and dont keep the other character
$ awk 'BEGIN{FS=OFS=""} NR>1{for (i=2;i<=NF;i++) if(p=index("jabcdefghi",$i)) $i=p-1} 1' file
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
Note that the above reproduces the header line User ID: as-is. So far, best I can tell, all of the other posted solutions would change the header line to Us5r ID: since they would do the letter-to-number translation on it just like on all of the subsequent lines.
I don't see the complexity. Your samples look like you just want to replace six of seven characters with the numbers 1-6:
s/^\([a-j0-9]\)[a-j0-9]\{6\}/\1123456/
Since the numbers to put there are defined by position, we don't care what the letter was (or even if it was a letter). The downside here is that we don't preserve the numbers, but they never varied in your sample data.
If we want to replace only letters, the first method I can think of involves simply using multiple substitutions:
s/^\([a-j0-9]\{1\}\)[a-j]/\11/
s/^\([a-j0-9]\{2\}\)[a-j]/\12/
s/^\([a-j0-9]\{3\}\)[a-j]/\13/
s/^\([a-j0-9]\{4\}\)[a-j]/\14/
s/^\([a-j0-9]\{5\}\)[a-j]/\15/
s/^\([a-j0-9]\{6\}\)[a-j]/\16/
Replacing letters with specific digits, excluding the first letter:
s/\(.\)a/\11/g
This pattern will replace two character sequences, preserving the first, so would have to be run twice for each letter. Using hold space we could store the first character and use a simple transliteration. The tricky part is joining the two sections, whereupon sed injects an unwanted newline.
# Store in hold space
h
# Remove the first character
s/^.//
# Transliterate letters
y/jabcdefghi/0123456789/
# Exchange pattern and hold space
x
# Keep the first character
s/^\(.\).*$/\1/
# Print it
#P
# Join
G
# Remove the newline
s/^\(.\)./\1/
Still learning about sed's capabilities :)

Creating a regex to parse a build version

I'm tyring to grab a build verson from a file that contains the following line:
<Assembly: AssemblyVersion("004.005.0862")>
and I would like it to return
4.5.862
I'm using sed in dos and got the following to spit out 004.005.0862
echo "<Assembly: AssemblyVersion("004.005.0862")>" | sed "s/[^0-9,.]//g"
How do I get rid of the leading zeros for each part of the build number?
The regex to do this in a single step looks like this:
^.*"0*([0-9]+\.)0*([0-9]+\.)0*([0-9]+).*
with sed-specific escaping and as a full expression, it becomes a little longer:
s/^.*"0*\([0-9]\+\.\)0*\([0-9]\+\.\)0*\([0-9]\+\).*/\1\2\3/g
The regex breaks down as
^ # start-of-string
.*" # anything, up to a double quote
0*([0-9]+\.) # any number of zeros, then group 1: at least 1 digit and a dot
0*([0-9]+\.) # any number of zeros, then group 2: at least 1 digit and a dot
0*([0-9]+) # any number of zeros, then group 3: at least 1 digit
.* # anything up to the end of the string
Maybe ... | sed "s/[^0-9]*0*([1-9][0-9,.]*)/\1/g". I'm using a subpattern to filter out the part you need, ignoring leading zeros and non-numeric characters.
There are probably many more clever ways, but one that works (and is reasonably easy to understand) is to pipe it through additional calls:
echo "version(004.005.0862)" | sed "s/[^0-9,.]//g" | sed "s/^0*//g" | sed "s/\.0*/./g"