sed regex pattern for some tricky line (ini section) - regex

I want parse something like a section entry in an *.ini file:
line=' [ fdfd fdf f ] '
What could be the sed pattern (???) for this line to split the
'fdfd fdf f'
out?
So:
echo "${line}" | sed -E 's/???/\1/g'
How can I describe all chars except [[:space:]], [ and ] ? This doesn't work for me: [^[[:space:]]\[]* .

When you use the [[:space:]] syntax, the outer brackets are normal "match one character from this list" brackets, the same as in [aeiou] but the inner brackets are part of [:space:] which is an indivisible unit.
So if you wanted to match a single character which either belongs to the space class or is an x you'd use [[:space:]x] or [x[:space:]]
When one of the characters you want to match is a ], it will terminate the bracketed character list unless you give it some special treatment. You've guessed that you need a backslash somewhere; a good guess but wrong. The way you include a ] in the list is to put it first. [ab]c] is a bracketed list containing the 2 characters ab, followed by 2 literal-match characters c], so it matches "ac]" or "bc]" but []abc] is a bracketed list of the 4 characters ]abc so it matches "a", "b", "c", or "]".
In a negated list the ] comes immediately after the ^.
So putting that all together, the way to match a single char from the set of all chars except the [:space:] class and the brackets is:
[^][:space:][]
The first bracket and the last bracket are a matching pair, even if you think it doesn't look like they should be.

$ echo "$line" | sed "s/^.*\[[[:space:]]*\([^]]*[^][:space:]]\)[[:space:]]*\].*$/'\1'/"
You can split the pattern into two:
$ echo "$line" | sed "s/^.*\[[[:space:]]*/'/; s/[[:space:]]*\].*$/'/"
awk works too:
$ echo "$line" | awk -F' *[[\]] *' -vQ="'" '{print Q$2Q}'

When you say split, do you mean split into an array, or do you mean filter out all spaces and brackets?
Assuming the value of line number 1 in file.ini is:
[ fdfd fdf f ]
If you mean array,
$linenumber=1;
array=($(sed -n ${linenumber}p file.ini | sed 's/[][]*//g'));
will split line number 1 of file.ini into an array and return the values:
${array[0]} = fdfd
${array[1]} = fdf
${array[2]} = f
If you mean filter spaces and brackets,
$linenumber=1;
sed -n ${linenumber}p file.ini | sed 's/[ []]*//g';
will return:
fdfdfdff
and if neither of those is what you meant, please specify the exact output you are looking to extract from the initial value so that we can address it correctly.

Related

regex for capturing a number with a range of digits in AWK

Im trying to capture numbers inside a file using AWK, I could capture all, but im not being able to capture those in a certain amount of digits. What im doing wrong?
echo -e "$teste" | awk '/_OA/ { match($0,/\[\([:digit:]{4,13}\]/);oa = substr($0,RSTART,RLENGTH);print oa}'
File sample:
_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]
Expected:
6712227000168
355016
5512987000
Hint for the regex match answers:
Thanks so much for all the answers, i found link that I need to use a --posix option because of my awk version.
With your shown samples please try following awk solution. Simply making field separator as ] OR [ and in main block checking condition if line starts from _QA then printing the 2nd last field.
awk -F"[][]" '/^_QA /{print $(NF-1)}' Input_file
You could update the pattern and the values for RSTART and RLENGTH to not match the leading and trailing square brackets.
The digits part should be [[:digit:]] and there is a \( in the pattern that matches ( that should not be there.
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}\]/);oa = substr($0,RSTART+1,RLENGTH-2);print oa}' <<< "$teste"
Output
6712227000168
355016
5512987000
As there are multiple occurrences of digits between square brackets, if you want to match multiple occurrences:
teste='_OA Tasdsd, OA .. [91][355016][123456789][1][9999]'
awk '/_OA/ {
while(match($0,/\[[[:digit:]]{4,13}]/)){
start=RSTART+1; len=RLENGTH-2
s=substr($0,start,len)
res=res?res","s:s
$0=substr($0,start+len)
}
print res
res = ""
}' <<< "$teste"
Output
355016,123456789,9999
Your regexp \[\([:digit:]{4,13}\] says:
\[ = the literal character [
\( = the literal character (
[:digit:] = a bracket expression containing a character set of the characters :, d, i, g, t
{4,13} = a regexp interval that's 4 to 13 repetitions of the preceding bracket expression
\] = the literal character ]
The 2 main issues with that which are causing your regexp to be unable to match any of your input are:
You don't have any (s in your input (from #2 above), and
To match digits you need a character class [:digit:] inside a bracket expression [[:digit:]], not a character set :digit: inside a bracket expression [:digit:] (from #3 above)
You also don't actually need to escape the ] at the end of the regexp as it's only a regexp metachar (end of bracket expression) if preceded by a matching unescaped [ (start of bracket expression).
So the regexp I think you wanted to write instead would have been:
\[[[:digit:]]{4,13}]
e.g.:
$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART,RLENGTH);print oa}' file
[6712227000168]
[355016]
[5512987000]
or to only print the numbers:
$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART+1,RLENGTH-2);print oa}' file
6712227000168
355016
5512987000
If you're not married to awk:
grep -Eo '[[:digit:]]{4,13}'
With GNU awk:
gawk 'match($0, /[[:digit:]]{4,13}/, m) {print m[0]}'
but that only matches the first such number in each record. To find them all:
gawk '{
line = $0
while (match(line, /[[:digit:]]{4,13}/, m)) {
print m[0]
line = substr(line, m[0,"start"] + m[0,"length"])
}
}'
Ref the match function in the gawk manual.
You can use
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART+1,RLENGTH-2)}'
See the online demo:
#!/bin/bash
s='_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]'
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART+1,RLENGTH-2)}' <<< "$s"
Output:
6712227000168
355016
5512987000
Details:
\[ - a [ char
[[:digit:]]{4,13} - four to thirteen digits (note that the [:digit:] POSIX character class must be used within [...], a bracket expression)
] - a ] char (it is not special, no need escaping)
And substr($0,RSTART+1,RLENGTH-2) means that we
$0 - take the match
RSTART+1 - starting with the second char
RLENGTH-2 - and then as many characters as is the match length - 2 (thus getting rid of enclosing [ and ] chars)

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

Can't understand this awk regex

I'm trying to understand a particular line of code from a Unix talk, and can't seem to understand what the awk portion is doing.
The full line is: man ls | col -b | grep '^[[:space:]]*ls \[' | awk -F '[][]' '{print $2}'. The text passed to awk (if for some reason you don't have the man program) is: ls [-ABCFGHLOPRSTUW#abcdefghiklmnopqrstuwx1] [file ...]. Somehow, awk is able to just pull out the list of options to ls, but I can't really understand how this regex [][] actually works & what it matches for.
My best guess is that the outer brackets denote a character class whose contents contain ][. If that's the case, why can't the inner brackets be written as []. Is it because pairs of brackets [[]] have a different meaning in awk?
Thanks in advance!
In POSIX regular expressions [...] is called a bracket expression.
It is very similar to character class in other reegx flavors. One key difference is that the backslash is NOT a meta-character in a POSIX bracket expression.
If you want to include [ and ] in a bracket expression then it needs to be placed correctly i.e. ] right at the start and [.
As per the linked article:
To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ].
In your example:
awk -F '[][]' '...'
awk sets (input) field separator as single literal [ or ] character.
If you had [[]] it would mean that [ is in brackets [], like [[] followed by a ] so the field separator would be []:
$ echo a[]b | awk -F'[[]]' '{print $2}'
b
But then the brackets other way around:
$ echo a][b | awk -F'[][]' '{print $3}'
b
Now the $2 is empty and $3==b (oh dear what done).
Your hunch about character classes is correct. If you want certain characters to be field separators, then you can list them between brackets. Using awk -F '[abc]' ... would specify the a and b and c characters as separators. Order is irrelevant; you could use awk -F '[cab]' ... and get the same results.
But what if you want the separating characters to be left and right brackets themselves? The documentation for regular expressions (man re_format on many systems) says this:
To include a literal `]' in the list, make it the first character ...
Which makes sense, given how the expression will be parsed. As the parser is scanning the expression, it's looking for the end, the right bracket. It doesn't care about seeing another left bracket or a comma or a space or whatever, but a right bracket would mark the end unless there's some way to tell the parser to take it literally. Since brackets with nothing between them, [], would be useless, a right bracket as the first character is defined to mean something else: this can't be the end, so take this right-bracket literally.
So if you want brackets as field-separating characters, you list [ and ] between brackets, but you put the right bracket first in the list so it'll be taken literally, per the instructions: [][]

regex match square brackets once

There is a text file with the following info:
[[parent]]
[son]
[daughter]
How to get only [son] and [daughter]?
$0 ~ /\[([a-z])*\]/ ???
Your regex is almost right. Just put the * inside the round brackets (in order to have the whole text inside the only group) and remember to use the ^ and $ delimiters (to avoid matching [[parent]]):
^\[([a-z]*)\]$
Match any square bracket at beginning of line where the next character is an alphabetic.
awk '/^\[[a-z]/' file
You might want to add uppercase and/or numbers to the character class, depending on what your real requiements are. (Your examples show only lowercase, so I have assumed that's a valid generalization.)
You can use this awk command:
awk -F '[][]+' 'NF && !/\[\[/{print $2}' file
son
daughter
awk command breakup:
-F '[][]+' # set input field separator as 1 or more of [ or ]
NF # only if at least one field is found
!/\[\[/ # when input doesn't start with [[

Grep regular expression for digits in character string of variable length

I need some way to find words that contain any combination of characters and digits but exactly 4 digits only, and at least one character.
EXAMPLE:
a1a1a1a1 // Match
1234 // NO match (no characters)
a1a1a1a1a1 // NO match
ab2b2 // NO match
cd12 // NO match
z9989 // Match
1ab26a9 // Match
1ab1c1 // NO match
12345 // NO match
24 // NO match
a2b2c2d2 // Match
ab11cd22dd33 // NO match
to match a digit in grep you can use [0-9]. To match anything but a digit, you can use [^0-9]. Since that can be any number of , or no chars, you add a "*" (any number of the preceding). So what you'll want is logically
(anything not a digit or nothing)* (any single digit) (anything not a digit or nothing)* ....
until you have 4 "any single digit" groups. i.e. [^0-9]*[0-9]...
I find with grep long patterns, especially with long strings of special chars that need to be escaped, it's best to build up slowly so you're sure you understand whats going on. For example,
#this will highlight your matches, and make it easier to understand
alias grep='grep --color=auto'
echo 'a1b2' | grep '[0-9]'
will show you how it's matching. You can then extend the pattern once you understand each part.
I'm not sure about all the other input you might take (i.e. is ax12ax12ax12ax12 valid?), but this will work based on what you posted:
%> grep -P "^(?:\w\d){4}$" fileWithInput
With grep:
grep -iE '^([a-z]*[0-9]){4}[a-z]*$' | grep -vE '^[0-9]{4}$'
Do it in one pattern with Perl:
perl -ne 'print if /^(?!\d{4}$)([^\W\d_]*\d){4}[^\W\d_]*$/'
The funky [^\W\d_] character class is a cosmopolitan way to spell [A-Za-z]: it catches all letters rather than only the English ones.
If you don't mind using a little shell as well, you could do something like this:
echo "a1a1a1a1" |grep -o '[0-9]'|wc -l
which would display the number of digits found in the string. If you like, you could then test for a given number of matches:
max_match=4
[ "$(echo "a1da4a3aaa4a4" | grep -o '[0-9]'|wc -l)" -le $max_match ] || echo "too many digits."
Assuming you only need ASCII, and you can only access the (fairly primitive) regexp constructs of grep, the following should be pretty close:
grep ^[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*$ | grep [a-zA-Z]
You might try
[^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*
But this will match 1234. why doesn't that match your criteria?
The regex for that is:
([A-Za-z]\d){4}
[A-Za-z] - for character class
\d - for number
you wrapp them in () to group them indicating the format character follow by number
{4} - indicating that it must be 4 repetitions
you can use normal shell script, no need complicated regex.
var=a1a1a1a1
alldigits=${var//[^0-9]/}
allletters=${var//[0-9]/}
case "${#alldigits}" in
4)
if [ "${#allletters}" -gt 0 ];then
echo "ok: 4 digits and letters: $var"
else
echo "Invalid: all numbers and exactly 4: $var"
fi
;;
*) echo "Invalid: $var";;
esac
thanks for your answers
finaly i wrote some script and it work perfect:
. /P ab2b2 cd12 z9989 1ab26a9 1ab1c1 1234 24 a2b2c2d2
#!/bin/bash
echo "$#" |tr -s " " "\n"s >> sorting
cat sorting | while read tostr
do
l=$(echo $tostr|tr -d "\n"|wc -c)
temp=$(echo $tostr|tr -d a-z|tr -d "\n" | wc -c)
if [ $temp -eq 4 ]; then
if [ $l -gt 4 ]; then
printf "%s " "$tostr"
fi
fi
done
echo