Unable to match multiple digits in regex - regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever

grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text

With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

Related

Remove leading and trailing numbers from string, while leaving 2 numbers, using sed or awk

I have a file containing lines like:
353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413
I'd like to strip most of the leading and tailing numbers (0-9), while still leaving 2 leading and trailing numbers in place, if any...
To clarify, for the list above, the expected output would be:
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Preferred tools would be sed or awk, but any other suggestions are welcome...
I've tried something like sed 's/[0-9]\+$//' | sed 's/^[0-9]\+//', but obviously this strips all leading and trailing numbers...
You may try this sed:
sed -E 's/^[0-9]+([0-9]{2})|([0-9]{2})[0-9]+$/\1\2/g' file
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Command Details:
^[0-9]+([0-9]{2}): Match 1+ digits at start if that is followed by 2 digits (captured in a group) and replace with 2 digits in group #1.
([0-9]{2})[0-9]+$: Match 1+ digits at the end if that is preceded by 2 digits (captured in a group) and replace with 2 digits in group #2.
Here is an awk that trims digits to a max of 2 on each side of a string:
awk '{ match($0, /^[0-9]*/); lh=RLENGTH
s=substr($0, lh>2 ? lh-1 : 1)
match(s, /[0-9]*$/); rh=RLENGTH
print substr(s, 1, rh>2 ? length(s)-rh+2 : length(s))
}' file
Prints:
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
I suggest using perl:
perl -pe 's/^\d+(?=\d{2})|(\d{2})\d+$/$1/' file
See the online demo and the regex demo.
Regex details:
^ - start of string
\d+ - one or more digits
(?=\d{2}) - on the right, there must be two digits (not added to the match as the lookahead is a non-consuming pattern)
| - or
(\d{2}) - two digits captured into Group 1 ($1)
\d+ - one or more digits
$ - end of string.
Using GNU awk gensub() function with parentheses in the regexp to mark the components and then specifying them in the replacement (here "\\2\\3")
awk '{print gensub(/^([[:digit:]]*)([[:digit:]]{2})|([[:digit:]]{2})([[:digit:]]*)$/,"\\2\\3","g",$0)}' file
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
I would use GNU AWK following way, let file.txt content be
353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413
then
awk 'BEGIN{FPAT="[0-9]+|[^0-9]+";OFS=""}$1~/[0-9]+/{$1=substr($1,length($1)-1)}$NF~/[0-9]+/{$NF=substr($NF,1,2)}{print}' file.txt
output
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Explanation: I instruct GNU AWK to split into fields which consist solely of digits or solely of non-digits using FPAT. If 1st column ($1) consist of digits, I slice it to get 2 last characters. If last column ($NF) consist solely of digits, I slice it to get 2 first characters. Finally whole line is printed using empty string as output field seperator (OFS).
(tested in gawk 4.2.1)

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

regex for exact number of colon separated fields

I am searching for an regex which would match exactly 7 occurences of the .*: (7 fields colon separated)
unfortunatelly, what I combined:
grep -E '(.*:){7}' ...
does also print the same lines when I decrase number in {}.
how to test it for fixed exactly 7 occurences?
Your problem statement "match 7 colon separated fields" is a good fit for awk:
awk -F: 'NF == 7' file ...
Just in case you still want to use grep to solve the problem:
grep -E '^([^:]*:){7}[^:]*$' file
Details:
^ - start of line
([^:]*:){7} - exactly 7 occurrences of
[^:]* - zero or more chars other than :
: - a colon
[^:]* - zero or more chars other than :
$ - end of line.
See an online demo of the regex (do not rely on this service to check grep patterns validity!)

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

egrep command for lines that have one or more instance of 1234 but no other numbers?

So I'm fairly new to regular expressions and I'm wondering how this would be implemented as a egrep command.
I basically want to look for lines in a file that have one or more instances of "1234", but no other numbers. (non-digit characters are allowed).
Examples:
1234 - valid
12341234 - valid
12345 - invalid (since 5 is there)
You can use grep to extract the lines that contain 1234, then replace 1234 with something that doesn't appear in the input, then remove lines that still contain any digits, and replace the special string back by 1234:
< input-file grep 1234 \
| sed 's/1234/\x1/g' \
| grep -v '[0-9]' \
| sed 's/\x1/1234/g'
So, we want to select lines that have 1234 one or more times but no other digits:
grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
How it works
The regex begins with ^ and ends with $. That means that is must match the whole line.
Inside the regex are two parts:
([^[:digit:]]*1234)+ matches one or more 1234 with no other digits.
[^[:digit:]]* matches any non-digits that follows the last 1234.
In olden times, one would use [0-9] to match digits. With unicode, that is no longer reliable. So, we are using [:digit:] which is unicode safe.
Example
Let's use this test file:
$ cat file
this 1234 is valid
12341234 valid
not valid 12345
not 2 valid 1234 line
no numbers so not valid
Here is the result:
$ grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
this 1234 is valid
12341234 valid
If you want no other digit after your 1234 block:
egrep '\<(1234)+(\>|[^0-9])' *
-- -- --> word delimiters
---- --> the word you're looking for
------ --> non digit characters
- --> one or more times
If you want only "words" made up by the "1234" block, then you can egrep this:
egrep '\<(1234)+\>' *
-- -- --> word delimiters
---- --> the word you're looking for
- --> one or more times.