regex for exact number of colon separated fields - regex

I am searching for an regex which would match exactly 7 occurences of the .*: (7 fields colon separated)
unfortunatelly, what I combined:
grep -E '(.*:){7}' ...
does also print the same lines when I decrase number in {}.
how to test it for fixed exactly 7 occurences?

Your problem statement "match 7 colon separated fields" is a good fit for awk:
awk -F: 'NF == 7' file ...

Just in case you still want to use grep to solve the problem:
grep -E '^([^:]*:){7}[^:]*$' file
Details:
^ - start of line
([^:]*:){7} - exactly 7 occurrences of
[^:]* - zero or more chars other than :
: - a colon
[^:]* - zero or more chars other than :
$ - end of line.
See an online demo of the regex (do not rely on this service to check grep patterns validity!)

Related

Remove leading and trailing numbers from string, while leaving 2 numbers, using sed or awk

I have a file containing lines like:
353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413
I'd like to strip most of the leading and tailing numbers (0-9), while still leaving 2 leading and trailing numbers in place, if any...
To clarify, for the list above, the expected output would be:
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Preferred tools would be sed or awk, but any other suggestions are welcome...
I've tried something like sed 's/[0-9]\+$//' | sed 's/^[0-9]\+//', but obviously this strips all leading and trailing numbers...
You may try this sed:
sed -E 's/^[0-9]+([0-9]{2})|([0-9]{2})[0-9]+$/\1\2/g' file
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Command Details:
^[0-9]+([0-9]{2}): Match 1+ digits at start if that is followed by 2 digits (captured in a group) and replace with 2 digits in group #1.
([0-9]{2})[0-9]+$: Match 1+ digits at the end if that is preceded by 2 digits (captured in a group) and replace with 2 digits in group #2.
Here is an awk that trims digits to a max of 2 on each side of a string:
awk '{ match($0, /^[0-9]*/); lh=RLENGTH
s=substr($0, lh>2 ? lh-1 : 1)
match(s, /[0-9]*$/); rh=RLENGTH
print substr(s, 1, rh>2 ? length(s)-rh+2 : length(s))
}' file
Prints:
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
I suggest using perl:
perl -pe 's/^\d+(?=\d{2})|(\d{2})\d+$/$1/' file
See the online demo and the regex demo.
Regex details:
^ - start of string
\d+ - one or more digits
(?=\d{2}) - on the right, there must be two digits (not added to the match as the lookahead is a non-consuming pattern)
| - or
(\d{2}) - two digits captured into Group 1 ($1)
\d+ - one or more digits
$ - end of string.
Using GNU awk gensub() function with parentheses in the regexp to mark the components and then specifying them in the replacement (here "\\2\\3")
awk '{print gensub(/^([[:digit:]]*)([[:digit:]]{2})|([[:digit:]]{2})([[:digit:]]*)$/,"\\2\\3","g",$0)}' file
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
I would use GNU AWK following way, let file.txt content be
353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413
then
awk 'BEGIN{FPAT="[0-9]+|[^0-9]+";OFS=""}$1~/[0-9]+/{$1=substr($1,length($1)-1)}$NF~/[0-9]+/{$NF=substr($NF,1,2)}{print}' file.txt
output
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41
Explanation: I instruct GNU AWK to split into fields which consist solely of digits or solely of non-digits using FPAT. If 1st column ($1) consist of digits, I slice it to get 2 last characters. If last column ($NF) consist solely of digits, I slice it to get 2 first characters. Finally whole line is printed using empty string as output field seperator (OFS).
(tested in gawk 4.2.1)

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever
grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text
With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

Regex, select the line that starts with my condition, but take only the characters after space

I have a file that has content similiar below:
ptrn: 435324kjlkj34523453
Note1: rtewqtiojdfgkasdktewitogaidfks
Note2: t4rwe3tewrkterqwotkjrekqtrtlltre
I am trying to get characters after space at the line starts with "ptrn:" . I am trying the command below ;
>>> cat daily.txt | grep '^p.*$' > dailynew.txt
and I am getting the result in the new file:
ptrn: 435324kjlkj34523453
But I want only the characters after space, which are " 435324kjlkj34523453" to be written in the new file without "ptrn:" at the beginning.
The result should be like:
435324kjlkj34523453
How can establish this goal with an efficient regex?
You can use
grep -oP '^ptrn:\s*\K.*' daily.txt > dailynew.txt
awk '/^ptrn:/{print $2}' daily.txt > dailynew.txt
sed -n 's/^ptrn:[[:space:]]*\(.*\)/\1/p' daily.txt > dailynew.txt
See the online demo. All output 435324kjlkj34523453.
In the grep PCRE regex (enabled with -P option) the patterns match
^ - the startof string
ptrn: - a ptrn: substring
\s* - zero or more whitespaces
\K - match reset operator that clears the current match value
.* - the rest of the line.
In the awk command, ^ptrn: regex is used to find the line starting with ptrn: and then {print $2} prints the value after the first whitespace, from the second "column" (since the default field separator in awk is whitespace).
In sed, the command means
-n - suppresses the default line output
s - substitution command is used
^ptrn:[[:space:]]*\(.*\) - start of string, ptrn:, zero or more whitespace, and the rest of the line captured into Group 1
\1 - replaces the match with group 1 value
p - prints the result of the substitution.
You can use this sed:
sed -nE 's/^ptrn: (.*)/\1/p' file > output_file.txt

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).