grep starts with capital and appears exactly three times - regex

I need to grep this: lines that start with a capital and that same capital has to appear EXACTLY 3 times in the line.
E.g. this is a good line :
'X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R'
(starts with X and X appears EXACTLY three times)
I tried this, but apparently the backreferences between the brackets don't work properly:
^\([A-Z]\)[^\1]*\1[^\1]*\1[^\1]*
Why doesn't this work and how should I do it?

In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do:
grep '^\([A-Z]\).*\1.*\1'

I'd use awk for this
$ cat ip.txt
Xq2X46Xad
asAnAndA
YeYeYeY
CCC
EsE63Eu6u
$ awk '/^[A-Z]/{c=substr($0,1,1); n=split($0,a,c); if(n==4)print}' ip.txt
Xq2X46Xad
CCC
EsE63Eu6u
/^[A-Z]/ if line starts with uppercase letter
c=substr($0,1,1) save that letter in a variable
n=split($0,a,c) use that letter to split the line and save number of fields so obtained in n
if there are four fields, then print the line
can be shortened to
$ awk '/^[A-Z]/ && split($0,a,substr($0,1,1))==4' ip.txt
$ # or, with GNU awk
$ gawk -v FS= '/^[A-Z]/ && split($0,a,$1)==4' ip.txt

[^\1] doesn't mean the negation of backreference \1.
You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE):
grep -P '^([A-Z])(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$'
This will match exactly 3 times in each line the first character if it is a capital

With an awk that splits input into chars when FS is null (e.g. GNU awk):
$ awk -F '' '/^[A-Z]/ && gsub($1,"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
With any awk in any shell on any UNIX box:
$ awk '/^[A-Z]/ && gsub(substr($0,1,1),"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
You might want to change A-Z to [:upper:] for portability to other locales.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

Limit grep to first match per line

I'm trying to grep for all lines that have the letter a before the first period in the line.
Here's an example file:
test123a.hello
example.more-test-a.xyz
stacka.tester.this
nothing.nothing.nothing
In the example above, I'd want to grep these 2 lines:
test123a.hello
stacka.tester.this
This is what I've tried:
grep ".*a\." test.txt
That is getting the 2 lines I want, but it's also getting this line, which I don't want, be the a is in front of the second period, not the first one:
example.more-test-a.xyz
How do I limit it to just get the lines with a before the first period?
$ grep '^[^.]*a\.' test.txt
test123a.hello
stacka.tester.this
^ to restrict matching at start of line
[^.]* to match any character other than . character, zero or more times
a literally match character a
\. literally match character .
You can also use awk here, which is more suited for field based processing
$ # 'a' as last character for first field
$ awk -F'.' '$1 ~ /a$/' test.txt
test123a.hello
stacka.tester.this
$ # 'a' as last character for second field
$ awk -F'.' '$2 ~ /a$/' test.txt
example.more-test-a.xyz
if you feel many output, you can try
grep ".*a." test.txt | less
You can try [EDITED]
grep ".*a." test.txt | grep -v "\([^a]\.\)\{1,\}.*a."
This will do your first grep and denies anything with "a." preceded by a dot.

How to capitalize the first letter of a word starting with numbers

I'd like to find a way to capitalize the first letter of a word starting with numbers.
Input:
2019donaldtrump
03012019paris
Expected result:
2019Donaldtrump
03012019Paris
Is there a way to modify this command
sed -e 's/^\(.\)/\U\1/g'
to make it look for the first actual letter of the word?
This appears to do what you want in my testing:
sed -e 's/\([[:digit:]]\)\([a-z]\)/\1\U\2/g' input.txt
Input:
2019donaldtrump
03012019paris
Output:
2019Donaldtrump
03012019Paris
Edit: As pointed out by Toto, the grouping is not actually necessary:
sed -e 's/[[:digit:]][a-z]/\U\0/g' input.txt
One in awk:
$ awk 'BEGIN{FS=OFS=""}/^[0-9]/ && match($0,/[a-z]/){$RSTART=toupper($RSTART)}1' file
Output:
2019Donaldtrump
03012019Paris
notstartingwith123
Explained:
$ awk 'BEGIN {
FS=OFS="" # separators to empty
}
/^[0-9]/ && match($0,/[a-z]/) { # if there is starting digit and lower case letters
$RSTART=toupper($RSTART) # capitalize the first letter
}1' file # output
Shorter, as match will return RSTART as its value, store and use that instead:
$ awk 'BEGIN{FS=OFS=""}/^[0-9]/&&r=match($0,/[a-z]/){$r=toupper($r)}1' file
This might work for you (GNU sed):
sed 's/\<[[:digit:]]\+[[:alpha:]]/\U&/' file
This will uppercase the first alphabetic character of a word beginning with digits.
With GNU sed for \U:
$ sed 's/[[:alpha:]]/\U&/' file
2019Donaldtrump
03012019Paris
Try Perl
$ cat boby.txt
donald
2019donaldtrump
03012019paris
$ perl -pe ' s/\b\d+\K(.)(?=\S+)/uc $1/ge ' boby.txt
donald
2019Donaldtrump
03012019Paris
Explanation:
\b - Match word boundary
\d+ - Match digits
\K - Ignore the matched content that occured before i.e (\b\d+)
(.) - Match a single character Store it in $1
(?=\S+) - Match the rest of the chars till you get a space.
uc $1 - Use the upper case function and replace $1 with it. This can be done by adding "e" modifier i.e "/ge" at the last

Grep only for lowercase and spaces

I need to grep files for lines containing only lowercase letters and spaces. Both conditions must be met at least once and no other characters are allowed.
I know how to grep only for lowercase or only for space but I don't know how to join those two conditions in one regexp/command.
I have only this right now:
egrep "[[:space:]]" $DIR/$file | egrep -vq "[[:upper:]]"
which of course will display lines with digits and/or special characters as well which is not what I want.
Thanks.
This is what you require
The -x matches whole lines
The first expression matches lines composed entirely of spaces and lower case letters.
The second expression matches lines that have both a space and a lower case letter.
egrep -x '[[:lower:] ]*' $DIR/$file | egrep '( [[:lower:]])|([[:lower:]] )'
awk may be better to express such conditions:
awk '/^[ a-z]+$/ && /[a-z]/ && / /' file
That is, it checks that a line:
consists in just spaces and lowercase letters.
it contains at least a lowercase.
it contains at least a space.
Test
$ cat a
hello this is something simple
but SUDDENLY not
wah
wa ah
$ awk '/^[ a-z]+$/ && /[a-z]/ && / /' a
hello this is something simple
wa ah
First grep all lines that only consist of lowercase characters and whitespace, and then all those that contain at least one whitespace.
egrep -x '[[:lower:][:space:]]+' "$DIR/$file" | egrep '[[:space:]]+'
The [:space:] meta class also matches for tabs, and can be replaced with a plain space if desired.

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.