grep starts with capital and appears exactly three times

grep starts with capital and appears exactly three times - regex

I need to grep this: lines that start with a capital and that same capital has to appear EXACTLY 3 times in the line.
E.g. this is a good line :
'X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R'
(starts with X and X appears EXACTLY three times)
I tried this, but apparently the backreferences between the brackets don't work properly:
^\([A-Z]\)[^\1]*\1[^\1]*\1[^\1]*
Why doesn't this work and how should I do it?

In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do:
grep '^\([A-Z]\).*\1.*\1'

I'd use awk for this
$ cat ip.txt
Xq2X46Xad
asAnAndA
YeYeYeY
CCC
EsE63Eu6u
$ awk '/^[A-Z]/{c=substr($0,1,1); n=split($0,a,c); if(n==4)print}' ip.txt
Xq2X46Xad
CCC
EsE63Eu6u
/^[A-Z]/ if line starts with uppercase letter
c=substr($0,1,1) save that letter in a variable
n=split($0,a,c) use that letter to split the line and save number of fields so obtained in n
if there are four fields, then print the line
can be shortened to
$ awk '/^[A-Z]/ && split($0,a,substr($0,1,1))==4' ip.txt
$ # or, with GNU awk
$ gawk -v FS= '/^[A-Z]/ && split($0,a,$1)==4' ip.txt

[^\1] doesn't mean the negation of backreference \1.
You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE):
grep -P '^([A-Z])(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$'
This will match exactly 3 times in each line the first character if it is a capital

With an awk that splits input into chars when FS is null (e.g. GNU awk):
$ awk -F '' '/^[A-Z]/ && gsub($1,"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
With any awk in any shell on any UNIX box:
$ awk '/^[A-Z]/ && gsub(substr($0,1,1),"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
You might want to change A-Z to [:upper:] for portability to other locales.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?

You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991

1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'

Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991

You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991

Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

Limit grep to first match per line

I'm trying to grep for all lines that have the letter a before the first period in the line.
Here's an example file:
test123a.hello
example.more-test-a.xyz
stacka.tester.this
nothing.nothing.nothing
In the example above, I'd want to grep these 2 lines:
test123a.hello
stacka.tester.this
This is what I've tried:
grep ".*a\." test.txt
That is getting the 2 lines I want, but it's also getting this line, which I don't want, be the a is in front of the second period, not the first one:
example.more-test-a.xyz
How do I limit it to just get the lines with a before the first period?

$ grep '^[^.]*a\.' test.txt
test123a.hello
stacka.tester.this
^ to restrict matching at start of line
[^.]* to match any character other than . character, zero or more times
a literally match character a
\. literally match character .
You can also use awk here, which is more suited for field based processing
$ # 'a' as last character for first field
$ awk -F'.' '$1 ~ /a$/' test.txt
test123a.hello
stacka.tester.this
$ # 'a' as last character for second field
$ awk -F'.' '$2 ~ /a$/' test.txt
example.more-test-a.xyz

if you feel many output, you can try
grep ".*a." test.txt | less

You can try [EDITED]
grep ".*a." test.txt | grep -v "\([^a]\.\)\{1,\}.*a."
This will do your first grep and denies anything with "a." preceded by a dot.

How to capitalize the first letter of a word starting with numbers

I'd like to find a way to capitalize the first letter of a word starting with numbers.
Input:
2019donaldtrump
03012019paris
Expected result:
2019Donaldtrump
03012019Paris
Is there a way to modify this command
sed -e 's/^\(.\)/\U\1/g'
to make it look for the first actual letter of the word?

This appears to do what you want in my testing:
sed -e 's/\([[:digit:]]\)\([a-z]\)/\1\U\2/g' input.txt
Input:
2019donaldtrump
03012019paris
Output:
2019Donaldtrump
03012019Paris
Edit: As pointed out by Toto, the grouping is not actually necessary:
sed -e 's/[[:digit:]][a-z]/\U\0/g' input.txt

One in awk:
$ awk 'BEGIN{FS=OFS=""}/^[0-9]/ && match($0,/[a-z]/){$RSTART=toupper($RSTART)}1' file
Output:
2019Donaldtrump
03012019Paris
notstartingwith123
Explained:
$ awk 'BEGIN {
FS=OFS="" # separators to empty
}
/^[0-9]/ && match($0,/[a-z]/) { # if there is starting digit and lower case letters
$RSTART=toupper($RSTART) # capitalize the first letter
}1' file # output
Shorter, as match will return RSTART as its value, store and use that instead:
$ awk 'BEGIN{FS=OFS=""}/^[0-9]/&&r=match($0,/[a-z]/){$r=toupper($r)}1' file

This might work for you (GNU sed):
sed 's/\<[[:digit:]]\+[[:alpha:]]/\U&/' file
This will uppercase the first alphabetic character of a word beginning with digits.

With GNU sed for \U:
$ sed 's/[[:alpha:]]/\U&/' file
2019Donaldtrump
03012019Paris

Try Perl
$ cat boby.txt
donald
2019donaldtrump
03012019paris
$ perl -pe ' s/\b\d+\K(.)(?=\S+)/uc $1/ge ' boby.txt
donald
2019Donaldtrump
03012019Paris
Explanation:
\b - Match word boundary
\d+ - Match digits
\K - Ignore the matched content that occured before i.e (\b\d+)
(.) - Match a single character Store it in $1
(?=\S+) - Match the rest of the chars till you get a space.
uc $1 - Use the upper case function and replace $1 with it. This can be done by adding "e" modifier i.e "/ge" at the last

Grep only for lowercase and spaces

I need to grep files for lines containing only lowercase letters and spaces. Both conditions must be met at least once and no other characters are allowed.
I know how to grep only for lowercase or only for space but I don't know how to join those two conditions in one regexp/command.
I have only this right now:
egrep "[[:space:]]" $DIR/$file | egrep -vq "[[:upper:]]"
which of course will display lines with digits and/or special characters as well which is not what I want.
Thanks.

This is what you require
The -x matches whole lines
The first expression matches lines composed entirely of spaces and lower case letters.
The second expression matches lines that have both a space and a lower case letter.
egrep -x '[[:lower:] ]*' $DIR/$file | egrep '( [[:lower:]])|([[:lower:]] )'

awk may be better to express such conditions:
awk '/^[ a-z]+$/ && /[a-z]/ && / /' file
That is, it checks that a line:
consists in just spaces and lowercase letters.
it contains at least a lowercase.
it contains at least a space.
Test
$ cat a
hello this is something simple
but SUDDENLY not
wah
wa ah
$ awk '/^[ a-z]+$/ && /[a-z]/ && / /' a
hello this is something simple
wa ah

First grep all lines that only consist of lowercase characters and whitespace, and then all those that contain at least one whitespace.
egrep -x '[[:lower:][:space:]]+' "$DIR/$file" | egrep '[[:space:]]+'
The [:space:] meta class also matches for tabs, and can be replaced with a plain space if desired.

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?

You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input

Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep starts with capital and appears exactly three times - regex

In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do: grep '^\([A-Z]\).\1.\1'

[^\1] doesn't mean the negation of backreference \1. You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE): grep -P '^([A-Z])(?:(?!\1).)\1(?:(?!\1).)\1(?:(?!\1).)*$' This will match exactly 3 times in each line the first character if it is a capital

Related

How do I take only the first occurrence of a hyphen in sed?

Limit grep to first match per line

How to capitalize the first letter of a word starting with numbers

Grep only for lowercase and spaces

grep or sed for word containing string

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep starts with capital and appears exactly three times - regex

In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do: grep '^\([A-Z]\).*\1.*\1'

[^\1] doesn't mean the negation of backreference \1. You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE): grep -P '^([A-Z])(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$' This will match exactly 3 times in each line the first character if it is a capital

Related

How do I take only the first occurrence of a hyphen in sed?

Limit grep to first match per line

How to capitalize the first letter of a word starting with numbers

Grep only for lowercase and spaces

grep or sed for word containing string

Categories

Resources

In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do: grep '^\([A-Z]\).\1.\1'

[^\1] doesn't mean the negation of backreference \1. You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE): grep -P '^([A-Z])(?:(?!\1).)\1(?:(?!\1).)\1(?:(?!\1).)*$' This will match exactly 3 times in each line the first character if it is a capital