Grep only for lowercase and spaces - regex

I need to grep files for lines containing only lowercase letters and spaces. Both conditions must be met at least once and no other characters are allowed.
I know how to grep only for lowercase or only for space but I don't know how to join those two conditions in one regexp/command.
I have only this right now:
egrep "[[:space:]]" $DIR/$file | egrep -vq "[[:upper:]]"
which of course will display lines with digits and/or special characters as well which is not what I want.
Thanks.

This is what you require
The -x matches whole lines
The first expression matches lines composed entirely of spaces and lower case letters.
The second expression matches lines that have both a space and a lower case letter.
egrep -x '[[:lower:] ]*' $DIR/$file | egrep '( [[:lower:]])|([[:lower:]] )'

awk may be better to express such conditions:
awk '/^[ a-z]+$/ && /[a-z]/ && / /' file
That is, it checks that a line:
consists in just spaces and lowercase letters.
it contains at least a lowercase.
it contains at least a space.
Test
$ cat a
hello this is something simple
but SUDDENLY not
wah
wa ah
$ awk '/^[ a-z]+$/ && /[a-z]/ && / /' a
hello this is something simple
wa ah

First grep all lines that only consist of lowercase characters and whitespace, and then all those that contain at least one whitespace.
egrep -x '[[:lower:][:space:]]+' "$DIR/$file" | egrep '[[:space:]]+'
The [:space:] meta class also matches for tabs, and can be replaced with a plain space if desired.

Related

grep starts with capital and appears exactly three times

I need to grep this: lines that start with a capital and that same capital has to appear EXACTLY 3 times in the line.
E.g. this is a good line :
'X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R'
(starts with X and X appears EXACTLY three times)
I tried this, but apparently the backreferences between the brackets don't work properly:
^\([A-Z]\)[^\1]*\1[^\1]*\1[^\1]*
Why doesn't this work and how should I do it?
In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do:
grep '^\([A-Z]\).*\1.*\1'
I'd use awk for this
$ cat ip.txt
Xq2X46Xad
asAnAndA
YeYeYeY
CCC
EsE63Eu6u
$ awk '/^[A-Z]/{c=substr($0,1,1); n=split($0,a,c); if(n==4)print}' ip.txt
Xq2X46Xad
CCC
EsE63Eu6u
/^[A-Z]/ if line starts with uppercase letter
c=substr($0,1,1) save that letter in a variable
n=split($0,a,c) use that letter to split the line and save number of fields so obtained in n
if there are four fields, then print the line
can be shortened to
$ awk '/^[A-Z]/ && split($0,a,substr($0,1,1))==4' ip.txt
$ # or, with GNU awk
$ gawk -v FS= '/^[A-Z]/ && split($0,a,$1)==4' ip.txt
[^\1] doesn't mean the negation of backreference \1.
You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE):
grep -P '^([A-Z])(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$'
This will match exactly 3 times in each line the first character if it is a capital
With an awk that splits input into chars when FS is null (e.g. GNU awk):
$ awk -F '' '/^[A-Z]/ && gsub($1,"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
With any awk in any shell on any UNIX box:
$ awk '/^[A-Z]/ && gsub(substr($0,1,1),"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
You might want to change A-Z to [:upper:] for portability to other locales.

Why doesn't grep work in pattern with colon

I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.
The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.
Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.
There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.
Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test

How can I match lines with exactly three numbers?

How can I use grep to match 3 numbers in a file? My file looks like this:
123
122
222
333443
fdsfs5454353
dsfsfjsk4654641
Note that some of the lines contain trailing spaces. I want to only match three digit numbers. I tried:
grep -E [0-9]{3} test.txt
grep -E '\<[0-9]{3}\>' test.txt
grep '^[0-9][0-9]*' test|awk '{if(length($0) == 3) print $0}'
or if you have whitespace:
sed 's/[ \t]*$//' test|grep '^[0-9][0-9]*'|awk '{if(length($0) == 3) print $0}'
(thanks #shellter)
Use Extended Regular Expressions with Bounds
I asked if you meant numbers with exactly three digits, or each three-digit match in a string. You replied that you wanted only lines that contained exactly three digits.
Extended grep provides an easy solution for this. Consider the following:
$ egrep '^\d{3}\b' /tmp/corpus
123
122
222
This uses a bound (also known as a range) to look for exactly three digits at the start of each line, followed by a word boundary. The word boundary will match trailing space or the end of line, ensuring that you get the proper match in either case.

How to match only items preceded by a-z, A-Z, space, or the start of a line when searching with grep?

I need to display all lines in file.txt containing the character "鱼", but only those where "鱼" is immediately preceded by a-z, A-Z, a space, or a line break.
I tried using grep, like this:
grep "[a-zA-Z\s\n]鱼" file.txt
The regular expression [a-zA-Z\s\n] does not appear to work. How can I search for this character, when appearing after a-z, A-Z, a space, or a line break?
If you want to match a space with grep, use a space:
grep "[a-zA-Z ]鱼" file.txt
If you want to match any whitespace, you can use the Posix standard character class:
grep "[a-zA-Z[:space:]]鱼" file.txt
("Any whitespace" is space, newline, carriage return, form feed, tab and vertical tab. If you just want to match space and tab, you can use [:blank:].)
You might also want to use a standard class for letters. Unless you are in the Posix or "C" locale, the meanings of character ranges like A-Z are unpredictable.
grep "[[:alpha:][:space:]]鱼" file.txt
grep works line by line, so it will never see a newline. But using an "extended" pattern, you can also match at the beginning of the line:
egrep "(^|[[:alpha:][:space:]])鱼" file.txt
(You can use grep -E instead of egrep if you prefer. But you need one or the other for the above regular expression to work.)
Grep does not support this by default
$ man grep | grep '\\s'
But awk does
$ man awk | grep '\\s'
\s Matches any whitespace character.
So perhaps use
awk '/[a-zA-Z\s\n]鱼/' file.txt
Use awk:
awk '/[A-Za-z \t]鱼/ || (NR > 1 && /^鱼/)' file
Which would print line if 鱼 is after [A-Za-z \t] or if it's not on the first line and it's in the beginning of the line: NR > 1 && /^鱼/.
If you just really want that it's on the beginning or is followed by [A-Za-z \t], you can simply do this:
awk '/(^|[A-Za-z \t])鱼/' file
Or
grep -E '/(^|[A-Za-z \t])鱼/' file
Try this one:
^[a-zA-Z \n]{1,}鱼
{1,} will make u assure that 鱼 got at least 1 of these element before
what is more i suggest to use awk in this particular case

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.