I need to grep files for lines containing only lowercase letters and spaces. Both conditions must be met at least once and no other characters are allowed.
I know how to grep only for lowercase or only for space but I don't know how to join those two conditions in one regexp/command.
I have only this right now:
egrep "[[:space:]]" $DIR/$file | egrep -vq "[[:upper:]]"
which of course will display lines with digits and/or special characters as well which is not what I want.
Thanks.
This is what you require
The -x matches whole lines
The first expression matches lines composed entirely of spaces and lower case letters.
The second expression matches lines that have both a space and a lower case letter.
egrep -x '[[:lower:] ]*' $DIR/$file | egrep '( [[:lower:]])|([[:lower:]] )'
awk may be better to express such conditions:
awk '/^[ a-z]+$/ && /[a-z]/ && / /' file
That is, it checks that a line:
consists in just spaces and lowercase letters.
it contains at least a lowercase.
it contains at least a space.
Test
$ cat a
hello this is something simple
but SUDDENLY not
wah
wa ah
$ awk '/^[ a-z]+$/ && /[a-z]/ && / /' a
hello this is something simple
wa ah
First grep all lines that only consist of lowercase characters and whitespace, and then all those that contain at least one whitespace.
egrep -x '[[:lower:][:space:]]+' "$DIR/$file" | egrep '[[:space:]]+'
The [:space:] meta class also matches for tabs, and can be replaced with a plain space if desired.
Related
I need to grep this: lines that start with a capital and that same capital has to appear EXACTLY 3 times in the line.
E.g. this is a good line :
'X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R'
(starts with X and X appears EXACTLY three times)
I tried this, but apparently the backreferences between the brackets don't work properly:
^\([A-Z]\)[^\1]*\1[^\1]*\1[^\1]*
Why doesn't this work and how should I do it?
In grep, you need to use \(...\) to create capture groups. For "starts with a capital and that same letter appears three times", you could do:
grep '^\([A-Z]\).*\1.*\1'
I'd use awk for this
$ cat ip.txt
Xq2X46Xad
asAnAndA
YeYeYeY
CCC
EsE63Eu6u
$ awk '/^[A-Z]/{c=substr($0,1,1); n=split($0,a,c); if(n==4)print}' ip.txt
Xq2X46Xad
CCC
EsE63Eu6u
/^[A-Z]/ if line starts with uppercase letter
c=substr($0,1,1) save that letter in a variable
n=split($0,a,c) use that letter to split the line and save number of fields so obtained in n
if there are four fields, then print the line
can be shortened to
$ awk '/^[A-Z]/ && split($0,a,substr($0,1,1))==4' ip.txt
$ # or, with GNU awk
$ gawk -v FS= '/^[A-Z]/ && split($0,a,$1)==4' ip.txt
[^\1] doesn't mean the negation of backreference \1.
You have to use negative lookahead, hanchor begin and end, and -P option (for PCRE):
grep -P '^([A-Z])(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$'
This will match exactly 3 times in each line the first character if it is a capital
With an awk that splits input into chars when FS is null (e.g. GNU awk):
$ awk -F '' '/^[A-Z]/ && gsub($1,"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
With any awk in any shell on any UNIX box:
$ awk '/^[A-Z]/ && gsub(substr($0,1,1),"&")==3' file
X^[<*'??X+BXK<:B7#;}V0|<|K!(P|HW}(1O#$JK_}}*.5H"Y&^A)D$QS97R
You might want to change A-Z to [:upper:] for portability to other locales.
I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.
The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.
Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.
There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.
Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test
How can I use grep to match 3 numbers in a file? My file looks like this:
123
122
222
333443
fdsfs5454353
dsfsfjsk4654641
Note that some of the lines contain trailing spaces. I want to only match three digit numbers. I tried:
grep -E [0-9]{3} test.txt
grep -E '\<[0-9]{3}\>' test.txt
grep '^[0-9][0-9]*' test|awk '{if(length($0) == 3) print $0}'
or if you have whitespace:
sed 's/[ \t]*$//' test|grep '^[0-9][0-9]*'|awk '{if(length($0) == 3) print $0}'
(thanks #shellter)
Use Extended Regular Expressions with Bounds
I asked if you meant numbers with exactly three digits, or each three-digit match in a string. You replied that you wanted only lines that contained exactly three digits.
Extended grep provides an easy solution for this. Consider the following:
$ egrep '^\d{3}\b' /tmp/corpus
123
122
222
This uses a bound (also known as a range) to look for exactly three digits at the start of each line, followed by a word boundary. The word boundary will match trailing space or the end of line, ensuring that you get the proper match in either case.
I need to display all lines in file.txt containing the character "鱼", but only those where "鱼" is immediately preceded by a-z, A-Z, a space, or a line break.
I tried using grep, like this:
grep "[a-zA-Z\s\n]鱼" file.txt
The regular expression [a-zA-Z\s\n] does not appear to work. How can I search for this character, when appearing after a-z, A-Z, a space, or a line break?
If you want to match a space with grep, use a space:
grep "[a-zA-Z ]鱼" file.txt
If you want to match any whitespace, you can use the Posix standard character class:
grep "[a-zA-Z[:space:]]鱼" file.txt
("Any whitespace" is space, newline, carriage return, form feed, tab and vertical tab. If you just want to match space and tab, you can use [:blank:].)
You might also want to use a standard class for letters. Unless you are in the Posix or "C" locale, the meanings of character ranges like A-Z are unpredictable.
grep "[[:alpha:][:space:]]鱼" file.txt
grep works line by line, so it will never see a newline. But using an "extended" pattern, you can also match at the beginning of the line:
egrep "(^|[[:alpha:][:space:]])鱼" file.txt
(You can use grep -E instead of egrep if you prefer. But you need one or the other for the above regular expression to work.)
Grep does not support this by default
$ man grep | grep '\\s'
But awk does
$ man awk | grep '\\s'
\s Matches any whitespace character.
So perhaps use
awk '/[a-zA-Z\s\n]鱼/' file.txt
Use awk:
awk '/[A-Za-z \t]鱼/ || (NR > 1 && /^鱼/)' file
Which would print line if 鱼 is after [A-Za-z \t] or if it's not on the first line and it's in the beginning of the line: NR > 1 && /^鱼/.
If you just really want that it's on the beginning or is followed by [A-Za-z \t], you can simply do this:
awk '/(^|[A-Za-z \t])鱼/' file
Or
grep -E '/(^|[A-Za-z \t])鱼/' file
Try this one:
^[a-zA-Z \n]{1,}鱼
{1,} will make u assure that 鱼 got at least 1 of these element before
what is more i suggest to use awk in this particular case
example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.