grep not the begining of a line - regex

I want to find all lines in a file containing a number, but not at the beginning of a line. I tried the following:
grep -E '[^^][1-9]?[0-9]+' test.txt
However, it does not work: this expression matches the lines starting with numbers consisting of two-(or more) digits. As I understand it, [^^] does not mean "any symbol except the beginning of a line". Why is so, and how to write this correctly?

Edited according to comment:
This Regex should do it, it matches lines not starting with a number (one or more characters), then find one or more numbers.
^[^1-9]+?\d+
You will need to set the 'multiline' option, if you check multiple lines at one time.

Your issue is the [^^] part of your regex. That is a negative character class (a ^ inside the [ ] negates what is inside the brackets).
Instead, I think you are looking for ^ outside of the brackets to state 'start of the line' and then a negated character class of [^0-9] for something other than a digit at the start of the line:
$ echo "1 line
line 2
3 line
line 4
no num" | grep '^[^0-9]'
line 2
line 4
no num
Then add .* for 'anything of any length' and [0-9] for at least one digit to filter for lines that have a digit in the line:
$ echo "1 line
line 2
3 line
line 4
no num" | grep '^[^0-9].*[0-9]'
line 2
line 4
Or, if you want to be locale aware, you can use POSIX character classes to the same result:
$ echo "1 line
line 2
3 line
line 4
no num" | grep '^[^[:digit:]].*[[:digit:]]'
line 2
line 4

Related

why grep '\s*' is not working, but grep '\S*' works

I am new to shell script.
I want to display the line starts with whitespace or non-whitespace in the file, but grep '\S*' works, grep '\s*' does not match any line.
And '\s' looks works
My grep version is 3.4. I am using WSL Ubuntu. The read color means it is matched. I tried [[:space:]], the result is same
Anyone can help? Thanks
test.fa includes
ctatccagcaccagatagcatcattttactttcaagcctagaaattgcac
haha
ok
acttgtatataaaccaaccgaagatgaggattgagagttcatcttggtgg
running result
* means "zero or more repetitions of the preceding expression". So \S* matches zero or more non-spaces while \s* matches zero or more spaces, and puting a ^ in front means match those at the start of a line (when the string being compared is a line as is the case with grep by default).
So in your input file:
Line 1: ctatccagcaccagatagcatcattttactttcaagcctagaaattgcac
Line 2: haha
Line 3:
Line 4: ok
Line 5: acttgtatataaaccaaccgaagatgaggattgagagttcatcttggtgg
^\S* matches the following on each line:
line 1: ctatccagcaccagatagcatcattttactttcaagcctagaaattgcac
Line 2: the null string before the leading blank
Line 3: the null string that is the whole line
Line 4: the null string before the leading blanks
Line 5: acttgtatataaaccaaccgaagatgaggattgagagttcatcttggtgg
while ^\s* matches the following on each line:
line 1: the null string before ctatccagcaccagatagcatcattttactttcaagcctagaaattgcac
Line 2: the leading blank
Line 3: the null string that is the whole line
Line 4: the leading blanks
Line 5: the null string before acttgtatataaaccaaccgaagatgaggattgagagttcatcttggtgg
So both regexps match something on every line, and what is colored as matching is the printable (i.e. non-null and non-blank) chars from each matching string.
To display the lines that start with whitespace would be:
grep '^\s'
and to display the lines that start with non-whitespace would be:
grep '^\S'
and to display empty lines would be:
grep -v '.'
If your grep doesn't support \s/\S then use [[:space:]]/[^[:space:]] instead if it's a POSIX grep or [ \t]/[^ \t] in any grep.

sed insert spaces between digits for long numbers

How one can exploit sed to insert spaces between every three digits but only if a number is longer than 10 digits, ie:
blahaaaaaa goog sdd 234 3242423
ala el 213123123123
1231231313123 i 14124124141411
should turn into:
blahaaaaaa goog sdd 234 3242423
ala el 213 123 123 123
123 123 131 312 3 i 141 241 241 414 11
I can easily separate 3-digits numbers using sed 's/[0-9]\{3\}/& /g' but cannot combine that with a number length.
A single (GNU) sed command could be enough:
sed -E 's/([0-9]{10,})/\n&\n/g; :a; s/([ \n])([0-9]{3})([0-9]+\n)/\1\2 \3/; ta; s/\n//g' file
Update:
Walter A suggested a bit more concise sed expression which works fine if I haven't overlooked something:
sed -E 's/([0-9]{10,})/&\n/g; :a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta; s/\n//g' file
Explanation:
-E flag instructs the sed to use the extended regular expression syntax (to get rid of escape slashes before (){}+ characters).
s/([0-9]{10,})/&\n/g appends a new-line (\n) character to all digit sequences with 10 or more digits. This is in order to differentiate the digit sequences we are dealing with. The \n is a safe choice here because it cannot occur in the pattern space as read from the input line since it is the delimiter terminating the line. Notice that we are processing a single line per cycle (ie, since no multiline techniques are used, \n can be used as an anchor without interfering with other characters in the line).
:a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta This is a loop. :a is a label and could be any word (the : indicates the label). ta means jump to the label a if the last substitution (s command) is successful. The s command here repeatedly (because it is the body of the loop) replaces, from left to right, a 3-digit sequence with the same 3 digits concatenated by a space character, only if this 3-digit sequence is immediately followed by one or more digits delimited by a \n character, until no substitution is possible.
s/\n//g removes all \n instances from the resultant pattern space. They have been used as an anchor, or marker, to delimit the end of the digit sequences with more than or equal to 10 characters. Their mission has been completed now.
When you need to meet a complex set of requirements like this, it is more convenient to use perl:
perl -i -pe 's/\d{10,}/$&=~s|\d{3}|$& |gr/ge' file
Here,
\d{10,} - matches 10 or more consecutive digits
$&=~s|\d{3}|$& |gr - takes the whole match (the 10+ digit substring) and replaces every 3-digit chunk (matched with \d{3}) with this match (since $& is the placeholder for the whole match) and a space. g is used to perform as many replacements as there are matches in the input, and r is used to return substitution and leave the original string untouched.
ge - this flag combination means the all matches will be replaced (g), and e is necessary since the replacement string here is a regular expression to be evaluated.
preprocess and postprocess the file:
tr "\n " "\r\n" < "${file}" | sed -r '/[0-9]{10}/ s/[0-9]{3}/& /g' | tr '\r\n' '\n '
This might work for you (GNU sed):
cat <<\!|sed -Ef - file
/[[:digit:]]{10,}/{
s//\n&\n/
h
s/.*\n(.*)\n.*/\1/
s/.{3}\B/& /g
G
s/(.*)(\n.*)\n.*\n/\2\1/
D
}
!
Determine if the current line has any 10 or more digit numbers and if so process them.
Surround the first such number by newlines.
Copy the whole line to the hold space (HS).
Remove everything except the number from the current line.
Space the number every 3 digits (only do so if there is a following digit).
Append the original line from HS to the current line.
Replace the original number by the spaced number and remove all introduced newlines except the first.
Delete the introduced newline and thus repeat the process.
N.B. The D command removes upto and including the first newline in the current line i.e. the pattern space. If there is no newline, it acts the same as the d command. However if there is a newline, once it has removed the text before and including the newline, if there is further text it begins a new cycle but does not read in another line from the input. Thus it treats whatever remains in the pattern space as if it has read in a another line of input and starts the sed cycle again. By inserting a newline and then using D command it is identical to :a;...;ba.
Or if you prefer:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;h;s/.*\n(.*)\n.*/\1/;s/.{3}\B/& /g;G;s/(.*)(\n.*)\n.*\n/\2\1/;D}' file
An alternative that just uses the pattern space:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;s/(.*)(\n.*)(\n.*)/\1\3\2/;:a;s/^(.*\n.*\n([[:digit:]]{3} )*[[:digit:]]{3}\B)/\1 /;ta;s/(.*)\n(.*)\n(.*)/\n\1\3\2/;D}' file

regex: match lines that doesn't end with '}' and has match one of three words [duplicate]

This question already has answers here:
Using the star sign in grep
(12 answers)
Closed 3 years ago.
I have this text:
NBA:red this line has a tab and ends with a curly braces}
some random text qwertyuiop
NBA:green this line must match
NBA:red this line has a tab and must match
NBA:response this line has spaces and must match
NBA:blue this line has a tab and ends with a curly braces}
some random text qwertyuiop
NBA:blue this line has spaces at the begining and ends with curly braces}
random text qwertyuiop
this line must not match}
this line must not match }
I want to match the lines that contains 'NBA:' following by the word 'red' or 'green' or 'blue', and also that doesn't end with a curly braces'}', this command match only 'NBA:' and one of the three words:
$ egrep 'NBA:(red|green|blue)' myfile.txt
NBA:red this line has a tab and ends with a curly braces}
NBA:green this line must match
NBA:red this line has a tab and must match
NBA:blue this line has a tab and ends with a curly braces}
NBA:blue this line has spaces at the begining and ends with curly braces}
But I don't know how to match the lines that doesn't end with '}':
I tried this but it doesn't work:
egrep 'NBA:(red|green|blue)*[^}]$' myfile.txt
But this works:
egrep 'NBA:(red|green|blue)' lorem.txt | egrep '[^}]$'
NBA:green this line must match
NBA:red this line has a tab and must match
I want to do it in just one command
You were just one character off. This should work fine:
egrep 'NBA:(red|green|blue).*[^}]$'
# ^
# Note this bit.
* doesn't mean the same thing in regex that it does in glob patterns. It means zero-or-more of the preceding item (a preceding item in this answer being ., any character).

Vimgrep before any empty line

I have a lot of files which starts with some tags I defined.
Example:
=Title
#context
!todo
#topic
#subject
#etc
And some text (notice the blank line just before this text).
Foo
Bar
I'd like to write a Vim search command (with vimgrep) to match something before an empty line.
How do I grep only in the lines before the first blank line? Will it make quicker grep action? Please, no need to mention :grep and binary like Ag - silver search.
I know \_.* to match everything including EOL. I know the negation [^foo]. I succeed to match everything but empty lines with /[^^$]. But I didn't manage to compose my :vimgrep command. Thank you for your help!
If you want a general solution which works for any content of file let me tell you that AFAK, you can't with that form of text. You may ask why ?
Explanation:
vimgrep requires a pattern argument to do the search line by line which behaves exactly as the :global cmd.
For your case we need to get the first part preceding the first blank line. (It can be extended to: Get the first non blank text)
Let's call:
A :Every block of text not containing any single blank line inside
x :Blank lines
With these only 5 forms of content file you can get the first A block with vimgrep(case 2,4,5 are obvious):
1 | 2 | 3 | 4 | 5
x | x | A | x | A
A | A | x | A | x
x | x | A
A |
Looking to your file, it is having this form:
A
x
A
x
A
the middle block causes a problem that's why you cannot split the first A unless you delimit it by some known UNIQUE text.
So the only solution that I can come up for the only 5 cases is:
:vimgrep /\_.\{-}\(\(\n\s*\n\)\+\)\#=/ %
AFAIK the most you can do with :vimgrep is use the \%<XXl atom to search below a specific line number:
:vim /\%<20lfunction/ *.vim
That command will find all instances of function above line 20 in the given files.
See :help \%l.
[...] always matches a single character. [^^$] matches a character that is not ^ or $. This is not what you want.
One of the things you can do is:
/\%^\%(.\+\n\)\{-}.\{-}\zsfoo/
This matches
\%^ - the beginning of the file
\%( \) - a non-capturing group
\{-} - ... repeated 0 or more times (as few as possible)
.\+ - 1 or more non-newline characters
\n - a newline
.\{-} - 0 or more non-newline characters (as few as possible)
\zs - the official start of the match
This will find the first occurrence of foo, starting from the beginning of the file, searching only non-empty lines. But that's all it does: You can't use it to find multiple matches.
Alternatively:
/\%(^\n\_.*\)\#<!foo/
\%( \) - a non-capturing group
\#<! - not-preceded-by modifier
^ - beginning of line
\n - newline
\_.* - 0 or more of any character
This matches every occurrence of foo that is not preceded anywhere by an empty line (i.e. a beginning-of-line / newline combo).

what does (?=.*[^a-zA-Z]) mean

What does (?=.*[^a-zA-Z]) mean
I am a beginner in regex and not getting what does it mean .
Is it like, dot(.) means any character so .* means any character any number of times and [^a-zA-z] any one character except a-z and A-Z.
what string will match it?
Thanks,
Puneet
That is positive look ahead assertion.
That means that there are at least one symbol that is not a-ZA-Z to right from the point.
Example:
$ echo 12abc | grep -P '2(?=.*[^a-zA-Z])'
$ echo 12abc. | grep -P '2(?=.*[^a-zA-Z])'
12abc.
In the first line there are no not a-zA-Z after 2. And the line will not be shown.
In the second line I've added point to the end. Now there is a not a-zA-Z after 2. And the line will be found and shown.