using grep in ubuntu - regex

I am trying to search pattern in a file named test by using grep in ubuntu
The following is content of test
./foldera/[hello]this.mp4
./foldera/folderb/[hello]that.mp4
./folderc/[these]hello.mp4
On this website regexp simulator, I use the following pattern to search and it works, three lines got matched.
.*\/[A-Za-z0-9\[\]]+\.mp4
But in ubuntu, I ran the following command in terminal, it doesn't work, nothing has returned in the terminal.
timothy#ubuntu:~$ cat ~/Desktop/test
./foldera/[hello]this.mp4
./foldera/folderb/[hello]that.mp4
./folderc/[these]hello.mp4
timothy#ubuntu:~$ cat ~/Desktop/test | grep -E '.*\/[A-Za-z0-9\[\]]+\.mp4'
timothy#ubuntu:~$
What is the reasons that grep cannot search all the lines in the file?

grep extended regular expressions doesn't use backslash to escape square brackets inside square brackets. The proper way to do it is to put ] as the first character in the square brackets; this is treated as a literal character because you can't have an empty character set.
grep -E '/[]A-Za-z0-9[]+\.mp4' test.txt
There's also no need for .* at the beginning. grep simply checks whether anything on the line matches the pattern, so adding a match for anything at the beginning or end is redundant (it's only necessary if you're using -o to print just the part of the line that matches).

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

Get only one instance of a regex instead of multiple

I have a text file that contains:
libpackage-example1.so.3.2.1,
libpackage-example2.so.3.2.1,
libpackage-example3.so.3.2.1,
libpackage-example4.so.3.2.1
I only want to get one instance of "3.2.1", but when I run the command below:
grep -Po '(?<=.so.)\d.\d.\d'
The result is
3.2.1
3.2.1
3.2.1
3.2.1
instead of just one "3.2.1". I think making it a lazy regex would work, but I do not know how to do that.
The regex is applied to each line. No matter how you change the regex, if the the whole file contains multiple matching lines then all of them will be printed.
However, you can limit the number of matched lines using the -m option. -o -m 1 will output at most all matches from one line before exiting. If there are multiple matches in one line use grep ... | head -n1 instead.
Also, keep in mind that . means any character. To specify a literal dot use \. or [.].
Perl regexes also support \K which makes writing easier. Only the part after the last \K will be printed.
grep -Pom1 '\.so\.\K\d\.\d\.\d'
The grep command has the -m N option that will make grep stop after the first N matches.
In general, the way to only get the first line of output in unix is to send the output to the head command. To get just the first line of output, do:
grep -Po '(?<=.so.)\d.\d.\d' | head -n 1
That "1" can be any number.
Use
awk -F'[.]so[.]' '/^libpackage-/{sub(/,$/,"", $NF);print $NF; exit}'
Split with .so. separator, find the line beginning with libpackage-, remove a comma from the end of the last field, print it and stop processing.
Another way:
grep -m1 -Po '(?<=\.so\.)\d+\.\d+\.\d+'
-m1 gets the first instance. I updated the expression: literal periods should be escaped, and \d+ will match one or more digits.

sed does not match the regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?
sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.
I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD
There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.
As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)

Using Sed to delete lines which contain non alphabets

The following Regex works as expected in Notepad++:
^.*[^a-z\r\n].*$
However, when I try to use it with sed, it wont work.
sed -r 's/\(^.*[^a-z\r\n].*$\)//g' wordlist.txt
You could use:
sed -i '/[^a-z]/d' wordlist.txt
This will delete each line that has a non-alphabet character (no need to specify linefeeds)
EDIT:
You regex doesn't work because you are trying to match
( bracket
^ beginning of line
...
$ end of line
) bracket
As you won't have a bracket and then the beginning of the line, your regex simply doesn't match anything.
Note, also an expression of
s/\(^.*[^a-z\r\n].*$\)//g'
wouldn't delete a line but replace it with a blank line
EDIT2:
Note, in sed using the -r flag changes the behaviour of \( and \) without the -r flag they are group indicators, but with the -r flag they're just brackets...
Two things:
Sed is a stream editor. It processes one line of the input at a time. That means the search and replace commands, etc, can only see the current line. By contrast, Notepad++ has the whole file in memory and so its search expressions can span two or more lines.
Your command sed -r 's/\(^.*[^a-z\r\n].*$\)//g' wordlist.txt includes \( and \). These mean real (ie non-escaped) round brackets. So the command says find a line that starts with a ( and ends with a ) with some other characters between and replace it with nothing. Rewriting the command as sed -r 's/^.*[^a-z\r\n].*$//g' wordlist.txt should have the desired effect. You could also remove the \r\n to give sed -r 's/^.*[^a-z].*$//g' wordlist.txt. But neither of these will be exactly the same as the Notepad++ command as they will leave empty lines. So you may find the command sed -r '/^.*[^a-z].*$/d' wordlist.txt is closer to what you really want.

Matching a word followed by a space, followed by three numbers in BASH

I have an issue that I would imagine is simple but I have spent the past hour trying everything out there.
I'm trying to match a string followed by a space followed by 3 numbers in grep.
egrep hello\s\d{3}
I have also tried older styles:
grep hello[:blank:][0-9][0-9][0-9]
If I use grep with hello or the numbers in a row independently they work fine, but as soon as you try to combine it with a blank or a space, grep returns nothing.
Off by two characters (or four, if you count quotes):
grep 'hello[[:blank:]][0-9][0-9][0-9]'
If you're determining whether a variable (as opposed to a file or stream) matches, on the other hand, grep isn't the right tool; bash has regex evaluation built in:
str='hello 123'
re='^hello[[:blank:]][0-9]{3}$'
if [[ $str =~ $re ]]; then
echo "Match!"
fi
You need to put [:blank:] into a character class.
$ grep 'hello[[:blank:]][0-9][0-9][0-9]' file
hello 123
OR
$ grep 'hello[[:blank:]][0-9]\{3\}' file
hello 123
[0-9]\{3\} would match three or more digits.
How i know that information?
$ grep hello[:blank:][0-9][0-9][0-9] file
grep: character class syntax is [[:space:]], not [:space:]