Search with grep only lines that start with # - regex

before i get my ass kicked, I want you to know that I checked several documents on "grep" and I couldn't find what I'm looking for or maybe my English is too limited to get the idea.
I have a lot of markdown documents. Each document contain a first level heading (#) which is always on line 1.
I can search for ^# and that works, but how can I tell grep to look for certain words on the line that starts with #?
I want this this
grep 'some words' file.markdown
But also specify that the line starts with a #.

You may use
grep '^# \([^ ].*\)\{0,1\}some words' file.markdown
Or, using ERE syntax
grep -E '^# ([^ ].*)?some words' file.markdown
Details
^ - start of a line
# - a # char
\([^ ].*\)\{0,1\} - an optional sequence of patterns (a \(...\) is a capturing group in BRE syntax, in ERE, it is (...)) (\{0,1\} is an interval quantifier that repeats the pattern it modifies 1 or 0 times):
[^ ] - any char but a space
.* - any 0+ chars
some words - some words text.
See an online grep demo:
s="# Get me some words here
#some words here I don't want
# some words here I need"
grep '^# \([^ ].*\)\{0,1\}some words' <<< "$s"
# => # Get me some words here
# # some words here I need

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

And condition usage on grep command

I have two regular expression, and trying to bind it into And condition
what I got
-grep -E "/[1-9]{4,}/" file
-grep -E '([0-9])(.*\1){3}' file
I tried to take a regular expression from each command, then bind it with multiple grep with pipe
cat file | grep pattern1 | grep patterns
, but didn't work.
anyone can teach me way to use and condition for grep with these two patterns?
"/[1-9]{4,}/" '([0-9])(.*\1){3}'
sample input
Q4HXD/7100525/+wg4C54V2I4mh4Xh
aaaa/123/422444qjem,,qewriiafa
!##AVADFQWERASDFASDFQervzxcilh
expected output
Q4HXDa /7100525/+wg4C54V2I4mh4Xh
which satisfy both condition
You need to use [0-9] or [[:digit:]] to match any digit in a POSIX pattern and make sure both patterns are handled as POSIX ERE by passing -E option:
cat file | grep -E '/[0-9]{4,}/' | grep -E '([0-9])(.*\1){3}'
Else, you may use a PCRE pattern like
grep -P '^(?=.*/[0-9]{4,}/).*([0-9])(.*\1){3}' file
See an online grep demo
The latter pattern matches
^ - start of a string
(?=.*/[0-9]{4,}/) - a positive lookahead that makes sure there is /, 4 or more digits, / after any 0+ chars other than line break chars
.* - any 0+ chars other than line break chars, as many as possible
([0-9]) - Group 1: any digit
(.*\1){3} - three occurrences of any 0+ chars other than line break chars, as many as possible, and then the Group 1 value.

How do I find all Twitter handles from a text file using grep?

Here's my dilemma:
I really like #somecrazytwitterhandle; he's so cool!
#somecrazytwitterhandle is the best! His email is cth1983#gmail.com.
So initially I thought I needed to search for the following -
"\ #[^\ ]*"
however this doesn't work because some Twitter ids. can start at the beginning of a line as seen above.
So then how do I search for the above? I wanted to do something like this, but I don't know the syntax ... "[^|\ ]#[^\ ]*" where the first bracket is an or ... for either at the beginning of a line or has a space before an "#" symbol.
You can use this grep -o with tr:
grep -oE '(^|[[:blank:]])#[[:alnum:]_]+' f | tr -d '[[:blank:]]'
#somecrazytwitterhandle
#somecrazytwitterhandle
Regex #[[:alnum:]_]+ matches a text that starts with # followed by 1+ word characters.
tr -d '[[:blank:]]' strips all whitespaces from output
You may use a PCRE regex with a GNU grep like this:
grep -Po '(?<!\S)#\w+' file
The P option enables the PCRE regex engine and o makes it return only the matched texts.
The (?<!\S) negative lookbehind makes sure there is start of string or a whitespace immediately to the left of the current location.
The #\w+ will match a # and then 1+ letters, digits or _.
See the online grep demo:
s="I really like #somecrazytwitterhandle; he's so cool!
#somecrazytwitterhandle is the best!"
grep -Po '(?<!\S)#\w+' <<< "$s"
Output:
#somecrazytwitterhandle
#somecrazytwitterhandle
Alternative solution is to use \B:
grep -Po '\B#\w+' <<< "$s"
See this online demo. \B is a position other than a word boundary, and # must be preceded with a non-word char or start of string then.
#[\w]*?(?=[^\w]) Will Match Twitter handles and will also match ones with numbers and underscores

Regex for uppercase matches with exclusions

I'm trying to come up with a regex for the following case: I need to find any matching paths using grep for the following paths:
Include all uppercase matching paths.
Example:
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Notice the capital B in Bar.
Exclude all uppercase matching paths that only contain SNAPSHOT and have no other uppercase letters.
Example:
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
Is this possible with grep?
Something like this might do:
grep -vE '^([^[:upper:]]*(SNAPSHOT)?)*$'
Breakdown:
-v will reverse the match (show all non matched lines. -E enabled Extended Regular Expressions.
^ # Start of line
( )* # Capturing group repeated zero or more times
[^[:upper:]]* # Match all but uppercase zero or more times
(SNAPSHOT)? # Followed by literal SNAPSHOT zero or one time
$ # End of line
Just use awk:
$ cat file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
With GNU awk or mawk for gensub():
$ awk 'gensub(/SNAPSHOT/,"","g")~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
With other awks:
$ awk '{r=$0; gsub(/SNAPSHOT/,"",r)} r~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Well, you need find to list all paths. Then you can do it with grep with two runs. One includes all capital cases. The other one excludes that contain no capitals except SNAPSHOT:
find . | grep '[A-Z]' | grep -v '.*\/[^A-Z]*SNAPSHOT[^A-Z]*$'
I think only the last grep needs some explanation:
grep -v excludes the matching lines
.*\/ greedily matches everything up to the first slash. There'll always be a slash due to find .
[^A-Z]* finds all characters that are non-capital letters. So we apply it before and after the SNAPSHOT literal, up to the end of the string.
Here you can play with it online.
If you only want to get the matching files. I'll do it like this.
find . -type f -regex '.*[A-Z].*' | while read -r line; do echo "$line" | sed 's/SNAPSHOT//g' | grep -q '.*[A-Z].*' && echo "$line"; done

Return last [0-9]\{6\} from a string with sed

I want to pass a long list of filenames in the form
something_0230232_long_5160mK.csv
something_0230232_long-025160mK.csv
simething_0230342_lingk425460mK.csv
to sed (or similar linux shell tools) and get always the
last array of digits before mK per line
This works, if there are exactly 6 digits. how can I enhance it for n digits?
echo "something_0230232_long_025160mK.csv" | sed -e "s/S.*\([0-9]\{6\}\)mK\.csv/\1/p"
Solution using GNU grep:
$ grep -Po '[0-9]+(?=mK)' file
5160
025160
425460
Explanation:
-o show only the part of the line that matches.
-P use perl regexp.
[0-9]+ # Match a string of digits (at least one)
(?=mK) # Followed by mK (positive lookahead)
And with sed (since you asked):
sed -E 's/.*[^0-9]([0-9]+)mK.*/\1/' file
-E use extended regexp (alias for -r but more portability).
s/ # Subsitution -
.* # Match everything
[^0-9] # That's not a digit
([0-9]+) # Capture the last digit string
mK # Followed by the string mK
.* # Match everything left
/ # Replace with -
\1 # The captured digit string only
/ #
You're on the right track with your sed command:
echo "something_0230232_long_025160mK.csv" |
sed -e 's/^.*[^0-9]\([0-9]\{1,\}\)mK\.csv/\1/'
Differences:
Replace S with ^. This matches at the start (there is no S in the data, so the original would never match).
Replace 6 with 1,. This means 'one or more digits' given the context (strictly, one or more repeats of the previous regex, but the previous regex was [0-9]).
Insert the [^0-9] to stop the .* from being too greedy. When the number of digits matched was fixed (\{6\}), the rigidity prevented the .* from being too greedy. When you have two flexible ranges, the first will be the longest possible. Without the [^0-9], you get a 0 printed for the sample string.
Drop the 'p' so the value is printed once. Alternatively, keep the p and add -n as an option.
Reminder to self: test before (or shortly after) you post.
echo "something_0230232_long_025160mK.csv" | sed 's/^.*_//' | sed 's/mK.csv//'