I want to parse html files to extract strings between "{{_(" and ")}}" using GREP. I tried something like this:
grep '"[^{{_(|)}}$]"' *.html
but it didn't work.
Can someone help me please?
Thanks!
You may use
grep -oP '(?<={{_\().+?(?=\)}})' file
Details
-o - output only matched substrings
-P - enable the PCRE regex engine
(?<={{_\().+?(?=\)}}) match:
(?<={{_\() - a location that is immediately preceded with {{+(
.+? - any 1 or more more chars other than line break chars, as few as possible
(?=\)}}) - a location that is immediately followed with )}} .
See the regex demo.
#Wiktor Stribiżew's answer works really good. However, if you have multiple files, you would get an output like this, where the respective file name per each match is also displayed:
foo.html: content abc
foo.html: test 123
bar.html: first match
bar.html: second match
So, if you are only interested in the matching string as output, you can try sed instead
sed -n 's/.*{{_(\(.*\))}}.*/\1/p' *.html
You can also count the unique occurrence of matches and things like that...
Update:
Or just use the -h | --no-filename with the grep that #Wiktor Stribiżew has provided.
grep -h -oP '(?<={{_\().+?(?=\)}})' *.html
Or the -c flag in order to display the count of matches per each file:
grep -c -oP '(?<={{_\().+?(?=\)}})' *.html
As in the posts before with it is possible to grep the value of an HTML property.
placeholder="SOME TEXT_HERE" -> grep -> "SOME TEXT_HERE"
grep -oP '(?<=placeholder=").+?(?=")' *html
Related
I have a bunch of LaTeX files that I want to search using grep pattern *.tex -n --color=always. It's important that I see:
The file's name that matched
The line number of the match
The full line, with the matched pattern highlighted
Furthermore, sometimes the pattern needs to be a full word match, so the command becomes grep pattern *.tex -n -w --color=always
I would like to modify this command to exclude commented lines in my *.tex files, which start with the % character. I am not interested in matches in comments.
Is this possible?
May be you can try the following command:
grep -n -v --color=always '^%' *.tex | grep content
EXPLANATION
The first grep invocation excludes (-v) line starting with % (regex ^% matches % at the start of the line (^))
The output of the first grep invocation is passed as input to the second grep invocation
The second grep invocations includes only rows matching with your filter pattern (you can add -w option if you need it).
I hope this can help you!
Would this do? First some test stuff:
$ cat file
yes
no % yes commented out
yes % yes before comment
no
noyes
furthermore yes
furthermore yes% yes
furthermore no% yes
The grep:
$ grep -Hn "^[^%]*\(^\| \)yes\( \|$\|%\)" file
file:1:yes
file:3:yes % yes before comment
file:6:furthermore yes
file:7:furthermore yes% yes
Edit: For partial matches:
$ grep -Hn "^[^%]*yes" file
and to highlight only the search word you need to grep for it again:
$ grep -Hn "^[^%]*yes" file | grep yes
file:1:yes
file:3:yes % yes before comment
file:5:noyes
file:6:furthermore yes
file:7:furthermore yes% yes
How can I get an unknown substring with an regular expression? I know what's before and after the wanted string but I don't want the known part with in the result.
Example text:
jhgjgjgvocher_SOMETHINGHERE.dbhjjkghjkg
vocher_SOMETHINGELSE.db
I'm looking for 'SOMETHINGHERE' and 'SOMETHINGELSE' only.
vocher_ and .db are always before and after the relevant part but should not be in the result.
A working solution is:
cat test | egrep -o "vocher_.*\.db" | cut -d "_" -f2 | cut -d "." -f1
… but you know it's ugly.
Is it possible to search exactly for an unknown part with regex (in this case only the .* part), or do I need to use something like sed? Is there a better solution?
A simple solution using perl is the following:
perl -ne 'if (/vocher_(.*)\.db/){ print "$1\n";}' test_file.txt
This iterates line-by-line over the file and only prints the desired portion.
Use the following grep approach:
grep -Po '(?<=vocher_).+(?=\.db)' test
-P - allows Perl regular expressions
-o - prints only matched substrings
The output will be like below:
SOMETHINGHERE
SOMETHINGELSE
Looking for suggestion to cat file | grep REGEX to get the lines with <version>anything</version>.
grep -F '<version>1.1.9-beta</version>' file
-F will match your pattern as literal text
you don't need that useless cat
if you really mean anything: try grep '<version>.*</version>' file or grep -P '<version>.*?</version>' file , however searching xml with regex is bad idea.
Use the -E option to match a regular expression:
grep -E "<version>.*</version>" file
Refer to these rules for the regular expression: https://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#Regular-Expressions
For example, to match the typical version format (3.14, or 13.14, or 0.1458) you can type:
grep -E "<version>[0-9]?\.[0-9]?</version>" file
You can do:
grep '<version>[^<]*</version>' file.xml
[^<]* will match zero or more characters upto next <.
Here is the line in a file I want to extract information from:
backup-initiation-time="00:00" backup-directory-path="/store/backup" backup-retention-period-days="2"
My command is:
grep "backup-directory-path" test.txt | sed 's/.*backup-directory-path="\(.*?\)" /\1/'
I just want /store/backup that's it. I don't know what I'm doing wrong.
You can use \K (keep) and simply:
grep -oP '.*backup-directory-path=\K([^ ])+'
This will display only the captured part after the "keep".
In order to remove the quotes you have there, just modify it:
grep -oP '.*backup-directory-path="\K([^"])+'
I couldn't find any documentation on non-greedy matching in sed so I'm not sure if this was implemented.
Instead of your non-greedy match using .*? you could use [^"]* if you know the last or one past last character you want to match, in your case ".
This command produces the expected output:
grep "backup-directory-path" test.txt | sed 's|.* backup-directory-path="\([^"]*\)".*|\1|'
I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).
The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive
You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.
grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].
To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)