grep utf8/unicode support/ u modifier [duplicate] - regex

This question already has an answer here:
grep -w finds partial match in words with non-latin letters
(1 answer)
Closed 12 months ago.
I'm trying to validate vtt files for a particular format. The regex is functional but UTF8 characters are causing issues. I tried using (?u) with no luck
The regex I'm using is:
grep -P '(?m)^(\d+:\d+[.]\d+\s*-->\s*\d+:\d+[.]\d+|\s*[\w\s]+)|^\s*$' . -r -v
The u flag allows the regex to work as expected here, https://regex101.com/r/21HW2A/1, but I can't find a way to do that in grep. Do I need to swap the \w to all allowed alphanumeric chars or can the u modifier be used in grep somehow?

The \w can be converted to \p{L} which doesn't require the u modifier for unicode support.
Full solution:
grep -P '(?m)^(\d+:\d+[.]\d+\s*-->\s*\d+:\d+[.]\d+|\s*[\p{L}\s]+)|^\s*$' . -r -v

Related

bash script sed to remove www or www3 or any other www prefix from string [duplicate]

This question already has answers here:
How to extract text from a string using sed?
(5 answers)
Closed 5 years ago.
I am trying to use \d in regex in sed but it doesn't work:
sed -re 's/\d+//g'
But this is working:
sed -re 's/[0-9]+//g'
\d is a switch not a regular expression macro. If you want to use some predefined "constant" instead of [0-9] expression just try run this code:
s/[[:digit:]]+//g
There is no such special character group in sed. You will have to use [0-9].
In GNU sed, \d introduces a decimal character code of one to three digits in the range 0-255.
As indicated in this comment.
You'd better use the Extended pattern in sed by adding -E.
In basic RegExp, \d and some others won't be detected
-E Interpret regular expressions as extended (modern) regular expressions rather than basic regular expressions (BRE's). The re_format(7) manual page fully describes both formats.

Can I perform a 'non-global' grep and capture only the first match found for each line of input?

I understand that what I'm asking can be accomplished using awk or sed, I'm asking here how to do this using GREP.
Given the following input:
.bash_profile
.config/ranger/bookmarks
.oh-my-zsh/README.md
I want to use GREP to get:
.bash_profile
.config/
.oh-my-zsh/
Currently I'm trying
grep -Po '([^/]*[/]?){1}'
Which results in output:
.bash_profile
.config/
ranger/
bookmarks
.oh-my-zsh/
README.md
Is there some simple way to use GREP to only get the first matched string on each line?
I think you can grep non / letters like:
grep -Eo '^[^/]+'
On another SO site there is another similar question with solution.
You don't need grep for this at all.
cut -d / -f 1
The -o option says to print every substring which matches your pattern, instead of printing each matching line. Your current pattern matches every string which doesn't contain slashes (optionally including a trailing slash); but it's easy to switch to one which only matches this pattern at the beginning of a line.
grep -o '^[^/]*' file
Notice the addition of the ^ beginning of line anchor, and the omission of the -P option (which you were not really using anyway) as well as the silly beginner error {1}.
(I should add that plain grep doesn't support parentheses or repetitions; grep -E would support these constructs just fine, of you could switch to toe POSIX BRE variation which requires a backslash to use round or curly parentheses as metacharacters. You can probably ignore these details and just use grep -E everywhere unless you really need the features of grep -P, though also be aware that -P is not portable.)

Why is this white space character following a colon in the grep statement not working in Bash? [duplicate]

This question already has an answer here:
grep regex whitespace behavior
(1 answer)
Closed 4 years ago.
Why does the first grep statement below fail to return results, but the modified grep statement below that works? I have tried egrep as well with same results.
cat test
ALL: 192.168.0.0/255.255.0.0, 10.0.0.0/255.0.0.0
grep '^[\s]*ALL[\s]*:[\s]*192.168.0.0/255.255.0.0[\s]*' test
No results
grep '^[\s]*ALL[\s]*: 192.168.0.0/255.255.0.0[\s]*' test
ALL: 192.168.0.0/255.255.0.0, 10.0.0.0/255.0.0.0
Also , when I put a $ at the end, both fail.
grep '^[\s]*ALL[\s]*:[\s]*192.168.0.0/255.255.0.0[\s]*$' test
No results
grep '^[\s]*ALL[\s]*: 192.168.0.0/255.255.0.0[\s]*$' test
No results
grep is guaranteed to implement BRE -- POSIX basic regular expressions. \s is not meaningful in BRE. (Some OS vendors extend the standard, some don't).
Use [[:space:]] instead to have something that works everywhere.
Adding $ to the end of your expression makes it fail because it matches the end of the line. Your line has an extra , 10.0.0.0/255.0.0.0 after the matching portion, so of course that doesn't match $. You could say .*$, but that would be redundant unless you had the -o/--only-matching flag enabled.

Escaping # in BRE regex [duplicate]

This question already has answers here:
Escaping the exclamation point in grep?
(2 answers)
Closed 5 years ago.
I want to find files that are scripts and I need to get from these files the list of all interpreters like Bash, sh, etc.
To find that, I use:
grep "#!/bin/*" ./*
But it displays that:
-bash: !/bin/*": event not found
I assume I need to escape # symbol somehow, but I didn't find that symbol to be escaped in documentation of BRE.
And how I can find files that contain this pattern in regex only on the first line of the file?
the # is no problem, you should escape the !, in Bash it refers to a previous command and must be followed by something, $ for the previous command or a number representing the index of the command in the history. (thx Aaron's correction)
also you may want to change * into .*
like grep "#\!/bin/.*"
If you don't want to escape !, use single quote like:
grep '#!/bin' ....
Also you can disable the regex match by using -F
Could you please try following find command and let me know if this helps.
find -type f -exec grep -l '#!/bin*' {} \+
Escape the ! with \:
grep "#\!/bin/*" ./*

Grep regex to unscramble a word

I want to unscramble a word using the grep command.
I am using below code. I know there are other ways to do it, but I think I'm missing something here:
grep "^[yxusonlia]\{9\}$" /usr/share/dict/words
should produce one output:
anxiously
but it produces:
annulosan
innoxious
and many more. Basically I can't find how I should specify that characters
can only be matched once, so that I get only one output.
I apologise if it seems very simple but I tried a lot and can't find anything.
You can use grep -P (PCRE regex) with negative lookahead
grep -P '^(?:([yxusonlia])(?!.*?\1)){9}$' /usr/share/dict/words
anxiously
Explanation:
This grep regex uses negative lookahead (?!.*?\1) for each character matched by group #1 i.e. \1. Each character is matched only and only when it is not followed by the same character again in the string till the end.
You can use lookaheads to make sure that each letter is matched exactly one time. It is verbose and requires a version of grep that supports lookaheads (e.g. via -P). It may be better to build the search string programmatically.
grep -P "^(?=.*y)(?=.*x)(?=.*u)(?=.*s)(?=.*o)(?=.*n)(?=.*l)(?=.*i)(?=.*a)[yxusonlia]{9}$" /usr/share/dict/words