How to negate two specific word in regex? - regex

I have a file containing words, like these.
Good ones words:
művész-ként
luisz-ként
gravid-ként
chips-ként
bizottság-kent
Pannon-ként
Nagyostobafalva-kent
Words to remove:
font-size
line-height
X-Faktor
Calais-nál
What I need, is to remove the words containing a hyphen and the word after the hyphen is not 'ként' or 'kent'. The file also contains other words unhyphenated, that I have to keep (like "keresztül", "kod".....).
This could, but also eliminates the words that do not contain hyphen.
grep -vE "\w+-(kent|ként) " file.txt

Perl's look-around assertions might simplify the solution:
perl -Mutf8 -CS -ne 'print unless /-(?!k[eé]nt)/' < file
-Mutf8 turns on UTF-8 in the source (i.e. makes the é work in the regex)
-CS turns UTF-8 on for the input and output
The regex says: dash not followed by kent or ként

Using grep, you can do:
grep -E '^(\w+-k[eé]nt|[^-]*)$' file
RegEx Demo
This will find hyphenated words ending with kent or ként or words with no hyphen.

Related

Match a string using grep

I want to match the below string using a regular expression in grep command.
File name is test.txt,
Unknown Unknown
Jessica Patiño
Althea Dubravsky 45622
Monique Outlaw 49473
April Zwearcan 45758
Tania Horne 45467
I want to match the lines containing special characters alone from the above list of lines; the line which I exactly need is 'Jessica Patiño', which contains a non-ASCII character.
I used,
grep '[^0-9a-zA-Z]' test.txt
But it returns all lines.
The following command should return the lines you want:
grep -v '^[0-9a-zA-Z ]*$' test.txt
Explanation
[0-9a-zA-Z ] matches a space or any alphanumeric character.
Adding the asterisk matches any string containing only these characters.
Prepending the pattern with ^ and appending it with $ anchors the string to the beginning and end of line so that the pattern matches only the lines which contain only the desired characters.
Finally, the -v or --invert-match option to grep inverts the sense of matching, i.e., select non-matching lines.
The provided answers should work for the example text given. However, you're likely to come across people with hyphens or apostrophes in their names, etc. To search for all non-ASCII characters, this should do the trick:
grep -P "[\x00-\x1F\x7F-\xFF]" test.txt
-P enables "Perl" mode and allows use of character code searches. \x00-\x1F are control characters, and \x7F-\xFF is everything above 126.
I would use:
grep [^0-9a-zA-Z\s]+ test.txt
live example
Or, even better:
grep -i "[^\da-z\s]" test.txt

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

How to do a non-greedy match in grep?

I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.
Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html
grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.
My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)
Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file
The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1
I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20

How can I add characters at the beginning and end of every non-empty line in Perl?

I would like to use this:
perl -pi -e 's/^(.*)$/\"$1\",/g' /path/to/your/file
for adding " at beginning of line and ", at end of each line in text file. The problem is that some lines are just empty lines and I don't want these to be altered. Any ideas how to modify above code or maybe do it completely differently?
Others have already answered the regex syntax issue, let's look at that style.
s/^(.*)$/\"$1\",/g
This regex suffers from "leaning toothpick syndrome" where /// makes your brain bleed.
s{^ (.+) $}{ "$1", }x;
Use of balanced delimiters, the /x modifier to space things out and elimination of unnecessary backwhacks makes the regex far easier to read. Also the /g is unnecessary as this regex is only ever going to match once per line.
perl -pi -e 's/^(.+)$/\"$1\",/g' /your/file
.* matches 0 or more characters; .+ matches 1 or more.
You may also want to replace the .+ with .*\S.* to ensure that only lines containing a non-whitespace character are quoted.
change .* to .+
In other words lines must contain at 1 or more characters. .* represents zero or more characters.
You should be able to just replace the * (0 or more) with a + (1 or more), like so:
perl -pi -e 's/^(.+)$/\"$1\",/g' /path/to/your/file
all you are doing is adding something to the front and back of the line, so there is no need for regex. Just print them out. Regex for such a task is expensive if your file is big.
gawk
$ awk 'NF{print "\042" $0 "\042,"}' file
or Perl
$ perl -ne 'chomp;print "\042$_\042,\n" if ($_ ne "") ' file
sed -r 's/(.+)/"\1"/' /path/to/your/file

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file