Grep Regular Expression in Cygwin

Grep Regular Expression in Cygwin - regex

I am using Cygwin on Windows. I want to extract all the lines from a file which contain exactly 9 letters in the name.
To do this, I am using:
cat filename.txt | grep -P "[a-z]{9}"
however this is also returning words of different case and lengths greater than 9.
I have even set the environment variable, LC_ALL to C.
I am able to make this work though:
cat filename.txt | grep -P "^[a-z]*[a-z]$"
And this displays only words with lowercase characters.
Please note that I am running the commands in Cygwin and I have observed that there are certain differences between Cygwin and a Linux Distro. The commands do not work the same way.

Try
cat filename.txt | grep -P "^[a-z]{9}$"
^ = beginning of string
$ = end of string
Your regex returns all words containing lower-case alphabets which have a length which is a multiple of 9.

Related

How to extract value from shell and regex

I have a string "12G 39G 24% /dev" . I have to extract the value '24'. I have used the below regex
grep '[0-9][0-9]%' -o
But I am getting output as 24%. I want only 24 as output and don't want '%' character. How to modify the regex script to extract only 24 as value?

One option would be to just grep again for the digits:
grep -o '[0-9][0-9]%' | grep -o '[0-9][0-9]'
However, if you want to accomplish this with a single regex, you can use the following:
grep -Po '[0-9]{2}(?=%)'
Note the -P option in this case; vanilla grep doesn't seem to support the (?=%) "look-around" part.

The most common way not to capture something is using look-around assertions:
Use it like this
grep -oP '[0-9][0-9](?=%)'
It's worth noting that GNU grep support the -P option to enable Perl compatible regex syntax, however it is not included with OS X. On Linux, it will be available by default. A workaround would be to use ack instead.
But I'd still recommend to use GNU grep on OS X by default. It can be installed on OSX using Homebrew with the command brew grep install
Also, see How to match, but not capture, part of a regex?

You can use sed as an alternative:
sed -rn 's/(^.*)([[:digit:]]{2})(%.*$)/\2/p' <<< "12G 39G 24% /dev"
Enable regular expressions with -r or -E and then split the line into 3 sections represented through parenthesis. Substitute the line for the second section only and print.

Use awk:
awk '{print $3+0}'
The value you seek is in the third field, and adding a zero coerces the string to a number, so % is removed.

Bash Regex: Search for a maximum of 3 consecutive vowels

I am trying to Search for a maximum of 3 consecutive vowels
I tried
grep -E "([AEIOUaeiou]{3})" gpl3.txt
and got the results
What I want is to NOT get the (aaaaaaaaa) that you see in the first line of output. All other output is correct.
Any help is appreciated

If you want to avoid the -P option and lookaheads, you can use something like the following.
grep -iE '(^|[^aeiou])[aeiou]{3}([^aeiou]|$)' gpl3.txt
It just matches
the start of the line or a non-vowel
three vowels
a non-vowel or the end of the line
A test run:
IT070137 ~/tmp $ cat gpl3.txt
aaaaaaaaaaaaaaa
asdaiosd
aa
aaa
aaaa
this is a righteous queue
IT070137 ~/tmp $ grep -E '(^|[^aeiou])[aeiou]{3}([^aeiou]|$)' gpl3.txt
asdaiosd
aaa
this is a righteous queue

If you want to find all occurrences of exactly three vowels (no more, no less), then you can try this pattern:
grep -iP '(?<![aeiou])[aeiou]{3}(?![aeiou])'
Using option -P makes grep use the Perl library for regular expressions which is more feature-rich than the standard regexp library. For instance, it knows the patterns (?<!something) (?!something) which mean "must not be preceded by something" and "must not be followed by something", respectively. Using this I express the following:
»Find stuff which is three vowels long and not preceded by a vowel and not followed by a vowel.« This is another way of saying »exactly three vowels long«.
Concerning portability: Using this you need to use a grep which is capable of using Perl regular expressions. Today I guess this won't be an issue but if you happen to code for historical machines, you need to check this first.

Try using a negative lookahead which asserts that four or more vowels do not appear consecutively:
grep -P "^(?!.*[AEIOUaeiou]{4,}).*$" gpl3.txt
We need to run this in Perl mode to use negative lookaheads.
Demo

Egrep command hangs when passed a file for Regex patterns

NB: I'm using Cygwin.
Passing in a file into the egrep command to use patterns is running incredibly slowly (to the point where after the 4th word match, it was more than 5 minutes before I gave up).
The command I'm trying to run is:
cat words.txt | egrep ^"[A-Z]" | egrep -f words9.txt
words.txt is a dictionary (390K words), and words9.txt is a file (36,148 words) I created that contains all lowercase 9-letter words from word.txt.
This command should find any 10+ letter words that contain a 9-letter word from words9.txt.
I am new to regex and shell commands so it may be simply that this file dependency is an incredibly inefficient method, (having to search 36148 words for every word in words.txt). Is there a better way of tackling this?

If words9.txt doesn't have regexes try using a fixed string search (fgrep or grep -F) instead of using the extended regex search (egrep).
cat words.txt | egrep "^[A-Z]" | fgrep -f words9.txt

So you want to improve on egrep ^"[A-Z]" words.txt | egrep -f words9.txt
Your words9.txt is not a file of regex patterns, it's only fixed strings, so treating it as such (grep -F) will generally be much faster, as #KurzedMetal said.
Mind you, if its contents had a lot of overlap near-duplicates, you could manually merge them by constructing regexes, here's how you'd do that:
Get a list of all 9-letter words starting with 'inter' (using the Unix builtin word dict)
awk 'length($0)==9' /usr/share/dict/words
now say you wanted to merge all 9-letter words starting with the 5 characters 'inter' into one regex. First let's get them as a list: grep "^inter" | paste -sd ',' - gives:
interalar,interally,interarch,interarmy,interaxal,interaxis,interbank,interbody,intercale,intercalm,intercede,intercept,intercity,interclub,intercome,intercrop,intercurl,interdash,interdict,interdine,interdome,interface,interfere,interflow,interflux,interfold,interfret,interfuse,intergilt,intergrow,interhyal,interject,interjoin,interknit,interknot,interknow,interlace,interlaid,interlake,interlard,interleaf,interline,interlink,interloan,interlock,interloop,interlope,interlude,intermaze,intermeet,intermelt,interment,intermesh,intermine,internals,internist,internode,interpage,interpave,interpeal,interplay,interplea,interpole,interpone,interpose,interpour,interpret,interrace,interroad,interroom,interrule,interrupt,intersale,intersect,intershop,intersole,intertalk,interteam,intertill,intertone,intertown,intertwin,intervale,intervary,intervein,intervene,intervert,interview,interweld,interwind,interwish,interword,interwork,interwove,interwrap,interzone`
The regex would start with: inter(a(l(ar|ly)|r(ch|my)|x(al|is))|b(...)|c(...)|...). We're implementing a tree structure from L-to-R (there are other ways but this is the obvious way).
Testing it: grep "^inter" words9.txt | egrep '^intera(l(ar|ly)|r(ch|my)|x(al|is))'
interalar
interally
interarch
interarmy
interaxal
interaxis
Yay! But it may still be faster to just have a plain list of fixed-strings. Also, this regex will be harder to maintain, brittle etc. Impossible to easily filter or remove specific strings. Anyway you get the point. PS I'm sure there are automated tools out there that construct regexes for such wordlists.

Replace more than 150000 character with sed

I want to replace this LONG string with sed
And I got the string from grep which I store it into variable var
Here is my grep command and my var variable :
var=$(grep -P -o "[^:]//.{0,}" /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map | grep -P -o "//.{0,})
Here is the output from grep : string
Then I try to replace it with sed command
sed -i "s|$var||g" /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map
But it give me output bash: /bin/sed: Argument list too long
How can I replace it?
NB : That string has 183544 character in one line.

What are you actually trying to accomplish here? sed is line-oriented, so you cannot replace a multi-line string (not even if you replace literal newlines with \n .... Well, there are ways to write a sed script which effectively replaces a sequence of lines, but it gets tortured quickly).
bash$ var=$(head -n 2 /etc/mtab)
bash$ sed "s|$var||" /etc/mtab
sed: -e expression #1, char 25: unterminated `s' command
bash$ sed "s|${var//$'\n'/\\n}||" /etc/mtab | diff -u /etc/mtab -
bash$ # (didn't replace anything, so no output)
As a workaround, what you probably want could be approached by replacing the newlines in $var with \| (or possibly just |, depending on your sed dialect) similarly to what was demonstrated above, but you'd still be bumping into the ARG_MAX limit and have a bunch of other pesky wrinkles to iron out, so let's not go there.
However, what you are attempting can be magnificently completed by sed itself, all on its own. You don't need a list of the strings; after all, sed too can handle regular expressions (and nothing in the regex you are using actually requires Perl extensions, so the -P option is by and large superfluous).
sed -i 's%\([^:]\)//.*%\1%' file
There is a minor caveat -- if there are strings which occur both with and without : in front, your original command would have replaced them all (if it had worked), whereas this one will only replace the occurrences which do not have a colon in front. That means comments at beginning of line will not be touched -- if you want them removed too, just add a line anchor as an alternative; sed -i 's%\(^\|[^:]\)//.*%\1%' file
If you want the comments in var for other reasons, the grep can be cleaned up significantly, too. (Obviously, you'd run this before performing the replacement.)
var=$(grep -P -o '[^:]\K//.*' file)
(The \K extension is one which genuinely requires -P. And of course, the common, clear, standard, readable, portable, obvious, simple way to write {0,} is *.)

On most systems these days, the value of ARG_MAX is big enough to handle 150k without problems, but it is important to note that while the limit is called ARG_MAX and the error message indicates that the command line is too long, the real limit is the sum of the sizes of the arguments and all (exported) environment variables. Also, Linux imposes a limit of 128k (131,072 bytes) for a single argument string. Exceeding any of these limits triggers an error return of E2BIG, which is printed as "Argument list too long".
In any case, bash built-ins are exempt from the limit, so you should be able to feed the command into sed as a command file:
echo "s|$var||g" | sed -f - -i /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map
That may not help you much, though. Your variable is full of regex metacharacters, so it will not match the string itself. You'll need to clean it up in order to be able to use it as a regular expression.
There's probably a cleaner way to do that edit, though.

How do I find broken NMEA log sentences with grep?

My GPS logger occassionally leaves "unfinished" lines at the end of the log files. I think they're only at the end, but I want to check all lines just in case.
A sample complete sentence looks like:
$GPRMC,005727.000,A,3751.9418,S,14502.2569,E,0.00,339.17,210808,,,A*76
The line should start with a $ sign, and end with an * and a two character hex checksum. I don't care if the checksum is correct, just that it's present. It also needs to ignore "ADVER" sentences which don't have the checksum and are at the start of every file.
The following Python code might work:
import re
from path import path
nmea = re.compile("^\$.+\*[0-9A-F]{2}$")
for log in path("gpslogs").files("*.log"):
for line in log.lines():
if not nmea.match(line) and not "ADVER" in line:
print "%s\n\t%s\n" % (log, line)
Is there a way to do that with grep or awk or something simple? I haven't really figured out how to get grep to do what I want.
Update: Thanks #Motti and #Paul, I was able to get the following to do almost what I wanted, but had to use single quotes and remove the trailing $ before it would work:
grep -nvE '^\$.*\*[0-9A-F]{2}' *.log | grep -v ADVER | grep -v ADPMB
Two further questions arise, how can I make it ignore blank lines? And can I combine the last two greps?

The minimum of testing shows that this should do it:
grep -Ev "^\$.*\*[0-9A-Fa-f]{2}$" a.txt | grep -v ADVER
-E use extended regexp
-v Show lines that do not match
^ starts with
.* anything
\* an asterisk
[0-9A-Fa-f] hexadecimal digit
{2} exactly two of the previous
$ end of line
| grep -v ADVER weed out the ADVER lines
HTH, Motti.

#Motti's answer doesn't ignore ADVER lines, but you easily pipe the results of that grep to another:
grep -Ev "^\$.*\*[0-9A-Fa-f]{2}$" a.txt |grep -v ADVER

#Tom (rephrased) I had to remove the trailing $ for it to work
Removing the $ means that the line may end with something else (e.g. the following will be accepted)
$GPRMC,005727.000,A,3751.9418,S,14502.2569,E,0.00,339.17,210808,,,A*76xxx
#Tom And can I combine the last two greps?
grep -Ev "ADVER|ADPMB"

#Motti: Combining the greps isn't working, it's having no effect.
I understand that without the trailing $ something else may folow the checksum & still match, but it didn't work at all with it so I had no choice...
GNU grep 2.5.3 and GNU bash 3.2.39(1) if that makes any difference.
And it looks like the log files are using DOS line-breaks (CR+LF). Does grep need a switch to handle that properly?

#Tom
GNU grep 2.5.3 and GNU bash 3.2.39(1) if that makes any difference.
And it looks like the log files are using DOS line-breaks (CR+LF). Does grep need a switch to handle that properly?
I'm using grep (GNU grep) 2.4.2 on Windows (for shame!) and it works for me (and DOS line-breaks are naturally accepted) , I don't really have access to other OSs at the moment so I'm sorry but I won't be able to help you any further :o(

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grep Regular Expression in Cygwin - regex

Try cat filename.txt | grep -P "^[a-z]{9}$" ^ = beginning of string $ = end of string Your regex returns all words containing lower-case alphabets which have a length which is a multiple of 9.

Related

How to extract value from shell and regex

Bash Regex: Search for a maximum of 3 consecutive vowels

Egrep command hangs when passed a file for Regex patterns

Replace more than 150000 character with sed

How do I find broken NMEA log sentences with grep?

Categories

Resources