simple regular expression question

simple regular expression question - regex

How to match aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab where number of a's should be min of 10?
I mean i know this way:
[a][a][a][a][a][a][a][a][a][a][a][a][a]a*b
But there must be a better elegant method where is if my min number of a's become say 100..
What is it? I am trying to match (a^n)b sort of thing where n can be anything
EDIT:
I forgot to mention this is done using lex and yacc.. where the lex has to return a token to yacc.
%{
#include "y.tab.h"
%}
%%
aaaaaaaaaa[a]*b {return ok;}
\n {return '\n';}
. {return 0;}
%%

Try
a{10,}
which says a 10 or more times.
grep -E "a{10,}" filename
matches aaaaaaaaaaaaaaaaaaaaaaaaab but not aaaaaaaaab.

If your lex is flex, you can use a{10,}.
If not so, according to
3. Lex Regular Expressions
, you can use a{10}a* instead.

Footy,
[WARNING: This answer is COMPLETE BUNKUM!!!]
(if you mean soccer, we're swarn enemies ;-)
Ummm, No... That is not as far as I know, using "the standard" regular expression syntax as supported by sed, grep, nawk, and the likes... and no not even egrep... As far as I know, the a{10,*} syntax (which is exactly what you're hankering for) didn't emerge until Perl rewrote all the books on the capabilities of regular expressions... and (don't quote me on this) I don't think that happened until like version 5.
So yeah, If you're stuck with using nawk, then it's the aaaaaaaaardvarking hardway dude. Sorry.
Cheers. Keith.
EDIT:
Hmmm... I seem to be the odd-man-out here... maybe everone-elses "standard operating environment(s)" have been updated with "standard tools" that recognise later regular expression syntax extensions... Sooo... Hmmm... I tested this on my (three year old) cygwin implementation of egrep... and it suprised me by actually working!!!
Administrator#snadbox3 ~
$ egrep 'a{3,}b' <<-eof
> ab
> aab
> aaab
> aaaab
> eof
aaab
aaaab
So I'm WRONG all ends up... looks like the "new" {min,[max]} syntax is reasonably well supported, and I'm getting old. Sigh.
Cheers. Keith.

use this format : a^na*b and replace n with any number you want.

Related

Using Grep + Regex linux

I am trying to find the value "PASS_MAX_DAYS" equal to 90 or less in the /etc/login.defs file But it does not work, I am testing on a suse 12 server but the command does not work.
grep "^PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)" /etc/login.defs
Thanks for your support and time

It is better to test the number against your value, and not test a string against any pattern (as already suggested in comments). For example like this:
awk '/^PASS_MAX_DAYS/ && $2<=90' /etc/login.defs
This way, you can easily modify your command, if your limit changes to 30 or to 365 days. Also I guess values like 090 are still valid for that configuration.

grep, by default, doesn't understand extended regular expressions ..
grep -E "^PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)\s*$" /etc/login.defs
will give you SOME result.
That said, how many different entries for PASS_MAX_DAYS do you expect in that file?

Bash script to match segments in lines of source code

I'm trying to learn a new programming language, and it's big. Thousands of new terms to learn. I know programming, but don't know the name used for a certain procedure or constant in this language. But I have a script file that I put together that helps tremendously by searching through a large selection of source files, as long as I get a group of the characters right.
But now I want to use && to match up multiple segments in the same line, and I want to pass this whole expression to the script file as one argument, so I might pass it this with a read command:
moo && cow
And it would match this:
Moonlight over Moscow
But not this:
I heard a cow mooing.
If I wanted it either way I would pass it this:
moo && cow || cow && moo
It's tricky, and probably outside what you can normally do with the available syntax. But then I'm no expert, so I don't really know.
I'm flexible on what gets passed to the script, like single &s and |s, the use of brackets, and so on. I just need to understand the rules involved and which utility can do it for me. Or set of utilities if it comes to that.

If you only want to check for the two elements in order, simply match anything between them with .*:
my_str="moonlight over moscow"
if [[ $my_str =~ moo.*cow ]]; then
echo "match"
fi

Regular expression to search for Gadaffi [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?
This is a list of 30 variants:
Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi
My best attempt so far is:
\b[KG]h?add?af?fi$\b
But I still seem to be missing some variants. Any suggestions?

Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.

\b[KGQ]h?add?h?af?fi\b
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $ at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b

One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.

Using CPAN module Regexp::Assemble:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;
This produces the following regular expression:
(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))

I think you're over complicating things here. The correct regex is as simple as:
\u0627\u0644\u0642\u0630\u0627\u0641\u064a
It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).

If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.

If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.

(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)
Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.

Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.

A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it.
Give it a chance!

Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.
Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.
But I can see patterns for some of the variants, and so I ended up with this:
\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.
See it here on www.rubular.com

I know this is an old question, but...
Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.
"Little Beauty" #1
(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)
"Little Beauty" #2
(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y
Rest in Peace, Muammar.

Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be
\b[KG]h?[ae]dd?af?fi$\b

[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)
In parts:
[GQK]
[ahu]+
[dtez]+
\'?
[adhz]+
f{1,2}(i|y)
Note: Just wanted to give a shot at this.

What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?
/\b[GQK].+[dzt].+fi\b/i
Done.
>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True
Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?

grep replacement with extensive regular expression implementation

I have been using grepWin for general searching of files, and wingrep when I want to do replacements or what-have-you.
GrepWin has an extensive implementation of regular expressions, however doesn't do replacements (as mentioned above).
Wingrep does replacements, however has a severely limited range of regular expression implementation.
Does anyone know of any (preferably free) grep tools for windows that does replacement AND has a reasonable implementation of regular expressions?
Thanks in advance.

I think perl at the command line is the answer you are looking for. Widely portable, powerful regex support.
Let's say that you have the following file:
foo
bar
baz
quux
you can use
perl -pne 's/quux/splat!/' -i /tmp/foo
to produce
foo
bar
baz
splat!
The magic is in Perl's command line switches:
-e: execute the next argument as a perl command.
-n: execute the command on every line
-p: print the results of the command, without issuing an explicit
'print' statement.
-i: make substitutions in place. overwrite the document with the
output of your command... use with caution.

I use Cygwin quite a lot for this sort of task.
Unfortunately it has the world's most unintuitive installer, but once it's installed correctly it's very usable... well apart from a few minor issues with copy and paste and the odd issue with line-endings.
The good thing is that all the tools work like on a real GNU system, so if you're already familiar with Linux or similar, you don't have to learn anything new (apart from how to use that crazy installer).
Overall I think the advantages make up for the few usability issues.

If you are on Windows, you can use vbscript (requires no downloads). It comes with regex. eg change "one" to "ONE"
Set objFS=CreateObject("Scripting.FileSystemObject")
Set WshShell = WScript.CreateObject("WScript.Shell")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
strFileContents = objFile.ReadAll
Set objRE = New RegExp
objRE.Global = True
objRE.IgnoreCase = False
objRE.Pattern = "one"
strFileContents = objRE.Replace(strFileContents,"ONE") 'simple replacement
WScript.Echo strFileContents
output
C:\test>type file
one
two one two
three
C:\test>cscript //nologo test.vbs file
ONE
two ONE two
three
You can read up vbscript doc to learn more on using regex

Regular Expression to find files with various extensions like-ASPX,ASCX,.js,.rpt,.xml

Is there any way to write a RegEx which can be used to find files with different Extensions.

This works in Bash:
find . -regex '.*\\.\\(pdf\|chm\|doc\\)'

Assuming you have a list of files and you are looking for .pdf, .chm and .doc, you can check it with:
\.pdf$|\.chm$|\.doc$
Regex above should work if you will check it against single filenames.

I'm sure there is, but the question you should be asking is "What's the best way to find files which have specific extensions?".
Regular expressions are not the best answer to every question.
I would suggest just getting a list of all files and passing them into a function like IsThisFileOneIWant(fileName,extensionList). That's far easier than trying to shoehorn the use of regular expressions into your problem.
Something like this should do it:
function IsThisFileOneIWant(fileName,extensionList):
for each extension in extensionList:
if fileName.endsWith (extension):
return true
return false
Done in pseudo-code since it should be simple enough to turn into any other language.
If you must have a regex, it's going to look something like (based on the values in your question):
"ASPX$|ASCX$|\.js$|\.rpt$|\.xml$"
but it depends entirely on the RE engine that you want to use. For example, here's the output from an egrep command in my work directory:
pax#paxbox1:~/work$ ls -1 | egrep '\.sh$|\.c$'
backup0.sh
backup1.sh
eclipse.sh
monbt.sh
qq.c
qq.sh
xx yy.sh

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

simple regular expression question - regex

Try a{10,} which says a 10 or more times. grep -E "a{10,}" filename matches aaaaaaaaaaaaaaaaaaaaaaaaab but not aaaaaaaaab.

If your lex is flex, you can use a{10,}. If not so, according to 3. Lex Regular Expressions , you can use a{10}a* instead.

use this format : a^na*b and replace n with any number you want.

Related

Using Grep + Regex linux

Bash script to match segments in lines of source code

Regular expression to search for Gadaffi [closed]

grep replacement with extensive regular expression implementation

Regular Expression to find files with various extensions like-ASPX,ASCX,.js,.rpt,.xml

Categories

Resources