regular expression to represent two of the same vowel in a row - regex

I try to write regular expression to represent two of the same vowel in a row.
I know this code grep a, but how about e,i,o,u
(a[aeiou]{2})
Should I'write the codes as like that to grep tow of the same vowel?
(a[aeiou]{2}|i[aeiou]{2}|i[aeiou]{2}|o[aeiou]{2}|u[aeiou]{2})

You can simply use a group reference :
([aeiou])\1
See demo https://regex101.com/r/dI9kB9/1

Why not just do:
aa|ee|ii|oo|uu
The bar ( | ) is used for "or".
So this reads as:
aa OR ee OR ii OR oo OR uu
It is also known as "alternation".
See: http://www.regular-expressions.info/alternation.html
It has an example where you can search for dog|cat|mouse|fish, which I would read as "dog OR cat OR mouse OR fish".

Related

Regex collating symbols

I tried to understand how 'collating symbols' match works but I did not come out this. I understood that it means matching an exact sequence instead of just the character(s), that is:
echo "ciiiao" | grep '[oa]' --> output 'ciiiao'
echo "ciiiao" | grep '[[.oa.]]' --> no output
echo "ciiiao" | grep '[[.ia.]]' --> output 'ciiiao'
However, the third command does not work. Am I wrong or I misinterpret something?
I have read this regexp tutorial.
Collating symbols are typically used when a digraph is treated like a single character in a language. They are an element of the POSIX regular expression specification, and are not widely supported.
For example, the Welsh alphabet has a number of digraphs that are treated as a single letter (marked with a * below)
a b c ch d dd e f ff g ng h i j l ll m n o p ph r rh s t th u w y
* * * * * * *
Assuming the locale file defines it (a collating symbol will only work if it is defined in the current locale), the collating symbol [[.ng.]] is treated like a single character. Likewise, a single character expression like . or [^a] will also match "ff" or "th." This also affects sorting, so that [p-t] will include the digraphs "ph" and "rh" in addition to the expected single letters.

Multiple character lookup within square brackets with regex

I’m using regex in JavaScript for certain text replacements to convert legacy encoded text to unicode (it’s an indic language). Suppose I anywhere I find either of a,b,c followed by either of x,y,z followed by e I have to replace it so that e comes first. So I have code like this:
modified_substring = modified_substring.replace( /([abc])([xyz]*)e/g , "e$1$2" ) ;
Now let us say I want to modify this rule as a or b or c or klm followed by either of x,y,z followed by e. So what would the code be?
modified_substring = modified_substring.replace( /([abc]klm)([xyz]*)e/g , "e$1$2" ) ;
That apparently doesn’t work. Is there a way to do this?
You need to use alternation operator |.
modified_substring = modified_substring.replace( /([abc]|klm)([xyz]*)e/g , "e$1$2" ) ;
^

Grep for Pattern in File in R

In a document, I'm trying to look for occurences of a 12-digit string which contains alpha and numerals. A sample string is: "PXB111X2206"
I'm trying to get the line numbers that contain this string in R using the below:
FileInput = readLines("File.txt")
prot_pattern="([A-Z0-9]{12})";
prot_string<-grep(prot_pattern,FileInput)
prot_string
This worked fine until it hit a document containing all upper-case titles and returned a line containing the word "CONCENTRATIO"
The string I am trying to look for is: "PXB111X2206". I am expecting the grep to return the line numbers containing the string : "PXB111X2206". It however is returning the line number containing the word: "CONCENTRATIO"
What is wrong with my expression above? Any idea what I am doing wrong here?
Here is some sample input:
Each design objective described herein is significantly important, yet it is just one aspect of what it takes to achieve a successful project.
A successful project is one where project goals are identified early on and where the >interdependencies of all building systems are coordinated concurrently from the planning and programming phase.
CONCENTRATION:
The areas of concentration for design objectives: accessible, aesthetics, cost effective, >functional/operational, historic preservation, productive, secure/safe, and sustainable and >their interrelationships must be understood, evaluated, and appropriately applied.
Each of these design objectives is presented in the design objectives document number. >PXB111X2206.
>
Thanks & Regards,
Simak
You are using a very powerful tool for a very simple task, the expression
[A-Z0-9]{12}
will match any alphanumeric 12 sized uppercased string, for example the word "CONCENTRATIO", however, your "PXB111X2206" is not even 12 symbols long, so it is not possible that is being matched. If you only want to match "PXB111X2206" you only have to use it as a regular expression itself, for example, if you file contents are:
foo
CONCENTRATIO.
bazz
foo bar bazz PXB111X2206 foo bar bazz
foo
bar
bazz
and you use:
grep('PXB111X2206',readLines("File.txt"))
then R will only match line 4 as you would wish.
EDIT
If you are looking for that specific pattern try:
grep('[A-Z]{3}[0-9]{3}[A-Z]{1}[0-9]{4}',readLines("File.txt"))
That expression will match strings like 'AAADDDADDDD' where A is an capital letter, and D a digit, the regular expression contains a group (symbols inside square brackets) and a quantifier (the number inside the brackets) that tells how many of the previous symbol will the expression accept, if no quantifier is present it assumes it is 1.
Let's take a look at what your regular expression means. [A-Z0-9] means any capitalized letter or number and {12} means the previous expression must occur exactly 12 times. The string CONCENTRATIO is 12 capitaized letters, so it's no surprise that grep picks it up. If you want to take out the matches that match to just letters or just numbers you could try something like
allleters <- grep("[A-Z]{12}",strings)
allnumbers <-grep("[0-9]{12}",strings)
both <- grep("[A-Z0-9]{12}",strings)
the matches you wanted would then be something like
both <- both[!both %in% union(allletters,allnumbers)]
Someone with better regexfu might have a more elegant solution, but this will work too.

named captures that match more than once (Perl)

When I run this code:
$_='xaxbxc';
if(/(x(?<foo>.))+/) {
say "&: ", $&;
say "0: ", $-{foo}[0];
say "1: ", $-{foo}[1];
}
I get:
&: xaxbxc
0: c
1:
I understand that this is how it's supposed to work, but I would like to be able to somehow get the list of all matches ('a', 'b', 'c') instead of just the last match (c). How can I do this?
In situations like these, using embeded code blocks provides an easy way out:
my #match;
$_='xaxbxc';
if(/((?:x(.)(?{push #match, $^N}))+)/) {
say "\$1: ", $1;
say "#match"
}
which prints:
$1: xaxbxc
a b c
I don't think there is a way to do this in general (please correct me if I am wrong), but there is likely to be a way to accomplish the same end-goal in specific situations. For example, this would work for your specific code sample:
$_='xaxbxc';
while (/x(?<foo>.)/g) {
say "foo: ", $+{foo};
}
What exactly are you trying to accomplish? Perhaps we could find a solution for your actual problem even if there is no way to do repeating captures.
Perl allows a regular expression to match multiple times with the "g" switch past the end. Each individual match can then be looped over, as described in the Global Matching subsection of the Using Regular Expressions in Perl section of the Perl Regex Tutorial:
while(/(x(?<foo>.))+/g){
say "&: ", $&;
say "foo: ", $+{foo};
}
This will produce an iterated list:
&: xa
foo: a
&: xb
foo: b
&: xc
foo: c
Which still isn't what you want, but it's really close. Combining a global regex (/g) with you previous local regex probably will do it. Generally, make a capturing group around your repeated group, then re-parse just that group with a global regex that represents just a single iteration of that group, and iterate over it or use it as a list.
It looks like a question fairly similar to this one- at least in answer, if not forumlation- has been answered by someone much more competent at Perl than I: "Is there a Perl equivalent of Python's re.findall/re.finditer (iterative regex results)?" You might want to check the answers for that as well, with more details about the proper use of global regexes. (Perl isn't my language, I just have an unhealthy appreciation for regular expressions.)
The %- variable is used when you have more than one of the same named group in the same pattern, not when the a given group happens to be iterated.
That’s why /(.)+/ doesn’t load up $1 with each separate character, just with the last one. Same with /(<x>.)+/. However, with /(<x>.)(<x>.)/ you have two different <x> groups, so $-{x}. Consider:
% perl -le '"foobar" =~ /(?<x>.)(?<x>.)/; print "x#1 is $-{x}[0], x#2 is $-{x}[1]"'
x#1 is f, x#2 is o
% perl -le '"foobar" =~ /(?:(?<x>.)(?<x>.))+/; print "x#1 is $-{x}[0], x#2 is $-{x}[1]"'
x#1 is a, x#2 is r
I'm not sure that is exactly what you're looking for, but the following code should do the trick.
$_='xaxbxc';
#l = /x(?<foo>.)/g;
print join(", ", #l)."\n";
But, I'm not sure this would work with overlapping strings.

Format all IP-Addresses to 3 digits

I'd like to use the search & replace dialogue in UltraEdit (Perl Compatible Regular Expressions) to format a list of IPs into a standard Format.
The list contains:
192.168.1.1
123.231.123.2
23.44.193.21
It should be formatted like this:
192.168.001.001
123.231.123.002
023.044.193.021
The RegEx from http://www.regextester.com/regular+expression+examples.html for IPv4 in the PCRE-Format is not working properly:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
I'm stucked. Does anybody have a proper solution which works in UltraEdit?
Thanks in advance!
Set the regular expression engine to Perl (on the advanced section) and replace this:
(?<!\d)(\d\d?)(?!\d)
with this:
0$1
twice. That should do it.
If your input is a single IP address (per line) and nothing else (no other text), this approach will work:
I used "Replace All" with Perl style regular expressions:
Replace (?<!\d)(?=\d\d?(?=[.\s]|$))
with 0
Just replace as often as it matches. If there is other text, things will get more complicated. Maybe the "Search in Column" option is helpful here, in case you are dealing with CSV.
If this is just a one-off data cleaning job, I often just use Excel or OpenOffice Calc for this type of thing:
Open your textfile and make sure only one IP address per line.
Open Excel or whatever and goto "Data|Import External Data" and import your textfile using "." as the separator.
You should now have 4 columns in excel:
192 | 168 | 1 | 1
Right click and format each column as a number with 3 digits and leading zeroes.
In column 5 just do a string concatenation of the previous columns with a "." in between each column:
A1 & "." & B1 & "." & C1 & "." & D1
This obviously is a cheap and dirty fix and is not a programmatic way of dealing with this, but I find this sort of technique useful for cleaning up data every now and then.
I'm not sure how you can use Regular Expression in Replace With box in UltraEdit.
You can use this regular expression to find your string:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])$