Good Day,
I have a simple working routine in Perl that swaps two words:
i.e. John Doe -----> Doe John
Here it is:
sub SwapTokens()
{
my ($currentToken) = #_;
$currentToken =~ s/([A-Za-z]+) ([A-Za-z]+)/$2 $1/;
# $currentToken =~ s/(\u\L) (\u\L)/$2 $1/;
return $currentToken;
}
The following usage yields exactly what I want:
print &SwapTokens("John Doe");
But when I uncomment out the line '$currentToken =~ s/(\u\L) (\u\L)/$2 $1/;
I get an error. Am I missing something, it looks like my syntax is correct.
TIA,
coson
\u is not a regex atom that match a uppercase letter. \L is not a regex atom that match a number of lowercase letters. You're looking for
s/(\p{Lu}\p{Ll}+) (\p{Lu}\p{Ll}+)/$2 $1/;
\p{Lu} Uppercase letter.
\p{Ll} Lowercase letter.
$ unichars '\p{Lu}' | head -n 5
A U+0041 LATIN CAPITAL LETTER A
B U+0042 LATIN CAPITAL LETTER B
C U+0043 LATIN CAPITAL LETTER C
D U+0044 LATIN CAPITAL LETTER D
E U+0045 LATIN CAPITAL LETTER E
$ unichars '\p{Ll}' | head -n 5
a U+0061 LATIN SMALL LETTER A
b U+0062 LATIN SMALL LETTER B
c U+0063 LATIN SMALL LETTER C
d U+0064 LATIN SMALL LETTER D
e U+0065 LATIN SMALL LETTER E
Perhaps you're looking for something like this:
sub swap_the_words {
my ($processed_string) = #_;
$processed_string =~ s/([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)/$2 $1/;
return $processed_string;
}
print swap_the_words('John Doe'); # prints Doe John
As for \u and \l, they are good for modifying the string - not the regex. For example, you can slightly alter your script like that...
$processed_string =~ s/([a-z]+) ([a-z]+)/\u\L$2\E \u\L$1\E/i;
...
print swap_the_words('cOsOn hAcKeR'); # Hacker Coson
... so your words are not only swapped, but given the proper case as well. Note, though, that these modifiers are used in the replacement part of s/// operator.
\L means "lowercase till \E"; i.e., it needs to be followed at some point by \E. You do not have \E in your regex, thus it is not valid; adding \E after each \L gets the script to compile, though I have no idea what you are actually trying to accomplish there.
Related
I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\+[A-Z][a-z]\+
but it only matches things like Wilkes-Barre it doesnt match California
also tried
[A-Z][a-z-]\+
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.
You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*)+$' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the +) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).
If there must be at least one lower case letter after the upper case letter, just replace the * by a +:
grep -E '^((^|-)[A-Z][a-z]+)+$' data.txt
These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)?$' data.txt
You can use
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]]+ - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.
Normally the answer should be:
grep "^[A-Z][a-z-]+" test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-]+|^[A-Z][a-z-]+[A-Z][a-z-]+" test.txt
I am trying to extract words [a-zA-Z]+ with one constraint: a word must contain at least one lower case letter AND at least one upper case letter (in any position within the word). Example: if input is hello 123 worLD, the only match should be worLD.
I tried to use positive lookaheads like this:
echo "hello 123 worLD" | grep -oP "(?=.*[a-z])(?=.*[A-Z])[a-zA-Z]+"
hello
This is not correct: the only match is hello instead of worLD. Then I tried this:
echo "hello 123 worLD" | grep -oP "\K((?=.*[a-z])(?=.*[A-Z])[a-zA-Z]+)"
hello
worLD
This is still incorrect: hello should not be matched.
The .* in the lookaheads checks for the letter presence not only in the adjacent word, but later in the string. Use [a-zA-Z]*:
echo "hello 123 worLD" | grep -oP "\\b(?=[A-Za-z]*[a-z])(?=[A-Za-z]*[A-Z])[a-zA-Z]+"
See the demo online
I also added a word boundary \b at the start so that the lookahead check was only performed after a word boundary.
Answer:
echo "hello 123 worLD" | grep -oP "\b(?=[A-Z]+[a-z]|[a-z]+[A-Z])[a-zA-Z]*"
Demo: https://ideone.com/HjLH5o
Explanation:
First check if word starts with one or more uppercase letters followed by one lowercase letters or vice versa followed by any number of lowercase and uppercase letters in any order.
Performance:
This solution takes 31 steps to reach the match on the provided test string, while the accepted solution takes 47 steps.
I stumbled upon this seemingly trivial question, and I'm stuck on it. I have a string, in which I want to match in one regex all uppercase words only if somewhere in the string there's at least a lowercase letter.
Basically, I want each of these lines (we can consider I'll apply the regex to each line separately, no need for some multiline handling) to output:
ab ABC //matches or captures ABC
ab ABC 12 CD //matches or captures ABC, CD
ABC DE //matches or captures nothing (no lowercase)
ABC 23 DE EFG a //matches or captures ABC, DE, EFG
AB aF DE //matches or captures AB, DE
I am using PCRE as regex flavor (I know some other flavors allow for variable length look-behind).
Update after comments
Obviously, there are lots of easy solutions if I use multiple regex or the program language I'm using to call the regex (e.g. first validate the string by looking for a lowercase letter then match all uppercase words with two different regex).
My goal here is to find a way to do it with one regex.
I have no technical imperative for this constraint. Take it as an exercise of style if you have to, or curiosity, or me trying to up my regex skills: the task seemed (at first) so simple that I'd like to know if one regex alone can achieve it. If it can't, I'd like to understand why.
Or if it can but regex aren't designed for these kind of tasks, I wish I'd know why - or at least what are "these kind of unsuited tasks", so that I can choose the right solution when I meet them.
So, is it doable in one regex?
Update
So \G initially is set to a matched condition at position 0.
Which means in multi-line mode, BOS has to be a special case.
Even though BOString is a BOLine, if the assertion (?= ^ .* [a-z] ) fails,
\G is initially set as matched (default?) and UC words are found without being validated.
(?|(?=\A.*[a-z]).*?\b([A-Z]+)\b|(?!\A)(?:(?=^.*[a-z])|\G.*?\b([A-Z]+)\b))
Update 2 Posted for posterity.
After some discussion with #Robin, the above regex can be refactored to this:
# (?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b
(?:
(?= ^ .* [a-z] ) # BOL, check if line has lower case letter
| # or
(?! \A ) # Not at BOS (beginning of string, where \G is in a matched state)
\G # Start the match at the end of last match (if previous matched state)
)
.*? \b
( [A-Z]+ ) # (1), Found UC word
\b
Perl test case:
$/ = undef;
$str = <DATA>;
#ary = $str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg;
print "#ary", "\n-------------\n";
while ($str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg)
{
print "$1 ";
}
__DATA__
DA EFR
ab ABC
ab ABC 12 CD
ABC DE t
ABC 23 DE EFG a
Output >>
ABC ABC CD ABC DE ABC DE EFG
-------------
ABC ABC CD ABC DE ABC DE EFG
Silly questions deserve silly answers.
/(?{ #matches = m{\b\p{Lu}+\b}g if m{\p{Ll}} })/;
Test:
use strict;
use warnings;
use feature qw( say );
while (<DATA>) {
chomp;
local our #matches;
/(?{ #matches = m{\p{Lu}+}g if m{\p{Ll}} })/;
say "$_: ", join ', ', #matches;
}
__DATA__
ab ABC
ab ABC 12 CD
ABC DE
ABC 23 DE EFG a
And now for the silly answer I promised:
my #matches = /
\G
(?: (?! ^ )
| (?= .* \p{Ll} )
)
.*? ( \b \p{Lu}+ \b )
/sg;
which condenses to
my #matches = /\G(?:(?!^)|(?=.*\p{Ll})).*?(\b\p{Lu}+\b)/sg;
At the start of the string, it looks ahead for a lower-case. Anywhere else, there's no need to check since we already checked.
I'm not sure if that can be done, but here is some background information, explaining the "Why?" part a bit.
Regexes were designed to match regular languages, and originally, that's all they could do. In fact, regular grammars are among the simplest that aren't completely trivial; most modern computer languages use non-regular grammars, for instance. (See, especially, this section.)
So, there is a limit to what kind of languages a regex can describe, and it is far more limited that what you can describe with some simple English sentences, for example.
The Chomsky hierarchy is a way to classify languages into different levels of expressiveness. Note that regular grammars are all the way at the bottom, and most useful (programming) languages are either Type 3, or borderline Type-3 (i.e. with a few Type-3 parts added in). This is due to a simple fact: our brains are quite capable of processing context-sensitive (Type-3) grammars, even complex ones (so we want programming languages to be powerful). However, computer parsers for context sensitive grammars are quite a bit slower than those for Type-2 (so, we want programming languages to be limited in power!
For regexes, which are expected to match very quickly, it's even more important to limit their overall expressiveness. But, by writing two or more regexes with some control-structure added, you are effectively expanding them to be more powerful than a regular expression parser.
Maybe we're over thinking things:
#! /usr/bin/env perl
#
use strict;
use feature qw(say);
use autodie;
use warnings;
use Data::Dumper;
while ( my $string = <DATA> ) {
chomp $string;
my #array;
say qq(String: "$string");
if ( #array = $string =~ /(\b[A-Z]+\b)/g ) {
say qq(String groups: ) . join( ", ", #array ) . "\n";
}
}
__DATA__
ab ABC
ab ABC 12 CD
ABC DE
ABC 23 DE EFG a
AB aF DE
ADSD asd ADSD
asd ADSDSD
SDSD SDD SD
SSDD SDS asds
The output:
String: "ab ABC"
String groups: ABC
String: "ab ABC 12 CD"
String groups: ABC, CD
String: "ABC DE"
String groups: ABC, DE
String: "ABC 23 DE EFG a"
String groups: ABC, DE, EFG
String: "AB aF DE"
String groups: AB, DE
String: "ADSD asd ADSD"
String groups: ADSD, ADSD
String: "asd ADSDSD"
String groups: ADSDSD
String: "SDSD SDD SD"
String groups: SDSD, SDD, SD
String: "SSDD SDS asds"
String is groups: SSDD, SDS
Did I miss something?
One regex:
#words = split (/[a-z]+/, $_);
Could anybody help on a regular expression that I can use to validate if a string contains both digit and non-digit characters?
I'm using "\d+\D+" but it's not working. The test cases I have are:
a1
1a
a1b
1ab
ab1
1-2
12-
-12
The test cases I listed should all result in match. I'm using javascript RegExp.test() So 999 or asdf or _+sdf would not match.
Your current regex only matches strings of one or more digits, followed by one or more non-digits. You could use a look-ahead to check for the existence of a digit:
"(?=.*\d).*\D.*"
The (?=.*\d) part means "somewhere after this, there must be zero or more of any character followed by a digit." This allows your digit to appear anywhere in the string.
The .*\D.* part means "match zero or more of any character, then a non-digit, then zero or more of any character," which will match a non-digit at any position in the string and the rest of the characters (digits or not) around it.
You can try using lookaheads:
.*(?=.*\d)(?=.*\D).*
But maybe you don't even need a regex? Depending on the language/tool you're using, you might be able to do something like this:
Let your input string be s. If s is empty, it is invalid.
If the first character of s is a digit:
Loop through the other characters of s until you find a non-digit. If you don't find a non-digit, s is invalid.
Otherwise:
Loop through the other characters of s until you find a digit. If you don't find a digit, s is invalid.
If you found the appropriate digit/non-digit, s is valid.
This here works for me.
It's ( match1 | match2 ) where | means OR.
(\d+[a-zA-Z]+|[a-zA-Z]+\d+)
By digit and non-digit if you mean (any non-digit character) you can use the character classes \d for digit and \D which means [^\d]. There is no need for a lookaround here though. If you mean a number and a letter, you can use the following. I'm exploding your string for comparison strings. I'm using a group with an | operator to allow for digit before letter and vice versa.
<?php
$string = 'a1 1a a1b 1ab ab1 1-2 12- -12';
$strings = explode(' ',$string);
$pattern = '!([0-9][A-Za-z]|[A-Za-z][0-9])!';
foreach($strings as $tempString){
if(preg_match($pattern,$tempString)){
echo "$tempString matches\n";
} else {
echo "$tempString doesn't match\n";
}
}
?>
Output
a1 matches
1a matches
a1b matches
1ab matches
ab1 matches
1-2 doesn't match
12- doesn't match
-12 doesn't match
If we change to the \d\D character classes everything matches.
$pattern = '!(\d\D|\D\d)!';
Output
a1 matches
1a matches
a1b matches
1ab matches
ab1 matches
1-2 matches
12- matches
-12 matches
I am not a regex expert, but my request is simple: I need to match any string that has at least 3 or more characters that are matching.
So for instance, we have the string "hello world" and matching it with the following:
"he" => false // only 2 characters
"hel" => true // 3 characters match found
This is python regex, but it probably works in other languages that implement it, too.
I guess it depends on what you consider a character to be. If it's letters, numbers, and underscores:
\w{3,}
if just letters and digits:
[a-zA-Z0-9]{3,}
Python also has a regex method to return all matches from a string.
>>> import re
>>> re.findall(r'\w{3,}', 'This is a long string, yes it is.')
['This', 'long', 'string', 'yes']
Try this .{3,} this will match any characher except new line (\n)
If you want to match starting from the beginning of the word, use:
\b\w{3,}
\b: word boundary
\w: word character
{3,}: three or more times for the word character
I tried find similiar as topic first post.
For my needs I find this
http://answers.oreilly.com/topic/217-how-to-match-whole-words-with-a-regular-expression/
"\b[a-zA-Z0-9]{3}\b"
3 char words only
"iokldöajf asd alkjwnkmd asd kja wwda da aij ednm <.jkakla "
You could try with simple 3 dots. refer to the code in perl below
$a =~ m /.../ #where $a is your string
For .NET usage:
\p{L}{3,}