Matching the pattern with foreign character - regex

Here i do a regular expression where _pattern is the list of teams and _name is the keyword i would like to find whether it matches the _pattern.
Result shows that it matched. I'm wondering why is it possible because the keyword is totally different to the _pattern. I suspect that it is related with the é symbol.
string _pattern = "Ipswich Town F.C.|Ipswich Town Football Club|Ipswich|The Blues||Town|The Tractor Boys|Ipswich Town";
string _name = "Estudiantes de Mérida";
regex = new Regex( #"(" + _pattern + #")", RegexOptions .IgnoreCase );
Match m = regex. Match (_name );
if (m . Success)
{
var g = m. Groups [1 ]. Value;
break ;
}

It has nothing to do with the é symbol. Let's go over a few things..
Is it right that there are 2 | in as your questions formulates :
The Blues||Town
Also the point has special meaning in a regex so you should escape it
meaIpswich Town F\.C\.
And alternatives should be enclosed with parenthesis:
(Ipswich Town F.C.)|(Ipswich Town Football Club)|(Ipswich)|
The parenthesis in the following java line are not necessary
regex = new Regex( #"(" + _pattern + #")"
Aneway, The reason that it matches is not do to a valid regex. I think it has to do with your use of the java API.
The regex that I would rewrite for your purposes is:
^((Ipswich Town F\.C\.)|(Ipswich Town Football Club)|(Ipswich)|(The Blues)|(Town)|(The Tractor Boys)|(Ipswich Town))$
As you can see, there are quit a few differences.

Related

Is it possible to combine two regex patterns into one without using OR | , pipe character

I am working on regex to match either(not both) of two conditions like this:
Digits following A. ex) A123, regex (?<=A)\d+
Digits followed by Z. ex) 123Z, regex \d+(?=Z)
So naive way is to use '|' to combine them like this:
(?<=A)\d+|\d+(?=Z)
This could be better for readability and maintainability, but I'm just curious if there is another way to make it without |.
Here is a C# test code for that.
string pattern = #"(?<=A)\d+|\d+(?=Z)";
var currencies = new string[]{
"A123", // match
"123Z", // match
"123", // NOT match
};
foreach (var c in currencies)
{
Match m = Regex.Match(c, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Matched:" + m.Success + ". Value:" + m.Value);
}
The output of the code here.
Matched:True. Value:123
Matched:True. Value:123
Matched:False. Value:

Grab first 4 characters of two words RegEx

I would like to grab the first 4 characters of two words using RegEx. I have some RegEx experinece however a search did not yeild any results.
So if I have Awesome Sauce I would like the end result to be AwesSauc
Use the Replace Text action with the following parameters:
Pattern: \W*\b(\p{L}{1,4})\w*\W*
Replacement text: $1
See the regex demo.
Pattern details:
\W* - 0+ non-word chars (trim from the left)
\b - a leading word boundary
(\p{L}{1,4}) - Group 1 (later referred to via $1 backreference) matching any 1 to 4 letters (incl. Unicode ones)
\w* - any 0+ word chars (to match the rest of the word)
\W* - 0+ non-word chars (trim from the right)
I think this RegEx should do the job
string pattern = #"\b\w{4}";
var text = "The quick brown fox jumps over the lazy dog";
Regex regex = new Regex(pattern);
var match = regex.Match(text);
while (match.Captures.Count != 0)
{
foreach (var capture in match.Captures)
{
Console.WriteLine(capture);
}
match = match.NextMatch();
}
// outputs:
// quic
// brow
// jump
// over
// lazy
Alternatively you could use patterns like:
\b\w{1,4} => The, quic, brow, fox, jump, over, the, lazy, dog
\b[\w|\d]{1,4} => would also match digits
Update:
added a full example for C# and modified the pattern slightly. Also added some alternative patterns.
one approach with Linq
var res = new string(input.Split().SelectMany((x => x.Where((y, i) => i < 4))).ToArray());
Try this expression
\b[a-zA-Z0-9]{1,4}
Using regex would in fact be more complex and totally unnecessary for this case. Just do it as either of the below.
var sentence = "Awesome Sau";
// With LINQ
var linqWay = string.Join("", sentence.Split(" ".ToCharArray(), options:StringSplitOptions.RemoveEmptyEntries).Select(x => x.Substring(0, Math.Min(4,x.Length))).ToArray());
// Without LINQ
var oldWay = new StringBuilder();
string[] words = sentence.Split(" ".ToCharArray(), options:StringSplitOptions.RemoveEmptyEntries);
foreach(var word in words) {
oldWay.Append(word.Substring(0, Math.Min(4, word.Length)));
}
Edit:
Updated code based on #Dai's comment. Math.Min check borrowed as is from his suggestion.

c# regex split or replace. here's my code i did

I am trying to replace a certain group to "" by using regex.
I was searching and doing my best, but it's over my head.
What I want to do is,
string text = "(12je)apple(/)(jj92)banana(/)cat";
string resultIwant = {apple, banana, cat};
In the first square bracket, there must be 4 character including numbers.
and '(/)' will come to close.
Here's my code. (I was using matches function)
string text= #"(12dj)apple(/)(88j1)banana(/)cat";
string pattern = #"\(.{4}\)(?<value>.+?)\(/\)";
Regex rex = new Regex(pattern);
MatchCollection mc = rex.Matches(text);
if(mc.Count > 0)
{
foreach(Match str in mc)
{
print(str.Groups["value"].Value.ToString());
}
}
However, the result was
apple
banana
So I think I should use replace or something else instead of Matches.
The below regex would capture the word characters which are just after to ),
(?<=\))(\w+)
DEMO
Your c# code would be,
{
string str = "(12je)apple(/)(jj92)banana(/)cat";
Regex rgx = new Regex(#"(?<=\))(\w+)");
foreach (Match m in rgx.Matches(str))
Console.WriteLine(m.Groups[1].Value);
}
IDEONE
Explanation:
(?<=\)) Positive lookbehind is used here. It sets the matching marker just after to the ) symbol.
() capturing groups.
\w+ Then it captures all the following word characters. It won't capture the following ( symbol because it isn't a word character.

Developing Regular Expression to my needs

I'm really bad with regular expressions and find them to be too complex. However, I need to use them to do some string manipulation in classic asp.
Input String :
"James John Junior
S.D. Industrial Corpn
D-2341, Focal Point, Phase 4-a,
Sarsona, Penns
Japan
Phone : 92-161-4633248 Fax : 92-161-253214
email : swerte_60#laher.com"
Desired Output string:
"JXXXX JXXX JXXXXX
S.X. IXXXXXXXXX CXXXX
D-XXXX, FXXXX PXXXX, PXXXX 4-X,
SXXXXXX, PXXXX
JXXXX
PXXXX : 9X-XXX-XXXXXXX Fax : 9X-XXX-XXXXXX
eXXXX : sXXXXX_XX#XXXXX.XXX"
Note: We need to split the original string into words based on a single space Then, in those words, we need to replace all letters (lower and upper case) and numbers except for the first character in each word with an "X"
I know its sort of difficult, but a seasoned RegEx expert could nail this pretty easily I would think. No?
Edit:
I've made some progress. Found a function (http://www.addedbytes.com/lab/vbscript-regular-expressions/) that sort of does the job. But needs a little refinement, if anyone can help
function ereg_replace(strOriginalString, strPattern, strReplacement, varIgnoreCase)
' Function replaces pattern with replacement
' varIgnoreCase must be TRUE (match is case insensitive) or FALSE (match is case sensitive)
dim objRegExp : set objRegExp = new RegExp
with objRegExp
.Pattern = strPattern
.IgnoreCase = varIgnoreCase
.Global = True
end with
ereg_replace = objRegExp.replace(strOriginalString, strReplacement)
set objRegExp = nothing
end function
Im calling it like so -
orgstr = ereg_replace(orgstr, "\w", "X", True)
However, the result looks like -
XXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXX.
XX, XXXXX XXXX, XXXXXX XXXXXX, XXXXXXX XXXXXXX, XXXXXXXXX
XXXXX : XXX-XXX-XXXX
XXX :
XXXXX : XXXXXX#XXXXXX.XX
I'd like this to show the first character in every word. Any help out there?
This approach gets close:
Function AnonymiseWord(m, p, s)
AnonymiseWord = Left(m, 1) & String(Len(m) - 1, "X")
End Function
Function AnonymiseText(input)
Dim rgx: Set rgx = new RegExp
rgx.Global = True
rgx.Pattern = "\b\w+?\b"
AnonymiseText = rgx.Replace(input, GetRef("AnonymiseWord"))
End Function
This might get you close enough to what you need otherwise the basic approach is sound but you may need to fiddle with that pattern to get it match exactly the stretches of text you want to put through AnonymiseWord.
Well, in .NET it would be easy:
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that there is before the current position...
\b # a word boundary
\w # one alphanumeric character (= first letter/digit/underscore)
[\w.#-]* # any number of alnum characters or ., # or -
) # End of lookbehind
[\p{L}\p{N}] # Match any letter or digit to be replaced",
"X", RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
The result, though, would be slightly different than what you wrote:
"JXXXX JXXX JXXXXX
S.X. IXXXXXXXXX CXXXX
D-XXXX, FXXXX PXXXX, PXXXX 4-X,
SXXXXXX, PXXXX
JXXXX
PXXXX : 9X-XXX-XXXXXXX FXX : 9X-XXX-XXXXXX
eXXXX : sXXXXX_XX#XXXXX.XXX"
(observe that Fax has also been changed to FXX)
Without .NET, you could try something like
orgstr = ereg_replace("\b(\w)[\w.#-]*", "\1XXXX", True); // not sure about the syntax here, you possibly need double backslashes
which would give you
"JXXXX JXXXX JXXXX
SXXXX IXXXX CXXXX
DXXXX, FXXXX PXXXX, PXXXX 4XXXX,
SXXXX, PXXXX
JXXXX
PXXXX : 9XXXX FXXXX : 9XXXX
eXXXX : sXXXX"
You won't get it better than that with a single regex.
I have no idea about classic ASP, but if it does support (negative) lookbehinds and the only problem is the quantifier in the lookbehind, then why not turn it around and do it this way:
(?<!^)(?<!\s)[a-zA-Z0-9]
and replace with "X".
Means, replace every letter and number if there is not a whitespace or not the start of the string/row before.
See it here on Regexr
Although I love regular expressions, you could do it without them, especially because VBScript does not support look behind.
Dim mystring, myArray, newString, i, j
Const forbiddenChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
myString = "James John Junior S.D. Industrial Corpn D-2341, Focal Point, Phase 4-a, Sarsona, Penns Japan Phone : 92-161-4633248 Fax : 92-161-253214 email : swerte_60#laher.com"
myArray = split(myString, " ")
For i = lbound(myArray) to ubound(myArray)
newString = left(myArray(i), 1)
For j = 2 to len(myArray(i))
If instr(forbiddenChars, mid(myArray(i), j, 1)) > 0 Then
newString = newString & "X"
else
newString = newString & mid(myArray(i), j, 1)
End If
Next
myArray(i) = newString
Next
myString = join(myArray, " ")
It doesn't cope with the VbNewLine character, but you will get the idea. You can do an extra split on the VbNewLine character, iterate through all elements and split each element on the space for example.

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next