I Need to identify a string in a text and replace it with null string. Problem is, it is not always present as a word itself. There will be space character present between each letter or set of letters. For example:
For word "Decent", I may face the following values.
D ec ent,
De ce nt,
De ce n t .
Is there a way to identify these strings using "Decent" word as input with any regular expression?
I am very new to regular expressions. Please help!!
TIA!
\bD\s*e\s*c\s*e\s*n\s*t\s*
so you match D ec ent, De ce nt, De ce n t, decent Decent
but not blade centimeter
If you use
'D ?e ?c ?e ?n ?t ?'
it will match the word with extra spaces
The expression "D\s*e\s*c\s*e\s*n\s*t" will do it. Each letter is followed by zero or more spaces. Actually \s is "whitespace characters." You could replace \s* with * (space followed by an asterisk) if you just want literal spaces.
first a bit of code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordsWithSpaces {
public static void main(String[] args) {
String test = "Descent D escent De s cent desce nd";
String word = "descent";
String pattern = "";
for(int i=0; i<word.length();i++) {
pattern = pattern+word.charAt(i)+"\\s*";
}
System.err.println("pattern is: "+pattern);
Pattern p = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(test);
while(m.find()) {
String found = test.substring(m.start(),m.end());
System.err.println(found+" matches");
}
}
}
now for the explanation: \s is a character class for whitespace. this includes spaces and tabs and (possibly) linebreaks. in this piece of code, i take every character of the word i am looking for, and append "\s", with "*" meaning 0 or mor occurences.
also, to avoid it being case sensitive, i set the CASE_INSENSITIVE flag on the pattern.
character classes may not have the same name in your programming language of choice, but there should be one for whitespace. check your documentation.
Related
I have a string, myFile, that looks like: Name_2019-11-29_D_HPSeries.txt. I need to extract the letter D between the underscores...the letter could be any uppercase letter. Right now I am using a 2 step Regex code.
Dim bC As String = Regex.Match(myFile, "_[A-Z]+_").ToString
boatClass = Regex.Match(bC, "[A-Z]+").ToString
This works but I believe it could be done with one line. I tried the code below but it doesn't work.
boatClass = Regex.Replace(myFile, "_[A-Z]_", "[A-Z]").ToString
You can use positive lookarounds to avoid a 2-step process, checking that the characters before and after the letter are underscores without capturing them:
Dim myFile AS String = "Name_2019-11-29_D_HPSeries.txt"
Dim bC As String = Regex.Match(myFile, "(?<=_)[A-Z](?=_)").ToString
Console.WriteLine(bc)
Output:
D
You were almost there with a single char A-Z, but you could wrap it in a capturing group and then use the Match.Groups property.
_([A-Z])_
Regex demo | VB.Net Demo
For example
Dim myFile AS String = "Name_2019-11-29_D_HPSeries.txt"
Dim bC As String = Regex.Match(myFile, "_([A-Z])_").Groups(1).Value
Console.WriteLine(bc)
Result
D
I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.
I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}
I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next
What Regex do I need for match this url:
Match:
1234
1234/
1234/article-name
Don't match:
1234absd
1234absd/article-name
1234/article.aspx
1234/any.dot.in.the.url
You can try:
^\d+(?:\/[\w-]*)?$
This matches a non-empty sequence of digits at the beginning of the string, followed by an optional suffix of a / and a (possibly empty) sequence of word characters (letters, digits, underscore) and a -.
This matches (see on rubular):
1234
1234/
1234/article-name
42/section_13
But not:
1234absd
1234absd/article-name
1234/article.aspx
1234/any.dot.in.the.url
007/james/bond
No parenthesis regex
You shouldn't need to do this, but if you can't use parenthesis at all, you can always expand to alternation:
^\d+$|^\d+\/$|^\d+\/[\w-]*$
^\d+(/?)|(/[a-zA-Z-]+)$
That may work. or not. Hope it helps
Hope this ll help u ........
string data = "1234/article-name";
Regex Constant = new Regex("(?<NUMBERS>([0-9]+))?(//)?(?<DATA>([a-zA-Z-]*))?");
MatchCollection mc;
mc = Constant.Matches(data,0);
if (mc.Count>0)
{
for (int l_nIndex = 0; l_nIndex < mc.Count; l_nIndex++)
{
string l_strNum = mc[l_nIndex].Groups["NUMBERS"].Value;
string l_strData = mc[l_nIndex].Groups["DATA"].Value;
}
}