Java replace all method appending the replacement string instead of replacing - regex

I am trying to replace all the words starting with vowels to "XXXXX" in my text file. I am using RegEx to perform this, but when I try to replace it with replaceAll method, my replacement string is getting appended instead of replacing.
Here is my text file, code and output.
Hello 12 I am John
How are you
I am good
Thank you 89767 0
$%^
code:
String dest = data.replaceAll("\\b(?=[AEIOUaeiou])","XXXXX");
System.out.println(dest);
data is the string that contains all my file data.
output :
Hello 12 XXXXXI XXXXXam Manoj
How XXXXXare you
XXXXXI XXXXXam good
Thank you 89767 0
#$%^
Please help me out in solving this issue. I have gone through some answers regarding replaceAll() method but I am not able to find answer to my problem.

Your pattern only contains zero-width assertions: \\b matches a word boundary location and (?=[AEIOUaeiou]) positive lookahead asserts the position before a vowel.
Make the pattern consuming. Use
data = data.replaceAll("\\b[AEIOUaeiou]\\w*","XXXXX");
To only match letters, replace \w with \p{Alpha}.
See regex demo and a Java demo:
String data = "Hello 12 I am John\nHow are you\nI am good\nThank you 89767 0\n#$%^";
data = data.replaceAll("\\b[AEIOUaeiou]\\p{Alpha}*","XXXXX");
System.out.println(data);
Output:
Hello 12 XXXXX XXXXX John
How XXXXX you
XXXXX XXXXX good
Thank you 89767 0
#$%^

Related

Find number of Instances for few words in string while ignoring other few words using regex

Hi i am using regex in Matlab.
I need to find number of hits for few words while ignoring other few words using regex
what i have tried so far:
String = 'Sunday:Monday:Tuesday:Wednesday:Thursday:Friday:Saturday:Sun:Mon:Tue:Wed:,Thu:,Fri:,Sat:';
Output = regexp( String,'^(?!.*(,Sun:|,Sunday:)).*(Sun:|Sunday:)' )
The Output of above regexp comes as true, But need it as 2 as it got hit 2 times for Sun: and Sunday:.
In next Scenario:
String = 'Sunday:Monday:Tuesday:Wednesday:Thursday:Friday:Saturday:Sun:Mon:Tue:Wed:,Thu:,Fri:,Sat:';
Output = regexp( String,'^(?!.*(,Fri:|,Friday:)).*(Fri:|Friday:)' )
The Output of above regexp comes as false, But need it as 1 as it*** got hit 1 time*** for Friday:.
I also tried:
regexp( String,'^(?!.*(,Sun:|,Sunday:)).*(Sun:|Sunday:)' ,'match')
But its giving Output as whole string.
I am confused how to get number of hits while ignoring other words, Help would be appreciated regexp work in Matlab same as normal.
You can use
(?<!,)Fri(?:day)?:
It matches
(?<!,) - a location not immediately preceded with ,
Fri - Fri
(?:day)? - an optional day string
: - a colon.
See the regex demo.
If you allow some redundancy, you may build the pattern like this:
(?<!,)(Fri:|Sunday:)
It will match Fri: or Sunday: not immediately preceded with a comma.
Unless you really need to use regexp, something like this will be easier to maintain:
Output = sum(ismember(strsplit(String,':'),{'Sunday','Sun'}))

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

Get some piece of string regex

I have the strings below and I just want to get the value of string: AttributeReferenceID. What I need to do?
I tried this [A]ttributeReferenceID (?<referenceID>\d+) but can't success. The string that I want is at any part of the log, so the string could be at the first line, second or in the last line.
String to get:
AttributeReferenceID 123
AttributeReferenceID 456
AttributeReferenceID 789
String to discard:
ISCCAttributeReferenceID 091281 [09123na0]
ISCCAttributeReferenceID 123012 [i1208221]
ISCCAttributeReferenceID 091221 [0oas9019]
If you regex flavor accepts positive lookbehinds, you can use the following
(?<=^AttributeReferenceID\s{4})(\d+)
Demo
The regex will look before the match if there is the specific string you're searching for followed by 4 spaces. If the length of the spaces may vary, then you'll have to find another solution. The following should normally work.
(?:^AttributeReferenceID\s+)(\d+)
Demo
And take the first group.

Regex hard to find

I'd like to build a regex which will be run on the following elements:
test.stuff;visibility:=reexport,
test.stuff;bundle-version="0.0.0";visibility:=reexport,
test.stuff;bundle-version="0.0.0"
test.stuff,
test.stuff
The aim of my regex is to either replace bundle-version="0.0.0" by bundle-version="1.2.3" or to add bundle-version="1.2.3".
After replacement it should produce the following elements:
test.stuff;bundle-version="1.2.3";visibility:=reexport,
test.stuff;bundle-version="1.2.3";visibility:=reexport,
test.stuff;bundle-version="1.2.3"
test.stuff;bundle-version="1.2.3",
test.stuff;bundle-version="1.2.3"
Currently I have the following regex:
(test.*?)([;]+bundle.*?)?([;,]+.*)
With this replacement pattern:
$1;bundle-version="1.2.3"$3
But it doesn't work for these two:
test.stuff;bundle-version="0.0.0" --> becomes test.stuff;bundle-version="1.2.3";bundle-version="0.0.0"
test.stuff --> not matched
Any help would be greatly appreciated, thanks!
EDIT: the regex should only match lines starting with "test.stuff"
This worked for me in C#/LinqPad:
string s = #"test.stuff;visibility:=reexport,
test.stuff;bundle-version=""0.0.0"";visibility:=reexport,
test.stuff;bundle-version=""0.0.0""
test.stuff,
test.stuff";
string pat = "(test[^;,\n]*)([;,]+bundle[^;,\n]*)?([;,]*.*)?";
string rep ="$1;bundle-version=\"1.2.3\"$3";
string result = Regex.Replace(s,pat,rep)
Edit: added \n to first group to avoid capturing a line after last "test.stuff" occurrence.
I would do it in two times:
1. replace bundle-version="0.0.0" par bundle-version="1.2.3"
2. replace stuff(?!;bun) par stuff;bundle-version="1.2.3"

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}