I want to delete all invalid letters from a string which should represent a phone number. Only a '+' prefix and numbers are allowed.
I tried in Kotlin with
"+1234abc567+".replace("[^+0-9]".toRegex(), "")
It works nearly perfect, but it does not replace the last '+'.
How can I modify the regex to only allow the first '+'?
You could do a regex replacement on the following pattern:
(?<=.)\+|[^0-9+]+
Sample script:
String input = "+1234abc567+";
String output = input.replaceAll("(?<=.)\\+|[^0-9+]+", "");
System.out.println(input); // +1234abc567+
System.out.println(output); // +1234567
Here is an explanation of the regex pattern:
(?<=.)\+ match a literal + which is NOT first (i.e. preceded by >= 1 character)
| OR
[^0-9+]+ match one or more non digit characters, excluding +
You can use
^(\+)|\D+
Replace with the backreference to the first group, $1. See the regex demo.
Details:
^(\+) - a + at the start of string captured into Group 1
| - or
\D+ - one or more non-digit chars.
NOTE: a raw string literal delimited with """ allows the use of a single backslash to form regex escapes, such as \D, \d, etc. Using this type of string literals greatly simplifies regex definitions inside code.
See the Kotlin demo:
val s = "+1234abc567+"
val regex = """^(\+)|\D+""".toRegex()
println(s.replace(regex, "$1"))
// => +1234567
Related
I'm struggling with the following combination of characters that I'm trying to parse:
I have two types of text:
1. AF-B-W23F4-USLAMC-X99-JLK
2. LS-V-A23DF-SDLL--X22-LSM
I want to get the last two combination of characters devided by - within dash.
From the 1. X99-JLK and from the 2. X22-LSM
I accomplished the 2. with the following regex '--(.*-.*)'
How can I parse the 1. sample and is there any option to parse it at one time with something like OR operator?
Thanks for any help!
The pattern --(.*-.*) that you tried matches the second example because it contains -- and it matches the first occurrence.
Then it matches until the end of the string and backtracks to find another hyphen.
As .* can match any character (also -) and there are no anchors or boundaries set, this is a very broad match.
If there have to be 2 dashes, you can match the first one, and use a capture group for the part with the second one using a negated character class [^-]
The character class can also match a newline. If you don't want to match a newline you can use [^-\r\n] or also not matching spaces [^-\s] (as there are none in the example data)
-([^-]+-[^-]+)$
Explanation
- Match -
( Capture group 1
[^-]+-[^-]+ Match the second dash between chars other than -
) Close group 1
$ End of string
See a regex demo
For example using Javascript:
const regex = /-([^-]+-[^-]+)$/;
[
"AF-B-W23F4-USLAMC-X99-JLK",
"LS-V-A23DF-SDLL--X22-LSM"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
You can try lookahead to match the last pair before the new line. JavaScript example:
const str = `
AF-B-W23F4-USLAMC-X99-JLK
LS-V-A23DF-SDLL--X22-LSM
`;
const re = /[^-]*-[^-]*(?=\n)/g;
console.log(str.match(re));
I would like to mask the email passed in the maskEmail function. I'm currently facing a problem wherein the asterisk * is not repeating when i'm replacing group 2 and and 4 of my pattern.
Here is my code:
fun maskEmail(email: String): String {
return email.replace(Regex("(\\w)(\\w*)\\.(\\w)(\\w*)(#.*\\..*)$"), "$1*.$3*$5")
}
Here is the input:
tom.cat#email.com
cutie.pie#email.com
captain.america#email.com
Here is the current output of that code:
t*.c*#email.com
c*.p*#email.com
c*.a*#email.com
Expected output:
t**.c**#email.com
c****.p**#email.com
c******.a******#email.com
Edit:
I know this could be done easily with for loop but I would need this to be done in regex. Thank you in advance.
For your problem, you need to match each character in the email address that not is the first character in a word and occurs before the #. You can do that with a negative lookbehind for a word break and a positive lookahead for the # symbol:
(?<!\b)\w(?=.*?#)
The matched characters can then be replaced with *.
Note we use a lazy quantifier (?) on the .* to improve efficiency.
Demo on regex101
Note also as pointed out by #CarySwoveland, you can replace (?<!\b) with \B i.e.
\B\w(?=.*?#)
Demo on regex101
As pointed out by #Thefourthbird, this can be improved further efficiency wise by replacing the .*? with a [^\r\n#]* i.e.
\B\w(?=[^\r\n#]*#)
Demo on regex101
Or, if you're only matching single strings, just [^#]*:
\B\w(?=[^#]*#)
Demo on regex101
I suggest keeping any char at the start of string and a combination of a dot + any char, and replace any other chars with * that are followed with any amount of characters other than # before a #:
((?:\.|^).)?.(?=.*#)
Replace with $1*. See the regex demo. This will handle emails that happen to contain chars other than just word (letter/digit/underscore) and . chars.
Details
((?:\.|^).)? - an optional capturing group matching a dot or start of string position and then any char other than a line break char
. - any char other than a line break char...
(?=.*#) - if followed with any 0 or more chars other than line break chars as many as possible and then #.
Kotlin code (with a raw string literal used to define the regex pattern so as not to have to double escape the backslash):
fun maskEmail(email: String): String {
return email.replace(Regex("""((?:\.|^).)?.(?=.*#)"""), "$1*")
}
See a Kotlin test online:
val emails = arrayOf<String>("captain.am-e-r-ica#email.com","my-cutie.pie+here#email.com","tom.cat#email.com","cutie.pie#email.com","captain.america#email.com")
for(email in emails) {
val masked = maskEmail(email)
println("${email}: ${masked}")
}
Output:
captain.am-e-r-ica#email.com: c******.a*********#email.com
my-cutie.pie+here#email.com: m*******.p*******#email.com
tom.cat#email.com: t**.c**#email.com
cutie.pie#email.com: c****.p**#email.com
captain.america#email.com: c******.a******#email.com
I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.
I am having problems constructing a regex that will allow the full range of UTF-8 characters with the exception of 2 characters: '_' and '?'
So the whitelist is: ^[\u0000-\uFFFF]
and the blacklist is: ^[^_%]
I need to combine these into one expression.
I have tried the following code, but does not work the way I had hoped:
String input = "this";
Pattern p = Pattern
.compile("^[\u0000-\uFFFF]+$ | ^[^_%]");
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result);
input: this
actual output: false
desired output: true
You can use character class intersections/subtractions in Java regex to restrict a "generic" character class.
The character class [a-z&&[^aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single consonant.
Use
"^[\u0000-\uFFFF&&[^_%]]+$"
to match all the Unicode characters except _ and %.
More about character class intersections/subtractions available in Java regex, see The Java™ Tutorials: Character Classes.
A test at the OCPSoft Visual Regex Tester showing there is no match when a % is added to the string:
And the Java demo:
String input = "this";
Pattern p = Pattern.compile("[\u0000-\uFFFF&&[^_%]]+"); // No anchors because `matches()` is used
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result); // => true
Here is a sample code to exclude some of characters from a range using Lookahead and Lookbehind Zero-Length Assertions that actually do not consume characters in the string, but only assert whether a match is possible or not.
Sample code: (exclude m and n from range a-z)
String str = "abcdmnxyz";
Pattern p=Pattern.compile("(?![mn])[a-z]");
Matcher m=p.matcher(str);
while(m.find()){
System.out.println(m.group());
}
output:
a b c d x y z
In the same way you can do it.
Regex explanation (?![mn])[a-z]
(?! look ahead to see if there is not:
[mn] any character of: 'm', 'n'
) end of look-ahead
[a-z] any character of: 'a' to 'z'
You can divide the whole range in sub-ranges and can solve the above problem with ([a-l]|[o-z]) or [a-lo-z] regex also.
Your problem is the spaces either side of the pipe.
Neither of
" ^.*"
".*$ "
will match anything, because nothing appears before start or after end.
This has a chance:
^[\u0000-\uFFFF]+$|^[^_%]
I have a Groovy String s that contains a String variable name and follows a pattern like this:
name = "{someInformation}"
s = "0/$name|2/{moreInformation}|3/$name"
I'm trying to remove \d+/{someInformation}|* (both cases of it) from the string, but this doesn't work because \d+/{someInformation}|* is not the right pattern to match.
The following does not work:
s = s.replace ("\\d+/$name|", "")
because s.contains("\\d+/$name|") returns false. But I have been fiddling with the regular expression for a while now and I don't seem to be able to get it to match.
What do I need to put into my String.replace regex so that it can find and remove the parts, such that all I have left is
s == "2/{moreInformation}"
You may use the following regex based solution:
def name = "{someInformation}"
def s = "0/$name|2/{moreInformation}|3/$name"
println(s) // => 0/{someInformation}|2/{moreInformation}|3/{someInformation}
// => 0/{someInformation}|2/{moreInformation}|3/{someInformation}
def pat = "\\|*\\d+/${java.util.regex.Pattern.quote(name)}\\|*"
println(pat) // => \|*\d+/\Q{someInformation}\E\|*
println(s.replaceAll(pat, ""))
// => 2/{moreInformation}
See the Groovy demo
Here is the generated regex demo. Note that ${java.util.regex.Pattern.quote(name)} is used to escape all special regex metacharacters inside the name variable so that they were treated as literal symbols by the regex engine.
Details
\|* - 0+ | chars
\d+ - 1+ digits
/ - a / char
\Q{someInformation}\E - a literal {someInformation} substring
\|* - 0+ | chars