Scala regex match lines with special characters - regex

I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).
I already checked my regex on http://www.regexr.com/ and there it works.
scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"
scala> mystr.matches(myreg)
res2: Boolean = false
What I found out is that the problem is related to special characters. For example a simple example:
scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false
scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true
It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about
error: invalid escape character
So how can I deal with this messy input and still use my regex to filter out some lines?

You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.
In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.
So, to match all your inputs, you need to use the following regexps:
val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true
Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.
The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.
The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).
[0-9]+ - 1 or more digits
(?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
(?:\\r)? - 1 or 0 occurrences of \r
(?:\\n) - one occurrence of a literal combination of \ and n.

Related

Regex FindAll not printing results Kotlin

I have a program that is using ML Kit to use Text recognition on a document and I am taking that data and only printing the prices. So I am taking the Text Recognition String and passing it through the regex below:
val reg = Regex("\$([0-9]*.[0-9]{2})")
val matches = reg.findAll(rec)
val prices = matches.map{it.groupValues[0]}.joinToString()
recogResult.text = prices
I have tested the Regex formula on another website and it grabs all the right data. However it is printing nothing. When it gets to the reg.findAll(rec) part matches = kotlin.sequences.GeneratorSequence#bd56ff3 and prices = "".
You can use
val reg = Regex("""\$[0-9]*\.[0-9]{2}""")
val matches = reg.findAll("Price: \$1234.56 and \$1.56")
val prices = matches.map{it.groupValues[0]}.joinToString()
See the online demo. Notes:
"""...""" is a triple quoted string literal where backslashes are parsed as literal \ chars and are not used to form string escape sequences
\$ - in a triple quoted string literal defines a \$ regex escape that matches a literal $ char
[0-9]*\.[0-9]{2} matches zero or more digits, . and two digits.
Note that you may use \p{Sc} to match any currency chars, not just $.
If you want to make sure no other digit follows the two fractional digits, add (?![0-9]) at the end of your regex.

Regex to replace all non numbers but allow a '+' prefix

I want to delete all invalid letters from a string which should represent a phone number. Only a '+' prefix and numbers are allowed.
I tried in Kotlin with
"+1234abc567+".replace("[^+0-9]".toRegex(), "")
It works nearly perfect, but it does not replace the last '+'.
How can I modify the regex to only allow the first '+'?
You could do a regex replacement on the following pattern:
(?<=.)\+|[^0-9+]+
Sample script:
String input = "+1234abc567+";
String output = input.replaceAll("(?<=.)\\+|[^0-9+]+", "");
System.out.println(input); // +1234abc567+
System.out.println(output); // +1234567
Here is an explanation of the regex pattern:
(?<=.)\+ match a literal + which is NOT first (i.e. preceded by >= 1 character)
| OR
[^0-9+]+ match one or more non digit characters, excluding +
You can use
^(\+)|\D+
Replace with the backreference to the first group, $1. See the regex demo.
Details:
^(\+) - a + at the start of string captured into Group 1
| - or
\D+ - one or more non-digit chars.
NOTE: a raw string literal delimited with """ allows the use of a single backslash to form regex escapes, such as \D, \d, etc. Using this type of string literals greatly simplifies regex definitions inside code.
See the Kotlin demo:
val s = "+1234abc567+"
val regex = """^(\+)|\D+""".toRegex()
println(s.replace(regex, "$1"))
// => +1234567

How to match a single character with multiple conditions in regex(regular expression) in linux grep? [duplicate]

I am having problems constructing a regex that will allow the full range of UTF-8 characters with the exception of 2 characters: '_' and '?'
So the whitelist is: ^[\u0000-\uFFFF]
and the blacklist is: ^[^_%]
I need to combine these into one expression.
I have tried the following code, but does not work the way I had hoped:
String input = "this";
Pattern p = Pattern
.compile("^[\u0000-\uFFFF]+$ | ^[^_%]");
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result);
input: this
actual output: false
desired output: true
You can use character class intersections/subtractions in Java regex to restrict a "generic" character class.
The character class [a-z&&[^aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single consonant.
Use
"^[\u0000-\uFFFF&&[^_%]]+$"
to match all the Unicode characters except _ and %.
More about character class intersections/subtractions available in Java regex, see The Java™ Tutorials: Character Classes.
A test at the OCPSoft Visual Regex Tester showing there is no match when a % is added to the string:
And the Java demo:
String input = "this";
Pattern p = Pattern.compile("[\u0000-\uFFFF&&[^_%]]+"); // No anchors because `matches()` is used
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result); // => true
Here is a sample code to exclude some of characters from a range using Lookahead and Lookbehind Zero-Length Assertions that actually do not consume characters in the string, but only assert whether a match is possible or not.
Sample code: (exclude m and n from range a-z)
String str = "abcdmnxyz";
Pattern p=Pattern.compile("(?![mn])[a-z]");
Matcher m=p.matcher(str);
while(m.find()){
System.out.println(m.group());
}
output:
a b c d x y z
In the same way you can do it.
Regex explanation (?![mn])[a-z]
(?! look ahead to see if there is not:
[mn] any character of: 'm', 'n'
) end of look-ahead
[a-z] any character of: 'a' to 'z'
You can divide the whole range in sub-ranges and can solve the above problem with ([a-l]|[o-z]) or [a-lo-z] regex also.
Your problem is the spaces either side of the pipe.
Neither of
" ^.*"
".*$ "
will match anything, because nothing appears before start or after end.
This has a chance:
^[\u0000-\uFFFF]+$|^[^_%]

Find the first to last alphabet in a string

I am new to Python and got pretty confused when reading the regex documentation. From what I understand, re.search searches everywhere in a string while re.match only searches the start of the string. But when do I have to use re.compile?
I tried playing around with regex but could not get it to work. If have a string that is mixed with letters, punctuations, numbers and spaces, how can I obtain the part of the string with alphabets?
import re
a = "123,12 jlkjL kSljdf 12.2"
test = re.search('^[a-zA-Z]', a)
print test
The output I am trying to get is jlkjL kSljdf.
You may use re.compile to compile a regex object before using the regex operation.
There are two options to ahcieve what you want: matching the letters with spaces and then stripping redundant whitespace or removing all non-letter symbols from start/end:
import re
a = "123,12 jlkjL kSljdf 12.2"
rg = re.compile(r'[a-zA-Z ]+')
mtch = rg.search(a)
if mtch:
print (mtch.group().strip()) # => jlkjL kSljdf
# Stripping non-letters from the start/end
rx = re.compile(r'^[^a-zA-Z]+|[^a-zA-Z]+$')
print(rx.sub('', a)) # => jlkjL kSljdf
See the Python demo
In the first approach, include a space to the character class and set a + (1 or more occurrences) quantifier on it.
In the second approach, ^[^a-zA-Z]+ matches 1 or more (+) characters other than letters ([^a-zA-Z]) at the start of the string (^) OR (|) 1 or more chars other than letters at the end of the string ($).

Include AlphaNumeric, but Don't Match a Particular Word

How can I write a Regex for:
equals any upper-cased alphanumeric [0-9A-Z]+ one or more times, but not equal to FOO?
I've seen ^ to exclude any of the following characters, such as "exclude xyz":
scala> val blockXYZ = """[^XYZ]+""".r
blockXYZ: scala.util.matching.Regex = [^XYZ]+
scala> "XXXX".matches(blockXYZ.toString)
res26: Boolean = false
scala> "AAA".matches(blockXYZ.toString)
res27: Boolean = true
scala> "AAAX".matches(blockXYZ.toString)
res28: Boolean = false
But, I'm not sure how to not match a whole word and match on alphanumeric characters.
You need to use negative lookahead in your regex:
^(?!FOO$)[0-9A-Z]+$
(?!FOO$) means don't match following pattern [0-9A-Z]+ if it is FOO followed by end of input.
Additional to anubhava answer you can use another option like:
\bFOO\b|([0-9A-Z]+)
And use the capturing group to keep with the content you want
Working demo