Scala. Regexp can't remove symbol ^ - regex

I need split sentence to words removing redundant characters.
I prepared regexp for that:
val wordCharacters = """[^A-z'\d]""".r
right now I have rule which can be used to handle task in next way:
wordCharacters.split(words)
.filterNot(_.isEmpty)
where words any sentence I need to parse.
But issue is that in case I try to handle "car: carpet, as,,, java: javascript!!&#$%^&" I get one more word ^. Trying to change my regex and without ^ I'm getting much more issues for different cases...
Is any ideas how to solve it?
P.S.
If somebody want to play with it try link or code below please:
val wordCharacters = """[^A-z'\d]""".r
val stringToInt =
wordCharacters.split("car: carpet, as,,, java: javascript!!&#$%^&")
.filterNot(_.isEmpty)
.toList
println(stringToInt)
Expected result is:
List(car, carpet, as, java, javascript)

The part A-z is not exactly what you want. Probably you assume that lower a comes immediately after upper Z, but there are some other characters in between, and one of them is ^.
So, correcting the regex as
"""[^A-Za-z'\d]""".r
would fix the issue.
Have a look at the order of characters:
https://en.wikipedia.org/wiki/List_of_Unicode_characters

I'd be tempted to start with \W and expand from there.
"\\W+".r.split("car: carpet, as,,, java: javascript!!&#$%^&")
//res0: Array[String] = Array(car, carpet, as, java, javascript)

Related

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Best way to test for FOO or BAR or Foo or Bar in a regex?

I am doing checking for keywords which are headers, and the input is totally out of my control.
So I've figured out that they will have the first letter capitalized, but also might be in all caps.
I can do a Java Pattern that is:
Pattern test = Pattern.compile("\\b(FOO|BAR|Foo|Bar)\\b");
And doing a Pattern matcher with that works fine. As in:
boolean ans = test.matcher(sometext).find();
However when I have 6 or 8 keywords to check for it starts to get kind of ugly to have all the keywords there twice.
Can anyone come up with a more elegant regex that might do this?
Thanks
ADDED 3/26/15
Let me re-emphasize, its not as simple as just ignoring case completely, which is what was initially suggested. The first letter does need to be capitalized, its the rest of the string that can be upper or lower.
Use the "ignore case" flag (?i):
Pattern test = Pattern.compile("(?i)\\b(FOO|BAR)\\b");
You don't need \\b\\b since anything that comes normally is treated as a word rather than as acharacter class.
also use i(ignoreCase) modifier.
Your regex should be:
(foo|bar)
Add, i modifier, according to your language
Also, you are saying "to test". Using regex for that is overkill.
Do this:
String Str = new String("Welcome to Foo bar ");
Str = Str.toLowerCase();
return Str.contains("foo")||Str.contains("bar"); // returns true or false

How to negate regex validation string?

I want to replace all the string except the #[anyword]
I have string like this:
yng nnti dkasih tau :)"#mazayalinda: Yg klo ada cenel busana muslim aku mau ikutan dong "#noviwahyu10: Model ! Pasti gk blh klo k
and the #mazayalinda and #noviwahyu10 matches my regex #\w*.
However, I need to get rid all of the string, except for those 2 words above. We need to do the negation, but I am confuses about combining 2 regex, which are the regex to match the #[anyword] and the one to get rid all of the sentence except those 2 words.
Any ideas?
It's not completely clear from the context if this is a viable solution, but when you want to replace everything except a certain pattern it sounds more like you want a regex search rather than a replacement. For example, in python it might look like:
>>> import re
>>> s = 'yng nnti dkasih tau :)"#mazayalinda: Yg klo ada cenel busana muslim aku mau ikutan dong "#noviwahyu10: Model ! Pasti gk blh klo k'
>>> re.findall(r'#\w+', s)
['#mazayalinda', '#noviwahyu10']
Edit: in js, something like this would be more appropriate:
var s = 'yng nnti dkasih tau :)"#mazayalinda: Yg klo ada cenel busana muslim aku mau ikutan dong "#noviwahyu10: Model ! Pasti gk blh klo k';
// code from http://www.activestate.com/blog/2008/04/javascript-refindall-workalike
var rx = new RegExp("#\\w+", "g");
var matches = new Array();
while((match = rx.exec(s)) !== null){
matches.push(match);
}
After this, matches contains all the matched strings. You can always join it back together if needed into a single string.
It seems to me that you want to use capturing groups, not exactly negate the rest of the string, a regex like:
[^#]*(#\w+)[^#]*
Will capture those entries in capturing groups, and then, depending on your language, you can access each of the captured strings: http://rubular.com/r/qHKb35OK3g
use this regex (?<=^|#\w+\b)[^#]+ and union matches
Well, I'm not entirely sure I understand the question, but if you want to keep just the names and use the rest just "to get rid of it", why don't you simply save the names and ignore the rest ?
If you're keen on matching the "non-name pattern" - this seems to be an excerpt from some kind of conversation, where every message starts with ':'. If so, then using this should simply do the trick.
:[^#]*
You can use zero-width negative assertion, i.e., #(?!mazayalinda|noviwahyu10)\w.
This requires some sophisticated regular expression engine like Perl, Ruby, Java and so on.
If you have classic engine, the way as #pcalcao sais is best.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed