data mismatch with spark regexp_extract [duplicate] - regex

In Java RegEx, how to find out the difference between .(dot) the meta character and the normal dot as we using in any sentence. How to handle this kind of situation for other meta characters too like (*,+,\d,...)

If you want the dot or other characters with a special meaning in regexes to be a normal character, you have to escape it with a backslash. Since regexes in Java are normal Java strings, you need to escape the backslash itself, so you need two backslashes e.g. \\.

Solutions proposed by the other members don't work for me.
But I found this :
to escape a dot in java regexp write [.]

Perl-style regular expressions (which the Java regex engine is more or less based upon) treat the following characters as special characters:
.^$|*+?()[{\ have special meaning outside of character classes,
]^-\ have special meaning inside of character classes ([...]).
So you need to escape those (and only those) symbols depending on context (or, in the case of character classes, place them in positions where they can't be misinterpreted).
Needlessly escaping other characters may work, but some regex engines will treat this as syntax errors, for example \_ will cause an error in .NET.
Some others will lead to false results, for example \< is interpreted as a literal < in Perl, but in egrep it means "word boundary".
So write -?\d+\.\d+\$ to match 1.50$, -2.00$ etc. and [(){}[\]] for a character class that matches all kinds of brackets/braces/parentheses.
If you need to transform a user input string into a regex-safe form, use java.util.regex.Pattern.quote.
Further reading: Jan Goyvaert's blog RegexGuru on escaping metacharacters

Escape special characters with a backslash. \., \*, \+, \\d, and so on. If you are unsure, you may escape any non-alphabetical character whether it is special or not. See the javadoc for java.util.regex.Pattern for further information.

Here is code you can directly copy paste :
String imageName = "picture1.jpg";
String [] imageNameArray = imageName.split("\\.");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
And what if mistakenly there are spaces left before or after "." in such cases? It's always best practice to consider those spaces also.
String imageName = "picture1 . jpg";
String [] imageNameArray = imageName.split("\\s*.\\s*");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
Here, \\s* is there to consider the spaces and give you only required splitted strings.

I wanted to match a string that ends with ".*"
For this I had to use the following:
"^.*\\.\\*$"
Kinda silly if you think about it :D
Heres what it means. At the start of the string there can be any character zero or more times followed by a dot "." followed by a star (*) at the end of the string.
I hope this comes in handy for someone. Thanks for the backslash thing to Fabian.

If you want to end check whether your sentence ends with "." then you have to add [\.\]$ to the end of your pattern.

I am doing some basic array in JGrasp and found that with an accessor method for a char[][] array to use ('.') to place a single dot.

I was trying to split using .folder. For this use case, the solution to use \\.folder and [.]folder didn't work.
The following code worked for me
String[] pathSplited = Pattern.compile("([.])(folder)").split(completeFilePath);

Related

eregi_replace to preg_replace conversion stuff

Regular expressions are not strong point.
I can do simple stuff, but this one has just got my goat !!
So could someone give me a hand with this one.
Here's the comment in the code :
// If utf8 detection didnt work before, strip those weird characters for an underscore, as a last resort.
eregi_replace("[^a-z0-9 \-\.\(\)\/\\]","_",$str);
to (here's what I tried)
preg_replace("{[^a-z0-9 \-\.\(\)\/\\]}i","_",$str);
Any regex pros out there who give me a hand?
You need to specify regexp identifier such as # or /
preg_replace("#[^a-z0-9 \-\.\(\)\/\\]#i","_",$str);
So you should enclose your regular expression in those identifier characters.
First, I believe the { and } are fine as delimiters for the expression from the flags, but I know there are some regex flavors that don't support it, so it might be a good idea to just use something like ! or #
Second, I am not sure how the expression before worked, because AFAIK escaping with a \ character does not work with ERE expressions. You have to represent special characters like ^, -, and ] by their position within the class (^ cannot be the first character, ] must be the first character, and - must be either the first or the last character). The - character in the first expression would be interpreted as a range specifier (in this case a character in the range between \ and \). Additionally, the \ characters are treated literally, so you've got a confusing looking and largely redundant regex.
The replacement expression, however, needs to be in preg notation/flavor, so there are rule changes:
Very few things need to be escaped in a character class, even with the new rules
The \ character needs to be escaped twice - once for the string, and then one more time for the regex - otherwise, it will escape the closing bracket ]
Assuming you want to match a dash (or rather match something OTHER than a dash, it needs to be moved to the end of the class
So, here is some code (link) that I believe does what you need it to do:
$source = 'hello! ##$%^&* wazzup-dawg?.()/\\[]{}<>:"';
$blah = preg_replace('![^a-z0-9 .()/\\\\-]!i','_',$source);
print($blah);
preg_replace("{[^a-z0-9]-.()/\/}i","_",$str)
works just fine.
I tried it with all # and / and { and they all worked.

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Regular expression for at least 10 characters

I need a regular expression which checks that a string is at least 10 characters long. It does not matter what those character are.
Thanks
You can use:
.{10,}
Since . does not match a newline by default you'll have to use a suitable modifier( if supported by your regex engine) to make . match even the newline. Example in Perl you can use the s modifier.
Alternatively you can use [\s\S] or [\d\D] or [\w\W] in place of .
This will match a string of any 10 characters, including newlines:
[\s\S]{10,}
(In general, . does not match newlines.)
Does the language you're using not have a string length function (or a library with such a function)? Is it very difficult to implement your own? This seems overkill for regex, but you could just use something like .{10,} if you really wanted to. In langauges that have length functions, it might look something like if (str.length()>10) lenGeq10=true or if (length(str) > 10) lenGeq10=true, etc... and if whitespace is a concernt, many libraries also have triming functions to strip whitespace, example: if (length(trim(str)) > 10) lenGeq10=true...
A C# solution to see if the string matters as defined by your parameters..
string myString = "Hey it's my string!";
bool ItMatters;
ItMatters = myString.Length >= 10;
\S{10,}
disregard below:
extra characters in order to post.

Please help about regular expression

What's wrong with this expression?
^[a-zA-Z]+(([\''\-][a-zA-Z])?[a-zA-Z]*)*$
I want to allow alpha characters with space,-, and ' characters
for example O'neal;Jackson-Peter, Mary Jane
The following is all you need:
^[a-zA-Z' -]+$
The important thing is that the "-" is the last character in the group, otherwise it'd be interpreted as a range (unless you escaped it with "\")
How you actually input that expression as a string in your target language is different depending on the language. For C#, I usually use "#" strings, like so:
var regex = new Regex(#"^[a-zA-Z' -]+$");
This will match any string made up of at least one character, which can be alpha characters, hyphen or the single quote mark:
^[a-zA-Z-\']+$
This will also include empty strings:
^[a-zA-Z-\']*$
If it needs to begin and end with alpha characters (as names do):
^[a-zA-Z][a-zA-Z-\']*[a-zA-Z]$
Something like this?
^[a-zA-Z '\-,]*$

RegEx for String.Format

Hiho everyone! :)
I have an application, in which the user can insert a string into a textbox, which will be used for a String.Format output later. So the user's input must have a certain format:
I would like to replace exactly one placeholder, so the string should be of a form like this: "Text{0}Text". So it has to contain at least one '{0}', but no other statement between curly braces, for example no {1}.
For the text before and after the '{0}', I would allow any characters.
So I think, I have to respect the following restrictions: { must be written as {{, } must be written as }}, " must be written as \" and \ must be written as \.
Can somebody tell me, how I can write such a RegEx? In particular, can I do something like 'any character WITHOUT' to exclude the four characters ( {, }, " and \ ) above instead of listing every allowed character?
Many thanks!!
Nikki:)
I hate to be the guy who doesn't answer the question, but really it's poor usability to ask your user to format input to work with String.Format. Provide them with two input requests, so they enter the part before the {0} and the part after the {0}. Then you'll want to just concatenate the strings instead of use String.Format- using String.Format on user-supplied text is just a bad idea.
[^(){}\r\n]+\{0}[^(){}\r\n]+
will match any text except (, ), {, } and linebreaks, then match {0}, then the same as before. There needs to be at least one character before and after the {0}; if you don't want that, replace + with *.
You might also want to anchor the regex to beginning and end of your input string:
^[^(){}\r\n]+\{0}[^(){}\r\n]+$
(Similar to Tim's answer)
Something like:
^[^{}()]*(\{0})[^{}()]*$
Tested at http://www.regular-expressions.info/javascriptexample.html
It sounds like you're looking for the [^CHARS_GO_HERE] construct. The exact regex you'd need depends on your regex engine, but it would resemble [^({})].
Check out the "Negated Character Classes" section of the Character Class page at Regular-Expressions.info.
I think your question can be answered by the regexp:
^(((\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*(\{0\}))+(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*$
Explanation:
The expression is built up as follows:
^(allowed chars {0})+(allowed chars)*$
one or more sequences of allowed chars followed by a {0} with optional allowed chars at the end.
allowed chars is built of the 4 sequences you mentioned (I assumed the \ escape is \\ instead of \.) plus all chars that do not contain the escapes chars:
(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])
combined they make up the regexp I started with.