eregi_replace to preg_replace conversion stuff - regex

Regular expressions are not strong point.
I can do simple stuff, but this one has just got my goat !!
So could someone give me a hand with this one.
Here's the comment in the code :
// If utf8 detection didnt work before, strip those weird characters for an underscore, as a last resort.
eregi_replace("[^a-z0-9 \-\.\(\)\/\\]","_",$str);
to (here's what I tried)
preg_replace("{[^a-z0-9 \-\.\(\)\/\\]}i","_",$str);
Any regex pros out there who give me a hand?

You need to specify regexp identifier such as # or /
preg_replace("#[^a-z0-9 \-\.\(\)\/\\]#i","_",$str);
So you should enclose your regular expression in those identifier characters.

First, I believe the { and } are fine as delimiters for the expression from the flags, but I know there are some regex flavors that don't support it, so it might be a good idea to just use something like ! or #
Second, I am not sure how the expression before worked, because AFAIK escaping with a \ character does not work with ERE expressions. You have to represent special characters like ^, -, and ] by their position within the class (^ cannot be the first character, ] must be the first character, and - must be either the first or the last character). The - character in the first expression would be interpreted as a range specifier (in this case a character in the range between \ and \). Additionally, the \ characters are treated literally, so you've got a confusing looking and largely redundant regex.
The replacement expression, however, needs to be in preg notation/flavor, so there are rule changes:
Very few things need to be escaped in a character class, even with the new rules
The \ character needs to be escaped twice - once for the string, and then one more time for the regex - otherwise, it will escape the closing bracket ]
Assuming you want to match a dash (or rather match something OTHER than a dash, it needs to be moved to the end of the class
So, here is some code (link) that I believe does what you need it to do:
$source = 'hello! ##$%^&* wazzup-dawg?.()/\\[]{}<>:"';
$blah = preg_replace('![^a-z0-9 .()/\\\\-]!i','_',$source);
print($blah);

preg_replace("{[^a-z0-9]-.()/\/}i","_",$str)
works just fine.
I tried it with all # and / and { and they all worked.

Related

data mismatch with spark regexp_extract [duplicate]

In Java RegEx, how to find out the difference between .(dot) the meta character and the normal dot as we using in any sentence. How to handle this kind of situation for other meta characters too like (*,+,\d,...)
If you want the dot or other characters with a special meaning in regexes to be a normal character, you have to escape it with a backslash. Since regexes in Java are normal Java strings, you need to escape the backslash itself, so you need two backslashes e.g. \\.
Solutions proposed by the other members don't work for me.
But I found this :
to escape a dot in java regexp write [.]
Perl-style regular expressions (which the Java regex engine is more or less based upon) treat the following characters as special characters:
.^$|*+?()[{\ have special meaning outside of character classes,
]^-\ have special meaning inside of character classes ([...]).
So you need to escape those (and only those) symbols depending on context (or, in the case of character classes, place them in positions where they can't be misinterpreted).
Needlessly escaping other characters may work, but some regex engines will treat this as syntax errors, for example \_ will cause an error in .NET.
Some others will lead to false results, for example \< is interpreted as a literal < in Perl, but in egrep it means "word boundary".
So write -?\d+\.\d+\$ to match 1.50$, -2.00$ etc. and [(){}[\]] for a character class that matches all kinds of brackets/braces/parentheses.
If you need to transform a user input string into a regex-safe form, use java.util.regex.Pattern.quote.
Further reading: Jan Goyvaert's blog RegexGuru on escaping metacharacters
Escape special characters with a backslash. \., \*, \+, \\d, and so on. If you are unsure, you may escape any non-alphabetical character whether it is special or not. See the javadoc for java.util.regex.Pattern for further information.
Here is code you can directly copy paste :
String imageName = "picture1.jpg";
String [] imageNameArray = imageName.split("\\.");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
And what if mistakenly there are spaces left before or after "." in such cases? It's always best practice to consider those spaces also.
String imageName = "picture1 . jpg";
String [] imageNameArray = imageName.split("\\s*.\\s*");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
Here, \\s* is there to consider the spaces and give you only required splitted strings.
I wanted to match a string that ends with ".*"
For this I had to use the following:
"^.*\\.\\*$"
Kinda silly if you think about it :D
Heres what it means. At the start of the string there can be any character zero or more times followed by a dot "." followed by a star (*) at the end of the string.
I hope this comes in handy for someone. Thanks for the backslash thing to Fabian.
If you want to end check whether your sentence ends with "." then you have to add [\.\]$ to the end of your pattern.
I am doing some basic array in JGrasp and found that with an accessor method for a char[][] array to use ('.') to place a single dot.
I was trying to split using .folder. For this use case, the solution to use \\.folder and [.]folder didn't work.
The following code worked for me
String[] pathSplited = Pattern.compile("([.])(folder)").split(completeFilePath);

Tring to make more specific regex code

I understand the use of ^ for making a string specific regex, but I'm not sure why my code isn't picking it up. I've messed with changing [] and (), but with no luck :/
/.*^([\/php|.html|.css])$/
Tested with
wss://worker.com/sdfsd.css
wss://worker.com/php
html://worker.com/sdfsd/bob.html
My current non-working example; https://regex101.com/r/qkupPT/1
I don't think you understand what your current regex is doing. The ^ is for the start of a string. The [] creates a character class which allows one of the characters in it to be present (or a range if the - is used, e.g. a-z, 0-9). For multiple characters from the character class to be allowed you'd add a quantifier after the closing ], either + for 1 or more character, or * for 0 or more (meaning none of the characters are required). Something like:
[.\/](php|css|html)$
Should allow for the three examples you've listed. That allows for the string to end with a . or a / and then either css, php, or html. Also note a . outside a character class needs to be escaped otherwise it is a single non new-line character.
Demo: https://regex101.com/r/qkupPT/2

Regular Expressions: Dealing with + (Plus) Sign

I am a little confused with regular expressions. My intent in below example is to replace all 'NE_NS+' with 'NE_OS+_OE_NS'.When I am giving below code, I don't see any issue with replace results
tempString1 = tempString1.replace (/\NE__NS_+/g,'NE__OS_+_OE__NS_');
When I am giving below code, I see that there are issues. My intent here is to replace all < mn+> with < mn> [No space between < and m]
tempString2 = tempString2.replace (/\<mn>+/g,'<mn>');
and right code for above replace seems to be
tempString3 = tempString3.replace (/\<mn>\+/g,'<mn>');
Why is '+' not relevant in replace example of tempString1 while it is relevant in tempString2 example and wont work until I change it as per code in tempString3?
I have tough time understanding regex. Any books/articles that can help me understand them. I am a novice at regular expression.
The problem
Let's take a closer look at your various regexes. You'll understand what's going on:
tempString1 \NE__NS_+
Clearly, one or more _ are expected at the end.
tempString2 \<mn>+
The \ is simply ignored because it is an escape character. Moreover, it used before < that's don't need to be escaped. Again, > are expected one or more times.
tempString3 \<mn>\+
Here the + is escaped, indicating that it is not a meta-character but the plus sign that has to be matched from your temporary string.
The solution
To sum it up, if you want to match NE_NS+, the plus sign must be escaped.
So your regex will be:
NE_NS\+
If you want to match < mn+>, you'll use \s for matching a blank character (space, tabulation, carriage return etc). Again, you must escape + since it's a meta character.
So, you end up with:
<\smn\+>
There is more...
Use the powerful Debuggex to visualize your regex.
Secondly, use Regexr to quickly live test your regex against a given input text.

What does the (?i)\\. regular expression mean?

The code uses the following regular expression
img[src~=(?i)\\.(png|jpe?g)]
I'm not sure if the . is escaped or the \
the \ is escaped, which appears to be an error given what it's trying to do....
actually, you've taken that out of context. that's probably in a string. if it's in a string, then it's escaping the slash, and then that slash is escaping the dot.
the ~= means "ends with" and the (?i) switches it into case-insensitive mode.
errr... now that i think about it, that actually looks like a hybrid between a CSS selector (probably used in jquery) and a regex (being familiar with both syntaxes, I thought nothing of it!). The ~= doesn't do anything in a regex (they're literal chars) the [ and ] represent a character set though.
So...I don't know what the result of this is. I suspect someone got confused and tried mixing the two.
It means match case insensitively, any string that ends in:
\.png
\.jpeg
\.jpg
But this is dependant on context. If used in a context, were \ need to be escaped out at a higher level, then it means match case insensitively:
.png
.jpeg
.jpg
In this expression , '/' is escaped ,which in turn escapes the '.'

RegEx for String.Format

Hiho everyone! :)
I have an application, in which the user can insert a string into a textbox, which will be used for a String.Format output later. So the user's input must have a certain format:
I would like to replace exactly one placeholder, so the string should be of a form like this: "Text{0}Text". So it has to contain at least one '{0}', but no other statement between curly braces, for example no {1}.
For the text before and after the '{0}', I would allow any characters.
So I think, I have to respect the following restrictions: { must be written as {{, } must be written as }}, " must be written as \" and \ must be written as \.
Can somebody tell me, how I can write such a RegEx? In particular, can I do something like 'any character WITHOUT' to exclude the four characters ( {, }, " and \ ) above instead of listing every allowed character?
Many thanks!!
Nikki:)
I hate to be the guy who doesn't answer the question, but really it's poor usability to ask your user to format input to work with String.Format. Provide them with two input requests, so they enter the part before the {0} and the part after the {0}. Then you'll want to just concatenate the strings instead of use String.Format- using String.Format on user-supplied text is just a bad idea.
[^(){}\r\n]+\{0}[^(){}\r\n]+
will match any text except (, ), {, } and linebreaks, then match {0}, then the same as before. There needs to be at least one character before and after the {0}; if you don't want that, replace + with *.
You might also want to anchor the regex to beginning and end of your input string:
^[^(){}\r\n]+\{0}[^(){}\r\n]+$
(Similar to Tim's answer)
Something like:
^[^{}()]*(\{0})[^{}()]*$
Tested at http://www.regular-expressions.info/javascriptexample.html
It sounds like you're looking for the [^CHARS_GO_HERE] construct. The exact regex you'd need depends on your regex engine, but it would resemble [^({})].
Check out the "Negated Character Classes" section of the Character Class page at Regular-Expressions.info.
I think your question can be answered by the regexp:
^(((\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*(\{0\}))+(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*$
Explanation:
The expression is built up as follows:
^(allowed chars {0})+(allowed chars)*$
one or more sequences of allowed chars followed by a {0} with optional allowed chars at the end.
allowed chars is built of the 4 sequences you mentioned (I assumed the \ escape is \\ instead of \.) plus all chars that do not contain the escapes chars:
(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])
combined they make up the regexp I started with.