How do I escape Regex to search on a period? - regex

have a simple assignment that's been messing with me and I need another few sets of eyes. I'm sure I'm missing something simple. We have a directory of files that include all kinds of special characters, and I need to strip those out leaving only alpha, numeric, dot (period) and underscore characters. I'm using regex within a PowerShell v2.0 script.
For example:
!foo12.log becomes foo12.log
foo1(bar)2.log becomes foo1bar2.log
[foo]bar_.log becomes foobar_.log
My strategy is to use and exclude list and replace everything else with "". Consider:
$bkpPath = "\\Server\foo"
gci $bkpPath | % {$_.name -replace "[^a-zA-z_0-9]",""}
When I ran this, I ended up with foo12log, foo1bar2log and foobar_log so I change the regex to include .: [^a-zA-Z_\.0-9]. That doesn't remove any special characters. I've also tried [^a-zA-Z_\[\]\(\)\.0-9] with the same results as when I escape a period.
I suspect that there's and issue with my escape to the period \. and regex is reading it as a wildcard. If that's what's going on, how do I fix it? If that's not what's going on, what am I missing?

Because "." means "anything", it would be silly to use that special character inside square brackets. So in this case, the full stop loses its special meaning and you don't have to use the "\" escape character before it.
Also, it's worth noting that:
\w means "any word character" (letter, number, underscore)
\W means "any non-word character" (Although this isn't a time-saver in this case, since you want to match full stops too.)
So in this case, your relevant bit of regex could just be:
[^\w.]

You don't need to escape a period inside a character class:
[^a-zA-Z_.0-9]
should work fine. If it doesn't, there may be something special about the powershell regex flavour.

Related

Match specific numbers and before after have or regex escaped

With /\escape/ I can escape special regex right? But why isn't working?
I'm trying to search specific numbers from the beginning which start with |something in the middle have numbers only [0-9] and ends with | again.
Also have other string etc from left and from the right like so left|something[0-9]|right
This is what I've done, but is not working
/\|/234123[0-9]/\|/
\ only escapes the next character, so the second forward slash is ending the regular expression. Instead, you want this:
/\|something[0-9]\|/
You have to make sure that something is escaped correctly.
Note that if you need to match any number not just a digit, you need [0-9]+.
What would probably help you the most would be the right tool for the job:
https://regex101.com/r/ZdjhCE/2
You'll still have to set your language, as regex are similar between languages, but unluckily not 100% identical.

Regex expression - single quote without comma

I need a regex expression to find single quotes that does not have a comma neither right before nor right after it. Also the single quotes should not be the first character or the last character in the string and should have an alphanumeric character on each side
Example "Jane's book" would detect while "'apples','oranges'"
Can anyone help?
You can use this regex with lookarounds:
(?<=[a-zA-Z0-9])'(?=[a-zA-Z0-9])
RegEx Demo
Something like:
(?<=[A-Za-z0-9])\'(?=[A-Za-z0-9])
should give you matches in the languages that support positive lookaheads and positive lookbehinds (JavaScript only supports lookaheads if I remember correctly). I didn't test the above, but I'm not sure you would even need to escape the single quote...
You need your language-appropriate variation of:
.+'[^,]+.*
' finds you a single quote. You generally do not need to escape a single quotation mark.
[^,] allows any character but a comma and + indicates that you require at least one such character
.* says you can have as many of any character as you like, so putting it before and after what you care about says your expression can occur anywhere in the string. .+ means you must have at least one of any character not a comma.
Note that I'm making some assumptions, like that you'll only have one ' in the string that you want to find. Also I'm assuming you don't care about , except for right after '. If that's not true, you need to be more specific about your requirements.

Regex expression to match all char inside

I'm trying to mass update a web app, I need to create a regex that matches:
lang::id(ALLCHARACTERS]
Can someone assist me with this? I'm not good with regex. I'm pretty sure it can start like:
lang\:\:\(WHAT GOES HERE\]
Something like this would work:
lang::id\([^]]*]
This will match a literal lang::id\(, followed by zero or more of any character other than ], followed by a literal ].
Note that the only character that really needs to be escaped is the open parenthesis.
lang::id\(.*]
The . means any single character, and then * repeats it zero->N times. Make sure to escape the ( since it is used inside regex and is a special char for them, so escaping it with \ is needed, or the regex will probably complain about unbalanced parenthesis.
If you wanted it to not include all characters, you can add a smaller regex in place of the .*. This way you can break the regex down into smaller chunks which help make it easier to understand and develop for some complex rules.

regex for links - help to understand it

how do you read this regex?
#(http|https|ftp)://([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(d+)?/?#i
this is a regex for links, but i'm having trouble to understand it
Thanks
Depending on what language you're in, regexes need a delimiter. Seems the # (pound sign or hash) is used here. So,
#...actual regex goes here...#
In javascript you need forward slashes (/..../).
Some regex engines allow you to pass flags that influence matching process. These appear after the closing delimiter:
#...actual regex goes here...#..flags go here..
In your example, there is one flag, the i and I am guessing that means: "case insensitive" (i for insensitive). Depending on the regex engine you can have flags that influence the syntax you can use for the actual regex (for example, the dot can match either any character or any character except newlines depending upon wheter a flag was passed), flags that influence how the matching is done (for example, in javascript a g indicates the global flag, and that means matching anywhere inside the string is done, and state is preserved), flags that determine whether whitespace is allowed as indentation inside the regex. And some have a m flag indicating whether the regex will be applied on a line by line basis, or on the entire text. There is AFAIK no standard set of flags, check your regex engine documentation.
If you have multiple flags, you just concatenate them together to a string of flags and put them after the closing delimiter.
Now for the actual regex. First, you start with a parenthesized expression:
(...group...)
This is also called a group. In many regex engines, these groups have special meaning, because when a match is found you can access the bits of text that matched the expression inside the group using a special variable (or sometimes, the match is returned as an array, where each element represents a group). If you can access the bits inside groups, it is called a "capturing group".
In this particular case the group uses "alternation" or "choice" and this is indicated by the | (pipe). The pipe is part of the regex syntax and means "or". So,
(http|https|ftp)
means: match "http", or if that doesn't match, "https", of if that doesn't match, "ftp". This also brings up another reason for using parenthesis: of all special regex syntax operators, the pipe has the lowest precedence, so the parenthesis would not have been there, it would have meant: match "http" or "https" or "ftp://...etc"
So far, we've seen these "special characters": | (pipe) and ( and ). After that we get
://
These are not special characters, and any non-special characters simply match themselves.
We then get another group, which makes up almost the rest of the regex:
([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+)
Inside it, we see a bracketed expression:
[A-Z0-9]
The brackets [ and ] are special, and indicate a "character class". There are other ways to denote character classes, but in all cases a character class matches a single character. Which character depends on the nature of the class. In this case, the class is defined using two ranges:
A-Z
means characters A thru Z (and anything in between) and
0-9
means characters 0 thru 9 (and anything in between).
Basically, [A-Z0-9] matches any alpha-numeric character.
Note that the dash between the boundaries of the range is only a special character inside these bracketed expressions. Paradoxically, a dash inside the brackets can also simply mean a dash if it cannot be interpreted as a range.
This is folllowed by yet another character class:
[A-Z0-9_-]
Almost the same as the previous on, it just adds the underscore and the dash. This last dash cannot be interpreted as a range separator, so it simply means a dash. This character class will match any alpha-numeric character as well as underscore and dash.
This class is followed by a * (asterisk) and this is a special character indicating a cardinality. Cardinalities specify how often the immediately preceding element may occur. These are the common cardinalities:
* (asterisk) means zero or more times.
? (question mask) means zero or once.
+ (plus) means one or more times.
Now the entire bit starts to make sense:
[A-Z0-9][A-Z0-9_-]*
means: a sequence starting with one alphanumeric caracter, optionally followed by a string of "word" characters (that is, alphanumeric, dash and underscore).
The following bit of the regex is this:
(?:.[A-Z0-9][A-Z0-9_-]*)+
I think this is trying to match the domain parts. So that if you have say:
https://mail.google.com
The .google and .com bits would be matched by this part. The initial (?: bit is meant to tell the regex engine to not create a "backreference". This is not really my stronghold, maybe someone else can explain. But the rest of that group is quite clear and resembles what we saw before. I think there is a mistake though: the dot (.) that appears immediately before the bracketed character class usually means "match any character" or "match any non-newline character", not "match a literal dot". Typically if you want a literal dot, you need to escape it. This would be the syntax in javascript and I think perl:
(\.[A-Z0-9][A-Z0-9_-]*)+
(note the backslash immediately before the dot to indicate a literal dot)
The final bits of the regex seem an attempt to match a port number:
:?(d+)?
However, the d+ bit is probably wrong: right now it matches "one or more d's". It should probably be:
:?(\d+)?
meaning: optionally match a colon (:), optionally followed by a bunch of digits. The \d is also a character class, but a predefined one. I think most regex engines use \d to denote a digit, but you should check the documentation of your engine to see the exact convention. So in say:
http://domain.server.extension:8080/
this part of the regex would match :8080 (provided you fix the d+ thing).
Finally, we see
/?
Meaning the entire thing can be followed optionally by a forward slash.
So, all in all, I don;'t think this matches a "link", rather it matches the inital part of a URL. To match an entire url, you would need a bit more, at least I don't see any expression that could match the path, resource, hash and query bits that may occur in a proper URL.
When you say you have trouble understanding it, it means you tried something and are stuck somewhere?
Please ask more specific questions.
I can give you some keywords that you can lookup them more easy, a good place for that is regular-expressions.info
(http|https|ftp) is an alternation
[A-Z0-9] is a character class
*, + and ? are quantifiers
(...) is a (capturing) group, (?:...) is a non capturing group
The # at the start and end are regex delimiters, the i at the very end is a modifier/option (match case independent).
The (d+)? at the end would match one or more (optional) letters "d". This is quite strange. I assume it should be (\d+)? that would be one or more (optional) digits.

What regex will capitalize any letters following whitespace?

I'm looking for a Perl regex that will capitalize any character which is preceded by whitespace (or the first char in the string).
I'm pretty sure there is a simple way to do this, but I don't have my Perl book handy and I don't do this often enough that I've memorized it...
s/(\s\w)/\U$1\E/g;
I originally suggested:
s/\s\w/\U$&\E/g;
but alarm bells were going off at the use of '$&' (even before I read #Manni's comment). It turns out that they're fully justified - using the $&, $` and $' operations cause an overall inefficiency in regexes.
The \E is not critical for this regex; it turns off the 'case-setting' switch \U in this case or \L for lower-case.
As noted in the comments, matching the first character of the string requires:
s/((?:^|\s)\w)/\U$1\E/g;
Corrected position of second close parenthesis - thanks, Blixtor.
Depending on your exact problem, this could be more complicated than you think and a simple regex might not work. Have you thought about capitalization inside the word? What if the word starts with punctuation like '...Word'? Are there any exceptions? What about international characters?
It might be better to use a CPAN module like Text::Autoformat or Text::Capitalize where these problems have already been solved.
use Text::Capitalize 0.2;
print capitalize_title($t), "\n";
use Text::Autoformat;
print autoformat{case => "highlight", right=>length($t)}, $t;
It sounds like Text::Autoformat might be more "standard" and I would try that first. Its written by Damian. But Text::Capitalize does a few things that Text::Autoformat doesn't. Here is a comparison.
You can also check out the Perl Cookbook for recipie 1.14 (page 31) on how to use regexps to properly capitalize a title or headline.
Something like this should do the trick -
s!(^|\s)(\w)!$1\U$2!g
This simply splits up the scanned expression into two matches - $1 for the blank/start of string and $2 for the first character of word. We then substitute both $1 and $2 after making the start of the word upper-case.
I would change the \s to \b which makes more sense since we are checking for word-boundaries here.
This isn't something I'd normally use a regex for, but my solution isn't exactly what you would call "beautiful":
$string = join("", map(ucfirst, split(/(\s+)/, $string)));
That split()s the string by whitespace and captures all the whitespace, then goes through each element of the list and does ucfirst on them (making the first character uppercase), then join()s them back together as a single string. Not awful, but perhaps you'll like a regex more. I personally just don't like \Q or \U or other semi-awkward regex constructs.
EDIT: Someone else mentioned that punctuation might be a potential issue. If, say, you want this:
...string
changed to this:
...String
i.e. you want words capitalized even if there is punctuation before them, try something more like this:
$string = join("", map(ucfirst, split(/(\w+)/, $string)));
Same thing, but it split()s on words (\w+) so that the captured elements of the list are word-only. Same overall effect, but will capitalize words that may not start with a word character. Change \w to [a-zA-Z] to eliminate trying to capitalize numbers. And just generally tweak it however you like.
If you mean character after space, use regular expressions using \s. If you really mean first character in word you should use \b instead of all above attempts with \s which is error prone.
s/\b(\w)/\U$1/g;
You want to match letters behind whitespace, or at the start of a string.
Perl can't do variable length lookbehind. If it did, you could have used this:
s/(?<=\s|^)(\w)/\u$1/g; # this does not work!
Perl complains:
Variable length lookbehind not implemented in regex;
You can use double negative lookbehind to get around that: the thing on the left of it must not be anything that is not whitespace. That means it'll match at the start of the string, but if there is anything in front of it, it must be whitespace.
s/(?<!\S)(\w)/\u$1/g;
The simpler approach in this exact case will probably be to just match the whitespace; the variable length restriction falls away, then, and include that in the replacement.
s/(\s|^)(\w)/$1\u$2/g;
Occasionally you can't use this approach in repeated substitutions because that what precedes the actual match has already been eaten by the regex, and it's good to have a way around that.
Capitalize ANY character preceded by whitespace or at beginning of string:
s/(^|\s)./\u$1/g
Maybe a very sloppy way of doing it because it's also uppercasing the whitespace now. :P
The advantage is that it works with letters with all possible accents (and also with special Danish/Swedish/Norwegian letters), which are problematic when you use \w and \b in your regex. Can I expect that all non-letters are untouched by the uppercase modifier?