How do you regex match some unicode character follow by bracket? - regex

I am not too familiar with regex and hope someone could help.
example:
This is a sentence with some_unicode[some other word] and other stuff.
After removing the characters and brackets, the result should be:
This is a sentence with and other stuff.
Thank you!!

Search for
some_unicode\[[^\]]*\]
and replace with nothing.
Explanation:
\[: Match a literal [.
[: Match a character class with the following properties (here [ is a metacharacter, starting a character class)...
^\]: "any character except a literal ]" (^ at the start of a character class negates its contents).
]*: ...zero or more times. Note again the unescaped ], ending the character class.
\]: Match a literal ].
This of course will only work if there can be no brackets inside brackets. How to actually format and use the regex is highly dependent on the language/tool you're doing this with; so if you add another tag to your question specifying the language, I can give you a code example.

[ and ] are metacharacters in regular expressions and must be escaped by a backslash, e.g. \[.

Related

Replace "advanced" pattern in sed

I cant figure out how to change this:
\usepackage{scrpage2}
\usepackage{pgf} \usepackage[latin1]{inputenc}\usepackage{times}\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
to this using sed only
REPLACED
REPLACED REPLACEDREPLACEDREPLACED
REPLACED
Im trying stuff like sed 's!\\.*\([.*]\)\?{.\+}!REPLACED!g' FILE
but that gives me
REPLACED
REPLACED
REPLACED
I think .* gets used and everything else in my pattern is just ignored, but I can't figure out how to go about this.
After I learned how to format a regex like that, my next step would be to change it to this:
\usepackage{scrpage2}
\usepackage{pgf}
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
So I would appreciate any pointers in that direction too.
Here's some code that happens to work for the example you gave:
sed 's/\\[^\\[:space:]]\+/REPLACED/g'
I.e. match a backslash followed by one or more characters that are not whitespace or another backslash.
To make things more specific, you can use
sed 's/\\[[:alnum:]]\+\(\[[^][]*\]\)\?{[^{}]*}/REPLACED/g'
I.e. match a backslash followed by one or more alphanumeric characters, followed by an optional [ ] group, followed by a { } group.
The [ ] group matches [, followed by zero or more non-bracket characters, followed by ].
The { } group matches {, followed by zero or more non-brace characters, followed by }.
Perl to the rescue! It features the "frugal quantifiers":
perl -pe 's!\\.*?\.?{.+?}!REPLACED!g' FILE
Note that I removed the capturing group as you didn't use it anywhere. Also, [.*] matches either a dot or an asterisk, but you probably wanted to match a literal dot instead.

Which characters must be escaped in a Perl regex pattern

Im trying to find files that are looking like this:
access_log-20160101
access_log-20160304
...
with perl regex i came up with something like this:
/^access_log-\d{8}$/
But im not sure about the "_" and the "-". are these metacharacter?
What is the expression for this?
i read that "_" in regex is something like \w, but how do i use them in my exypression?
/^access\wlog-\d{8}$/ ?
Underscore (_) is not a metacharacter and does not need to be quoted (though it won't change anything if you quote it).
Hyphen (-) IS a metacharacter that defines the range between two symbols inside a bracketed character class. However, in this particular position, it will be interpreted verbatim and doesn't need quoting since it is not inside [] with a symbol on both sides.
You can use your regexp as is; hyphens (-) might need quoting if your format changes in future.
Your regex pattern is exactly right
Neither underscore _ nor hyphen - need to be escaped. Outside a square-bracketed character class, the twelve Perl regex metacharacters are
Brackets ( ) [ {
Quantifiers * + ?
Anchors ^ $
Alternator |
Wild character .
The escape itself \
and only these must be escaped
If the pattern of your file names doesn't vary from what you have shown then the pattern that you are using
^access_log-\d{8}$
is correct, unless you need to validate the date string
Within a character class like [A-F] you must escape the hyphen if you want it to be interpreted literally. As it stands, that class is the equivalent to [ABCDEF]. If you mean just the three characters A, - or F then [A\-F] will do what you want, but it is usual to put the hyphen at the start or end of the class list to make it unambiguous. [-AF] and [AF-] are the same as [A\-F] and rather more readable

gvim search match multiple characters using regex

I am trying to search in gvim for the following pattern:
arrayA[*].entryx
hoping it would match the following:
arrayA[size].entryx
arrayA[i].entryx
arrayA[index].entryx
but it prints message saying Pattern not found even though the above lines are present in the file.
arrayA[.].entryx
only matches arrayA[i].entryx
i.e. with only one character between [] braces.
What should I do to match multiple characters between [] braces?
Here is the PCRE expression detail
/arrayA\[[^]]*]\.entryx/
^^^^^ # 0 or more characters before a ']'
^^ ^^ # Escaped '[' & '.'
^ # Closing ']' -- does not need to be escaped
^^^^^^ ^^^^^^ # Literal parts
If you want to look for arrayA[X].entryx where, there is at least on character in the [],
You need to replace \[[^]]* with \[[^]]\+
ps: Note my edit -- I've changed the \* to just * -- you don't escape that either.
But, you need to escape the + :-)
Update on your comment:
While my comment answers your question on escaping ] broadly,
for more detail look at Perl Character Class details.
Specifically, the Special Characters Inside a Bracketed Character Class section.
Rules of what needs to be escaped change after a [character starts a Character Class (CCL).
The * repeats the previous character; and [ starts a character class. So, you need something more like:
/arrayA\[[^]]*]\.entryx/
That looks for a literal [, a series of zero or more characters other than ], a literal ], a literal . and the entryx.
Always remember that in VIM you need to scape some special characters, such as [, ], {, } and .. As said before the *repeats the previous character, with this you can simply use the /arrayA\[.*\]\.entryx, but the * is greedy character, it may match some strange things, add the following line to your file and you'll understand: arrayA[size].entryx = arrayB[].entryx
A "safer" Regular Expression would be:
/arrayA\[.\{-\}\]\.entryx
The .\{-\} matches any character in a non-greedy way, witch is safer for some cases.

How do you match [ ] with regex?

I thought I was doing \[/b\]
but the machine disagrees.
How do you match [ ] with regex?
\[ \] should do just fine. At least in the Java regular expression engine.
System.out.println("[ ]".matches("\\[ \\]")); // prints true
Not sure where you get the /b from. Perhaps you're after a "blank" character. The most common expression for whitespace characters is \s. I.e., you could do \[\s\].
(Matching balanced [ ] is another story though. A task which regular expression are not very well suited for.)
It's hard to answer well without knowing which flavor of regex you're using, but:
If you're writing a regular expression literal in a language (like JavaScript) that has them, then just put a backslash in front of the [ and ]. E.g.:
var re = /\[\/b\]/;
...creates a regular expression that will match a [ followed by a / followed by a b followed by a ]. (I had to escape the / because in JavaScript regular expression literals, of course the / is the delimiter.)
In languages where you use a string to specify the regular expression (Java, for instance), escaping can be confusing, because you have to escape with a backslash, but of course backslashes are special in strings and so you have to escape them. You end up with lots of them:
Pattern p = Pattern.compile("\\[/b\\]");
That creates a regex that does what the one above does, but note how we had to escape the escapes.
'^[a-z]' // Should do fine in UNIX and possibly PERL/PHP too.
Example one with the grep command (similar to find)
grep '^[A-Z].?' file.txt
Find words that begin with a capital letter and then any characters after whether capitals or not.
Hope that helps.
DL.

Does anyone know how to write this Regular Expression?

I want to create a regex pattern to match a string which might include (`) not ('). For example: "This is Joe`s book", which is different from "This is Joe's book". I know how to match a string with (') but (`). So does anyone know how to write this Regular Expression?
Thanks!
This should do it...
^[^']+$
The caret inside a bracket expression [^ ] is the negation operator.
This captures strings from start ^ to end $ containing the character range in the square brackets. Note the back-tick at the end of the range.
^([a-zA-Z0-9 \.,;:\?\!`]+)$
[^']*[`][^']*
Accept any number of characters (including 0) that are not a single quote until a you encounter a backtick, and then accept any characters (including 0) that are not a single quote after that
If you are only wanting to test that the string has a back-tick:
/`/
Should work...
If you want to test for strings with backticks that don't contain apostrophes:
/^(?!.*').*`/
Should work...