Regex PatternRepository pattern on BlackBerry 5 - how to ignore case - regex

I hope this title makes sense - I need case-insensitive regex matching on BlackBerry 5.
I have a regular expression defined as:
public static final String SMS_REG_EXP = "(?i)[(htp:/w\\.)]*cobiinteractive\\.com/[\\w|\\%]+";
It is intended to match "cobiinteractive.com/" followed by some text. The preceding (htp:w.) is just there because on my device I needed to override the internal link-recognition that the phone applies (shameless hack).
The app loads at start-up. The idea is that I want to pick up links to my site from sms & email, and process them with my app.
I add it to the PatternRepository using:
PatternRepository.addPattern(
ApplicationDescriptor.currentApplicationDescriptor(),
GlobalConstants.SMS_REG_EXP,
PatternRepository.PATTERN_TYPE_REGULAR_EXPRESSION,
applicationMenu);
On the os 4.5 / 4.7 simulators and on
a Curve 8900 device (running 4.5),
this works.
On the os 5 simulators and the Bold
9700 I tested, app fails to compile
the pattern with an
IllegalArgumentException("unrecognized
character after (?").
I have also tried (naively) to set the pattern to "/rockstar/i" but that only matches the exact string - this is possibly the correct direction to take, but if so, I don't know how to implement it on the BB.
How would I modify my regex in order to pick up case insensitive patterns using the PatternRepository as above?
PS: would the "correct" way be to use the [Cc][Oo][Bb][Ii]2... etc pattern? This is ok for a short string, but I am hoping for a more general solution if possible?

Well not a real solution for the general problem but this workaround is easy, safe and performant:
As your dealing here with URLs and they are not case-sensitive...
(it doesn't matter if we write google.com or GooGLE.COm or whatever)
The most simple solution (we all love KISS_principle) is to do first a lowercase (or uppercase if you like) on the input and than do a regex match where it doesn't matter whether it's case-sensitive or not because we know for sure what we are dealing with.

Since nobody else has answered this question relating to the PatternRepository class, I will self-answer so I can close it.
One way to do this would be to use a pattern like: [Cc][Oo][Bb][Ii]2[Nn][Tt][Ee][Rr][Aa][Cc][Tt][Ii][Vv][Ee]... etc where for each letter in the string, you put 2 options. Fortunately my string is short.
This is not an elegant solution, but it works. Unfortunately I don't know of a way to modify the string passed to PatternRepository and I think the crash when using the (?i) modifier is a bug in BB.

Use the port of the jakarta regex library:
https://code.google.com/p/regexp-me/
If you use unicode support, it's going to eat memory,
but if you just want case insensitive matching,
you simply need to pass the RE.MATCH_CASEINDEPENDENT flag when you compile your regex.
new RE("yourCaseInsensitivePattern", RE.MATCH_CASEINDEPENDENT | OTHER_FLAGS)

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Regex that can handle an arbitrary number of asterisks in a word

I'm trying to write a regex for x509 CN/SAN validation and have just learned that apparently partial wildcards are possible in theory. How would I build a regex to handle this when I want to make sure that it captures all certificates that might be issued for example.org?
My naive approach would be
\**e\**x\**a\**m\**p\**l\**e\**.\**o\**r\**g\**
not including possible subdomains of course. This looks pretty bad though and really inflates the term longer than I'd like it to be. Is there a more concise way to get the behaviour I described?
Edit: I also just realised that my naive regex wouldn't even catch when someone uses the asterisk to replace a part of the domain, e.g. exa*.org.
Since I feel like there's a possibility that this is not easily expressible in a concise regex, I solved my use case within the Python code that surrounds my previous regex check.
Instead of mapping a regex to the domains appearing in a certificate, I instead convert the certificate domain into a regex pattern, replace the literal dots with escaped dots and the asterisk with [a-zA-Z0-9-]{0,63}. I then compare it to the list of domains I manage and if the regex matches, I know that the certificate is applicable to the managed domain.
If someone manages to express this in a concise regex I'd still be interested.

Filter by regex example

Could anyone provide an example of a regex filter for the Google Chrome Developer toolbar?
I especially need exclusion. I've tried many regexes, but somehow they don't seem to work:
It turned out that Google Chrome actually didn't support this until early 2015, see Google Code issue. With newer versions it works great, for example excluding everything that contains banners:
/^(?!.*?banners)/
It's possible -- at least in Chrome 58 Dev. You just need to wrap your regex with forward-slashes: /my-regex-string/
For example, this is one I'm currently using: /^(.(?!fallback font))+$/
It successfully filters out any messages that contain the substring "fallback font".
EDIT
Something else to note is that if you want to use the ^ (caret) symbol to search from the start of the log message, you have to first match the "fileName.js?someUrlParam:lineNumber " part of the string.
That is to say, the regex is matching against not just the log message, but also the stack-entry for the line which made the log.
So this is the regex I use to match all log messages where the actual message starts with "Dog":
/^.+?:[0-9]+ Dog/
The negative or exclusion case is much easier to write and think about when using the DevTool's native syntax. To provide the exclusion logic you need, simply use this:
-/app/ -/some\sother\sregex/
The "-" prior to the regex makes the result negative.
Your expression should not contain the forward slashes and /s, these are not needed for crafting a filter.
I believe your regex should finally read:
!(appl)
Depending on what exactly you want to filter.
The regex above will filter out all lines without the string "appl" in them.
edit: apparently exclusion is not supported?

CAtlRegExp for a regular expression that matches 4 characters max

Short version:
How can I get a regex that matches a#a.aaaa but not a#a.aaaaa using CAtlRegExp ?
Long version:
I'm using CAtlRegExp http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx to try to match email addresses. I want to use the regex
^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$
extracted from here.
But the syntax that CAtlRegExp accepts is different than the one used there. This regex returns the error REPARSE_ERROR_BRACKET_EXPECTED, you can check for yourself using this app: http://www.codeproject.com/KB/string/mfcregex.aspx
Using said app, I created this regex:
^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z]$
But the problem is this matches a#a.aaaaa as valid, I need it to match 4 characters maximum for the op-level domain.
So, how can I get a regex that matches a#a.aaaa but not a#a.aaaaa ?
Try: ^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+\c\c\c?\c?$
This expression replaces the [A-Z]{2,4} sequence which CAtlRegExp doesn't support with \c\c\c?\c?
\c serves as an abbreviation of [a-zA-Z]. The question marks after the 3rd and 4th \c's indicate they can match either zero or one characters. As a result, this portion of the expression matches 2, 3 or 4 characters, but neither more nor less.
You are trying to match email addresses, a very widely used critical element of internet communication.
To which I would say that this job is best done with the most widely used most correct regex.
Since email address format rules are described by RFC822, it seems useful to do internet searches for something like "RFC822 email regex".
For Perl the answer seems to be easy: use Mail::RFC822::Address: regexp-based address validation
RFC 822 Email Address Parser in PHP
Thus, to achieve the most correct handling of email addresses, one should either locate the most precise regex that there is out somewhere for the particular toolkit (ATL in your case) or - in case there's no suitable existing regex yet - adapt a very precise regex of another toolkit (Perl above seems to be a very complete albeit difficult candidate).
If you're trying to match a specific sub part of email addresses (as seems to be the case given your question), then it probably still makes sense to start with the most up-to-date/correct/universal regex and specifically limit it to the parts that you require.
Perhaps I stated the obvious, but I hope it helped.

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)