Expressing basic Access query criteria as regular expressions - regex

I'm familiar with Access's query and filter criteria, but I'm not sure how to express similar statements as regular expression patterns. I'm wondering if someone can help relate them to some easy examples that I understand.
If I were using regular expressions to match fields like Access, how would I express the following statements? Examples are similar to those found on this Access Query and Filter Criteria webpage. As in Access, case is insensitive.
"London"
Strings that match the word London exactly.
"London" or "Paris"
Strings that match either the words London or Paris exactly.
Not "London"
Any string but London.
Like "S*"
Any string beginning with the letter s.
Like "*st"
Any string ending with the letters st.
Like "*the*dog*"
Any strings that contain the words 'the' and 'dog' with any characters before, in between, or at the end.
Like "[A-D]*"
Any strings beginning with the letters A through D, followed by anything else.
Not Like "*London*"
Any strings that do not contain the word London anywhere.
Not Like "L*"
Any strings that don't begin with an L.
Like "L*" And Not Like "London*"
Any strings that begin with the letter L but not the word London.

Regex as much more powerful than any of the patterns you have been used to for creating criteria in Access SQL. If you limit yourself to these types of patterns, you will miss most of the really interesting features of regexes.
For instance, you can't search for things like dates or extracting IP addresses, simple email or URL detection or validation, basic reference code validation (such as asking whether an Order Reference code follows a mandated coding structure, say something like PO123/C456 for instance), etc.
As #Smandoli mentionned, you'd better forget your preconceptions about pattern matching and dive into the regex language.
I found the book Mastering Regular Expressions to be invaluable, but tools are the best to experiment freely with regex patterns; I use RegexBuddy, but there are other tools available.
Basic matches
Now, regarding your list, and using fairly standardized regular expression syntax:
"London"
Strings that match the word London exactly.
^London$
"London" or "Paris"
Strings that match either the words London or Paris exactly.
^(London|Paris)$
Not "London"
Any string but London.
You match for ^London$ and invert the result (NOT)
Like "S*"
Any string beginning with the letter s.
^s
Like "*st"
Any string ending with the letters st.
st$
Like "*the*dog*"
Any strings that contain the words 'the' and 'dog' with any characters before, in between, or at the end.
the.*dog
Like "[A-D]*"
Any strings beginning with the letters A through D, followed by anything else.
^[A-D]
Not Like "*London*"
Any strings that do not contain the word London anywhere.
Reverse the matching result for London (you can use negative lookahead like:
^(.(?!London))*$, but I don't think it's available to the more basic Regex engine available to Access).
Not Like "L*"
Any strings that don't begin with an L.
^[^L] negative matching for single characters is easier than negative matching for a whole word as we've seen above.
Like "L*" And Not Like "London*"
Any strings that begin with the letter L but not the word London.
^L(?!ondon).*$
Using Regexes in SQL Criteria
In Access, creating a user-defined function that can be used directly in SQL queries is easy.
To use regex matching in your queries, place this function in a module:
' ----------------------------------------------------------------------'
' Return True if the given string value matches the given Regex pattern '
' ----------------------------------------------------------------------'
Public Function RegexMatch(value As Variant, pattern As String) As Boolean
If IsNull(value) Then Exit Function
' Using a static, we avoid re-creating the same regex object for every call '
Static regex As Object
' Initialise the Regex object '
If regex Is Nothing Then
Set regex = CreateObject("vbscript.regexp")
With regex
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
End If
' Update the regex pattern if it has changed since last time we were called '
If regex.pattern <> pattern Then regex.pattern = pattern
' Test the value against the pattern '
RegexMatch = regex.test(value)
End Function
Then you can use it in your query criteria, for instance to find in a PartTable table, all parts that are matching variations of screw 18mm like Pan Head Screw length 18 mm or even SCREW18mm etc.
SELECT PartNumber, Description
FROM PartTable
WHERE RegexMatch(Description, "screw.*?d+\s*mm")
Caveat
Because the regex matching uses old scripting libraries, the flavour of Regex language is a bit more limited than the one found in .Net available to other programming languages.
It's still fairly powerful as it is more or less the same as the one used by JavaScript.
Read about the VBScript regex engine to check what you can and cannot do.
The worse though, is probably that the regex matching using this library is fairly slow and you should be very careful not to overuse it.
That said, it can be very useful sometimes. For instance, I used regexes to sanitize data input from users and detect entries with similar patterns that should have been normalised.
Well used, regexes can enhance data consistency, but use sparingly.

Regex is difficult to break into initially. Honestly, looking for spoon-fed examples is not going to help as much as "getting your hands dirty" with it. Also, MS Access is not a good springboard. Regex doesn't "cognate" well with the SQL query process -- not in application, and not in mental orientation. What you need is some text files to process, using a text editor.

Our solution was to open the Excel file in OpenCalc (part of Apache OpenOffice, https://www.openoffice.org/) which provides what seems like full regular expressions for both the find and replace.
We test the regular expressions at http://regexr.com/

Related

How to get the string that start after the last > by regular expression?

I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

RegEx if target string is a superset of the regex

[edited for hopefully more clarity.]
I'm probably confused, but I have a Mongodb dataset of simple words:
Items:
Boston BeerBoston BreweryCoors Brewing Light
I have an input string:
"Boston Beer Company"
I want to find any item That is contained within the input string. In this case,
'Boston Beer' would be a match.
The trouble is, given any input string, I don't know which words in the string would find a match in a field. (The match is not anchored to the beginning or end.)
In Javascript, I'd just create a loop and test.
inputString.indexOf(currentItem) >= 0
I may have confused myself, but I can't find a way to express a regEx where the RegEx is the target string and I am testing if any individual item (field) is contained within the longer string.
I hope this is somewhat clearer.
Thanks in advance-

regex to match specific words hyphenated at arbitrary positions and split across two lines

I wish to search a text file for a given word that may optionally be hyphenated at an unknown position within the word and split across consecutive lines.
eg. match "hyphenated" within:
This sentence contains a hyphena-
ted word.
Closest (unattractive) solution:
"h\(-\s*\n\s*\)\?y\(-\s*\n\s*\)\?p\(-\s*\n\s*\)\?h\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?n\(-\s*\n\s*\)\?a\(-\s*\n\s*\)\?t\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?d"
I'm hoping that some regex-foo stronger than mine can come up with a regex that clearly includes the word being searched for, ie. I'd like to see "hyphenated" in there. I haven't found a way to encode something like the following (which would be buggy anyway, since it would match "hy-ted"):
"{prefix-of:hyphenated}{hyphen/linebreak}{suffix-of:hyphenated}"
I realize that pre-processing the document to collapse such words would make the search simpler but I'm looking for a regex that I can use in a context where this won't be possible due to the tools involved.
Considering that hy-phen-ated should also match, I think this is a case where regex alone isn't the right way to go.
I would do this (not knowing your language, I've used pseudo code):
remove hyphens and newlines from input
match cleaned input with .*hyphenated.*
All languages can achieve step 1. trivially, and the code would be so much more readable.
I think this would work. If you have many words to search for, you would probably want to create a script to generate the search pattern for you.
[h\-]+\s*[y\-\s]+[p\-\s]+[h\-\s]+[e\-\s]+[n\-\s]+[a\-\s]+[t\-\s]+[e\-\s]+d\b
I don't think you mentioned which language you are using, but I tested this with .Net.
Here's a simple python script that will generate search patterns:
# patterngen.py
# Usage: python patterngen.py <word>
# Example: python patterngen.py hyphenated
word = sys.argv[1]
pattern = '[' + word[0] + r'\-]+\s*'
for i in range(1,len(word)-1):
pattern = pattern + r'[' + word[i]
pattern = pattern + r'\-\s]+'
pattern = pattern + word[-1] + r'\b'
print pattern
Another way to approach this, just right of the bat, is to "slide" the hyphenation like this:
hyphenated|h(-\s*\n\s*)yphenated|hy(-\s*\n\s*)phenated|hyp(-\s*\n\s*)henated|hyph(-\s*\n\s*)enated|hyphe(-\s*\n\s*)nated|hyphen(-\s*\n\s*)ated|hyphena(-\s*\n\s*)ted|hyphenat(-\s*\n\s*)ed|hyphenate(-\s*\n\s*)d
Reads better, but I don't really know how this stands performance wise to your original pattern.
Yet another idea is to first narrow the search with a pattern along these lines:
h[hypenatd]{0,9}(-\s*\n*\s)?[hypenatd]{0,9}
and then match within the results of this one.
In fact, if I'm not mistaken, if you match with groups like this:
(h[hypenatd]{0,9})(?:-\s*\n*\s)?([hypenatd]{0,9})
then the occurences of the word hyphenated are all the matches where, pseudocodily:
(match.group1 + match.group2) == "hyphenated"

Regex multi word search

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.
Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.
The regular expression for this is:
^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)
Where \b is a word boundary and the ?= is the lookaround modifier.
If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.
AND as concatenation
^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)
OR as alternation
^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
Maybe using a language recognition chart to recognize english would work. Some quick tests seem to work (this assumes paragraphs separated by newlines only).
The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.
my #paragraphs = split(/\n/,$text);
for my $p (#paragraphs) {
if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
print "Probable english\n$p\n";
}
}
Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?
Something like:
\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.
(where \b is the word boundary in your language)
would match a complete sentence that contained either of the two words or both..
You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match
Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.