Complex regex single quote replace - regex

I have a set of strings for which I would like to replace single quotes by double quotes. But, sometimes the single quote to replace is at the end of the line, sometimes the single quote should be replaced since it follow a S for possessive.
Example :
The song 'Miss you' is featured in The Rolling Stones' album 'Voodoo Lounge'
should be
The song "Miss you" is featured in The Rolling Stones' album "Voodoo Lounge"
Thanks your help :)

Regular expressions can only deal with raw text. It can't tell context or grammar. So it is pretty much impossible to build up a regular expression that will correctly identify the occurrences of non-possessive s characters.
However, if you'd like to ignore such cases, and match rest of them, you can use the following regex with lookaround assertions:
(?<!s)'(?!s\b)
Note that this will not match for valid cases like Blurred Lines, Dangerous etc.
Working demo

Related

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

Regular expression to only return every other match

I'm trying to write a regex that will match humanly readable quoted values. As one example, XML attributes. The problem I'm running into is that the data between quoted areas is actually quoted as well if you consider an attribute's ending quote and a subsequent attribute's beginning quote. Here's the expression I have so far:
(?<=\")(?(?!\s+\")[^\"]+)(?=\")
What I tried to express in plain English was: A quote (don't capture it), if not followed by just spaces terminating in another quote, match anything not a quote that is followed by another quote (not capturing the last quote).
and here's my sample data:
<computer name = "printserver" model = "1000ZS" />
The regex produces 3 matches:
printserver
model =
1000ZS
I think that if I could find a way to tell the regex engine to skip every other occurrence I'd have it.
Here's another sample data set, sort of like QML class attributes:
field1: "value1" field2: "value2" field3: "value3"
I can "see" the quoted data, but extracting it via regex is beating me :-)
I'm using the .NET 4.5 System.Text.RegularExpressions framework in my project. I'm not targeting a specific markup like XML, JSON, QML, etc. but am looking for a general purpose regex that would just grab the quoted values similar to how we interpret the data as humans...
Any suggestions? Thanks!
You can always consume the quote in your match:
\"([^\"]+)\"
And extract the part you need from the first capture group.
If it's explicitly a quote preceded by a space, then you can use the part you used, with a little tweak:
\"((?:(?!\s+\")[^\"])+)\"
Of if you just know that the string contains simple patterns like that, maybe something like this:
(?:(?!\s+\")[^\"])+(?=\")

regex find and replace multiple chars of different kinds in one expression

I need to replace all left { and right } curly braces as well as all percentage signs % to their respective HTML entities in a document.
I'm using Sublime Text 2's nice little star icon/button in Find and Replace. I came up with (\{)|(\})|(\%) to match the chars I need. There might be better ways, but hey... it seems to work.
What would the replacement string look like for this? I mean one expression, not with a programming language.
Basically it's replacing what I find with group $1 with something and then the same for group $2 with something else. Here in pseudo-code:
(\{)|(\})|(\%) ==> $1 replace with { AND $2 replace with } AND... etc.
Is this possible? I can provide some target sample data if needed.
Back story
These three characters can't be placed as is inside an attribute's value in HAML, like
:text => "blabliblu {20% lalla...}"
etc., without being escaped.
The percentage sign could theoretically be escaped with \% but the curly braces can not be escaped with \{ and \}, at least not when i'm preprocessing the HAML with Livereload (Win7). Maybe it's a Ruby thing? Anyhow, I'm going for the HTML entity approach.

Expressing basic Access query criteria as regular expressions

I'm familiar with Access's query and filter criteria, but I'm not sure how to express similar statements as regular expression patterns. I'm wondering if someone can help relate them to some easy examples that I understand.
If I were using regular expressions to match fields like Access, how would I express the following statements? Examples are similar to those found on this Access Query and Filter Criteria webpage. As in Access, case is insensitive.
"London"
Strings that match the word London exactly.
"London" or "Paris"
Strings that match either the words London or Paris exactly.
Not "London"
Any string but London.
Like "S*"
Any string beginning with the letter s.
Like "*st"
Any string ending with the letters st.
Like "*the*dog*"
Any strings that contain the words 'the' and 'dog' with any characters before, in between, or at the end.
Like "[A-D]*"
Any strings beginning with the letters A through D, followed by anything else.
Not Like "*London*"
Any strings that do not contain the word London anywhere.
Not Like "L*"
Any strings that don't begin with an L.
Like "L*" And Not Like "London*"
Any strings that begin with the letter L but not the word London.
Regex as much more powerful than any of the patterns you have been used to for creating criteria in Access SQL. If you limit yourself to these types of patterns, you will miss most of the really interesting features of regexes.
For instance, you can't search for things like dates or extracting IP addresses, simple email or URL detection or validation, basic reference code validation (such as asking whether an Order Reference code follows a mandated coding structure, say something like PO123/C456 for instance), etc.
As #Smandoli mentionned, you'd better forget your preconceptions about pattern matching and dive into the regex language.
I found the book Mastering Regular Expressions to be invaluable, but tools are the best to experiment freely with regex patterns; I use RegexBuddy, but there are other tools available.
Basic matches
Now, regarding your list, and using fairly standardized regular expression syntax:
"London"
Strings that match the word London exactly.
^London$
"London" or "Paris"
Strings that match either the words London or Paris exactly.
^(London|Paris)$
Not "London"
Any string but London.
You match for ^London$ and invert the result (NOT)
Like "S*"
Any string beginning with the letter s.
^s
Like "*st"
Any string ending with the letters st.
st$
Like "*the*dog*"
Any strings that contain the words 'the' and 'dog' with any characters before, in between, or at the end.
the.*dog
Like "[A-D]*"
Any strings beginning with the letters A through D, followed by anything else.
^[A-D]
Not Like "*London*"
Any strings that do not contain the word London anywhere.
Reverse the matching result for London (you can use negative lookahead like:
^(.(?!London))*$, but I don't think it's available to the more basic Regex engine available to Access).
Not Like "L*"
Any strings that don't begin with an L.
^[^L] negative matching for single characters is easier than negative matching for a whole word as we've seen above.
Like "L*" And Not Like "London*"
Any strings that begin with the letter L but not the word London.
^L(?!ondon).*$
Using Regexes in SQL Criteria
In Access, creating a user-defined function that can be used directly in SQL queries is easy.
To use regex matching in your queries, place this function in a module:
' ----------------------------------------------------------------------'
' Return True if the given string value matches the given Regex pattern '
' ----------------------------------------------------------------------'
Public Function RegexMatch(value As Variant, pattern As String) As Boolean
If IsNull(value) Then Exit Function
' Using a static, we avoid re-creating the same regex object for every call '
Static regex As Object
' Initialise the Regex object '
If regex Is Nothing Then
Set regex = CreateObject("vbscript.regexp")
With regex
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
End If
' Update the regex pattern if it has changed since last time we were called '
If regex.pattern <> pattern Then regex.pattern = pattern
' Test the value against the pattern '
RegexMatch = regex.test(value)
End Function
Then you can use it in your query criteria, for instance to find in a PartTable table, all parts that are matching variations of screw 18mm like Pan Head Screw length 18 mm or even SCREW18mm etc.
SELECT PartNumber, Description
FROM PartTable
WHERE RegexMatch(Description, "screw.*?d+\s*mm")
Caveat
Because the regex matching uses old scripting libraries, the flavour of Regex language is a bit more limited than the one found in .Net available to other programming languages.
It's still fairly powerful as it is more or less the same as the one used by JavaScript.
Read about the VBScript regex engine to check what you can and cannot do.
The worse though, is probably that the regex matching using this library is fairly slow and you should be very careful not to overuse it.
That said, it can be very useful sometimes. For instance, I used regexes to sanitize data input from users and detect entries with similar patterns that should have been normalised.
Well used, regexes can enhance data consistency, but use sparingly.
Regex is difficult to break into initially. Honestly, looking for spoon-fed examples is not going to help as much as "getting your hands dirty" with it. Also, MS Access is not a good springboard. Regex doesn't "cognate" well with the SQL query process -- not in application, and not in mental orientation. What you need is some text files to process, using a text editor.
Our solution was to open the Excel file in OpenCalc (part of Apache OpenOffice, https://www.openoffice.org/) which provides what seems like full regular expressions for both the find and replace.
We test the regular expressions at http://regexr.com/

Regex matching in ColdFusion OR condition

I am attempting to write a CF component that will parse wikiCreole text. I am having trouble getting the correct matches with some of my regular expression though. I feel like if I can just get my head around the first one the rest will just click. Here is an example:
The following is sample input:
You can make things **bold** or //italic// or **//both//** or //**both**//.
Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
My first attempt was:
<cfset out = REreplace(out, "\*\*(.*?)\*\*", "<strong>\1</strong>", "all") />
Then I realized that it would not match where the ** is not given, and it should end where there are two carriage returns.
So I tried this:
<cfset out = REreplace(out, "\*\*(.*?)[(\*\*)|(\r\n\r\n)]", "<strong>\1</strong>", "all") />
and it is close but for some reason it gives you this:
You can make things <strong>bold</strong>* or //italic// or <strong>//both//</strong>* or //<strong>both</strong>*//.
Character formatting extends across line breaks: <strong>bold,</strong>
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
Any ideas?
PS: If anyone has any suggestions for better tags, or a better title for this post I am all ears.
The [...] represents a character class, so this:
[(\*\*)|(\r\n\r\n)]
Is effectively the same as this:
[*|\r\n]
i.e. it matches a single "*" and the "|" isn't an alternation.
Another problem is that you replace the double linefeed. Even if your match succeeded you would end up merging paragraphs. You need to either restore it or not consume it in the first place. I'd use a positive lookahead to do the latter.
In Perl I'd write it this way:
$string =~ s/\*\*(.*?)(?:\*\*|(?=\n\n))/<strong>$1<\/strong>/sg;
Taking a wild guess, the ColdFusion probably looks like this:
REreplace(out, "\*\*(.*?)(?:\*\*|(?=\r\n\r\n))", "<strong>\1</strong>", "all")
You really should change your
(.*?)
to something like
[^*]*?
to match any character except the *. I don't know if that is the problem, but it could be the any-character . is eating one of your stars. It also a generally accepted "best practice" when trying to balance matching characters like the double star or html start/end tags to explicitly exclude them from your match set for the inner text.
*Disclaimer, I didn't test this in ColdFusion for the nuances of the regex engine - but the idea should hold true.
I know this is an older question but in response to where Ryan Guill said "I tried the $1 but it put a literal $1 in there instead of the match" for ColdFusion you should use \1 instead of $1
I always use a regex web-page. It seems like I start from scratch every time I used regex.
Try using '$1' instead of \1 for this one - the replace is slightly different... but I think the pattern is what you need to get working.
Getting closer with this:
**(.?)**|//(.?)//
The tricky part is the //** or **//
Ok, first checking for //bold//
then //bold// then bold, then
//bold//
**//(.?)//**|//**(.?)**//|**(.?)**|//(.?)//
I find this app immensely helpful when I'm doing anything with regex:
http://www.gskinner.com/RegExr/desktop/
Still doesn't help with your actual issue, but could be useful going forward.