Regular Expression to match string which doesn't contain substring - regex

I have a comma separated list as shown below. The list is actually on one line, but I have split it up to demonstrate the syntax and that each single unit contains 5 elements. There is no comma at the end of the list
ro:2581,1309531682152,A,Place,Page,
me:2642,1310989368864,A,Place,Page,
uk:2556,1309267095061,A,Place,Page,
me:2642,1310989380238,D,Place,Page,
me:2642,1334659643627,D,Place,Page,
ro:3562,1378721526696,A,Place,Page,
uk:1319,1309337246675,D,Place,Page,
ro:2581,1379500694666,D,Place,Page,
uk:1319,1309337246675,A,Place,Page
What I am trying to do is remove any unit (full line) that does not begin with uk:. I.e., the results will be:
uk:2556,1309267095061,A,Place,Page,
uk:1319,1309337246675,D,Place,Page,
uk:1319,1309337246675,A,Place,Page
If the string was on separate lines as my example, I could do this relatively easy, but because it is all on one line, I cannot get it to work. Can anyone point me in the right direction?
Thanks

This should work:
(uk:\d+,\d+,\w,\w+,\w+)
Demo
It looks for uk: and then it's pretty much comma-counting from there on.
EDIT:
Since OP has now clarified that what they're using can only remove strings:
,?[^u][^k]:\d+,\d+,\w,\w+,\w+
Demo 2
This looks for an optional comma followed by two letters that are not u and not k in that order, then a colon (:), and then the rest of the regex is the same.

I would suggest a simple regex like this:
(\buk:.+?,Page)(?:,|$)
and grab matched group #1
RegEx Demo

Related

Regex Conditionnals

I would like to control orphans in InDesign by applying a "No Break" character style based on a GREP expression. Basically, I need to target the last 2 words of a paragraph (That is to say: The last 2 strings of characters separated by a space).
I found a solution for my English publications where (\H+?\h?){2}$ works like a charm.
The problem is with my French publications where some punctuation requires to have a space before it. I am trying to specify the Matching Pattern based on the last character of the paragraph: If it is a ?, ! or :, I match the last 3 "words" using (\H+?\h?){3}$, if not than I match the last 2.
I thought the following expression would work:
(?(?=[\?!:]$)((\H+?\h?){3}$)|(\H+?\h?){2}$)
but somehow it always default to the "else" statement.
Can someone tell me where I did go wrong?
Maybe you want option (A) below
See if I understand correctly ...
The requirements are:
Capture the last two words
Even if in the end it is ?,! or :
(A) Use this to capture as group: https://regexr.com/4lr6h
(\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:[\?!:]|$)
(B) Use this to capture only words: https://regexr.com/4lr84
\w*\s\w*(?=(?:$|[\?!:]))
(C) Use this to capture tree last words with marks: https://regexr.com/4lr87
\w*\s\w*[\?!:]?$

Matching multiple letters and special characters in regex

I am trying to catch strings around the acronym ADJ. The strings look like this:
·NOM·JJ·ADJ+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_DEF_ACC
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·JJ·ADJ+NSUFF_FEM_SG+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_ACC
So far I have this:
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*?/g
But it only matches from the beginning of the strings until "ADJ+" ·NOM·DT+JJ·DET+ADJ+.
Since the rest of the strings after ADJ have the same composition of the beginning of the strings before ADJ, I thought this /[A-Z·\+#_]*?[·\+]/g should work, but it doesn't.
How do I get it to match the rest of the string?
My guess is that you want to make sure if you have an ADJ in the string, which if so, maybe we could simplify our expression to something similar to:
([A-Z·+#_]*)\bADJ\b([A-Z·+#_]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
That *? quantifier after the +ADJ+ phrase is satisfied with the empty string right after it, since the ? makes the quantifier before it match "the minimum number of times possible" and for * that is zero times.
So drop the ?, which also has no purpose for the rest of the line
perl -wE'$_=q(-XADJX-JJ+ADJ-REST-);
($before, $after) = /(.*?)[+\-]ADJ[+\-](.*)/;
say for $before,$after'
Removing the ? at the end would match the whole strings,
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*/g
I am not entirely sure why you needed a ? in a *.

How can I match groups separated by other groups in regex?

I am writing a regex to match a list of items that follow a specific complex format, so the regex for that is very long. The items on this list have to be separated by either a comma, which can optionally be padded with either one space on the right or spaces on both sides, so the regex for matching the delimiter is ( , )|(, ?). Also, I want the list to be between square brackets.
For example, it should match the following:
[]
[validItem]
[validItem,validItem, validItem]
But not the following:
[validItem,invalidItem]
[validItemvalidItem]
[validItem, validItem ]
The regex I currently have is: \[verylongregex(?:(?: , )|(?:, ?)verylongregex)*\], but I'd like to simplify this to include the regex pattern that matches the element format only once.
Does regex have a method to match X groups separated by another group?
Here is an answer. I don`t know if it is what you are looking for, but here it is nonetheless.
1/ Assuming you want to capture the list in one group:
(\[(?:complexRegex(?: , |, ?|\]))+)
Demo: http://regex101.com/r/pW2oZ1/1
2/ Assuming you want all element of the list matched separately, this is a much more complex thing (at least for my knowledge...). Here is a working (complex) solution:
(?:\[|(?!\[)\G(?: , |, ?))(complexRegex)(?=(?:(?: , |, ?)complexRegex)*\])
Demo: http://regex101.com/r/iB3jD1/2
I don't have the time to write an explanation right now if it's needed. Ask for it in the comments if you want one, I'll write it later today. Sorry...

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

Regular expression for a list of items separated by comma or by comma and a space

Hey,
I can't figure out how to write a regular expression for my website, I would like to let the user input a list of items (tags) separated by comma or by comma and a space, for example "apple, pie,applepie". Would it be possible to have such regexp?
Thanks!
EDIT:
I would like a regexp for javascript in order to check the input before the user submits a form.
What you're looking for is deceptively easy:
[^,]+
This will give you every comma-separated token, and will exclude empty tokens (if the user enters "a,,b" you will only get 'a' and 'b'), BUT it will break if they enter "a, ,b".
If you want to strip the spaces from either side properly (and exclude whitespace only elements), then it gets a tiny bit more complicated:
[^,\s][^\,]*[^,\s]*
However, as has been mentioned in some of the comments, why do you need a regex where a simple split and trim will do the trick?
Assuming the words in your list may be letters from a to z and you allow, but do not require, a space after the comma separators, your reg exp would be
[a-z]+(,\s*[a-z]+)*
This is match "ab" or "ab, de", but not "ab ,dc"
Here's a simpler solution:
console.log("test, , test".match(/[^,(?! )]+/g));
It doesn't break on empty properties and strips spaces before and after properties.
This thread is almost 7 years old and was last active 5 months ago, but I wanted to achieve the same results as OP and after reading this thread, came across a nifty solution that seems to work well
.match(/[^,\s?]+/g)
Here's an image with some example code of how I'm using it and how it's working
Regarding the regular expression... I suppose a more accurate statement would be to say "target anything that IS NOT a comma followed by any (optional) amount of white space" ?
I often work with coma separated pattern, and for me, this works :
((^|[,])pattern)+
where "pattern" is the single element regexp
This might work:
([^,]*)(, ?([^,]*))*
([^,]*)
Look For Commas within a given string, followed by separating these. in regards to the whitespace? cant you just use commas? remove whitespace?
I needed an strict validation for a comma separated input alphabetic characters, no spaces. I end up using this one is case anyone needed:
/^[a-z]+(,[a-z]+)*$/
Or, to support lower- and uppercase words:
/^[A-Za-z]+(?:,[A-Za-z]+)*$/
In case one need to allow whitespace between words:
/^[A-Za-z]+(?:\s*,\s*[A-Za-z]+)*$/
/^[A-Za-z]+(?:,\s*[A-Za-z]+)*$/
You can try this, it worked for me:
/.+?[\|$]/g
or
/[^\|?]+/g
but replace '|' for the one you need. Also, don't forget about shielding.
something like this should work: ((apple|pie|applepie),\s?)*