Regular Expression Match (get multiple stuff in a group) - regex

I have trouble working on this regular expression.
Here is the string in one line, and I want to be able to extract the thing in the swatchColorList, specifically I want the word Natural Burlap, Navy, Red
What I have tried is '[(.*?)]' to get everything inside bracket, but what I really want is to do it in one line? is it possible, or do I need to do this in two steps?
Thanks
{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}

You can try this regex
(?<=[[,]\{\")[^"]+
If negative lookbehind is not supported, you can use
[[,]\{"([^"]+)
This will save needed word in group 1.

import json
str = '{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}'
obj = json.loads(str)
words = []
for thing in obj["swatchColorList"]:
for word in thing:
words.append(word)
print word
Output will be
Natural Burlap
Navy
Red
And words will be stored to words list. I realize this is not a regex but I want to discourage the use of regex on serialized object notations as regular expressions are not intended for the purpose of parsing strings with nested expressions.

Related

Regular expression to place number pair in square brackets

I have a large data file with sequences of numbers bearing the form
6.06038475036627,50.0646896362306\r\n
6.0563435554505,50.0635681152345\r\n
6.05446767807018,50.0632934570313\r\n
which I am trying to modify in Notepad++ so it reads
[6.06038475036627,50.0646896362306]\r\n
[6.0563435554505,50.0635681152345]\r\n
[6.05446767807018,50.0632934570313]\r\n
I can count the number of instances of these occurrences with a relatively simple regex \d{1,2}\.\d+\,\d{1,2}\.\d+. However, there my own regex skills hit the buffers. I am dimly aware that it is possible to go a step further and perform the actual modifications but I have no idea how that should be done.
You would simply need to do as follows:
Find what: (\d+\.\d+,\d+\.\d+)
Replace with: [\1]
Make sure that Regular Expression is checked.
Given this, it will transform this:
6.06038475036627,50.0646896362306\r\n
6.0563435554505,50.0635681152345\r\n
6.05446767807018,50.0632934570313\r\n
Into this:
[6.06038475036627,50.0646896362306]\r\n
[6.0563435554505,50.0635681152345]\r\n
[6.05446767807018,50.0632934570313]\r\n
The expression above will match the comma seperated numbers and throw them in a group. The replace will inject a [, followed by the matched group (denoted by \1) and it will inject another ].
Try the following regexp(with substitution):
\b(\d{1,2}\.\d+,\d{1,2}\.\d+)\b
https://regex101.com/r/VkHppp/1

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

Regular Expression for phrases starting with TO

I am pretty new to Regular Expression. I want to write a regular expression to get the TO Followed by the rest of it after each new line. I tried to use this but doesn't work properly.
^TO\n?\s?[A-Za-z0-9]\n?[A-Za-z0-9]
It only highlights properly the TO W11 which all are in one line. Highlights only TO from first data and the 3rd data only highlights the first line. Basically it doesn't read the new lines.
Some of my data looks like this:
TO
EXTERNAL
TRAVERSE
TO W11
TO CONTROL
TRAVERSE
I would appreciate if anybody can help me.
Make sure you use a multiline regex:
var options = RegexOptions.MultiLine;
foreach (Match match in Regex.Matches(input, pattern, options))
...
More at: http://msdn.microsoft.com/en-us/library/yd1hzczs(v=vs.110).aspx
It looks like your pattern isn't matching because the start of the string is really a space and not the T character. Also, [A-Za-z0-9] matches only one character, and you want the whole word. I used the + to denote that I want one or more matches of those characters.
(TO\n?\s?[A-Za-z0-9]+)
This regex matches "TO EXTERNAL", "TO W11" and "TO CONTROL". Be sure to use the global modifier so that you get all matches, not just the first one.

What regular expression can I use to find the Nᵗʰ entry in a comma-separated list?

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

Regex multi word search

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.
Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.
The regular expression for this is:
^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)
Where \b is a word boundary and the ?= is the lookaround modifier.
If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.
AND as concatenation
^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)
OR as alternation
^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
Maybe using a language recognition chart to recognize english would work. Some quick tests seem to work (this assumes paragraphs separated by newlines only).
The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.
my #paragraphs = split(/\n/,$text);
for my $p (#paragraphs) {
if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
print "Probable english\n$p\n";
}
}
Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?
Something like:
\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.
(where \b is the word boundary in your language)
would match a complete sentence that contained either of the two words or both..
You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match
Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.