Using a regular expression to extract substring - regex

I'm trying to extract a substring in R, using stringr. Some timeago I wrote a script that did the job, however it does not work anymore. Probably due to an update, but I don't know.
My string looks like(!) this: myStr <- " layout = (3,3); //lala".
The string will always contain the layout keyword, the equal sign and the two braces (open ... close). However, the number of arguments in between can vary: (1,23,455,22) would also be possible. After the part after ) can varying as well.
I like to obtain the substring starting form ( and ending with ). Thus this example must give: (3,3). Others may give e.g. (1,23,455,22).
Up to now I used this:
library(stringr)
str_extract(" layout = (3,3); //lala", "*\\(.*\\)")
However this does not work anymore. It gives me this error:
Error in stri_extract_first_regex(string, pattern, opts_regex = attr(pattern, :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
It used to work in the past. What is wrong this the regular expression?
EDIT:
If by string contains two pair of braces, the substring should select the left pair (the other is commented-out with //):
Str <- "layout = (1,2,3,4) //lala(huhu)"
gsub(".*([(])(.*)([)]).*", "\\1\\2\\3", Str)
#gives "(huhu)" which is not good; should be (1,2,3,4)

Your regex "*\\(.*\\)" is not correct as it starts with *, a quantifier, and that causes the incorrect regex syntax issue as you cannot have multiple string start positions (it is a logical error checked by the regex engine when parsing the expression).
The substring should select the left pair
Use lazy matching in the left part - .*?:
myStr <- "layout = (1,2,3,4) //lala(huhu)"
gsub(".*?(\\([^()]*\\)).*", "\\1", myStr)
## ^^^
See IDEONE demo
Result: [1] "(1,2,3,4)"
Lazy matching will ensure we match as few characters as possible before the first occurrence of the subsequent pattern.
Note that if you want to extract multiple (number,number....) values, you need to use
library(stringr)
str_extract(Str,"\\(\\d+(\\s*,\\d+)*\\)")
See the regex demo here.

If the numbers and commas along with the parentheses needs to be extracted, use those patterns,
str_extract(Str,"\\([0-9,]+\\)")
#[1] "(1,2,3,4)"
str_extract(myStr,"\\([0-9,]+\\)")
#[1] "(3,3)"

Related

Matching last and first bracket in gsub/r and leaving the remaining content intact

I'm working with a character vector of the following format:
[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)
I would like to remove [ and ) so I can get the desired result resembling:
-0.2122,-0.1213
-0.2750,-0.2122
-0.1213,-0.0222
-0.1213,-0.0222
Attempts
1 - Groups,
I was thinking of capturing first and second group, on the lines of the syntax:
[[^\[{1}(?![[:digit:]])\){1}
but it doesn't seem to work, (regex101).
2 - Punctuation
The code: [[:punct:]] will capture all punctuation regex101.
3 - Groups again
Then I tried to match the two groups: (\[)(\)), but, again, no lack regex101.
The problem can be easily solved by applying gsub twice or making use of the multigsub available in the qdap package but I'm interested in solving this via one expression, is possible.
You could try using lookaheads and lookbehinds in Perl-style regular expressions.
x <- scan(what = character(),
text = "[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)")
regmatches(x, regexpr("(?<=\\[).+(?=\\))", x, perl = TRUE))
# [1] "-0.2122,-0.1213" "-0.2750,-0.2122" "-0.1213,-0.0222" "-0.1213,-0.0222"

R digit-expression and unlist doesn't work

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.
I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))
When I run this command, "yend_clean" is simply set to "character (empty)".
If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.
So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.
Was hoping someone in here could point me in the right direction.
This is a regular expression question. Your regular expression is wrong. Use:
unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
or equivalently
sub("^(\\d{4}).*", "\\1", "2003-")
of if really all you want is to remove the "-"
sub("-", "", "2003-")
Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:
match any single digit, followed by a 4, followed by the end of the string
When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).
The pattern I propose says instead:
match the beginning of the string (^), followed by a digit repeated four times.
The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].
A good website to learn about regex
A visualization tool to see how specific regular expressions match strings
If you mean the book Automated Data Collection with R, the code could be like this:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))
Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".

How to apply conditional treatment with line.endswith(x) where x is a regex result?

I am trying to apply conditional treatment for lines in a file (symbolised by list values in a list for demonstration purposes below) and would like to use a regex function in the endswith(x) method where x is a range page-[1-100]).
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-2']
for line in lines:
if line.startswith('http') and line.endswith('page-2'):
print line
So the required functionality is that if the value starts with http and ends with a page in the range of 1-100 then it will be returned.
Edit: After reflecting on this, I guess the corollary questions are:
How do I make a regex pattern ie page-[1-100] a variable?
How do I then use this variable eg x in endswith(x)
Edit:
This is not an answer to the original question (ie it does not use startswith() and endswith()), and I have no idea if there are problems with this, but this is the solution I used (because it achieved the same functionality):
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-100']
for line in lines:
match_beg = re.search( r'^http://', line)
match_both = re.search( r'^http://.*page-(?:[1-9]|[1-9]\d|100)$', line)
if match_beg and not match_both:
print match_beg.group()
elif match_beg and match_both:
print match_both.group()
I don't know python well enough to paste usable code, but as far as the regular expression is concerned, this is rather trivial to do:
page-(?:[2-9]|[1-9]\d|100)$
What this expression will match:
page- is just a fixed string that will be matched 1:1 (case insensitive if you set Options for that).
(?:...) is a non-capturing group that's just used for separating the following branching.
| all act as "either or" with the expressions being to their left/right.
[2-9] will match this numerical range, i.e. 2-9.
[1-9]\d will match any two Digit number (10-99); \d matches any digit.
100 is again a plain and simple match.
$ will match the line end or end of string (again based on settings).
Using this expression you don't use any specific "ends with" functionality (that's given through using $).
Considering this will have to parse the whole string anyway, you may include the "begins with" check as well, which shouldn't cause any additional overhead (at least none you'd notice):
^http://.*page-(?:[2-9]|[1-9]\d|100)$
^ matches the beginning of the line or string (based on settings).
http:// is once again a plain match.
. will match any character.
* is a quantifier "none or more" for the previous expression.
To get you going in the right direction, the Regex that matches your needed range of pages is:
^http.*page-([2-9]?|[1-9][0-9]|100)$
this will match lines that start with http and end with page-<2 to 100> inclusive.

how to use a regular expression to extract json fields?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?
Example:
"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24
I want to extract URL,TITLE,TAGS,
/"(url|title|tags)":"((\\"|[^"])*)"/i
I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:
"
A literal ".
(url|title|tags)
Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.
":"
Another literal string.
(
The beginning of another group. (Group 2)
(
Another group (3)
\\"
The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.
|
or...
[^"]
Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.
)
End of group 3...
*
The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.
)"
The end of group 2, and a literal ".
I've done a few non-obvious things here, that may come in handy:
Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
The i at the end of the expression makes it case insensitive.
Group 1 contains the name of the captured field.
EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.
Your new Regex is:
/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i
All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:
\[(S(,S)*)?\]
Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.
With our same S Notation, here's the whole dirty Regular Expression:
/"(url|title|tags)":(S|\[(S(,S)*)?\])/i
If it helps to see it in action, here's a view of it in action.
This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.
EDIT:
# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10
(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)
# test document
[
{
"_id": "56af331efbeca6240c61b2ca",
"index": 120000,
"guid": "bedb2018-c017-429E-b520-696ea3666692",
"isActive": false,
"balance": "$2,202,350",
"object": {
"name": "am",
"lastname": "lang"
}
}
]
the json string you'd like to extract field value from
{"fid":"321","otherAttribute":"value"}
the following regex expression extract exactly the "fid" field value "321"
(?<=\"fid\":\")[^\"]*
Please try below expression:
/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm
Explanation:
1st Capturing Group (url|title|tags): This is alternatively capturing the characters 'url','title' and 'tags' literally (case sensitive).
2nd Capturing Group ("([^""]+)"|[[^[]+]):
1st Alternative "([^""]+)" is matches all words within " and " including " and "
2nd Alternative [[^[]+] is matches all words within [ and ] including [ and ]
I have tested here
I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.
First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:
"matched".search("ch") // yields 3
For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).
Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:
find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain
With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.
You can see the library and code I authored at http://json.spiritway.co/
if your json is
{"key1":"abc","key2":"xyz"}
then below regex will extract key1 or key2 based on a key that you pass in regex
"key2(.*?)(?=,|}|$)
you can verify it here - regex101.com
Why does it have to be a Regular Expression object?
Here we can just use a Hash object first and then go search it.
mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}
The output of which would be
=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}
Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.
mh.values_at(:url, :title, :tags)
The output:
["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]
Taking the pattern that FrankieTheKneeman gave you:
pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i
we can search the mh hash by converting it to a json object.
/#{pattern}/.match(mh.to_json)
The output:
=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">
Of course this is all done in Ruby which is not a tag that you have but relates I hope.
But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.
pattern = /"(title)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">
pattern = /"(tags)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
=> nil
Sorry about that last one. It will have to be handled differently.

R: Find the last dot in a string

In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]