Find non-alphanumeric characters in column in R - regex

I have a data frame named frame with IDs and names
1 marisa monte
2 dru hill
3 2pac
4 rã¶yksopp
5 cafã© del mar
6 maria bethã¢nia
This is the expected output
> no_alpha
[1] 4 5 6
I want to separate the names with any non-alphanumeric characters, expecting to get the indexes from rows with such characters. Said differently, if there's at least one character outside [a-zA-Z0-9], the row should be included. For some reason, I'm having difficulty doing so.
Below there are some variations I tried and their output. Regardless, after reading this, I still can't see my error. Clearly, I'm missing some important elements, since none of the returned vectors satisfy the condition "anything but alphanumeric".
no_alpha <- grep("[^a-zA-Z0-9]", frame$name)
> no_alpha
integer(0)
no_alpha <- grep("[a-zA-Z0-9]", frame$name, invert = TRUE)
> no_alpha
integer(0)
no_alpha <- grep("[^[:alnum:]_]", frame$name)
> no_alpha
[1] 1 2 3 4 5 6
Any ideas?

It seems you need
x <- c('marisa monte', 'dru hill', '2pac', 'rã¶yksopp', 'cafã© del mar', 'maria bethã¢nia')
grep("[^a-zA-Z0-9[:space:]]", x)
## => [1] 4 5 6
See the R demo online.
The TRE compliant regex matches
[^ - start of a negated bracket expression:
a-z - lowercase ASCII letters
A-Z - uppercase ASCII letters
0-9 - ASCII digits
[:space:] - a POSIX character class matching any whitespace
] - end of the bracket expression.
So, it matches any char other than the chars belonging to the ranges/POSIX character class listed above.

Related

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

R regular expression: find the last but one match

I need a regular expression script (in R language) which finds the last but one match.
Here is an example:
input = c("(test1(test2(test3","(((((othertest1(othertest2(othertest3")
regexpr('the_right_regular_expression_here_which_can_finds_the_last_but_one_'(' ', input)
The result has to be: 7 and 16, because in the first case the last but one '(' is in a 7th position (from left), and in the second case the last but one '(' in in the 16th position (from left).
I've found a regular expression which can find the last match, but I could not transform it in the right way:
\\([^\\(]*$
Thanks for any help!
To match a chunk of text beginning with the last but one (, you may use
"(\\([^(]*){2}$"
Details:
(\\([^(]*){2} - 2 sequences of:
\( - a literal (
[^(]* - zero or more chars other than (
$ - end of string.
R test:
> input = c("(test1(test2(test3","(((((othertest1(othertest2(othertest3")
> regexpr("(\\([^(]*){2}$", input)
[1] 7 16
attr(,"match.length")
[1] 12 22
attr(,"useBytes")
[1] TRUE

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

R regmatches() and stringr str_extract() dragging whitespaces along

Here's the thing:
test=" 2 15 3 23 12 0 0.18"
#I want to extract the 1st number separately
pattern="^ *(\\d+) +"
d=regmatches(test,gregexpr(pattern,test))
> d
[[1]]
[1] " 2 "
library(stringr)
f=str_extract(test,pattern)
> f
[1] " 2 "
They both bring whitespaces to the result despite usage of ()-brackets. Why? The brackets are for specifying which part of the matched pattern you want, am I wrong? I know I can trim them with trimws() or coerce them directly to numeric, but I wonder if I misunderstand some mechanics of patterns.
Using str_match (or str_match_all)
Since you want to extract a capture group, you can use str_match (or str_match_all). str_extract only extracts whole matches.
From R stringr help:
str_match Extract matched groups from a string.
and
str_extract to extract the complete match
R code:
library(stringr)
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
f=str_match(test,pattern)
f[[2]]
## [1] "2"
The f[[2]] will output the 2nd item that is the first capture group value.
Using regmatches
As it is mentioned in the comment above, it is also possible with regmatches and regexec:
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
res <- regmatches(test,regexec(pattern,test))
res[[1]][2] // The res list contains all matches and submatches
## [1] "2" // We get the item[2] from the first match to get "2"
See regexec help page that says:
regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" a vector giving the lengths of the matches (or -1 for no match).
OP task specific solution
Actually, since you only are interested in 1 integer number in the beginning of a string, you could achieve what you want with a mere gsub:
> gsub("^ *(\\d+) +.*", "\\1", test)
[1] "2"

R: removing the last three dots from a string

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.
This does the trick, though not especially elegant...
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")
This will get you most of the way there, and it will have no problems with numbers that include commas:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.
Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result
[haiku-pseudocode]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/haiku-pseudocode]