Separating column using separate (tidyr) via dplyr on a first encountered digit

Separating column using separate (tidyr) via dplyr on a first encountered digit - regex

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.

I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched

You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

Related

Regular Expression to Search Quantity in Human Written Descriptions

Hello and thank you in advance,
I buy items that have a variety of human written listings on auction sites and forums.
Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().
I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.
Here's a few examples, my desired return, and what I'm getting.
Listing
Desired Return
Actual Return
Red 1996 Corvette 2x - Matchbox
2
2
3 x SmartCar, broken 2nd door
3
3
2nd edition Kindle (x3)
3
3
**1x** 2008 financial crash notice
1
1
Collectors Edition Beannie Baby, item 204/343
1
4
(6) Nissan window motors (1995-1998 ONLY)
6
N/A
White chevy F150, 1996
1
6
Green bowl, cracked (stored in room 2A5)
1
5
As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.
The quantities will only be a single number 1-9. (perhaps reject all numbers > 9?)
They'll possibly be precede by or followed by an X or x, with or without a space
The quantity may be next to a special character like * , () or -
It should ignore all 1st, 2nd, 3rd, - 9th style notation
If a number is mixed in a word, like 2A3, it should ignore all
Obviously most description don't have any quantity, so if there's no return or zero, that's fine.
I have something that feels close, and does a reasonable job:
[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)
It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!
Thanks in advance!

You can use
=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2"), "(\d)")), 1)
Here,
REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
=IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.
See the long regex demo.
\d{2,} - two or more digits
| - or
(x\d) - Group 1 ($1): x and a digit
| - or
(\dx) - Group 2 ($2): a digit and x
| - or
[^\W\d]+\d\w* - one or more word chars except digits, a digit and then zero or more word chars
| - or
\d+[^\W\d]\w* - one or more digits, a letter or underscore, and then zero or more word chars.
Demo:

I'm not gettnig row number from grepl - doing this in R

I'm trying to determine which is the first row with a cell that contains only digits, "," "$" in a data frame:
Assessment Area Offices Offices Deposits as of 6/30/16 Deposits as of 6/30/16 Assessment Area Reviews Assessment Area Reviews Assessment Area Reviews
2 Assessment Area # % $ (000s) % Full Scope Limited Scope TOTAL
3 Ohio County 1 50.0% $24,451 52.7% 1 0 1
4 Hart County 1 50.0% $21,931 47.3% 1 0 1
5 OVERALL 2 100% $46,382 100.0% 2 0 2
This code does find the row:
grepl("[0-9]",table_1)
But the code returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I only want to know the row.

Your data could use some cleaning up, but it's not entirely necessary in order to solve your problem. You want to find the first row that contains a dollar sign and an appropriate value. My solution does the following:
Iterates over rows
In each row, asks if there's at least one cell that starts with a dollar sign followed by a specific combination of digits and commas (to be explained in greater detail below)
Stops when we reach that row
Prints the ID of the row
The solution involves a for loop, an if statement, and a regular expression.
First of all, here's my attempt to reproduce a data frame. Again the details don't matter too much. I just wanted to make the "money row" the second row which is kind of how it seems to appear in your example
df<- data.frame(
Assessment_Area = c(2,3,4,5),
Offices = c("#",1,1,2),
Dep_Percent_63016 = c("#","50.0%","50.0%","100.0%"),
Dep_Total_63016 = c("$ (000s)", "$24,451", "$21,931","$46,382"),
Assessment_Area_Rev = rep("Blah",4)
)
df
Assessment_Area Offices Dep_Percent_63016 Dep_Total_63016
1 2 # # $ (000s)
2 3 1 50.0% $24,451
3 4 1 50.0% $21,931
4 5 2 100.0% $46,382
Assessment_Area_Rev
1 Blah
2 Blah
3 Blah
4 Blah
Here's the for loop:
library(stringr)
for (i in 1:nrow(df)) {
if (any(str_detect(df[i,],"^\\$\\d{1,3}(,\\d{3})*"))) {
print(i)
break
}
}
The key is the line with the if statement. any returns TRUE if any element of a logical vector is true. In this case the vector is created by applying stringr::str_detect to a row of the df which is indexed as df[i,]. str_detect returns a logical vector - you supply a character vector and an expression to match in the elements of that vector. It returns TRUE or FALSE for each element in the vector which in this case is each cell in a row. So the crux of this is the regular expression:
"^\\$\\d{1,3}(,\\d{3})*"
This is the pattern we're searching for (the money cell) in each row. ^\\$ indicates we want the string to start with the dollar sign. The two backslashes escape the $ character because it's a metacharacter in regular expressions (end anchor). We then want 1-3 digits. This will match any dollar value below $1,000. Then we specify that the expression can contain any number (including 0) of , followed by three more digits. This will cover any dollar value.
Finally, if we encounter a row which contains one of these expressions, the for loop will print the number of the row and end the loop so it will return the lowest row number containing one desired cell. In this example the output is 2. If no appropriate rows are encountered, nothing will happen.
There may be more you want to do once you have that information, but if all you need is the lowest row number containing your money expression then this is sufficient.
A less elegant regular expression which only looks for dollar signs, commas, and digits would be:
"[0-9$,]+"
which is what you asked for although I don't think that's what you really want because that will match something like ,56$,,$$78

RegEx for matching group in multiline texts

I have this multi-line text, I want to extract the numerical value before the 'Next' text (in this case 13). The numerical values will change, but the location will stay the same, it indicates total # of pages on website. I am having trouble writing the correct regex to return this value:
Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]
pattern =re.compile(r'(\d{1,2})\r\nNext', re.M)
result = pattern.match(text)
The expected return value is 13.

import re
t = """Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]"""
re.search(r"\d+(?=\s+Next)", t).group(0)
Returns: '13'
The regular expression does a lookahead assertion to see if there is any amount (>1) of digits followed by any amount (>1) of whitespace characters followed by the word Next.

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.

I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

R: removing the last three dots from a string

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.

This does the trick, though not especially elegant...
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")

This will get you most of the way there, and it will have no problems with numbers that include commas:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.

Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result
[haiku-pseudocode]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/haiku-pseudocode]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js