I'm not gettnig row number from grepl - doing this in R - regex

I'm trying to determine which is the first row with a cell that contains only digits, "," "$" in a data frame:
Assessment Area Offices Offices Deposits as of 6/30/16 Deposits as of 6/30/16 Assessment Area Reviews Assessment Area Reviews Assessment Area Reviews
2 Assessment Area # % $ (000s) % Full Scope Limited Scope TOTAL
3 Ohio County 1 50.0% $24,451 52.7% 1 0 1
4 Hart County 1 50.0% $21,931 47.3% 1 0 1
5 OVERALL 2 100% $46,382 100.0% 2 0 2
This code does find the row:
grepl("[0-9]",table_1)
But the code returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I only want to know the row.

Your data could use some cleaning up, but it's not entirely necessary in order to solve your problem. You want to find the first row that contains a dollar sign and an appropriate value. My solution does the following:
Iterates over rows
In each row, asks if there's at least one cell that starts with a dollar sign followed by a specific combination of digits and commas (to be explained in greater detail below)
Stops when we reach that row
Prints the ID of the row
The solution involves a for loop, an if statement, and a regular expression.
First of all, here's my attempt to reproduce a data frame. Again the details don't matter too much. I just wanted to make the "money row" the second row which is kind of how it seems to appear in your example
df<- data.frame(
Assessment_Area = c(2,3,4,5),
Offices = c("#",1,1,2),
Dep_Percent_63016 = c("#","50.0%","50.0%","100.0%"),
Dep_Total_63016 = c("$ (000s)", "$24,451", "$21,931","$46,382"),
Assessment_Area_Rev = rep("Blah",4)
)
df
Assessment_Area Offices Dep_Percent_63016 Dep_Total_63016
1 2 # # $ (000s)
2 3 1 50.0% $24,451
3 4 1 50.0% $21,931
4 5 2 100.0% $46,382
Assessment_Area_Rev
1 Blah
2 Blah
3 Blah
4 Blah
Here's the for loop:
library(stringr)
for (i in 1:nrow(df)) {
if (any(str_detect(df[i,],"^\\$\\d{1,3}(,\\d{3})*"))) {
print(i)
break
}
}
The key is the line with the if statement. any returns TRUE if any element of a logical vector is true. In this case the vector is created by applying stringr::str_detect to a row of the df which is indexed as df[i,]. str_detect returns a logical vector - you supply a character vector and an expression to match in the elements of that vector. It returns TRUE or FALSE for each element in the vector which in this case is each cell in a row. So the crux of this is the regular expression:
"^\\$\\d{1,3}(,\\d{3})*"
This is the pattern we're searching for (the money cell) in each row. ^\\$ indicates we want the string to start with the dollar sign. The two backslashes escape the $ character because it's a metacharacter in regular expressions (end anchor). We then want 1-3 digits. This will match any dollar value below $1,000. Then we specify that the expression can contain any number (including 0) of , followed by three more digits. This will cover any dollar value.
Finally, if we encounter a row which contains one of these expressions, the for loop will print the number of the row and end the loop so it will return the lowest row number containing one desired cell. In this example the output is 2. If no appropriate rows are encountered, nothing will happen.
There may be more you want to do once you have that information, but if all you need is the lowest row number containing your money expression then this is sufficient.
A less elegant regular expression which only looks for dollar signs, commas, and digits would be:
"[0-9$,]+"
which is what you asked for although I don't think that's what you really want because that will match something like ,56$,,$$78

Related

Regular Expression to Search Quantity in Human Written Descriptions

Hello and thank you in advance,
I buy items that have a variety of human written listings on auction sites and forums.
Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().
I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.
Here's a few examples, my desired return, and what I'm getting.
Listing
Desired Return
Actual Return
Red 1996 Corvette 2x - Matchbox
2
2
3 x SmartCar, broken 2nd door
3
3
2nd edition Kindle (x3)
3
3
**1x** 2008 financial crash notice
1
1
Collectors Edition Beannie Baby, item 204/343
1
4
(6) Nissan window motors (1995-1998 ONLY)
6
N/A
White chevy F150, 1996
1
6
Green bowl, cracked (stored in room 2A5)
1
5
As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.
The quantities will only be a single number 1-9. (perhaps reject all numbers > 9?)
They'll possibly be precede by or followed by an X or x, with or without a space
The quantity may be next to a special character like * , () or -
It should ignore all 1st, 2nd, 3rd, - 9th style notation
If a number is mixed in a word, like 2A3, it should ignore all
Obviously most description don't have any quantity, so if there's no return or zero, that's fine.
I have something that feels close, and does a reasonable job:
[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)
It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!
Thanks in advance!
You can use
=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2"), "(\d)")), 1)
Here,
REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
=IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.
See the long regex demo.
\d{2,} - two or more digits
| - or
(x\d) - Group 1 ($1): x and a digit
| - or
(\dx) - Group 2 ($2): a digit and x
| - or
[^\W\d]+\d\w* - one or more word chars except digits, a digit and then zero or more word chars
| - or
\d+[^\W\d]\w* - one or more digits, a letter or underscore, and then zero or more word chars.
Demo:

How to use a Regular Expression to find a Specific word and return following 10 characters?

I need to find a regular expression that finds "Order #" and then return the following 10 characters.
For example I can the following rows (Ignore row numbers just using them to designate that it is a new or next line in the original data):
Row 1 Order #100013661 By John DOE
Row 2 REFUND for CHARGE(Order #100013667 By Lara Croft
Row 3 Order #100013668 By Sammy
Row 4 Blah Blah Blah Order #10013664 By Fluffy fluff
I want the expression to return:
ROW 1 100013661
ROW 2 100013667
Row 3 100013668
Row 4 100013664
Use capturing groups for that:
Order #(.{9})
Use the tools in your hosting language to harvest the capturing group.
Demo.
The regex you need is
(?<=Order #).{10}
Detailed explanation:
(?<=Order #) is a positive lookbehind: it matches if the literal string Order # occurs before current position;
.{10} matches any 10 characters.
Note that this won't match if your line has less than 10 characters in a line after the search string. If you need to match up to 10 characters, not exactly 10 characters, replace {10} with {1,10}.
Here is a demo.
Order #(.{10}) or Order #(.{1,10}) if it could be up to 10 characters.
Order #(\d{1,10}) if they are always numbers.

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

regex: continuous numbering replace

I have a text with the word 'article' repeated, I need using Notepad ++ to replace each of that word with a number, starting with '1', then the next instance '2'.. etc. something like {\d+[+1]}
For example:
this is article and this is another article. Here is an article etc.
becomes:
this is 1 and this is another 2. Here is an 3 etc.
Yes you can, but this is tricky.
Assuming you don't have any # in your file, replace each line by the next one (replace all and multiple times):
(?<!#)(#*article)([^#]*?)((?=\1))
\1\2#\3
So this gives: this is article and this is another #article. Here is an ##article etc.. Continue with:
article
\#
So you have:
this is # and this is another ##. Here is an ### etc.
Now every article is mapped to n times the number of pounds of its position. To convert to base 10, you need to perform the following operations alternatively and repeatingly (assuming that % does not appear in the original text)
#{10}
%
%{10}
#
And so on until you cannot replace neither the first nor the second. Then you can do:
(#|%)\1{8}
9
(#|%)\1{7}
8
(#|%)\1{6}
7
(#|%)\1{5}
6
(#|%)\1{4}
5
(#|%)\1{3}
4
(#|%)\1{2}
3
(#|%)\1{1}
2
(#|%)
1
And voilĂ ! this is 1 and this is another 2. Here is an 3 etc. If you have n articles in your document, The first operation takes O(n) times to click on "replace all", the second O(log(n)), the third O(1), so the total time is actually O(n) which is what you would have expected.