List all the numbers in various ranges (1 - 6; 1, 2, 5 & 23; etc) - regex

I am working on a project where I have a column of Ward Group data, and the data is formatted as Municipality Name - Type (City, Town, Village) Wards. So for example:
ADAMS - T 1 & 2
Cumberland V 1 - 5
Marshfield - C 1 - 20, 23 - 25 & 27
....
etc.
To link this information to a government-provided ward shapefile, I need to have one line per Ward. So for example, I need to turn the above information into:
ADAMS - T 1
ADAMS - T 2
Cumberland V 1
Cumberland V 2
Cumberland V 3
Cumberland V 4
Cumberland V 5
Marshfield - C 1
....
Marshfield - C 20
Marshfield - C 23
Marshfield - C 24
Marshfield - C 25
Marshfield - C 27
Also, each Ward Group line has several columns of election data, that I want copied into each new row. So for example:
Ward Group Total Votes
ADAMS - T 1 & 2 300
needs to become:
Ward Group Total Votes
ADAMS - T 1 300
ADAMS - T 2 300
Is there a way to do this in Excel that isn't by hand, either formula or VBA?

Although not tagged as such, I suspect you are going to require code for this, so means your question "must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results." or could be deem off topic. So perhaps a start with formulae (assuming each 'row' is a single cell):
maybe regex to make the Municipality Name - Type linking more consistent (say look for letter space letter space and replace with letter space hyphen space letter space - so Cumberland V 1 - 5 ends up in a similar format to the others, ie Cumberland - V 1 - 5.)
then use =SUBSTITUTE(A1,"-",",",1) to replace the hyphens before the Type with commas (can reply on hyphens because used for range indication also).
Then "fix" the results (Paste Special Values to replace the formulae).
Then parse out the results (Text to Columns) with "," and "&" as the delimiters (only).
Insert three new columns immediately to the right of the left of these columns and parse the (existing) left column with Space and hyphen as the delimiters.
Applied to your example, at this point the results should look so:
Columns C & D are the lower and upper bounds of ranges - but I'm afraid E3 alone represents a range (though this could be split out with Text To Columns again - a point to watch out for may be confusing 27-29 with 27 & 29.
There is more could be done with formulae but I think VBA is a better bet - unless perhaps this is definitely a 'once off' requirement. For example, the votes, if in a different cell, could be attached after the pages (?) have been split out, via say a lookup function.

Related

Regular Expression to Search Quantity in Human Written Descriptions

Hello and thank you in advance,
I buy items that have a variety of human written listings on auction sites and forums.
Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().
I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.
Here's a few examples, my desired return, and what I'm getting.
Listing
Desired Return
Actual Return
Red 1996 Corvette 2x - Matchbox
2
2
3 x SmartCar, broken 2nd door
3
3
2nd edition Kindle (x3)
3
3
**1x** 2008 financial crash notice
1
1
Collectors Edition Beannie Baby, item 204/343
1
4
(6) Nissan window motors (1995-1998 ONLY)
6
N/A
White chevy F150, 1996
1
6
Green bowl, cracked (stored in room 2A5)
1
5
As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.
The quantities will only be a single number 1-9. (perhaps reject all numbers > 9?)
They'll possibly be precede by or followed by an X or x, with or without a space
The quantity may be next to a special character like * , () or -
It should ignore all 1st, 2nd, 3rd, - 9th style notation
If a number is mixed in a word, like 2A3, it should ignore all
Obviously most description don't have any quantity, so if there's no return or zero, that's fine.
I have something that feels close, and does a reasonable job:
[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)
It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!
Thanks in advance!
You can use
=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2"), "(\d)")), 1)
Here,
REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
=IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.
See the long regex demo.
\d{2,} - two or more digits
| - or
(x\d) - Group 1 ($1): x and a digit
| - or
(\dx) - Group 2 ($2): a digit and x
| - or
[^\W\d]+\d\w* - one or more word chars except digits, a digit and then zero or more word chars
| - or
\d+[^\W\d]\w* - one or more digits, a letter or underscore, and then zero or more word chars.
Demo:

How to regex extract only numbers up to the first comma or after a specific keyword?

I'm having trouble trying to regex extract the 'positions' from the following types of strings:
6 red players position 5, button 2
earn $50 pos3, up to $1,000
earn $50 pos 2, up to $500
table button 4, before Jan 21
I want to get the number that comes after 'pos' or 'position', and if there's no such keyword, get the last number before the first comma. The position value can be a number between 1 and 100. So 'position' for each of the previous rows would be:
Input text
Desired match (position)
6 red players position 5, button 2
5
earn $50 pos3, up to $1,000
3
earn $50 pos 2, up to $500
2
table button 4, before Jan 21
4
I have a big data set (in BigQuery) populated with basically those 4 types of strings.
I've already searched for this type of problem but found no solution or point to start from.
I've tried .+?(?=,) (link) which extracts everything up to the first comma (,), but then I'm not sure how to go about extracting only the numbers from this.
I've tried (?:position|pos)\s?(\d) (link) which extracts what I want for group 1 (by using non-capturing groups), but doesn't solve the 4th type of string.
I feel like there's a way to combine these two, but I just don't know how to get there yet.
And so, after the two things I've tried, I have two questions:
Is this possible with only regex? If so, how?
What would I need to do in SQL to make my life easier at getting these values?
I'd appreciate the help/guidance with this. Thanks a ton!
You can use
^(?:[^,]*[^0-9,])?(\d+),
See the RE2 regex demo. Details:
^ - start of string
(?:[^,]*[^0-9,])? - an optional sequence of:
[^,]* - zero or more chars other than comma
[^0-9,] - a char other than a digit and comma
(\d+) - Group 1: one or more digits
, - a comma
Use look ahead for a comma, with a look behind requiring the previous char to be a space or a letter to prevent matching the “1” in “$1,000”:
(?<=[ a-z])(\d+)(?=,)
See live demo.

Regex in Hive QL (RLIKE) - performance?

I'm wondering how/if can I improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:
X123 or XY12, (type 1)
any two letter combo, excluding XY (type 2)
Type 1 groups always are of length 4. It's either letter X followed by a number between 100 and 999 (inclusive) OR XY followed by numbers between 0 and 99 (padded to length 2 with zeros).
Type 2 groups are 2 letter strings, with any letter allowed, excluding XY (although my query doesn't specify this).
User can belong to multiple groups, in which case different groups are separated by pound symbol (#). Here's an example:
groups user age
X124 john 23
XY22#AB mike 33
AB peter 21
X122#XY01 francis 43
I want to count rows in which at least one group in second format appears, i.e. where user is not exclusively member of groups in first format.
I need to catch all rows (i.e. users) which don't belong exclusively to first type of groups. In the example above, I want to exclude users john and francis because they are members only of type 1 groups.
On the other hand, mike is OK because he's member of AB group (i.e. group of type 2).
I'm currently doing it like this:
select
count(*)
from
users
where
groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'
Is this bad performance wise? And how should I approach fixing it?
I want to count rows in which at least one group in second format appears.
It seems a bit simpler then to select where groups like:
\b(?:(?!XY)[A-Z]{2})\b
\b is a word boundary. It doesn't consume a character, instead it states there cannot be a non-alphanumeric character there.
Live demo.

openrefine extracting a number from a text column using regex

I'm trying to parse out a column of data from the OpenFoodFacts dataset that I found via Kaggle. There is a attribute called "serving_size" that contains whatever serving size information is presented on the package for a food item. Most of the time the serving size is expressed in grams (g), however there is often other text as well. I'd like to be able to search through the string, find the number that corresponds to the number of grams, and extract that value into its own field. The value is not just an integer - it might have a decimal.
I'm new to regular expressions, but it seems like it ought to be possible to search for the "g" character and if it is proceeded by any numeric values to extract them. I've found some recipes that suggest this is possible, but so far nothing I've tried has worked. In the OpenRefine documentation they give the example of extracting decimal data using this regex: /[-+]?[0-9]+(.[0-9]+)?/, but there was no variation of that I could get to work in our scenario. I've also tried commands like "value.match(/(.)?(/d+[g]).?/)". I'm finding that I don't understand how regex is supposed to work - when I tell it "/d" I'm expecting that it will ONLY give me back numeric values, however that does not appear to be the case - it gives whatever is there regardless of the character type.
Any help would be appreciated.
Here are some example text strings from the data:
serving_size
- 113.5g
- 20g
- 1 cup (227g)
- 4 cookies (15g)
- 13 pieces (39g)
- 1/4 packet (21g) makes 1/2 cup
- 0.75 oz (21g)
- 1 can (12 FL OZ) 355g
- 15.2 fl oz (450g)
- 1 can (355mL)
- 1/4 tsp (1.4g)
- 10 fl oz 1 bottle.
- 20 fl oz
- 1 envelope (21g)
- 1 tbsp (4.5g)
- 45.2g
- 1/2 pack 142.5gms
- 1 carré de chocolat de 20g
- 4 biscottes (≈ 35g) Ce paquet contient 8.5 portions de 4
biscottes.
- 0.33L
- 2galettes 10.5g
- 0.041649313g
- 1 package (79g)
screenshot of attempt
In OpenRefine GREL (the language used to write the transformations) the 'match' function requires the regular expression to match the entire string in the cell - you can't use a partial match.
The output of the 'match' function is an array of all the capture groups. To get a specific value you have to select this from the array, or convert the array to a string.
So for example you could try:
value.match(/.*?(\d+\.?\d*)g(ram)?(s)?\b?.*/)[0]
This will find all strings where there is a number (with or without a decimal point) in front of the letter 'g', or 'gram' or 'grams', followed by a non-word character (e.g. a space or a bracket) and will capture the number as the first member of the resulting array of capture groups.
The '?' is needed after the first '.*' to make this lazy, so that the capture group gets the whole number, not just the last digit.

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)