how do i exctract the acres burned within a variable - sas

hi i need help with extracting the acres burned in a sas file named wildfire_narrative. the varriables within the file are episdoe_id,episode_narrative, event_id, event_narrative. the acres burned are within the variable episode_narrative. episode_narrative contains at least a paragraph of text string and within the text string is the acres burned.
EXAMPLE: Dry Santa winds caused a discarded cigarette butt in the median of Interstate 8 to grow into a 10,353 acre brush fire. Resources used to fight the fire cost over $8 million and involved 2000 fire fighters, nine helicopters, and nine air tankers. Property damaged or destroyed in the fire consisted of 15 single family homes, 65 outbuildings, 15 trailers, and 164 motor vehicles. Several livestock were burned and later euthanasized. thank you.
data acres;
set 'C:\Users\scott\Downloads\Wildfire_narrative.sas7bdat';
acresBurned = scan(episode_narrative, findw("acre",0-8,' ')-1, ",");
run;

something like this in prxchange. This is done by doing by using regular expression by replacing everything else and keep the number in front of acre. by below code, which basically capture various groups and replaces everything with number in front of acre.
acres=input(prxchange('s/(.+?)([0-9\,]+)(?=\s?\-?acre)(.+)/$2/i',-1,
acresBurned),comma10.)
Brief explanation of above code.
(.+?) is first captured group, which goes till number space followed by word acre
([0-9\,]+) is second capture group with number
(?=\s?-?acre) is third capture group this is look ahead reference which makes sure there is word acre in front of number followed by space or -
(.+) is fourth capture group which goes till the end of the sentence.
/$2/ replaces everything with second capture group and input function is used to change the value to a number
data have ;
length acresBurned $500.;
acresBurned = "Dry Santa winds caused a discarded cigarette butt in the median
of 8 to grow into a 10,353 acre brush fire. Resources used to fight the
fire cost over $8 million and involved 2000 fire fighters, nine helicopters, a
and nine air tankers. Property damaged or destroyed in the fire consisted of
15 single family homes, 65 outbuildings, 15 trailers, and 164 motor vehicles.
Several livestock were burned and later euthanasized";
output;
acresBurned = "Dry Santa winds caused a discarded cigarette butt in the
median
of Interstate 8 to grow into a 100,353-acre brush fire. Resources used
to fight
the fire cost over $8 million and involved 2000 fire fighters, nine
helicopters, and nine air tankers. Property damaged or destroyed in the
fire consisted of 15 single family homes, 65 outbuildings, 15 trailers, and
164 motor vehicles. Several livestock were burned and later
euthanasized";
output;
run;
data have1;
set have;
acres=input(prxchange('s/(.+?)([0-9\,]+)(?=\s?\-?acre)(.+)/$2/i',-1,
acresBurned),comma10.);
run;

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.
You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.
Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Word VBA - replace each integer i in a selection with i-1

I want to search a selection for integers and increment each by some value, such as 1. For example, in the following text from U.S. Patent No. 6,293,874, I would want to make replacements where double brackets indicate deletions and italics indicate insertions:
"In accordance with the present invention, an amusement apparatus is provided that includes a user-operated and controlled apparatus for self-infliction of repetitive blows to the user's buttocks including a plurality of rotating arms bearing flexible extensions for self-paddling the user's buttocks B. [The drawings show an] amusement apparatus [[10]] 11 for self-paddling a user U. As illustrated in [the drawings] the self-paddling apparatus [[10]] 11 includes a display platform [[12]] 13 having a hollow interior and having a first end [[14]] 15 portion and a second end [[16]] 17 portion. The platform [[12]] *13*is constructed of materials adequate to support the weight of at least a few people standing on the platform, such as a user and an observer. In one embodiment, the platform [[12]] 13 includes at least two generally equally sized subunits that are connectable at the mid-section [[20]] 21 for folding together after any upstanding posts mounted thereon are detached. The first end [[14]] 15 portion and the second end [[16]] 17 portion includes storage compartments within the hollow platform [[12]] 13 for storage of disassembled components connectable to the platform [[12]] 13. The first end [[14]] 15 is connected at the mid-section hinges 20 to the second end [[16]] 17 by a plurality of hinges [[22]] 23. Once the upstanding posts and accessories are detached, disassembled and stored, the platform [[12]] 13 is foldable and securable for storage or moving to the next location for reassembly, display, and use for self-paddling."
I know how to iterate over matches in a match collection, but only for whole paragraphs, viz.
For Each para In Selection.Paragraphs
txt = para.Range.text
offset = 0
'any match?
If re.Test(txt) Then
'get all matches
Set allMatches = re.Execute(txt)
For Each m In allMatches
num = m.Value + 1
Set rng = para.Range
rng.Collapse wdCollapseStart
rng.MoveStart wdCharacter, m.FirstIndex + offset
rng.MoveEnd wdCharacter, m.length
rng.text = re.Replace(rng.text, num) 'account for added digits:
offset = offset + Len(CStr(num)) - Len(m.Value)
Next m
End If
Next para
Can I either (A) modify the foregoing code to search through a selection only instead of the paragraphs that it touches (in their entireties), (B) somehow use Selection.Find.Execute:=wdReplaceAll and be assured that it won't replace everything in the document?* (*I have noticed that if a variable is defined in relation to the Selection object and used in the .text field of Selection.Find, Word will replace all instances in the document, even with .Wrap = wdFindStop. See Word VBA replace text appearing in selection only with uppercase)

Regular Expression for group

I have a text were I need to find 3 groups strings.
I try expression: \r?\n\r?\n\r?[0-9A-Z].*\d{7} but I find only 2 strings instead 3.
I should highlight 00170784,HEDINV,00173575 but I get only 00170784 and 00173575
This is the text:
BUY
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.375
SELL
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.5938
May I suggest using this instead?
^\d{8}$|^[A-Z]{6}$
It has two capture groups it looks for. One is an 8 digit sequence for a whole line. The other is a 6 letter sequence for a whole line. That grabs what you're looking, unless there's a specific reason you're using all those linebreak matches.

Google Scripts replace function to prefix a regexp in CAPS

In GAS, using the .replace(), Is it possible to match any term within a long text string that is at least 5 consecutive ALL CAPS characters (may have 1 space in there) and prefix it with a string, such as ][? There may be multiple matches within the text string, so I want to insert markers that begin and end a phrase beginning with an ALLCAPS category.
An example of a similar type of text would be this (structurally similar, but with other sensitive data):
"VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need."
I have learned, through Serge, how to replace globally, which was a big help, but the more I research regexp's, the more confusing it gets. I tried substituting the all caps regexp for a specific term, but failed. I think that I could go through and extract all of the all caps regexp's and use them in a replace with multiple values, but it seems that would be a very long way around.
Is it possible, in a couple of lines to make the above text look like this:
"][VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. ][SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. ][FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need."
My intention is to then split on the ] Which would mean that new cells would start with the all caps term, and end with ]. I have the code to convert the text to an array (there are lots of entries), then use .replace() to find and replace within the array, and to set the values back into the sheet, but I just don't know if there is a way to either prefix (my research says lookback isn't possible in GAS), or to pick up the allcaps value, add the string "][", and put it back.
If this is asking too much, or feels like I haven't included any code, here is the first part that Serge already helped with: Looking for a Google script that will perform CTRL+F replace for a string
Here is the code, as I used it, combining Serge's previous help and the new recommendation. I had to fix some case issues with a term before running the all caps because some people can't follow a template, but it works.
function insertSplitMarkers(){
var sh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Freq Iso');
var data = sh.getRange(2,1,sh.getLastRow(),sh.getLastColumn()).getValues();// get all data
var regexp = /(([A-Z]\s*){5,})/g;
for(var n=0;n<data.length;n++){
for(var m=0;m<data[0].length;m++){
if(typeof(data[n][m])=='string'){ // if it is a string
data[n][m]=data[n][m].replace(/Interventions/g,'INTERVENTIONS');// use the regex replace with /g parameter meaning "globally"
data[n][m]=data[n][m].replace(regexp, "][$1");
}
}
}
Logger.log(data);
sh.getRange(2,1,data.length,data[0].length).setValues(data);
}
It looks like this will do what you want although as is, it will also pick out aoAOEOUE:
var yourString = "VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need.";
var regexp = /(([A-Z]\s*){5,})/g;
var newString = yourString.replace(regexp, "][$1");
Logger.log(newString);
#user3169581 I've adjusted your regex slightly to try to eliminate matching whitespace around the desired phrase and ensure you get the whole desired phrase, it will require a little adjustment in the replace:
var regexp = /\b([A-Z\s]{5,})(:)/g
...
data[n][m] = data[n][m].replace(regexp,"][$2$3")
Link to regex101 with working matching here: http://regex101.com/r/rD5kS9
HTH
EDIT: for some reason the existing answer wasn't showing up for me when I started this response. Forgive the redundancy.