I am given this paragraph:
*Cast and characters
* Bob Denver is Gilligan, the inept, accident-prone First Mate
(affectionately known as "Little Buddy" by "the Shipper") of the SS Minnow. Denver was not the first choice to play Gilligan; actor Jerry Van Dyke, phone 210-222-3333, was offered the role on 2/11/1963, but he turned it down, believing that the show would never be successful. He chose instead to play the lead in My Mother the Car, which premiered the following year and was cancelled after one season. The producers looked to Bob Denver, the actor who had played Maynard G. Krebs, ss #111-22-3333, the goofy but lovable beatnik in The Many Loves of Dobie Gillis. None of the show's episodes ever specified Gilligan's full name or clearly indicated whether "Gilligan" was the character's first name or his last. In the DVD collection, Sherwood Schwartz states that he preferred the full name of "%Willy Gilligan%" for the character.
My goal is to make "%Willy Gillgan%" into "" using sed. I have tried s/%[^%]*%// but it also interferes with another sed command s/[0-9]{3}-[0-9]{2}-[0-9]{4}/%%%-%%/%%%%/ that changes #111-22-3333 into #%%%-%%-%%%%. It deletes 2 %'s turning into #%-%%-%%%% incorrectly.
Here are my other sed commands incase it might interfere with something else:
s+([0-9]{1,2})-([0-9]{1,2})-([0-9]{4})+\3-\2-\1+g converts date format
/[*]\s/i\
\n* ATTENTION *\n adds the line * ATTENTION * and a newline when it encounters "* " anywhere in the paragraph.
This is what my script file looks like:
s/[0-9]{3}-[0-9]{2}-[0-9]{4}/%%%-%%/%%%%/
s+([0-9]{1,2})-([0-9]{1,2})-([0-9]{4})+\3-\2-\1+g
s/%[^%]*%//
/[*]\s/i\
\n* ATTENTION *\n
Any help is greatly appreciated.
Try:
$ cat script.sed
s/%[^%]*%//g
s/[0-9]{3}-[0-9]{2}-[0-9]{4}/%%%-%%-%%%%/g
s+([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})+\3-\2-\1+g
/[*]\s/i\
\n* ATTENTION *\n
This produces the output:
$ sed -Ef script.sed text
*Cast and characters
* ATTENTION *
* Bob Denver is Gilligan, the inept, accident-prone First Mate
(affectionately known as "Little Buddy" by "the Shipper") of the SS Minnow. Denver was not the first choice to play Gilligan; actor Jerry Van Dyke, phone 210-222-3333, was offered the role on 1963-11-2, but he turned it down, believing that the show would never be successful. He chose instead to play the lead in My Mother the Car, which premiered the following year and was cancelled after one season. The producers looked to Bob Denver, the actor who had played Maynard G. Krebs, ss #%%%-%%-%%%%, the goofy but lovable beatnik in The Many Loves of Dobie Gillis. None of the show's episodes ever specified Gilligan's full name or clearly indicated whether "Gilligan" was the character's first name or his last. In the DVD collection, Sherwood Schwartz states that he preferred the full name of "" for the character.
Related
Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)
I have a text file which contains a series of movie titles, which looks like this once opened.
A Nous la Liberte (1932) About Schmidt (2002) Absence of Malice
(1981) Adam's Rib (1949) Adaptation (2002) The Adjuster (1991) The
Adventures of Robin Hood (1938) Affliction (1998) The African Queen
(1952)
Using the code below:
def movie_text():
moviefile = open("movies.txt", 'r')
yourResult = [line.split('\n') for line in moviefile.readlines()]
movie_text()
I get nothing.
Your code doesn't prints right.
If I understand it well,
moviefile = open("movies.txt", 'r')
lines=moviefile.readlines()
print(len(lines)) # Shows list size
for line in lines:
print(line[:1]) # The [:1] part cuts the \n
The method readlines returns a list, I am not sure why your use split. I mean, if all you want is to remove the '\n', you can do it in many ways, being the one I used just one of them.
Hope it works!
I'm working from a spreadsheet of values. I have code that pulls a row of content to analyze. I was planning to split it on commas, but some of the strings inside the cells include commas (that aren't regularly spaced, so escaping them would be difficult). I downloaded the sheet as a tsv instead of a csv and re-uploaded it, but my attempts to split on \t haven't been successful. (For good measure, I've also tried \n, \r, and \f to see if they're involved in delimiting cells. They don't seem to be.)
Is there a special character that means "next cell" or "next record" or something like that? Am I better off trying to end each cell with a particular character that I would then have to strip out of my data after splitting? I'd welcome any other ideas!
Code snippet:
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var contentChunks = lastRowContents.toString().split('\t');
var product = contentChunks[0];
Logger.log(product);
This outputs the entire row as one item in that array, like so:
product: Wed Jan 05 2005 02:00:00 GMT-0600 (CST),001-2005, Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings ,misbranded,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy¤t=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index271,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy¤t=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index386,Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings,Produced 10/6/2004. The products subject to recall are: One pound bags of "DAY-LEE PRIDE BEEF GYOZA POTSTICKERS, VEGETABLE AND BEEF DUMPLINGS." Each bag bears the code "28004," as well as "Est. 17309" inside the USDA mark of inspection.,The packages state that the gyozas are filled with beef, but they may instead contain shrimp, a known allergen.,The problem was discovered by the establishment.,17309 M Day-Lee Foods Inc. 13055 E. Molette St. Santa Fe Springs, CA 90670,,Approximately 2,520 pounds,California, Colorado, Georgia, Maryland, New York, and Washington.,Class I,U.S. Food and Drug Administration (FDA),,,,,
(just for visibility :)
since lastRowContents is an 2D array (link to doc) you have every cell with lastRowContents[0][0],lastRowContents[0][1],lastRowContents[0][2],etc..
in your code :
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var product = lastRowContents[0][0];
Logger.log(product);
I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""
There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.
This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.
Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa
The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
function! ParseAddress(line)
let r = split(a:line, ',\s*', 1)
let hadcountry = r[-1] !~ '\d'
let a = {}
let a.country = hadcountry ? r[-1] : ''
let r = r[:-1-hadcountry]
let a.zip = r[-1]
let a.state = r[-2]
let a.city = r[-3]
let a.header = r[0]
let nleft = len(r) - 4
if hadcountry
let a.address1 = r[-4]
let a.address2 = join(r[1:nleft-1], ', ')
else
let a.address1 = r[1]
let a.address2 = join(r[2:nleft], ', ')
endif
return a
endfunction
function! FormatAddress(a)
let t = map([
\ ['Header', 'header'],
\ ['Address 1', 'address1'],
\ ['Address 2', 'address2'],
\ ['Town', 'city'],
\ ['State/Province', 'state'],
\ ['Country', 'country'],
\ ['Zip', 'zip']],
\ 'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
\ '? v:val[0] . ": " . a:a[v:val[1]] : ""')
return join(filter(t, '!empty(v:val)'), '; ')
endfunction
The command below can be used to test the above parsing routines.
:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))
(One can provide a range to the :global command to run it through fewer
number of test address lines.)
Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
Best practices for storing postal addresses in a database
Is there a common street address database design for all addresses of the world
How many address fields would you use for a UK address
ISO Standard Street Addresses
There are probably other related questions: see the tag street-address for some of them.