Beautiful Soup Exact tag data - python-2.7

I am using BeautifulSoup to extract some data from HTML page. What I am doing is:
list=soup.find_all('td', {'align': 'left', 'valign': None})
print list[0]
It gives me
<td align="left">\n<h3>Name XYZ</h3>\n CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, <br/>KANDIVALI EAST,<br/>Mumbai MAHARASHTRA-400101</td>
But I want output like:
Name: Name XYZ, Add: CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, KANDIVALI EAST, Mumbai MAHARASHTRA-400101
What should I do?

find_all will return a list of tag, so when you access the first item in the list list[0], it will return the first tag, like you output
if you want to extract text for tag, you can use tag.text, in your case
list[0].text

Actually, I think that there are two methods for that, depending on what you are looking for.
Im not sure whether the "name" and "add" strings in front of your desired output are typos or not, so here are the two possible ways I see on how to do it:
In the case you simply want to extract all the text beneath each tag of your list_tags obtained from the find_all method, without any manipulations such as separating each word, go for the get_text() method.
With it, you could opt for a simple list comprehension like:
>>> simple_uni_text = [tag.get_text() for tag in list_tags]
>>> simple_uni_text
[u'\nName XYZ\n CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, KANDIVALI EAST,Mumbai MAHARASHTRA-400101', u'\nName ABC\n DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE, BOMBAY WEST,BOMBAY RASHTRAMAHA-400101']
>>> len(simple_uni_text)
>>> 2 # I pretended the list_tags to have two tags, so it generated a list of length two!
The stripped_strings generator.
Its maybe a trickier method. But you could gain in precision.
>>> uni_stripped_words = []
>>> for tag in list_tags:
for string in tag.stripped_strings:
uni_stripped_words.append(string)
>>> uni_stripped_words
[u'Name XYZ', u'CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX,', u'KANDIVALI EAST,', u'Mumbai MAHARASHTRA-400101', u'Name ABC', u'DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE,', u'BOMBAY WEST,', u'BOMBAY RASHTRAMAHA-400101']
>>> len(uni_stripped_words)
8
Here you separate each string that are found beneath each teag of your list_tags from another. Thus, if you indeed want to add the following "Name" and "Add" in front of your text, then it could better correspond to your needs.
>>> for word in uni_stripped_words:
print word
Name XYZ
CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX,
KANDIVALI EAST,
Mumbai MAHARASHTRA-400101
Name ABC
DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE,
BOMBAY WEST,
BOMBAY RASHTRAMAHA-400101 # Sorry for the weird text example haha
I however find the second method less controllable. There are, for instance, sometimes unexpected characters. Personnally I prefer to concatenate when writing out the output to a file!
Anyway, in both cases, don't forget that the resulting lists will contain extracted text of unicode type.
Cheers

Related

Get a string after a specific word, using a program that has limited regex features?

Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)

Converting a textfile into a list

I have a text file which contains a series of movie titles, which looks like this once opened.
A Nous la Liberte (1932) About Schmidt (2002) Absence of Malice
(1981) Adam's Rib (1949) Adaptation (2002) The Adjuster (1991) The
Adventures of Robin Hood (1938) Affliction (1998) The African Queen
(1952)
Using the code below:
def movie_text():
moviefile = open("movies.txt", 'r')
yourResult = [line.split('\n') for line in moviefile.readlines()]
movie_text()
I get nothing.
Your code doesn't prints right.
If I understand it well,
moviefile = open("movies.txt", 'r')
lines=moviefile.readlines()
print(len(lines)) # Shows list size
for line in lines:
print(line[:1]) # The [:1] part cuts the \n
The method readlines returns a list, I am not sure why your use split. I mean, if all you want is to remove the '\n', you can do it in many ways, being the one I used just one of them.
Hope it works!

Parsing text file into a Data Frame

I have a text file which has information, like so:
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: A1QA985ULVCQOB
review/profileName: Carleen M. Amadio "Lady Dragonfly"
review/helpfulness: 2/2
review/score: 5.0
review/time: 1314057600
review/summary: Fun for adults too!
review/text: I really enjoy these scissors for my inspiration books that I am making (like collage, but in books) and using these different textures these give is just wonderful, makes a great statement with the pictures and sayings. Want more, perfect for any need you have even for gifts as well. Pretty cool!
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: ALCX2ELNHLQA7
review/profileName: Barbara
review/helpfulness: 0/0
review/score: 5.0
review/time: 1328659200
review/summary: Making the cut!
review/text: Looked all over in art supply and other stores for "crazy cutting" scissors for my 4-year old grandson. These are exactly what I was looking for - fun, very well made, metal rather than plastic blades (so they actually do a good job of cutting paper), safe ("blunt") ends, etc. (These really are for age 4 and up, not younger.) Very high quality. Very pleased with the product.
I want to parse this into a dataframe with the productID, title, price.. as columns and the data as the rows. How can I do this in R?
A quick and dirty approach:
mytable <- read.table(text=mytxt, sep = ":")
mytable$id <- rep(1:2, each = 10)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
There will be issues if there are other colons in the data. Also assumes that there is an equal number (10) of variables for each case. All

Splitting strings that contain commas (special characters?)

I'm working from a spreadsheet of values. I have code that pulls a row of content to analyze. I was planning to split it on commas, but some of the strings inside the cells include commas (that aren't regularly spaced, so escaping them would be difficult). I downloaded the sheet as a tsv instead of a csv and re-uploaded it, but my attempts to split on \t haven't been successful. (For good measure, I've also tried \n, \r, and \f to see if they're involved in delimiting cells. They don't seem to be.)
Is there a special character that means "next cell" or "next record" or something like that? Am I better off trying to end each cell with a particular character that I would then have to strip out of my data after splitting? I'd welcome any other ideas!
Code snippet:
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var contentChunks = lastRowContents.toString().split('\t');
var product = contentChunks[0];
Logger.log(product);
This outputs the entire row as one item in that array, like so:
product: Wed Jan 05 2005 02:00:00 GMT-0600 (CST),001-2005, Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings ,misbranded,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index271,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index386,Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings,Produced 10/6/2004. The products subject to recall are: One pound bags of "DAY-LEE PRIDE BEEF GYOZA POTSTICKERS, VEGETABLE AND BEEF DUMPLINGS." Each bag bears the code "28004," as well as "Est. 17309" inside the USDA mark of inspection.,The packages state that the gyozas are filled with beef, but they may instead contain shrimp, a known allergen.,The problem was discovered by the establishment.,17309 M Day-Lee Foods Inc. 13055 E. Molette St. Santa Fe Springs, CA 90670,,Approximately 2,520 pounds,California, Colorado, Georgia, Maryland, New York, and Washington.,Class I,U.S. Food and Drug Administration (FDA),,,,,
(just for visibility :)
since lastRowContents is an 2D array (link to doc) you have every cell with lastRowContents[0][0],lastRowContents[0][1],lastRowContents[0][2],etc..
in your code :
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var product = lastRowContents[0][0];
Logger.log(product);

Stata - inputting data from .txt with "" and ,

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""
There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.
This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.