After scrape some info in a web site I got to save the file with the raw code in html format because I didn't arrive to a solution to find_all the text in a list of lists.
Now I have the data but I can't get the text because bs4 don't recognize the format list.
Here's my open code:
with open('/my_file.csv', 'r') as read_obj:
csv_reader = reader(read_obj)
list_of_rows = list(csv_reader)
print(list_of_rows)
This is the list format:
[['', '0', '1', '2', '3'], ['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']]
Is there a way to read csv files in BeautifulSoup an then parse it?
The aim of the task is to keep only the text, removing everithing between '<>' (<> symbols included).
You can make a function that will apply the beautifulsoup object and return the text. if there are not tags/content to parse, it'll just leave as is.
Also, I'd rather just use pandas to read in that csv.
import pandas as pd
from bs4 import BeautifulSoup
df = pd.read_csv('/my_file.csv')
def foo_bar(x):
try:
return BeautifulSoup(x, 'lxml').text
except:
return x
print ('Parsing html in table...')
df = df.applymap(foo_bar)
Example input:
df = pd.DataFrame([['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']], columns = ['', '0', '1', '2', '3'])
Original table:
print (df.to_string())
0 1 2 3
0 0 <span class="item">Red <small>col.</small></span> <span class="item">120 <small>cc.</small></span> <span class="item">Available <small>in four da... <span class="item"><small class="txt-highlight...
1 1 <span class="item">Blue <small>col.</small></s... <span class="item">200 <small>cc.</small></span> <span class="item">Available <small>in a week<... <span class="item">04 mar <small></small></span>
2 0 <span class="item">Green <small>col.</small></... <span class="item">Available <small>immediatel... <span class="item"><small class="txt-highlight... None
Output:
print (df.to_string())
0 1 2 3
0 0 Red col. 120 cc. Available in four days 15 min
1 1 Blue col. 200 cc. Available in a week 04 mar
2 0 Green col. Available immediately 2 hours None
Related
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I have the following dataframe:
pd.DataFrame({
'Code': ['XAW', 'PAK', 'I', 'QP', 'TOPZ', 'XAW', 'APOL'],
'Name': ['George Truck', 'Fred Williams', 'Jessica Weir', 'Tony P.', 'John Truck', 'Liz Moama', 'Emily Truck'],
'Color': ['Blue', 'Green', 'Green', 'Red', 'Pink', 'Blue', 'Pink']
})
Code Name Color
0 XAW George Truck Blue
1 PAK Fred Williams Green
2 I Jessica Weir Green
3 QP Tony P. Red
4 TOPZ John Truck Pink
5 XAW Liz Moama Blue
6 APOL Emily Truck Pink
Given a keyword, such as 'blue', I would like to retrieve the following rows:
0 XAW George Paul Blue
5 XAW Liz Moama Blue
The search can contain multiple keywords, for example, 'truck pink' would return:
4 TOPZ John Truck Pink
6 APOL Emily Truck Pink
Imagine that this dataframe has half a million rows and a few extra columns. Is there a fast way I can query the entire dataframe for specific keywords?
With search string s = 'truck pink', set up a search column:
t = (df['Name'] + ' ' + df['Color']).str.lower()
I force everything to lower case, because your search example doesn't seem to be case-sensitive. If you have dynamic search inputs, also force the search field to lower case. Then do searches for contains like so:
d = {}
for i in s.split(' '):
d[i] = t.str.contains(i, na=False)
I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.
Regardless, take results and reduce them with bit-wise 'and'. You need two imports:
from functools import reduce
from operator import and_
df[reduce(and_, d.values())]
And thus:
Code Name Color
4 TOPZ John Truck Pink
6 APOL Emily Truck Pink
I am trying to use python regular expressions to print desired items out of a text file which contains the strings like:
-rwxr-xr-x 1 jttoivon hyad-all 2356 Dec 11 11:50 add_colab_link.py
-rw-r--r-- 1 jttoivon hyad-all 164519 Dec 28 17:59 basics.ipynb
-rw-r--r-- 1 jttoivon hyad-all 164477 Nov 5 19:21 basics.ipynb.orig
-rw-r--r-- 1 jttoivon hyad-all 115587 Dec 11 11:50 bayes.ipynb
For this I have written a function in the code below. The function only returns the list from the first row of the file. I want to get all the lists out of all the rows present in the file
import re
def file_listing(filename="src/listing.txt"):
with open('listing.txt','r') as f:
for i in range(47):
line = f.readline()
lists = re.findall(r'[-d]\w+\W+\w*\W+\w*\W*\s+\d+\s+\w+\s+\w+\W+\w+\s+(\d+)\s+(\w+)\s+(\d+)\s+(\d+)\W+(\d+)\s+(\w*\W?\w*\W?\w*.\w+)', line)
return lists
print(file_listing("listing.txt"))
This code gives:
[('2356', 'Dec', '11', '11', '50', 'add_colab_link.py')]
However, I want the function to iterate through all the rows in the file and return the lists of desired items for all the rows
Returning from a function ends its processing, nothing is done afterwards.
return lists
will therefor leave the function after the first line.
Fix:
import re
def file_listing(filename="src/listing.txt"):
all_hits = []
with open('listing.txt','r') as f:
for line in f:
if not line.strip():
continue # ignore empty lines
all_hits.append( re.findall(r'[-d]\w+\W+\w*\W+\w*\W*\s+\d+\s+\w+\s+\w+\W+\w+\s+(\d+)\s+(\w+)\s+(\d+)\s+(\d+)\W+(\d+)\s+(\w*\W?\w*\W?\w*.\w+)', line) )
return all_hits
print(file_listing("listing.txt"))
I did not verify your regex as it seems to find what you want.
Output:
[[('2356', 'Dec', '11', '11', '50', 'add_colab_link.py')],
[('164519', 'Dec', '28', '17', '59', 'basics.ipynb')],
[('164477', 'Nov', '5', '19', '21', 'basics.ipynb.orig')],
[('115587', 'Dec', '11', '11', '50', 'bayes.ipynb')]]
There might not ever be 2 matches per line - so you could use list.extend() instead.
I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.
You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]
I get the following file every hour, only numbers will vary:
<text>
<......>
<smtng>1</smtn> #line 3
......
<smtngelse>5</smtngelse> #line 9
-----
</text>
The next file might have a 2 instead of 1 for example.
How can I return the numbers on line 3 and 9 delimited by tab then append them into a different file?
The result should be 15.
Thanks
Diez
You can use beautifulsoup for getting values from the xml, and then write results to the file in a (append) mode:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<text>
<smtng>1</smtng>
<smtngelse>5</smtngelse>
</text>
""")
value1 = soup.find('smtng').text
value2 = soup.find('smtngelse').text
with open('output.txt', 'a') as f:
f.write('%s\t%s\n' % (value1, value2))
Hope that helps.
https://google-developers.appspot.com/chart/interactive/docs/gallery/linechart#Example
How can I add units to the vertical axis like "$" or "€"? In the example, it should be 1.200 $, 1.000 $, 800 $, 600 $ and 400 $.
Just adding '$' like this doesn't work:
var data = google.visualization.arrayToDataTable([
['Year', 'Sales', 'Expenses'],
['2004', 1000 '$', 400],
['2005', 1170 '$', 460],
['2006', 660 '$', 1120],
['2007', 1030 '$', 540]
]);
I know, it's a bad example as sales doesn't have any unit, but it's just an example.
You need to add a format tag to your graph options as follows:
var options = {
vAxis: {format:'# $'}
};
Reference: https://google-developers.appspot.com/chart/interactive/docs/gallery/linechart#Configuration_Options
I stumbled onto this question and found that the newer Google Sheets (2017) has an "Advanced Edit" option for Grids/Charts. Click the down arrow in the upper-right, and you'll see it. The last tab on that interface lets you set custom prefix/suffix for data Axes if you scroll down.