grep/sed/awk - extract substring from html code

grep/sed/awk - extract substring from html code - regex

i want to get a value from html code like this:
<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:
As result i need just the value: "53"
How can this be done using linux command line tools like grep, awk or sed? I want to use it on a raspberry pi ...r
Trying this doesnt work:
root#raspberrypi:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt
root#raspberrypi:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt
root#raspberrypi:/home/pi#

Because HTML is not a flat-text format, handling it with flat-text tools such as grep, sed or awk is not advisable. If the format of the HTML changes slightly (for example: if the span node gets another attribute or newlines are inserted somewhere), anything you build this way will have a tendency to break.
It is more robust (if more laborious) to use something that is built to parse HTML. In this case, I'd consider using Python because it has a (rudimentary) HTML parser in its standard library. It could look roughly like this:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
Addendum: Here's a backport to Python 2 that bulldozes right over encoding problems. For this case, that is arguably nicer because encoding doesn't matter for the data we want to extract and you don't have to know the encoding of the input file in advance. The changes are minor, and the way it works is exactly the same:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))

Related

There is a text file want to return content of file in json format on the matching column

there is a text file containing data in the form:
[sec1]
"ab": "s"
"sd" : "d"
[sec2]
"rt" : "ty"
"gh" : "rr"
"kk":"op"
we are supposed to return dara of matching sections in json format like if user wants sec1 so we are supposed to send sec1 contents

The format you specified is very similar to the TOML format. However, this one uses equals signs for assignments of key-value pairs.
If your format actually uses colons for the assignment, the following example may help you.
It uses regular expressions in conjunction with a defaultdict to read the data from the file. The section to be queried is extracted from the URL using a variable rule.
If there is no hit within the loaded data, the server responds with a 404 error (NOT FOUND).
import re
from collections import defaultdict
from flask import (
Flask,
abort,
jsonify
)
def parse(f):
data = defaultdict(dict)
section = None
for line in f:
if re.match(r'^\[[^\]]+\]$', line.strip()):
section = line[1:-2]
data[section] = dict()
continue
m = re.match(r'^"(?P<key>[^"]+)"\s*:\s*"(?P<val>[^"]+)"$', line.strip())
if m:
key,val = m.groups()
if not section:
raise OSError('illegal format')
data[section][key] = val
continue
return dict(data)
app = Flask(__name__)
#app.route('/<string:section>')
def data(section):
path = 'path/to/file'
with open(path) as f:
data = parse(f)
if section in data:
return jsonify(data[section])
abort(404)

Trying to write a custom template tag in Django that finds a phone number in text and converts it to a link

I want to convert this string tel:123-456-7890.1234 to a a link in html. The final output would be 123-456-7890 ext 1234
I'm not great with Regex and I'm REALLY close, but I need some help. I know that I'm not all the way there with the regex and output. How do I change what I have to make it work?
import re
#register.filter(name='phonify')
#stringfilter
def phonify(val):
"""
Pass the string 'tel:1234' to the filter and the right tel link is returned.
"""
# find every instance of 'tel' and then get the number after it
for tel in re.findall(r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})\D*(\d*)', val):
# format the tag for each instance of tel
tag = '{}'.format(tel, tel)
# replace the tel instance with the new formatted html
val = val.replace('tel:{}'.format(tel), tag)
# return the new output to the template context
return val
I added the wagtail tag as I've seen other solutions for this in Wagtail and is something needed for Wagtail, so this might be helpful to others.

You can use re.sub to perform a find and replace:
import re
#register.filter(name='phonify')
#stringfilter
def phonify(val):
tel_rgx = r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4}\D*\d*)'
return re.sub(tel_rgx, r'\1', val)
Note however that in your template, you will need to mark the result as "safe", otherwise it will replace < with < etc and thus render <a href= as text.
You can mark the string as safe in your template fiter as well:
import re
from django.utils.safestring import mark_safe
#register.filter(name='phonify')
#stringfilter
def phonify(val):
tel_rgx = r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4}\D*\d*)'
return mark_safe(re.sub(tel_rgx, r'\1', val))
Regardless how you do this, it will however mark all items as safe, even tags, etc. that are part of the original string, and thus should be escaped. Therefore, I'm not sure that this is a good idea.

Translate ?page_no=5 to <QueryDict: {'page_no': ['5']}>

When I send a query request like ?page_no=5 from the brower:
http://127.0.0.1:8001/article/list/1?page_no=5
I get the output in the debugging terminal
def article_list(request, block_id):
print(request.GET)
<QueryDict: {'page_no': ['5']}>
Django encapsulates the page_no=5 to a dict {'page_no': ['5']}
How Django accomplish such a task, are the regex or str.split employed?

Short answer: we can inspect the source code to see how the QueryDict is constructed. It is done with a combination of regex splitting and unquoting.
This can be done by using the constructor. Ineed, if we call:
>>> QueryDict('page_no=5')
<QueryDict: {u'page_no': [u'5']}>
The constructor uses the limited_parse_qsl helper function. If we look at the source code [src]:
def __init__(self, query_string=None, mutable=False, encoding=None):
super().__init__()
self.encoding = encoding or settings.DEFAULT_CHARSET
query_string = query_string or ''
parse_qsl_kwargs = {
'keep_blank_values': True,
'fields_limit': settings.DATA_UPLOAD_MAX_NUMBER_FIELDS,
'encoding': self.encoding,
}
if isinstance(query_string, bytes):
# query_string normally contains URL-encoded data, a subset of ASCII.
try:
query_string = query_string.decode(self.encoding)
except UnicodeDecodeError:
# ... but some user agents are misbehaving :-(
query_string = query_string.decode('iso-8859-1')
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
self.appendlist(key, value)
self._mutable = mutable
If we look at the source code of limited_parse_qsl [src], we see that this parser uses a combination of splitting, and decoding:
FIELDS_MATCH = re.compile('[&;]')
# ...
def limited_parse_qsl(qs, keep_blank_values=False, encoding='utf-8',
errors='replace', fields_limit=None):
"""
Return a list of key/value tuples parsed from query string.
Copied from urlparse with an additional "fields_limit" argument.
Copyright (C) 2013 Python Software Foundation (see LICENSE.python).
Arguments:
qs: percent-encoded query string to be parsed
keep_blank_values: flag indicating whether blank values in
percent-encoded queries should be treated as blank strings. A
true value indicates that blanks should be retained as blank
strings. The default false value indicates that blank values
are to be ignored and treated as if they were not included.
encoding and errors: specify how to decode percent-encoded sequences
into Unicode characters, as accepted by the bytes.decode() method.
fields_limit: maximum number of fields parsed or an exception
is raised. None means no limit and is the default.
"""
if fields_limit:
pairs = FIELDS_MATCH.split(qs, fields_limit)
if len(pairs) > fields_limit:
raise TooManyFieldsSent(
'The number of GET/POST parameters exceeded '
'settings.DATA_UPLOAD_MAX_NUMBER_FIELDS.'
)
else:
pairs = FIELDS_MATCH.split(qs)
r = []
for name_value in pairs:
if not name_value:
continue
nv = name_value.split('=', 1)
if len(nv) != 2:
# Handle case of a control-name with no equal sign
if keep_blank_values:
nv.append('')
else:
continue
if nv[1] or keep_blank_values:
name = nv[0].replace('+', ' ')
name = unquote(name, encoding=encoding, errors=errors)
value = nv[1].replace('+', ' ')
value = unquote(value, encoding=encoding, errors=errors)
r.append((name, value))
return r
So it splits with the regex [&;], and uses unquote to decode the elements in the key-value encoding.
For the unquote(..) function, the urllib is used.

Scrapy convert from unicode to utf-8

I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code
class ArticleSpider(Spider):
name = "article"
allowed_domains = ["example.com"]
start_urls = (
"http://example.com/tag/1/page/1"
)
def parse(self, response):
next_selector = response.xpath('//a[#class="next"]/#href')
url = next_selector[1].extract()
# url is like "tag/1/page/2"
yield Request(urlparse.urljoin("http://example.com", url))
item_selector = response.xpath('//h3/a/#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin("http://example.com", url),
callback=self.parse_article)
def parse_article(self, response):
item = ItemLoader(item=Article(), response=response)
# here i extract title of every article
item.add_xpath('title', '//h1[#class="title"]/text()')
return item.load_item()
I'm not pleased with the output, something like:
[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name>
{'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}
I think I need to use custom ItemLoader class but I don't know how. Need your help.
TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8

As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)
$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya
In [7]: print response.xpath('//h1/text()').extract_first()
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"
In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'
What you see is two different representations of the same thing - a unicode string.
What I would suggest is run crawls with -L INFO or add LOG_LEVEL='INFO' to your settings.py in order to not show this output in the console.
One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.
$ scrapy crawl example -L INFO -o a.jl
gives you:
$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}
This is correct but it takes more space and most applications handle equally well non-escaped JSON.
Adding a few lines in your settings.py can change this behaviour:
from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
def __init__(self, file, **kwargs):
super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)
FEED_EXPORTERS = {
'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
'jl': 'myproject.settings.MyJsonLinesItemExporter',
}
Essentially what we do is just setting ensure_ascii=False for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:
$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}
which is better-looking, equally valid and more compact.

There are 2 independant issues affecting display of unicode string.
if you return a list of strings, the output file will have some issue them because it will use ascii codec by default to serialize list elements. You can work around as below but it's more appropriate to use extract_first() as suggested by #neverlastn
class Article(Item):
title = Field(serializer=lambda x: u', '.join(x))
the default implementation of repr() method will serialize unicode string to their escaped version \uxxxx. You can change this behaviour by overriding this method in your item class
class Article(Item):
def __repr__(self):
data = self.copy()
for k in data.keys():
if type(data[k]) is unicode:
data[k] = data[k].encode('utf-8')
return super.__repr__(data)

Decoding utf-8 in Django while using unicode_literals in Python 2.7

I'm using Django to manage a Postgres database. I have a value stored in the database representing a city in Spain (Málaga). My Django project uses unicode strings for everything by putting from __future__ import unicode_literals at the beginning of each of the files I created.
I need to pull the city information from the database and send it to another server using an XML request. There is logging in place along the way so that I can observe the flow of data. When I try and log the value for the city I get the following traceback:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 1: ordinal not in range(128)
Here is the code I use to log the values I'm passing.
def createXML(self, dict):
"""
.. method:: createXML()
Create a single-depth XML string based on a set of tuples
:param dict: Set of tuples (simple dictionary)
"""
xml_string = ''
for key in dict:
self.logfile.write('\nkey = {0}\n'.format(key))
if (isinstance(dict[key], basestring)):
self.logfile.write('basestring\n')
self.logfile.write('value = {0}\n\n'.format(dict[key].decode('utf-8')))
else:
self.logfile.write('value = {0}\n\n'.format(dict[key]))
xml_string += '<{0}>{1}</{0}>'.format(key, dict[key])
return xml_string
I'm basically saving all the information I have in a simple dictionary and using this function to generate an XML formatted string - this is beyond the scope of this question.
The error I am getting had me wondering what was actually being saved in the database. I have verified the value is utf-8 encoded. I created a simple script to extract the value from the database, decode it and print it to the screen.
from __future__ import unicode_literals
import psycopg2
# Establish the database connection
try:
db = psycopg2.connect("dbname = 'dbname' \
user = 'user' \
host = 'IP Address' \
password = 'password'")
cur = db.cursor()
except:
print "Unable to connect to the database."
# Get database info if any is available
command = "SELECT state FROM table WHERE id = 'my_id'"
cur.execute(command)
results = cur.fetchall()
state = results[0][0]
print "my state is {0}".format(state.decode('utf-8'))
Result: my state is Málaga
In Django I'm doing the following to create the HTTP request:
## Create the header
http_header = "POST {0} HTTP/1.0\nHost: {1}\nContent-Type: text/xml\nAuthorization: Basic {2}\nContent-Length: {3}\n\n"
req = http_header.format(service, host, auth, len(self.xml_string)) + self.xml_string
Can anyone help me correct the problem so that I can write this information to the database and be able to create the req string to send to the other server?
Am I getting this error as a result of how Django is handling this? If so, what is Django doing? Or, what am I telling Django to do that is causing this?
EDIT1:
I've tried to use Django's django.utils.encoding on this state value as well. I read a little from saltycrane about a possible hiccup Djano might have with unicode/utf-8 stuff.
I tried to modify my logging to use the smart_str functionality.
def createXML(self, dict):
"""
.. method:: createXML()
Create a single-depth XML string based on a set of tuples
:param dict: Set of tuples (simple dictionary)
"""
xml_string = ''
for key in dict:
if (isinstance(dict[key], basestring)):
if (key == 'v1:State'):
var_str = smart_str(dict[key])
for index in range(0, len(var_str)):
var = bin(ord(var_str[index]))
self.logfile.write(var)
self.logfile.write('\n')
self.logfile.write('{0}\n'.format(var_str))
xml_string += '<{0}>{1}</{0}>'.format(key, dict[key])
return xml_string
I'm able to write the correct value to the log doing this but I narrowed down another possible problem with the .format() string functionality in Python. Of course my Google search of python format unicode had the first result as Issue 7300, which states that this is a known "issue" with Python 2.7.
Now, from another stackoverflow post I found a "solution" that does not work in Django with the smart_str functionality (or at least I've been unable to get them to work together).
I'm going to continue digging around and see if I can't find the underlying problem - or at least a work-around.
EDIT2:
I found a work-around by simply concatenating strings rather than using the .format() functionality. I don't like this "solution" - it's ugly, but it got the job done.
def createXML(self, dict):
"""
.. method:: createXML()
Create a single-depth XML string based on a set of tuples
:param dict: Set of tuples (simple dictionary)
"""
xml_string = ''
for key in dict:
xml_string += '<{0}>'.format(key)
if (isinstance(dict[key], basestring)):
xml_string += smart_str(dict[key])
else:
xml_string += str(dict[key])
xml_string += '<{0}>'.format(key)
return xml_string
I'm going to leave this question unanswered as I'd love to find a solution that lets me use .format() the way it was intended.

This is correct approach (problem was with opening file. With UTF-8 You MUST use codecs.open() :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
class Writer(object):
logfile = codecs.open("test.log", "w", 'utf-8')
def createXML(self, dict):
xml_string = ''
for key, value in dict.iteritems():
self.logfile.write(u'\nkey = {0}\n'.format(key))
if (isinstance(value, basestring)):
self.logfile.write(u'basestring\n')
self.logfile.write(u'value = {0}\n\n'.format( value))
else:
self.logfile.write(u'value = {0}\n\n'.format( value ))
xml_string += u'<{0}>{1}</{0}>'.format(key, value )
return xml_string
And this is from python console:
In [1]: from test import Writer
In [2]: d = { 'a' : u'Zażółć gęślą jaźń', 'b' : u'Och ja Ci zażółcę' }
In [3]: w = Writer()
In [4]: w.createXML(d)
Out[4]: u'<a>Za\u017c\xf3\u0142\u0107 g\u0119\u015bl\u0105 ja\u017a\u0144</a><b>Och ja Ci za\u017c\xf3\u0142c\u0119</b>'
And this is test.log file:
key = a
basestring
value = Zażółć gęślą jaźń
key = b
basestring
value = Och ja Ci zażółcę

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep/sed/awk - extract substring from html code - regex

Related

There is a text file want to return content of file in json format on the matching column

Trying to write a custom template tag in Django that finds a phone number in text and converts it to a link

Translate ?page_no=5 to <QueryDict: {'page_no': ['5']}>

Scrapy convert from unicode to utf-8

Decoding utf-8 in Django while using unicode_literals in Python 2.7

Categories

Resources