SQLite like with Regular Expressions - regex

I have a column with HTML content. I want to search for words in that column, but only the text, not the HTML code.
For example:
(1) <p class="last">First time I went there...</p>
(2) This is a <em>very</em> subtle colour.
(1) Searching for last doesn't find it, because it's a class name, not content.
(2) Searching for very subtle will find it, ignoring HTML
Is this possible with SQLite directly?
Note: I cannot define functions.

Don't do it with SQLite.
Do it with your programming language, your framework that is using SQLite.
In the table, where you have the column with the html code, add additional columns for data about the html. You will have to gather the data for the extra columns, while you analyze the html with your framework.
Track data about the structure the html format does have and save in an extra column the textual content of the html data.
You can get all tags by simple REGEX:
/<?[^<>]+>?/
Checkout how you receive data by scanning the html data for tags with the regexp above and write an iterated evaluation for tag-content (i.e. if a string in the results-array starts with a "<" it´s a tag, by scanning it with /<\s*\/\s*[^>]+>/ you will see if it is a ending tag and by scanning it with /<\s*[^\/>]+\s*\/\s*>/ you will see if it is a single closed tag. If none of the differentiated states does apply, it is textual content.

There isn't a good way to do that in SQLite directly (you'd need to build a SQLite extension that would parse the HTML and let you search through it like MSSQL's XML field type).
Your best bet is going to be to parse the HTML in your code and write out all the text into a separate column to search on as #Kevin suggests in the comments.
E.g.
ID | HTML | Text
---------------------------------------------------------------------------
1 | <p class="last">First time ...</p> | First time ...
2 | This is a <em>very</em> subtle colour. | This is a very subtle colour.

Related

Select certain fields/children of flat xml with xpath

I have a xml file like a single database table, is there a way I can get "listpos"-rows of certain fields, ie. only "type" and "objName" ?
<listpos lfdNr="0001" reihe="20140626143443">
<type>Akt</type>
<objName>2#25.6.2014#40801#de</objName>
<laborOrt>au</laborOrt>
...
</listpos>
<listpos lfdNr="0002" reihe="20140626181936">
<type>Akt</type>
<objName>2#25.6.2014#40802#de</objName>
<laborOrt>au</laborOrt>
...
</listpos>
...
So the output should be
<listpos>
<type>Akt</type>
<objName>2#25.6.2014#40801#de</objName>
</listpos>
<listpos>
<type>Akt</type>
<objName>2#25.6.2014#40802#de</objName>
</listpos>
...
Writing this it comes to my mind it might be a job for xslt?
The whole thing I do, is trying to make django-xml work together with django_tables2 ...
Yes, by using XPath you can get that data pretty easily.
You navigate to the appropriate elements using Expressions.
For the data you're trying to extract (assuming you wanted to access all "type" and "objName" Elements), those would be "//listpos/type" and "//listpos/objName"
Those would get you the right nodes from which you can extract the content.

Mapping user spreadsheet columns to database fields

I’m not sure where to start on this project. I know how to read the contents of the excel spreadsheet, I know how to identify the header row, I know how to loop over the contents. I believe I have the UX portion worked out but I am not sure how to process the data.
I’ve googled and only found .Net solutions but I’m looking for a ColdFusion/Lucee solution.
I have a working form allowing me to map a user's spreasheet column to my database values (this is being kept simple for this post; user does not have direct access to the database).
Now that I have my data, I'm not sure how to loop over the data results. I believe there will be several loops (an outer and an inner). Then of course I also need to loop over the file contents but I think if I can get the headings mapped out,I can figure out the remaining.
Any good links, tutorials, or guides would be greatly appreciated.
Some pseudo code might be enough to get me started.
User uploads form
System reads headers and content.
User is presented form with a list of columns from their uploaded spreadsheet to match with available database fields (eg “column1” matches “customer name”.
User submits form.
Now what?
UPDATED
Here is what the data looks like AFTER the mapping has been done in my form. The column deliiter is the ::: and within the column the ||| indicates the ID associated with the selected column value. I've included the id and the column value since I plan on displaying the mapping again as a confirmation. Having the ID saves a trip to the database.
If I understand correctly, your question is: how do you provide the user a form allowing them to map their spreadsheet columns to that of the database
Since you have their spreadsheet column names, and you have the database column names, then this problem is essentially a UI/UX problem. You need to show both lists, and allow the user to map them. I can imagine several approaches to this. My first thought would be some sort of drag/drop operation, as follows:
Create a list of boxes, one for each field in your database table, and include the field name in (or above) the box. I'll call this the db field list. Then, create another list for each column from the spreadsheet, which I'll call the spreadsheet column list. The user would drag/drop items from the spreadsheet column list to the db field list.
When a mapping has been completed by the user, you would store the column/field names in as data for the DOM element of the db field list box. Then upon submission, you would acquire the mapping data by visiting each box and adding it to an array. Then you would serialize that array into JSON and send that to your form submission handler.
This could be difficult or easy, depending on your knowledge of UI implementations using JavaScript. jQuery makes this easy (if you know jQuery). There's even a jquery UI plugin that does this: https://jqueryui.com/droppable/.
A quick search for javascript drag drop would help, and here's a few articles I found:
https://www.w3schools.com/html/html5_draganddrop.asp
https://medium.com/quick-code/simple-javascript-drag-drop-d044d8c5bed5
You would also need to submit the array of mappings using javascript. You could search for that as well, and here's an article I found:
https://codereview.stackexchange.com/questions/94493/submit-an-array-as-an-html-form-value-using-javascript

scrape with lxml and python from web si

I would like to extract certain data from websites.
Originally, I was converting to .txt file and then writing some routines in Python to filter/read out the data, which worked for 95% of the data, which is not sufficient. I found that there is a way with lxml, which I tried, but could not succeed. With XPATH I think I mark the correct position, however, I get only empty brackets as result []. If anybody knew how to correct it, would be highly appreciated.
Thanks.
Peter
from lxml import html
import requests
page=requests.get('http://www.finanzen.net/analyse/ING_Group_NV_overweight-JP_Morgan_Chase__Co__529284')
tree=html.fromstring(page.text)
unternehmen=tree.xpath('/html/body/div[1]/div[8]/div[2]/div[3]/div[4]/div[1]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/td[1]/br')
#This should fetch the information about unternehmen
print unternehmen
That page source code is a mess. Very little id attributes and many anonymous <div>. You could get the nearest element with an id attribute, go up to its parent, root of the table that keeps the element you want to extract. Following xpath expression worked for me. Give it a try:
response.xpath('//a[#id="commentLink"]/ancestor::div[4]/div[2]/div[4]/div[2]/table/tr/td[1]/text()')[0]
It yields:
u'ING Group NV'
The last [0] is because it's hard to parse non-closed <br> tags and td column retrieves three text elements, so I get only the first one. You will have to adapt it to your needs.

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.
Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

Using RegEx in SSIS

I currently have a package pulling data from an excel file, but when pulling the data out I get rows I do not want. So I need to extract everything from the 'ID' field that has any sort of letter in it.
I need to be able to run a RegEx command such as "%[a-zA-Z]%" to pull out that data. But with the current limitation of conditional split it's not letting me do that. Any ideas on how this can be done?
At the core of the logic, you would use a Script Transformation as that's the only place you can access the regex.
You could simply a second column to your data flow, IDCleaned and that column would only contain cleaned values or a NULL. You could then use the Conditional Split to filter good rows vs bad. System.Text.RegularExpressions.Regex.Replace error in C# for SSIS
If you don't want to add another column, you can set your current ID column to be ReadWrite for the Script and then update in place. Perhaps adding a boolean column might make the Conditional Split logic easier at this point.