Converting Word HTML document to PDF - coldfusion

I have a document that is generated via HTML/CSS and made to be a Word document (e.g., It uses Word formatting so that Word recognizes it as a Word document). I need to then convert this Word document to PDF. However, using <cfdocument> (with Open Office) returns an error. I've tried renaming the document to an HTML file, but specific Word CSS for things like page breaks and page numbers are not translated. Is there a way to convert these Word HTML documents into proper PDFs?

Related

Finding and replacing a pattern with bold and normal characters

So as the title suggests I have a crazy thing that I need to do and was wondering if there is a faster way to do it. Basically I have a list in Word format. On each line there is data that looks like this:
Bold Text Normal Text
I need to insert something between the bold and normal text. Is there any way to find only the places that match that pattern (i.e. B space here N)? I could then easily insert what I need. Maybe something with regex?
Ok, so a bit extreme idea:
The document you are talking about, is docx? if not, I guess you can convert it to it.
I've tried that on a docx file, without a regex, but i'm sure that you'll be able to take care of this :)
So!
Extract the docx file as a zip archive
You can add .zip to the file name, as an extension, or just open with an archiver - such as 7zip.
Navigate to the folder named word, under the extracted folder.
Open document.xml with your preferred editor
Every part of the text that changes his style - has a different tag
Find some string that looks like that: <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"><w:rPr><w:b w:val="1"/><w:rtl w:val="0"/></w:rPr><w:t xml:space="preserve">bold text </w:t></w:r>
A string style section looks like that ^
The tag <w:b w:val="1"/> with the 1 value, indicates that this string inside ("bold text ") has the bold style.
Create a string that looks like what I've shown above, and insert the text you like. If for example you want the new text to have another style, like italic, so use <w:i w:val="1"/> (with i instead of b).
My example:
I wanted to add pictures, but I don't have enough reputation :(
It looks like:
Before: bold text normal text
After: bold text hi im new normal text
The XMLs example:
https://gist.github.com/arieljannai/08756ef562962eee0798
So, the only thing you need to do now, is build a regex that will find you the parts with w:b tags and all of the surrounding, and than you have it :)
Good luck!
EDIT: A regex example I made, that matches a style string line, like I put in the example above:
(<w:r.*?>(?:<w:b\s{1}.*?\/>){1}.*?(?:<w:t\s{1}.*?>(.*?)<\/w:t>)<\/w:r>)
The regex matches a section, between the <w:r> tag (first group).
The first non-matching group make sure it has the bold tag ((?:<w:b\s{1}.*?\/>))
The second non-matching group finds the tag that the text is with in it (the <w:t> tag).
inside the second non-matching group, there's the second matching group (.*?) which actually holds the text of that style string. (second group).
So you have the whole style string in the first group, and only the actual text in the second group.

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

Regular expression to validate length string without including html tags

I am using umbraco where the validation on fields is done by regular expressions. In one field I want to allow users to style their text using the rich text editor (tinymce) but I still want to limit the number of characters they can enter.
I'm currently using this regular expression but it checks the total number of characters so includes the html.
^[\s\S]{0,250}$
Is there a regular expression that wouldn't count the characters in html tags.
The short answer is no. At least, not with any sane regex, not without an advanced regex engine that allows recursion or balanced groups, and maybe not at all. A regex that can recognize and ignore HTML tags would have to parse the HTML to do it, and down that road lies madness.
However, you could use some sort of preprocessing, such as jQuery on the client-side or something else on the server-side, to parse the HTML and strip out the tags before you apply length validation.
Are you sure you want to do this, though? If you're storing the styled input in a database, then those HTML tags are going to count against your column size just like everything else will. If you're storing these in a varchar(250) column, you're going to have to either count the HTML tags as part of that 250, or else strip them out and lose all the style information.
It's going to be hard (nigh impossible) to do this in one step, since the grammar you're trying to detect is not context-free. Two steps would be easy; just do a s/<.+?>// substitution first to remove all the tags then count again.
On a related note, your regex above is a little bit silly. You can use the . character to represent any character; you don't have to do the "whitespace OR not-whitespace" trick you're using.
^.{0,250}$

Getting alt tags with regex

I am parsing some HTML source. Is there a regex script to find out whether alt tags in a html document are empty?
I want to see if the alt tags are empty or not.
Is regex suitable for this or should I use string manipulation in C#?
You have to parse the HTML and check tags, use the following link, it includes a C# library for parsing HTML tags, and you can loop through tags and get the number of tags: Parsing HTML tags.
If this is valid XHTML, why do you need Regex at all? If you simply search for the string:
alt=""
... you should be able to find all empty alt tags.
In any case, it shouldn't be too complicated to construct a Regex for the search too, taking into account poorly written HTML markup (especially with spaces):
alt\s*=\s*"\s*"
If you want to do it just looking at the page then CSS selectors might be better, assuming your browser supports the :not selector.
Install the selectorgadget bookmarklet. Activate it on your page and then put the following selector in the intput box and press enter.
img:not([alt])
If you are automating it, and have access to the DOM for the HTML you could use the same selector.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.