Html renderer with limited resources (good memory management) - c++

I'm creating a linux program in C++ for a portable device in order to render html files.
The problem is that the device is limited in RAM, thus making it impossible to open big files (with actual software).
One solution is to dynamically load/unload parts of the file, but I'm not sure how to implement that.
The ability of scrolling is a must, with a smooth experience if possible
I would like to hear from you what is the best approach for such situation ?
You can suggest an algorithm, an open-source project to take a look at, or a library that support what I'm trying to do (webkit?).
EDIT:
I'm writing an ebook reader, so I just need pure html rendering, no javascript, no CSS, ...

To be able to browse a tree document (like HTML) without fully loading, you'll have to make a few assumptions - like the document being an actual tree. So, don't bother checking close tags. Close tags are designed for human consumption anyway, computers would be happy with <> too.
The first step is to assume that the first part of your document is represented by the first part of your document. That sounds like a tautology, but with "modern" HTML and certainly JS this is technically no longer true. Still, if any line of HTML can affect any pixel, you simply cannot partially load a page.
So, if there's a simple relation between position the the HTML file and pages on screen, the next step is to define the parse state at the end of each page. This will then include a single file offset, probably (but not necessarily) at the end of a paragraph. Also part of this state is a stack of open tags.
To make paging easier, it's smart to keep this "page boundary" state for each page you've encountered so far. This makes paging back easy.
Now, when rendering a new page, the previous page boundary state will give you the initial rendering state. You simply read HTML and render it element by element until you overflow a single page. You then backtrack a bit and determine the new page boundary state.
Smooth scrolling is basically a matter of rendering two adjacent pages and showing x% of the first and 100-x% of the second. Once you've implemented this bit, it may become smart to finish a paragraph when rendering each page. This will give you slightly different page lengths, but you don't have to deal with broken paragraphs, and that in turn makes your page boundary state a bit smaller.

Dillo is the lightest weight Linux web browser that I'm aware of.
Edit: If it (or its rendering component) won't meet your needs, then you might find Wikipedia's list of and comparison of layout engines to be helpful.
Edit 2: I suspect that dynamically loading and unloading parts of an HTML file would be tricky; for example, how would you know that a randomly chosen chunk of the file isn't in the middle of a tag? You'd probably have to use something like SAX to parse the file into an intermediate representation, saving discrete chunks of the intermediate representation to persistent storage so that they won't take up too much RAM. Or you could parse the file with SAX to show whatever fits in RAM at once then re-parse it whenever the user scrolls too far. (Stylesheets and Javascript would ruin this approach; some plain HTML might too.) If it were me, I'd try to find a simple markup language or some kind of rich text viewer rather than going to all of that difficulty.

Related

MFC: what is the best way to generate and display large document?

I am still new to this area. My current project requires to generate and display large report (over a few hundred pages). The structure of document is quite simple but still contains row-column formatting with a few colors, fonts and lines. Also, it needs to be printable which is quite a headache. The approach I am taking is to use browser control plus HTML. One issue is that when the document gets big, UI is pretty lagging. Is there other way of doing that?

Parsing deleted pdfs

I'm trying to do some file carving on a disk with c++. I can't find any resources on the web related to the on-disk structure of a pdf file. The thing is that I can find the %PDF-1.x token at the start of a cluster but I can't find out the size of a PDF file anywhere.
Let's say hypothetically that the file system entry for this particular document is lost. I find the start of the document and I keep reading until I run into the "startxref number %%EOF". The thing is that I don't know when to stop since there are multiple "%%EOF" markers in the content of a document.
I've tried stopping after reading, let's say 10 clusters, and not finding any pdf specific keyword like "obj", "stream", "trailer", "xref" anywhere. But it's quite arbitrary and it's not a deterministic method of finding the ending of the document so I can determine it's size.
I've also seen some "Length number" markers at the start of some "obj"s but the number doesn't really fit most of the time.
Any ideas on what I can try next? Is there a way to determine the exact size of the entire document? I'm interested in recovering documents programmatically.
Since PDF's are "free format" (pretty much like text files, but with less obviousness to humans when it comes to "reading" the content), it's probably hard to piece them together if they aren't in order.
A stream does have a length, which is a key to where the endstream goes. (A blank line before and after the stream itself). Streams are used t introduce bitmaps and similar things [fonts, line-art data in compressed form, etc] into the document). But if you have several 4KB segments that could go in as the same block in the middle of a stream then there's no way to tell which way they go, other than pasting it together and seeing which ones look sane and which doesn't. Similarly, if there are several segments of streams and objects, you can't really tell which goes where.
Of course, this applies to almost all types of files with "variable content" - you can find the first few kilobytes of a JPG, but knowing what the REST of the of is, won't be easy - only be visually inspecting the content can you determine which blocks of bytes belong where - if you get it wrong, you'll probably just get some random garbage.
The open source tool bulk_extractor has a module called scan_pdf that does pretty much what you are describing here. It can recognize the individual parts of a PDF file on a drive, automatically decompresses the compressed regions, and extracts text using a two strategies. It will recover data from fragments of PDFs even if the xref table cannot be found.

xslfo with FOP: Check if content overflows and call different template?

I have a question with XSLFO, generator is FOP. What I wanna do:
In the PDF I wanna generate an item list, each item is in a box with a specific width and height. In case the content does not fit this box, the content should be displayed in a bigger box (with also specific dimensions).
I do not see any way to reach that in XSLFO, especially with FOP.
Has someone an idea to solve that?
Thanks for every idea!!
There are two separate, independent processing steps involved here:
Generation of XSL-FO markup (using a stylesheet and an XSLT processor).
Rendering of XSL-FO markup as PDF (using a FO processor, such as FOP).
The second step cannot influence the first. It is not possible to test for overflow conditions during rendering and somehow decide what template to invoke. There is no feedback loop. What you are asking for is not possible.
It is possible to do crude text fitting by estimating the length of text strings in XSLT. That is the idea behind "Saxon Extension for Guessing Composed Text String Length".
I have not used this extension, and it may not even be available anymore (the announcement about it is from 2004). In any case, this is very far from an actual layout feedback mechanism.

How does large text file viewer work? How to build a large text reader

how does large text file viewer work?
I'm assuming that:
Threading is used to handle the file
The TextBox is updated line by line
Effective memory handling is used
Are these assumptions correct? if someone were to develop their own, what are the mustsand don'ts?
I'm looking to implement one using a DataGrid instead of a TextBox
I'm comfortable with C++ and python. I'll probably use QT/PyQT
EDIT
The files, I have are usually between 1.5 to 2 GB. I'm looking at editing and viewing these files
I believe that the trick is not loading the entire file into memory, but using seek and such to just load the part which is viewed (possibly with a block before and after to handle a bit of scrolling). Perhaps even using memory-mapped buffers, though I have no experience with those.
Do realize that modifying a large file (fast) is different from just viewing it. You might need to copy the gigabytes of data surrounding the edit to a new file, which may be slow.
In Kernighan and Plaugher's classic (antique?) book "Software Tools in Pascal" they cover the development and design choices of a version of ed(1) and note
"A warning: edit is a big
program (excluding contributions from
translit, find, and change; at
950 lines, it is fifty percent bigger
than anything else in this book."
And they (literally) didn't even have string types to use. Since they note that the file to be edited may exist on tape which doesn't support arbitrary writes in the middle, they had to keep an index of line positions in memory and work with a scratch file to store changes, deletions and additions, merging the whole together upon a "save" command. They, like you, were concerned about memory constraining the size of their editable file.
The general structure of this approach is preserved in the GNU ed project, particularly in buffer.c

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.