I'm writing an application which generates vCards from an internal customer database and would like to include additional information in the card, e.g. internal customer number. Reading the RFC, I've see "vendor-specific" extensions, which are to be registered at IANA etc. Are these extensions the way to go? What should I keep in mind while using them, any pitfalls? Are there any alternative ways to define a custom field in a vCard?
RFC 2426 section 4 specifies how to create non-standard names. If you read the production rules, you'll get the exact way properties can be defined. For example, to define a customer number, you could say simply X-CUSTOMER-ID:123 on a separate line.
Things to beware:
Escaping characters like comma, semicolon, colon and newline. This is well defined in the syntax rules.
Which things are case sensitive and which are not. Ditto.
Encoding and line endings in the resulting file. Just to use UTF-8, line endings \r\n (Windows style) and an extra empty line at the end of the file.
Related
I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.
I'm writing a long HOWTO in reStructuredText format and wondering if there's a way to let user specify values for a couple variables (hostname, ip address) at the top so the rest of the document would be filled with those automatically?
Like me, you are probably looking for substitution. At the bottom of the section you'll find how to replace text.
Substitution Definitions
Doctree element: substitution_definition.
Substitution definitions are indicated by an explicit markup start
(".. ") followed by a vertical bar, the substitution text, another
vertical bar, whitespace, and the definition block. Substitution text
may not begin or end with whitespace. A substitution definition block
contains an embedded inline-compatible directive (without the leading
".. "), such as "image" or "replace".
Specifically about text replacement:
Replacement text
The substitution mechanism may be used for simple macro substitution. This may be appropriate when the replacement text is
repeated many times throughout one or more documents, especially if it
may need to change later. A short example is unavoidably contrived:
|RST|_ is a little annoying to type over and over, especially
when writing about |RST| itself, and spelling out the
bicapitalized word |RST| every time isn't really necessary for
|RST| source readability.
.. |RST| replace:: reStructuredText
.. _RST: http://docutils.sourceforge.net/rst.html
reStructuredText is a markup language to define static content. HTML content (I assumed the desired output format is HTML) is typically generated from reStructuredText on build time and then released/shipped to the user.
To allow users to specify variables, you would need a solution on top of reStructuredText, for example:
Ship the content with a JavaScript plugin that dynamically replaces specific strings in the HTML document with user input.
Generate the documentation on-the-fly after the user has specified the variables.
Note that these examples are not necessarily particularly viable solutions.
Using windows xp, i want to read a value from .ini file.
The value is a path.
Using QSettings, the result of the call to "settings.value("key").toString()" is the the path excluding backslashes, because backslash is escape character.
What is the way to read a path from ini file, using QSettings?
Although backslash is a special character in INI files, most Windows applications don't escape backslashes () in file paths [...]
QSettings always treats backslash as a special character and provides no API for reading or writing such entries.
This is what the documentation has to say about it. It is a polite way of saying "if some other code does it, they're not following the WINAPI spec and it's broken and we shouldn't have to deal with it". Pretty much your .ini files are broken.
If you wish to read them, you may need to provide your own backend for QSettings. Such a backend can be easily obtained by copying the one that comes as part of Qt, and modifying it not to perform escaping.
You'd need to investigate whether writing your own QTextCodec for this purpose and passing it to QSettings::setIniCodec would be sufficient. If sufficient, you wouldn't need to provide an entire backend.
To minimize compatibility issues, any # that doesn't appear at the first position in the value or that isn't followed by a Qt type (Point, Rect, Size, etc.) is treated as a normal character.
Although backslash is a special character in INI files, most Windows applications don't escape backslashes () in file paths
enter link description here
The RFC822/RFC2822 standard says that "Header fields are lines composed of a field name, followed by a colon (':'), followed by a field body, and terminated by CRLF".
But I see at least one RFC822 MIME parser that auto-normalizes payloads that use LF ("\n") into CRLF ("\r\n") before proceeding with parsing.
How safe is it to use an RFC822 format for serializing data that may have been hand-edited in places to use LF instead of CRLF? Would it be safe to send this data around to different programs & expect them to be able to parse it with various RFC822 parser libraries?
In the general case, not safe at all. Be conservative in what you send / generate.
Having said that, most Unix tools expect locally stored email files to use local line ending conventions. RFC5322 really only codifies the format used on the wire.
I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.