Using General Unicode Properties - regex

I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}
However, I am struggling with understanding the a mapping of those property names.
I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.
But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.
... am I missing a point here somewhere ?
I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !

Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.
If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.
If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.

Related

Can one disable TextDelimiter within Schema.ini for ado in mfc?

I have to deal with importing CSV files. The legacy solution uses MFC and ADO text driver to manage this. I know that you can specify the TextDelimiter-option within the corresponding Schema.ini file.
The problem is, that it is impossible for some input files to specify a character that isn't used within that file!
All our files are CP1252 encoded - we cannot deal with other encodings, so a "☃" (SNOWMAN, U+2603) or stuff like that provide no solution.
If I omit a character, ADO seems to fall back to the default character (doublequotes):
[Import.txt]
ColNameHeader=False
Format=Delimited(;)
TextDelimiter= //← omit character doesn't work!
col1=...
I also cannot define a sequence of characters, which would reduce the risk of mismatches to an acceptable value:
[Import.txt]
ColNameHeader=False
Format=Delimited(;)
TextDelimiter=##+# // produces error when opening the ADO connection!
So my question is: Is it possible to completly disable this feature? I just do not want any automatic text delimiting!
The code is implemented in C++ based upon MFC and ADO - so no ADO.NET solutions will help me.
This should do it:
TextDelimiter=none

Arabic: 'source' Unicode to final display Unicode

simple question:
this is the final display string I am looking for
لعبة ديدة
now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)
ل ع ب ة د ي د ة
note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.
and then in that above, the characters are actually appearing right to left (in memory, they are left to right)
so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?
that's all I want, one function that does that.
UPDATE:
ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)
so, can I just put my unicode into a function and get the transformed unicode out?
The joining and RTL conversion don't happen at the level of Unicode characters.
In other words: the order of the characters and the actual unicode codepoints are not changed during this process.
In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.
This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:
Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.
The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.
note how they are NOT the same characters
They are the same for an Arabic reader. It is still readable.
There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.
Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.
I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.
This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).
What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.
Some points:
At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.
At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.
To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms.
For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.
Here are a couple of pointers for more information on the topic:
http://unicode.org/faq/ligature_digraph.html#Pf1
http://www.decotype.com/publications/unicode-tutorial.pdf
PLease see: http://www.fileformat.info/info/unicode/block/arabic_presentation_forms_b/list.htm and Have a look at this repo: https://github.com/Accorpa/Arabic-Converter-From-and-To-Arabic-Presentation-Forms-B

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.
Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.
Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.
The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.
If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.

File system regular expression search tool

What is the best tool to make complex (multi-line) regular expression file contents searches with good reporting capabilities?
I need to make a report over large Java/JSP code base and I have to make some charts afterward.
Eclipse is rather good at searches, but it does not provide good report of what is found. It just shows the tree of files, but I would like to see a table with columns corresponding to full match, each group, file name, file path, file date, may some version control information etc. Then I can transfer this table to Excel and make some graphs that I want.
Is there some generic file system search tool that has such capabilities? Or maybe there is some Eclispe plugin that can give better reports (note that I'm stuck on eclipse 3.1.2)?
Agent Ransack, TextPad, and UltraEdit allow you to perform regular expression searches against the file system. My favorite is Agent Ransack as you can specify regular expressions for the file names and for the content.
PowerGREP (on Windows) can be used to do (most of) that. You can define the format of your search results quite freely. I haven't tried yet to also add file meta information to the search results, but that should work. Not sure if you can add version control information (where would that come from?) - perhaps if you could be a bit more specific, I could check.
Other than that, why not write a small Python/Ruby/Perl script like JasonTrue suggested?
For searches over code bases with queries that understand the language structure, look at SD Search Engine. This tool indexes larges source base to provide very fast query response.
Queries are stated in terms of langauge elements (identifiers, operators, strings, ...) with constraints over the language elements (including wildcards and regexps on identifiers, strings and comments, as well as range constraints on numbers). Language whitespace and linebreaks (and comments unless you insist) are ignored.
If you want to do a plain regexp search on file character content, you can do that too but you don't get the speed advantage of the index, runs more like regular grep.
The interactive query result is shown in a hit window with other hits; by clicking, you can go to window containg the full source code of a hit.
In logging mode, all hits found are written to a log file with N lines of context, where you configure N. That's probably the report you want.
um... grep -r ?
Or ruby/perl/python, if you want to have more control over the final output; it sounds like what you're after would only be a few lines.

How can I get Django to output bad characters instead of returning an error

i've got some weird characters in my database, which seem to mess up django when returning a page. I get this error come up:
TemplateSyntaxError at /search/legacy/
Caught an exception while rendering: Could not decode to UTF-8 column 'maker' with text 'i� G�r'
(the actual text is slightly different, but since it is a company name i've changed it)
how can i get django to output this text? i'm currently running the site from sqlite (fast dev), is this the issue?
Also, on a completely unrelated note, is it possible to use a database view?
thanks
Probably not.
Django is using UTF-8 Strings internally, and it seems that your database returns some invalid string. You should fix the data in the database and use exclusively UTF-8 in all your application (data import, database, templates, source files, ...).
I have a related problem with a site owner who uses Apple's iPages for article creation, then does a copy-paste into a Django admin textbox. This process creates 'funny characters' that screw up Django and/or MySQL (you wouldn't believe the number of different double-left/right quote characters there are). I can't 'fix' the customer so I have a function that looks for known strangeness and translates it to something useful before. A complete PITA.
That's a bit of a confusing error message, and without knowing more details I'm not clear what the source of the problem is (the error message phrasing "decode to UTF-8" seems wrong, as normally you would encode to UTF-8). Perhaps Django is expecting to find data in some other encoding and is trying to decode it and re-encode as UTF-8, but is choking on some characters that aren't valid for the encoding it's expecting?
In general, you want to make sure that you're storing UTF-8 in your database, and that internally you're using unicode objects (not str objects) everywhere in your code.
Some other reading that may be helpful:
Unicode in the real world
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Django Tips: UTF-8, ASCII Encoding Errors, Urllib2, and MySQL