ICU Layout sample renders text differently than Microsoft Notepad and Word - c++

I have a bidirectional text
1002 -ابو ماجد الانصاري
Most editors notepad++, notepad etc. show the text as it is shown here. But when I get this text processed through ICU the number is shifted to the right then spaces and hyphen and then Arabic. ICU's sample application layout.exe also shows the number on the right.
I have modified paragraphlayout.cpp and set all possible reordering modes but result is still the same:
Can someone help to configure ICU to provide output as other display engines do.

If I understand correctly, your text 'begins' with the numeric, which is followed by the hyphen and text. Notepad and other editors let you choose the 'writing direction'. If you choose right-to-left, you get the same result as your screenshot,
If you want to maintain left-to-right writing direction, you can set it explicitly
ubidi_setPara(para, "1002 -ابو ماجد الانصاري", ‭25, UBIDI_LTR, NULL, pErrorCode);
or you can embed a UNICODE flag U+202A (LEFT-TO-RIGHT EMBEDDING) into your string that will enforce this direction. If your code is in C++, you can write something like
icu::UnicodeString string_to_layout = "\x202a";
string_to_layout += "1002 -ابو ماجد الانصاري";
and not you can use string_to_layout as input parameter for renderParagraph() (see http://icu-project.org/apiref/icu4c-latest/ubidi_8h.htm).

Related

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

How to prevent OpenOffice/LibreOffice Calc from changing what you input (data, numbers,...)

Basically, I want LibreOffice Calc to do what I tell it, not what it wants.
For example:
when I input 1.1.12, I want to have 1.1.12 in that cell, not 01.01.2012 or whatever.
when I input 001, I want to have 001 in that cell, not 1
and so on and so forth
I want it to never ever touch my data until I explicitly tell it to. Is that possible at all?
I know I can set format of a cell to text. It doesn't help at all. Example:
Input 1.1.12, it gets displayed as 01.01.12, format as text, it becomes "40909", original input is lost
Format empty cells as text. Paste "000 001 002 ..." separated by line breaks. Displays "0 1 2 ..."
I know I can write ' in front of anything for it to be forced text. Again it doesn't help, because when I paste in text, I cannot have ' auto-appended to it.
I hope this is possible. I tried googling for different problems and never found a good answer.
If you want your input to be interpreted as text and preventing Calc to do fancy (and annoying) things with your input, you have to change the format before entering any value.
Select the cells/columns/rows.
Right-click 'Format Cells...'
Select the tab 'Numbers'
In the list 'Category', select 'Text' (the last option)
Select the format '#' (it is the only one in this category)
Click on 'Ok'
You may need to tweak the 'autocorrect' options as well. Go to 'Tools > Auotcorrect Options...'. Here is a link that may help: https://help.libreoffice.org/Calc/Deactivating_Automatic_Changes
I understand your problem with pasting pure unformatted text. This may be more work than you like (we can try to automate that later) but when I paste data from Notepad, I am prompted with an import screen as you can see below. Select the column header(s) and then select Column type: Text. This should solve your paste/import problem. An alternative is to handle this with an AutoHotKey script.
Oh b.t.w. the # is the format type for text, just like you have HH for 24 hour or ddd for weekdays...
When you are importing, you're given a bunch of options. Select "Quoted field as text" so any text inside quotes is treated as text which is interpreted by LibreOffice as sacred and they do not modify it in the way they they modify something that they identify as number
When you have your data in the clipboard click Edit -> Paste as... in main menu. In next window choose "Paste as text". All your data will be pasted as is.
I initially arrived at this page with a very similar (but not identical) problem. I am posting the solution here for the benefit of those who might be visiting with the same issue.
Every time I would save, close, and then re-open my .XSLX spreadsheet in OpenOffice, it would delete the spaces I had entered in between text. For example:
"Did not attend" would become "Didnotattend".
"John DOE" would become "JohnDOE", etc.
Specifying "text" (#) as the format (as recommended above) did not help me, unfortunately.
What ultimately did solve it was saving it as an .ODS file instead of .XSLX .
just simply put character ' before the text, '0.1.16 and calc will interprate it as text data
My issue was currency, properly formatted would change to a much larger number if the numbers entered could represent a date; such as 4.22 becoming $42,482. I discovered that adding a trailing zero solved the problem.
I had pasted numbers from another site and it kept coming up with dates. I just messed around and hit the arrow that's on the paste board to give me the option of unformatted text or HTML format. I selected unformatted, a window opened to show me the text I wanted so I pressed o.k.

Unicode character for superscript shows a square box: ࠚ

Using the following code to create a Unicode string:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), 0x2074);
When I display this onto a Win32 Control like a Text box or a button it gets mapped to a [] Square.
How do I fix this ?
I tried compiling with both Eclipse(MinGW) and Microsoft Visual C++ (2010).
Also, UNICODE is defined at the top
Edit:
I think it might be something to do with my system, because when I visit: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
some of the unicode characters don't appear.
The font you are using does not contain a glyph for that character. You will likely need to install some new fonts to overcome this deficiency.
The character you have picked out is 'SAMARITAN MODIFIER LETTER EPENTHETIC YUT' (U+081A). Perhaps you were after U+2074, i.e. 'SUPERSCRIPT FOUR' (U+2074). You need hex for that: 0x2074.
Note you changed the question to read 0x2074 but the original version read 2074. Either way, if you see a box that indicates your font is missing that glyph.
The characters you are getting from Wikipedia are expressed in hexadecimal, so your code should be:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), (wchar_t)0x2074); // or TEXT('\x2074')
If it still doesn't work, it's a font problem; if you need a pan-Unicode font, it seems that Code2000 is one of the most complete out there.
Funny fact: the character that has the decimal code 2074 (i.e. hex 81a) seems to actually be a box (or it's such a strange beast that even the image outline at FileFormat.Info is wrong). :)
For the curious ones: it turns out that 0x081a is this thing:

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.