Convert hex to utf8 in greenplum in regexp_replace - regex

I have strings in a table that contain hex values such as \ffffffc4. An example is the following:
Urz\ffffffc4\ffffff85dzenie zgodne ze standardem High Definition Audio
The following code can convert the hex into UTF8:
select chr(x'c4'::int)
which returns Ä but when I try to use a regexp_replace I get into problems. I have tried the following:
select regexp_replace(sal_input, E'\\f{6}(..)',convert(E'\\1','xyz','UTF8'),'g')
where XYZ are the various source encodings offered in 8.2 but all I get back is the hex value.
Any idea on how I could use the chr function inside regexp_replace?
Version used: PostgreSQL 8.2.15 (Greenplum Database 4.1.1.1 build 1) on x86_64-unknown-linux-gnu
Thanks in advance for the help

You are misunderstanding the order of evaluation. The 2nd argument to regexp_replace isn't a callback invoked for every substitution of '\1'.
What happens is that your convert call is evaluated first, on the literal value \1, and that result is passed to regexp_replace.
In any case, the SQL doesn't even evaluate on a modern PostgreSQL because of stricter casting rules, as '\1' isn't a valid bytea literal.
In a less ancient Pg version it might be possible to do something with regexp_split_to_table, chr and string_agg. In 8.2, I think you're going to be using a PL. I'd load PL/Perl and write a simple Perl function to do it. It's likely possible to implement in PL/PgSQL, but I suspect any implementation with the functionality available in 8.2 will be verbose and slow. I'd love to be proved wrong.

Related

Trying to write an SQL query with regexp_matches() look behind positive in postgresql

From a PostgreSQL database, I'm trying to match 6 or more digits that come after a string that looks like "(OCoLC)" and I thought I had a working regular expression that would fit that description:
(?<=\(ocolc\))[0-9]{6,}
Here are some strings that it should return the digits for:
|a(OCoLC)08507541 will return 08507541
|a(OCoLC)174097142 will return 174097142
etc...
This seems to work to match strings when I test it on regex101.com, but when I incorporate it into my query:
SELECT
regexp_matches(v.field_content, '(?<=\(ocolc\))[0-9]{6,}', 'gi')
FROM
varfield as v
LIMIT
1;
I get this message:
ERROR: invalid regular expression: quantifier operand invalid
I'm not sure why it doesn't seem to like that expression.
UPDATE
I ended up just resorting to using a case statement, as that seemed to be the best way to work around this...
SELECT
CASE
WHEN v.field_content ~* '\(ocolc\)[0-9]{6,}'
THEN (regexp_matches(v.field_content, '[0-9]{6,}', 'gi'))[1]
ELSE v.field_content
END
FROM
varfield as v
as electricjelly noted, I'm kind of after just the numeric characters, but they have to be preceded by the "(OCoLC)" string, or they're not exactly what I'm after. This is part of a larger query, so I'm running a second case statement a boolean flag in cases where the start of the string wasn't "(OCoLC)". These seems to be more helpful anyway, as I'm going to probably want to preserve those other values somehow.
After looking over your question it seems your error is caused from a syntax problem, not so much from the function not being available on your version of PostgreSQl, as I tested it on 9.6 and I received the same error.
However, what you seem to want is to pull the numbers from a given field as in
|a(OCoLC)08507541 becomes 08507541
an easy way you could accomplish this would be to use regex_replace
the function would be:
regexp_replace('table.field', '\D', '', 'g')
the \D in the function finds all non-numbers and replaces it with a nothing (hence the '') and returns everything else
It looks like after doing some more searching, this is only a feature of versions of PostgreSQL server >= 9.6
https://www.postgresql.org/docs/9.6/static/functions-matching.html#POSIX-CONSTRAINTS-TABLE
The version I am running is version 9.4.6
https://www.postgresql.org/message-id/E1ZsIsY-0006z6-6T#gemulon.postgresql.org
So, the answer is it's not available for this version of PostgreSQL, but presumably this would work just fine in the latest version of the server.

Untranslatable character when extracting dates from strings

I am attempting to extract dates from a free-text field (because our process is awesome like that :\ ) and keep hitting Teradata error 6706. The regex I'm using is: REGEXP_SUBSTR(original_field,'(\d{2})\/(\d{2})\/(\d{4})',1) AS new_field. I'm unsure of the field's type HELP TABLE has a blank in the Type column for the field.
I've already tried converting using TRANSLATE(col USING LATIN_TO_UNICODE), as well as UNICODE_TO_LATIN, those both actually cause the error by themselves. A straight CAST(original_field AS VARCHAR(255)) doesn't fix the issue, though that cast does work. I've also tried stripping various special characters (new-line, carriage return, etc.) from the field before letting the REGEXP_SUBSTR take a crack at it, both by itself and with the CAST & TRANSLATEs I already mentioned.
At this point I'm not sure what the issue could be, and could use some guidance on additional options to try.
The final version that worked ended up being
, CASE
WHEN TRANSLATE_CHK(field USING LATIN_TO_UNICODE) = 0 THEN
REGEXP_SUBSTR(TRANSLATE(field USING LATIN_TO_UNICODE),'(\d{2})\/(\d{2})\/(\d{4})',1)
ELSE NULL
END AS Ref_Date
For whatever reason, using a TRIM inside the TRANSLATE seems to cause an issue. Only once I striped any and all functions from inside the TRANSLATE did the TRANSLATE, and thus the REGEXP_SUBSTR, work.

display a latex string with c++

I'm looking for a function that displays a LaTeX or a MathML string in a windows GUI app.
For example given: char* myLaTeX = "\\dfrac{5}{3}";
the function I'm looking for can display the formatted fraction in my window, in the logical coordinates i set.
Is there a way to do so just using the DrawText() or TextOut()?
I'm a Smalltalk programmer so let me tell you how I've worked this out:
Use the EM_GETOLEINTERFACE message to get an IRichEditOle interface
Use this interface to QueryInterface ITextDocument2
Use GetSelection and then SetText to output '5/3' (the String)
Use Range and Select to select all (i.e., '5/3')
Use BuildUpMath with argument 0 to produce the math notation
For general expressions replace step 3 with a printing visitor on the expression's parse tree.
Note that steps 3 and 5 are not intended for TeX but for the Unicode Nearly Plain-Text Encoding of Mathematics, which is a derived format. The reason to use this format is that, at least in my experience, only fairly simple TeX expressions got correctly rendered. Of course, it would be worth giving it a try. In such case, use the TeX format (as far as I know LaTeX is not supported, so in the example \dfrac{5}{3} should be written as {5 \over 3}) and the tomTeX constant (=1) instead of 0 as the argument of BuildUpMath.
Here is a TeX example:
which I produced from the expression:
$\int_{-\infty}^\pi {x_0\over {\sqrt{y_0^{t^2} + 1}} + {5\over 3}}\; dt$
Another thing to keep in mind when using this feature is that it requires RichEdit version 6+, which comes with recent versions of Office.
Finally, after some experimentation I realized that only two modules are needed for this to work: RICHED20.dll and MSPTLS.DLL, the first one not to be confused with the dll that comes with Windows. Look for them in
%ProgramFiles%\Microsoft Office\root\VFS\ProgramFilesCommonX86\Microsoft Shared\OFFICE16

camelCase to underscore in vi(m)

If for some reason I want to selectively convert camelCase named things to being underscore separated in vim, how could I go about doing so?
Currently I've found that I can do a search /s[a-z][A-Z] and record a macro to add an underscore and convert to lower case, but I'm curious as to if I can do it with something like :
%s/([a-z])([A-Z])/\1\u\2/gc
Thanks in advance!
EDIT: I figured out the answer for camelCase (which is what I really needed), but can someone else answer how to change CamelCase to camel_case?
You might want to try out the Abolish plugin by Tim Pope. It provides a few shortcuts to coerce from one style to another. For example, starting with:
MixedCase
Typing crc [mnemonic: CoeRce to Camelcase] would give you:
mixedCase
Typing crs [mnemonic: CoeRce to Snake_case] would give you:
mixed_case
And typing crm [mnemonic: CoeRce to MixedCase] would take you back to:
MixedCase
If you also install repeat.vim, then you can repeat the coercion commands by pressing the dot key.
This is a bit long, but seems to do the job:
:%s/\<\u\|\l\u/\= join(split(tolower(submatch(0)), '\zs'), '_')/gc
I suppose I should have just kept trying for about 5 more minutes. Well... if anyone is curious:
%s/\(\l\)\(\u\)/\1\_\l\2/gc does the trick.
Actually, I realized this works for camelCase, but not CamelCase, which could also be useful for someone.
I whipped up a plugin that does this.
https://github.com/chiedojohn/vim-case-convert
To convert the case, select a block of text in visual mode and the enter one of the following (Self explanatory) :
:CamelToHyphen
:CamelToSnake
:HyphenToCamel
:HyphenToSnake
:SnakeToCamel
:SnakeToHyphen
To convert all occerences in your document then run one of the following commands:
:CamelToHyphenAll
:CamelToSnakeAll
:HyphenToCamelAll
:HyphenToSnakeAll
:SnakeToCamelAll
:SnakeToHyphen
Add a bang (eg. :CamelToHyphen!) to any of the above command to bypass the prompts before each conversion.
You may not want to do that though as the plugin wouldn't know the different between variables or other text in your file.
For the CamelCase case:%s#(\<\u\|\l)(\l+)(\u)#\l\1\2_\l\3#gc
Tip: the regex delimiters can be altered as in my example to make it (somewhat) more legible.
I have an API for various development oriented processing. Among other things, it provides a few functions for transforming names between (configurable) conventions (variable <-> attribute <-> getter <-> setter <-> constant <-> parameter <-> ...) and styles (camelcase (low/high) <-> underscores). These conversion functions have been wrapped into a plugin.
The plugin + API can be fetch from here: https://github.com/LucHermitte/lh-dev, for this names conversion task, it requires lh-vim-lib
It can be used the following way:
put the cursor on the symbol you want to rename
type :NameConvert + the type of conversion you wish (here : underscore). NB: this command supports auto-completion.
et voilà!

ICU Custom Currency Formatting (C++)

Is it possible to custom format currency strings using the ICU library similar to the way it lets you format time strings by providing a format string (e.g. "mm/dd/yyy").
So that for a given locale (say USD), if I wanted I could have all currency strings come back "xxx.00 $ USD".
See http://icu-project.org/apiref/icu4c/classDecimalFormat.html,
Specifically: http://icu-project.org/apiref/icu4c/classDecimalFormat.html#aadc21eab2ef6252f25eada5440e3c65
For pattern syntax see: http://icu-project.org/apiref/icu4c/classDecimalFormat.html#_details
I didn't used this but from my knowledge of ICU this is the direction.
However I would suggest to use:
http://icu-project.org/apiref/icu4c/classNumberFormat.html and createCurrencyInstance member and then use setMaximumIngegerDigits or other functions to make what you need -- that would be much more localized. Try not assume anything about any culture. Because "10,000 USD" my be misinterpreted as "$ 10" in some countries where "," used for fraction part separation.
So be careful.
You can create a currency instance, then if it is safe to cast it to a DecimalFormat
if (((const NumberFormat*)fmt)->getDynamicClassID() == DecimalFormat::getStaticClassID())
{ const DecimalFormat* df = (const DecimalFormat*) fmt; ...
… then you can call applyPattern on it. See the information on ¤, ¤¤, ¤¤¤ under 'special pattern chars'
Use the ICU library's createCurrencyInstance().