I have a long list in the format like such:
group1 » group2 » group3
Within R, I can use a gsub('»', '-', x) where x is the vector structured like above.
However, I am running into errors when trying to utilize this functionality when loading this into a shiny app. I've tried multiple ways to use gsub, chartr, and some other ones.
Also, The  character is not captured when using [[:punct:]].
Any suggestions?
group1 » group2 » group3 is a UTF-8 encoded string and therefore it would be best if the R application is coded to read the strings with conversion from UTF-8 to Latin 1 as explained on Read or Set the Declared Encodings for a Character Vector and Read text as UTF-8 encoding.
» is the UTF-8 encoded right-pointing double angle quotation mark whereby the 2 bytes with the hexadecimal values C2 BB are interpreted and displayed (wrong) with code page Windows-1252 or ISO 8859-1 (Latin-1).
gsub("\\xC2?\\xBB", "-", x) could be used to find in a UTF-8 encoded string or single byte encoded string (Latin 1 or Windows 1252) all right pointing guillemets and replace each of them by a hyphen character.
Related
I have a large array of text (text, stored as cell-array), that I want to truncate in matlab, say for 5 characters. Truncating with regexprep is quite efficient, but now, I would love to append a '...' at the end of every truncated match (and only there).
(How) can this be achieved within MATLAB's regexprep?
>> text = {'123456780','1','12'}; %<- small representative sample
>> regexprep(text,'(^.{0,5})(.*)','$1') %capture first 5 characters or less in first group (and replace the text with first group captures)
ans =
1×3 cell array
{'12345'} {'1'} {'12'}
it should read:
ans =
1×3 cell array
{'12345...'} {'1'} {'12'}
You need to use
regexprep(text,'^(.{5}).+','$1...')
See the regex demo.
The main point is that you need to only trigger the replacement if a string is linger than five chars (else, you do not even need to truncate the string).
Note that regexprep returns the input string as is if there was no regex match found, thus you do not need to worry about strings that are zero to five chars long.
Details:
^ - start of string
(.{5}) - Capturing group 1 ($1): any five chars
.+ - any one or more chars, as many as possible.
Note that the string 12345... is in fact 8 characters long. You don't want to make the mistake of truncating 1234567 to 12345..., as the truncated version is longer and therefore shouldn't be truncated in the first place.
A solution that takes this into account is:
regexprep(text,'^(.{5}).{3}.+','$1...')
which will only truncate if there are more than 8 characters and, if so, will display the first 5 with the trailing ellipsis.
I'm currently working with a database in Googlesheets containing app publishers linked to financial apps. I plan to match it with another list that has multiple bank names from multiple countries. The problem is that the first database has the names of the publishers as the original names, but second list has some bank names translated to English and some of them have the original native name. Specifically, the names of banks that appear to be written originally with a non-Latin script (such as Korean, Cyrillic, Japanese or Arabic) have been translated to English, and the names of banks that are written in a Latin-script-based language (as Spanish, Romanian, Slovenian or French) all appear to be not translated, just without any diacritic.
Because of this, i'm trying to use Regex in Googlesheets in order to check the cells for non-Latin Unicode letter characters (and that's why this question doesn't help, nor this one). Since currently Googlesheets' REGEXMATCH is incompatible with Unicode class characters, i've been forced to use the QUERY function (following this answer). Let's say that I have this column A:
A
融360
הבנק הבינלאומי
АТ "Акцент-Банк
АО АЛСЕКО
АО "Altyn Bank"
АКБ "Kapitalbank
Şekerbank
İninal
Československá obchodní banka
iBillionaire
iBear
iBankマーケティング株式会社
4finance
11번가(주)
11com7 design
1 2 3 Apps
1 2 3 Apps
(주)인포바인
I want to use QUERY in another column combined with WHERE MATCHES in order to be able to use the Latin Unicode class, and I want the QUERY function to give results only when non-Latin letter characters appear. That is, I want something like this as a result in column B:
A:B
融360:融360
הבנק הבינלאומי:הבנק הבינלאומי
АТ "Акцент-Банк:АТ "Акцент-Банк
АО АЛСЕКО:АО АЛСЕКО
АО "Altyn Bank":АО "Altyn Bank" \\ These are Cyrillic A and O
АКБ "Kapitalbank:АКБ "Kapitalbank
Şekerbank:#N/A
İninal:#N/A
Československá obchodní banka:#N/A
iBillionaire:#N/A
iBear:#N/A
iBankマーケティング株式会社:iBankマーケティング株式会社
4finance:#N/A
11번가(주):11번가(주)
11com7 design:#N/A
1 2 3 Apps:#N/A
1 2 3 Apps:#N/A
(주)인포바인:(주)인포바인
I'm doing this with this formula: =QUERY(A;"select A where A matches 'SomeREGEX'";0), but I don't seem to get the correct regular expression. After a lot of unsuccessful attempts, I tried [\p{Latin}\d\s]*[^\p{Latin}]+[\p{Latin}\d\s]* that gives a correct answer to АО АЛСЕКО, Československá obchodní banka, 11번가(주), 11com7 design and (주)인포바인, but not with АО "Altyn Bank", АКБ "Kapitalbank or iBankマーケティング株式会社
What I might be doing wrong?
You may use
=REGEXMATCH(A1,".*([\x{0080}-\x{02AF}]|\d.*[a-zA-Z]|[a-zA-Z].*\d).*|^[a-zA-Z]+$")
See the regex demo
Details
.* - any 0 or more chars other than line break chars, as many as possible
([\x{0080}-\x{02AF}]|\d.*[a-zA-Z]|[a-zA-Z].*\d) -
[\x{0080}-\x{02AF}] - a character from U+0080 to U+02AF range
| - or
\d.*[a-zA-Z]|[a-zA-Z].*\d - a digit, any 0+ chars as many as possible, ASCII letter, or an ASCII letter, then 0 or more chars as a digit after
.* - any 0 or more chars other than line break chars, as many as possible
| - or
^ - start of string
[a-zA-Z]+ - 1 or more ASCII letters
$ - end of string
I can find different values of same ASCII characters at the below links:
1.http://www.theasciicode.com.ar/
(Extended ASCII in non-latin format)
2.https://www.ascii-code.com/
(shows Extended Ascii in Latin1 format)
In the first link i can see the value of á (a with acute accent) = 160,
In the second link value of á (a with acute accent) = 225
similary there is a Random Difference between each values ranging from 128-255.
I have a C++ application where i am getting the ASCII value in Non-Latin Format (1), which i need to Output in Latin Value (2). Is there any formula which could be helpful? Please help. Thanks
The first table looks like the codepage 437 you can found on Windows.
If you have a string encoded with this codepage, you can decode it like this:
text = b"your\xa0text".decode("cp437")
You get: 'yourátext'
To convert it to ISO Latin1, you can write:
latin1 = text.encode("Latin1"). # or "8859-1"
You get the bytes string: b'your\xe1text'
Notice: some characters won't convert from cp437 to Latin1.
I'm parsing a file using Python 2.7 and I'm trying to find all occurrences of the following pattern
ASCII 06 followed by any two characters in the range from ASCII 0 to ASCII 255
Naive try #1 - [chr(6)][chr(0)-chr(255][chr(0)-chr(255]
fails with a message that indicates the range cannot be strings.
I've tried several other combos - no success.
The record that I'm parinsg was read in
sF = open('D:\Scratch\xxxxx.01', 'r')
record = sF.read()
Any help will be gratefully appreciated.
Thanx,
Doug
Given you're using python2.7, your open().read() will return bytes. You can use the following bytes regex to match ASCII 06 followed by 2 [0-255] bytes:
reg = re.compile(b'\x06..')
Note that I used . here as any byte will satisfy 0 - 255 (except the newline character \n (\xa0))
If you wanted to also match the newline character you could do one of the following things:
# Uses the DOTALL flag to force `.` to match `\n`
reg = re.compile(b'\x06..', re.DOTALL)
# Makes a character class which matches all bytes
reg = re.compile(b'\x06[\x00-\xff]{2}')
I'm using the libconfig to create an configuration file and one of the fields is a content of a encrypted file. The problem occurs because in the file have some escapes characters that causes a partial storing of the content. What is the best way to store this data to avoid accidental escapes caracter ? Convert to unicode?
Any suggestion?
You can use either URL encoding, where each non-ASCII character is encoded as a % character followed by two hex digits, or you case use base64 encoding, where each set of 3 bytes is encoded to 4 ASCII characters (3x8 bits -> 4x6 bits).
For example, if you have the following bytes:
00 01 41 31 80 FE
You can URL encode it as follows:
%00%01A1%80%FE
Or you can base64 encode it like this, with 0-25 = A-Z, 26-51 = a-z, 52-62 = 0-9, 62 = ., 63 = /:
(00000000 00000001 01000001) (00110001 10000000 11111110) -->
(000000 000000 000101 000001) (001100 011000 000011 111110)
AAJBNYD.
The standard for encoding binary data in text used to be uuencode and is now base64. Both use same paradigm: a byte uses 8bits, so 3 bytes use 24 bits or 4 6 bits characters.
uuencode just used the 6 bits with an offset of 32 (ascii code for space), so characters are in range 32-96 => all in printable ascii range, but including space and possibly other characters that could have special meanings
base64 choosed these 64 characters to represent values from 0 to 63 (no =:;,'"\*(){}[] that could have special meaning...):
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
and the equal sign(=) being a place holder for empty positions and the end of an encoded string to ensure that the encoded string length is a multiple of 4.
Unfortunately, neither the C nor C++ standard library offer functions for uuencode not base 64 conversions, but you can find nice implementations around, with many pointers in this other SO answer: How do I base64 encode (decode) in C?