Using Regex to match Arabic Text in R - regex

I'm trying match elements in character vector that contain specific Arabic phrases.
So Far I have:
#load list of Arabic phrases
list.of.phrases <- read.table("arabic_phrases.txt")
#look for the first phrase
phrase1 <- arabic.text.vector[grepl(list.of.phrases[1],arabic.text.vector)]
Unfortunately, this approach or using raw Arabic text doesn't seem to return anything and I get this message:
Error in `[[<-.data.frame`(`*tmp*`, qname, value = 1) :
replacement has 1 row, data has 0
I know that I can match Arabic words using : [U0627-U06FF]+ as in:
#look for all cells containing arabic
arabic <-arabic.text.vector[grepl("[U0627-U06FF]+",arabic.text.vector)]
...
So far my approach is to convert the Arabic text to its Unicode point values and then use grep; however, I'm having trouble with the conversion.
Am I heading in the right direction or does anybody have another solution/approach?

Related

SUM multiple values after a substring within all cells in a column in Google Sheets

For an open source chat analyser in Google Sheets, I need to extract all numeric values after a substring (Example), then total them.
For example, if a cell contains Example1 another text 123 Example500 text, Example1 and Example500 should be extracted out, and their numeric values summed to 501.
This is complicated further by needing to obtain the total for a column of messages.
What I've tried already:
=REGEXEXTRACT(A1, "Example(\d+)"): This only extracts the first matching value, but works!
=SUM(SPLIT(A1, "Example")): This works for messages that only include my target string, but falls apart when other strings are included. The output could possibly be filtered to results that start with a number, but this is very messy and possibly a red herring.
CONCATENATEing all my cells together, then searching for numbers. This is error-prone due to additional numbers within messages.
Another idea is to substitute each Example(\d+) to $1 the captured digit and space |. or replace anything else with empty string (regex101 demo). Knowing that $1 is unset on the right side of the alternation. Then split on space and sum up digits (any other occurring digits have been removed). If Example is a placeholder, replace with e.g. [[:alpha:]]+ for one or more alphabetic characters.
=IF(ISTEXT(A1);SUM(SPLIT(REGEXREPLACE(A1;"Example(\d+)|.";"$1 ");" "));0)
I added IF(ISTEXT(A1);...) for only processing text in the source field (to avoid errors). Else if empty or no text it's set to 0. Just remove if the field always contains text and this is unneeded.
Edit from #TheMaster: As a array formula, we can use BYROW
=BYROW(A:A; LAMBDA(row; IF(ISTEXT(row); SUM(SPLIT(
REGEXREPLACE(row;"Example(\d+)|.";"$1 ");" "));)))
try:
=LAMBDA(x, REGEXEXTRACT(A1, "(\w+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\w+\d+")),
REGEXEXTRACT(x, "\w+(\d+)"), )))(SPLIT(A1, " "))
update 1:
=LAMBDA(x, REGEXEXTRACT(A1, "(\D+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), )))(SPLIT(A1, " "))
update 2:
=INDEX(LAMBDA(xx, REGEXEXTRACT(xx, "(\D+)\d+")&
BYROW(LAMBDA(x, IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), ))(SPLIT(xx, " ")), LAMBDA(x, SUMPRODUCT(x))))
(A1:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
if you start from A2 just change A1: to A2:

Split mixed upper and lowercase in the same word into two lines

I have a string in cells that is lacking new lines.
It looks like this:
Text Text TextText Text Text T5df Tdfcv TextNeu
In other words:
If there is a change from Lowercase to Uppercase within a word, this is where a new line should be inserted as \n.
So the example would convert to
Text Text Text
Text Text Text T5df Tdfcv Text
Neu
Resp.:
Text Text Text\nText Text Text T5df Tdfcv Text\nNeu
I found
String[] r = s.split("(?=\\p{Lu})");
I tried REGAUS(F2;"(?=\\p{Upper})";"\n";"g") yet I get a 502, as something is wrong with the regex.
Which formula do I need for calc to do this?
With english formula names, the following formula will do the trick:
=REGEX(A1;"([:lower:])([:upper:])";"$1"&CHAR(10)&"$2";"g")
Same on multiple lines for sake of readability:
=REGEX(
A1;
"([:lower:])([:upper:])";
"$1" & CHAR(10) & "$2";
"g"
)
It matches a lower-case letter followed by an upper-case letter, and inserts a newline using the CHAR() function.
You'll have to adapt the line heigth manually, otherwise you will see only "Neu" (the last line).
For localised formula names (german), it would be:
=REGAUS(A1;"([:lower:])([:upper:])";"$1"&ZEICHEN(10)&"$2";"g")
I would have expected that inserting "\n" should work, too, but i did'nt manage got get it working, thus the recourse to CHAR(10).

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:
my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);
I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".
I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.
I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.
Thanks!
My regex-fu is weak, so I'd go with a less magical solution.
First, you can remove all marks via samemark:
'חוּם'.samemark('a')
Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:
Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str
In case of mixed strings, stripping marks from Hebrew characters only could look like this:
$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
Here is a simple approach:
my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my #ords;
for $hum.ords {
#ords.push($_) if $min ≤ $_ ≤ $max;
}
say join('', #ords.map: { .chr });
Output:
חום

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Format lists in VIM

I would like to find a way to easy format lists in Vim.
I checked PAR and the default formatter of Vim.
p.e.
1. this is my text this is my text this is my text
2. this is my text this is my text this is my text
3. this is my text this is my text this is my text
4. this is my text this is my text this is my text
and this
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
when I select the lines and do a default format to 42 with PAR and VIM these are the results:
NUMBERED LIST
formatting with par:
par error:
(42) <= (0) + (50)
formatting with vim:
1. this is my text this is my text this is
my text
2. this is my text this is my text this is
my text
3. this is my text this is my text this is
my text
4. this is my text this is my text this is
my text
LIST with '-'
formatting with par:
4 lines filtered (no change)
formatting with vim:
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
Vim does a better job formatting lists but it is not correct as well in a numbered list.
Par does have a lot of troubles formatting lists even when I use the prefix ("p") option like this:
'<,'>!par w42p4dh or '<,'>!par w42p3dh
Does anyone know a good way how to format lists without problems?
Try set fo+=n. From :help fo-table:
n When formatting text, recognize numbered lists. This actually uses
the 'formatlistpat' option, thus any kind of list can be used. The
indent of the text after the number is used for the next line. The
default is to find a number, optionally followed by '.', ':', ')',
']' or '}'. Note that 'autoindent' must be set too. Doesn't work
well together with "2".
Example:
1. the first item
wraps
2. the second item