What formula can I use to get a count of emoji and characters in a single cell?
For example, In cells, A1,A2 and A3:
๐๐๐
๐คโ๏ธ๐๐ค๐ค
??๐๐๐
Total Count of characters in each cell(Desired Output):
3
5
5
For the given emojis, This will work well:
=LEN(REGEXREPLACE(A13,".","."))
MID/LEN considers each emoji as 2 separate characters.
REGEX will consider them as one.
But even REGEX will fail with a complex emoji like this:
๐จโ๐ฉโ๐งโ๐ฆ
This contains a literal man emoji๐จ, a woman emoji๐ฉ,a girl emoji๐ง and a boy emoji๐ฆ-all joined by a ZeroWidthJoiner. You could even swap the boy for a another girl with this formula:
=SUBSTITUTE("โ๐จโ๐ฉโ๐งโ๐ฆ","๐ฆ","๐ง")
It'll become like this:
โ๐จโ๐ฉโ๐งโ๐ง
=COUNTA(FILTER(
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#"),
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#")<>""
))
Based on the answer by #I'-'I
Some emojis contain from multiple emojis joined by char(8205):
๐จโ๐ฉโ๐งโ๐ฆโ๐ฆโ๐
The result differs and depends on a browser you use.
I wonder, how do we count them?
Related
I have a list of different items. Some of them have 8-10 digits in front of the name, some others have these 8-10 digits behind the name and some others again don't have these numbers in the name.
I have two expressions that I use to remove these digits, but I can not manage to combine them with | (or). They work each for themselves, but if I use the first expression first, then the second expression, I don't get the result I want to have.
I use these to expressions for now:
(?<=[\d]{8,10}) (.*)
.*?(?=[\d]{8,10})
But if I use them both (first one and then the other), then some of the lines become totally empty.
How can I combine these to to do what I want, or if it's better, write a new expression that does what I want to do :)
List is like this:
12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145
Want this result:
Book
Book
Book
Book
Cabinet 120x30x145
Why not just use the following.
Check if there are 8 numbers in the beginning of the string, or at the end of it and remove them.
(^\d{8,10}\s*|\s*\d{8,10}$)
It gives the wanted behaviour
Instead of only matching everything but a number containing
8-10 digits + adjacent spaces, use a regex to substitute
such a number (also + adjacent spaces) with an empty string.
To match, use the following regex:
*\d{8,10} *
That is:
* - a space and an asterix - a sequence of spaces (may be empty),
\d{8,10} - a sequence of 8 to 10 digits,
* - another sequence of spaces (may be empty).
The replacement string is (as I said) empty. Of course, you should use
g (global) option.
Note that you can not use \s instead of the space, as \s matches also
CR and LF and we don't want this.
For a working example see https://regex101.com/r/1hsGzT/1
You need to use \b meta sequence boundary:
/\b[0-9\s]{8,10}\b/g;
var str = `12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145`;
var rgx = /\b[0-9\s]{8,10}\b/g;
var res = str.replace(rgx, `\n`)
console.log(res);
I have a google sheet with cells that contain different words. I want the words to equal numbers if present in a different cell. Normally I can just use =if when there is just one word but can't for this. I've tried using =regexactmatch and =search but can't get it to work.
For example the cell might contain the following text:
"Uruguay, France, Brazil, Belgium"
I want Uruguay, France, Brazil to each = 3 in another cell but Belgium to = 0 in that same cell.
I think does what you ask:
=substitute(substitute(substitute(substitute(A1,"Uruguay",3),"France",3),"Brazil",3),"Belgium",0)
but, possibly like user who VTC'd as Unclear, I doubt it is what you want.
I'm working on a 3.75 million line text catalog of Authors names and titles in Editpad Pro. I need to standardize the authors initials to have periods after them.
The catalog has the authors name and book titles separated by a vertical bar "|" character, like this:
A N Author|A Title
A. N. Name|A Blah
Some A Name|Blah A Lot
A Name|Blah I
Name A|I Blah
B O'Name|A Book
Normally in Calibre I use this regex to standardize the initials
\b([A-Z])\.?\s?(?!'|\-|\.)\b
Replace:"\1. "
but here I need it to only work up to the vertical bar "|" character, and not make any changes to the titles. I cannot seem to get anything to work on all the above authors names without it also changing the titles.
Results I'm looking for:
A. N. Author|A Title
A. N. Name|A Blah
Some A. Name|Blah A Lot
A. Name|Blah I
Name A.|I Blah
B. O'Name|A Book
Thanks.
Add to your regex a positive lookahead:
(?=.*\|)
It means: Somewhere later in the line there must be a |.
It works as long as there is a single | in the line, but your source
text sample meets this condition.
Single letters before it are matched, single letters after it aren't.
I am trying to extract information about people wounded from several articles. The issue is that there are different ways in which conveying that information in journalistic language since it can be written in numbers or in words.
For instance:
`Security forces had *wounded two* gunmen inside the museum but that two or three accomplices might still be at large.`
`The suicide bomber has wounded *four men* last night.`
`*Dozens* were wounded in a terrorist attack.`
I noticed as most of the times numbers that goes from 1-10 are written in words rather than in numbers. And I was wondering how to extract them without incurring in any convoluted code and just list regular expression with words from 1-10.
Shall I use a list? And how it would be included?
This is the pattern I used so far for extracting the number of people wounded with digit:
text_open = open("News")
text_read = text_open.read()
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) injured|(\d+) people were wounded|wounding (\d+)|wounding at least (\d+)")
result = re.findall(pattern,text_read)
print(result)
try this
import re
regex = r"(\w)+\s(?=were)|(?<=wounded|injured)\s[\w]{3,}"
test_str = ("`Security forces had wounded two gunmen inside the museum but that two or three accomplices might still be at large.`\n\n"
"`The suicide bomber has wounded four men last night.`\n\n"
"`Dozens were wounded in a terrorist attack.")
matches = re.finditer(regex, test_str)
for match in matches:
print (match.group().strip())
Output:
two
four
Dozens
\w+\s(?=were) : ?= look ahead for were , found capture word using \w
| or
(?<=wounded|injured)\s\w{3,} : ?<= look behind , capture word if wounded or injured occurred before word and {3,} mean length of word is 3 or more , simply to avoid capturing word i.e. in and every numeric word has min length 3 so it's fine to use it.
A friend of mine said if the regex I'm using is too long, it's probably the wrong tool for the job. Any thoughts here on a better way to parse this text? I have a regex that returns everything to an array I can easily just chunk out, but if there's another simpler way I'd really like to see it.
Here's what it looks like:
2 AB 123A 01JAN M ABCDEF AA1 100A 200A 02JAN T /ABCD /E
Here's a break down of that:
2 is the line number, these range from 1 all the way to 99. If you can't see because of formatting, there is a space charecter prepending numbers less than 10.
The space may or may not be replaced by an *
AB is an important unit of data (UOD).
AB may be prepended by /CD which is another important UOD.
123 is an important UOD. It can range from 1 (prepended by 4 spaces) to 99999.
A is an important UOD.
01JAN is a day/month combination, I need to extract both UODs.
M is a day name short form. This may be a number between 1 and 7.
ABC is an important UOD.
DEF is an important UOD.
The space after DEF may be an *
AA1 may be zero characters, or it may be 5. It is unimportant.
100A is a timestamp, but may be in the format 1300. The A may be N when the time is 1200 or P for times in the PM.
We then see another timestamp.
The next date part may not be there, for example, this is valid:
93*DE/QQ51234 30APR J QWERTY*QQ0 1250 0520 /ABCD*ASDFAS /E
The data where /ABCD*ASDFAS /E appears is irrelevant to the application, but, this is where the second date stamp may appear. The front-slash may be something else (such as a letter).
Note:
It is not space delimited, some parts of the body run into others. Character position is only accurate for the first two or three items on the list
I don't think I left anything out, but, if there's an easier way to parse out a string like this than writing a regex, please let me know.
This is a perfect task for regular expressions. The text does not contain nesting and the items you're matching are fairly simple taken individually.
Most regular expression syntaxes have an xtended flag or mode that allows whitespace and comments to improve readability. For example:
$regex = '#
# 2 is the line number, these range from 1 all the way to 99.
# There is a space character prepending numbers less than 10.
# The space may or may not be replaced by an *.
[ *]\d|\d\d
\s
# AB is an important unit of data (UOD).
# AB may be prepended by /CD which is another important UOD.
(/CD)?AB
\s
# 123 is an important UOD. It can range from 1 (prepended by 4 spaces)
# to 99999.
\s{4}\d{1}|\s{3}\d{2}|\s{2}\d{3}|\s{1}\d{4}|\d{5}
#x';
And so on.
A regex seems fine for this application, but for simplicity and readability, you might want to split this into several regexes (one for each field) so people can more easily follow which part of the regex corresponds to which variable.
You can always code your own parser by hand, but that would be more lines of code than a regex. The lines of code, however, will probably be simpler to follow for the reader.
Simply write a custom parser that handles it line by line. It seems like everything is at a fixed position rather than space/comma-delimited, so simply use those as indices into what you need:
line_number = int(line_text[0:1])
ab_unit = line_text[3:4]
...
If it is indeed space-delimited, simply split() each line and then parse through each, splitting each chunk into component parts where appropriate.