Finding type of break in icu::BreakIterator - c++

I'm trying to understang how to use icu::BreakIterator to find specific words.
For example I have following sentence:
To be or not to be? That is the question...
Word instance of break iterator would put breaks there:
|To| |be| |or| |not| |to| |be|?| |That| |is| |the| |question|.|.|.|
Now, not every pair of break points is actual word.
In derived class icu::RuleBasedBreakIterator there is a "getRuleStatus()" that returns some kind of information about break, and it gives "Word status at following points (marked "/")"
|To/ |be/ |or/ |not/ |to/ |be/?| |That/ |is/ |the/ |question/.|.|.|
But... It all depends on specific rules, and there is absolutely no documentation to understand it (unless I just try), but what would happend with different locales and languages where dictionaries are used? what happens with backware iteration?
Is there any way to get "Begin of Word" or "End of Word" information like in Qt QTextBoundaryFinder: http://qt.nokia.com/doc/4.5/qtextboundaryfinder.html#BoundaryReason-enum?
How should I solve such problem in ICU correctly?

Have you tried the ICU documentation? It appears to explain everything you are asking about including handling of internationalisation, reverse iteration, and the rules, both default and how to create your own custom set. They also have code snippets to help.

Related

Find specified position's value on a function decleration

I try to find method(function) declerations on which 6th argument is true. Actually I think I can return ,just using regex, wished group without using an extra programming language's feature, unfortunately no option exits as such.
sdfsdfs(123,234,werer,23324,234324,true,dwfwefwer,sdfdsdff);
sdfsdfs(123,234,true,23324,234324,true,dwfwefwer);
sdfsdfs(123,234,234234,23324,234324,r23423,dwfwefwer);
sdfsdfs(123,234,234234,23324,234324,false,dwfwefwer);
(123,234,werer,23324,234324,true,dwfwefwer,sdfdsdff)
erterterterter(123,234,werer,23324,234324,true,dwfwefwer,sdfdsdff);
What I have tried is here. (\b\w+(?=\s*[,()]))
You can check the following regex.
(?i)\w*\([^,]+,[^,]+,[^,]+,[^,]+,[^,]+,true\b,[^\)]+\)
I have done here https://regex101.com/r/BEJNp4/1/

How to set up word wrap for an stc.StyledTextCtrl() in wxPython

I was wondering about this, so I did quite a bit of google searches, and came up with the SetWrapMode(self, mode) function. However, it was never really detailed, and there was nothing that really said how to use it. I ended up figuring it out, so I thought I'd post a thread here and answer my own question for anyone else who is wondering how to make an stc.StyledTextCtrl() have word wrap.
Ok, so first you need to have your Styled Text Control already defined, of course. If you don't know how to do this, then go watch some tutorials on wxPython. I recommend a youtuber called sentdex http://youtube.com/sentdex, who has a complete series on wxPython, as well as Zach King, who has a 4 episode series on making a text editor. Anyways, my definition of my text control looks like this: self.control = stc.StyledTextCtrl(self, style=wx.TE_MULTILINE). Yours could look a little different, but the overall idea is the same.
self.control = stc.StyledTextCtrl(self, style=wx.TE_MULTILINE)
Many places will tell you that it will need to be SetWrapMode(self, mode), but if you have self.CONTROLNAME at the beginning like I do, you will get an error if you also put self as an argument because self. at the beginning counts as the argument. However, if your control is defined with self.CONTROLNAME and you don't put the self.CONTROLNAME at the beginning of your SetWordWrap()function, you'll also get an error, so be careful with that. Mode just has to be 0 or 1-3. So for example, mine looks like this: self.control.SetWrapMode(mode=1). Word wrap mode options:
0: None |
1: Word Wrap |
2: Character Wrap |
3: White Space Wrap
My final definition and word wrap setup looks like this:
self.control = stc.StyledTextCtrl(self, style=wx.TE_MULTILINE)
self.control.SetWrapMode(mode=1)
And that's it! Hope this helped.
Thanks to #Chris Beaulieu for correcting me on an issue with the mode options.
I see you answered your own question, and you are right in every way except for one small detail. There are actually several different wrap modes. The types and values corresponding to them are as follows:
0: None
1: Word Wrap
2: Character Wrap
3: White Space Wrap
So you cannot enter any value above 0 to get word wrap. In fact if you enter a value outside of the 0-3 you should just end up getting no wrap as the value shouldn't be recognized by Scintilla, which is what the stc library is.
It would be more maintainable to use the constants stc.WRAP_NONE, stc.WRAP_WORD, stc.WRAP_CHAR and stc.WRAP_WHITESPACE instead of their numerical values.

trying to stem a string in natural language using python-2.7

I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"
You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.

Specific Regex Failing on Neko and Native

So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec....
here is my regex, I'm sorry its long, but I have no idea where the problem is...
^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$
if anyone knows what might be going on I'd appreciate it,
Thanks,
Nico
ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)
edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:
1,1,1,1
1,1,1,1
1,1,1,1
or this:
1,1,1,1,
1,1,1,1,
1,1,1,1,
but not:
xml blahh blahh>
1,m,1,1
1,1,b,1
1,1,1,1
xml>
(and yes I know thats not valid xml;))
edit: it gets stranger:
so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:
what happens is:
this will fail a .match() test, but not crash:
a
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
while this will crash the program:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1
To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).
Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.
I'd suggest something along the lines of
~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
There, "~/" and last "/" are haxe regexp markers.
I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).
Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.
I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.
Edit: the answer to question "why it eats so much memory"
Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:
parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
sets up read pointer to the first symbol of symbol string
sets up state poiter to initial state of regexp
takes up one pair from current state list and stores it as current state and current read pointer
reads symbol under current read pointer
matches it with symbols in states which current state have links to, and makes a list of states that matched.
if there is a final regexp state in this list, goes to 12
for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
if the current state list is empty, returns false, as string didn't match the regexp
goes to 6
here it is, in a final state of matched regexp, returns true, string matches regexp.
Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally.
Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)
Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.
And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.

internal code-completion in vim

There's a completion type that isn't listed in the vim help files (notably: insert.txt), but which I instinctively feel the need for rather often. Let's say I have the words "Awesome" and "SuperCrazyAwesome" in my file. I find an instance of Awesome that should really be SuperCrazyAwesome, so I hop to the beginning of the word, enter insert mode, and then must type "SuperCrazy".
I feel I should be able to type "S", creating "SCrazy", and then simply hit a completion hotkey or two to have it find what's to the left of the cursor ("S"), what's to the right ("Crazy"), regex this against all words in the file ("/S\w*Crazy/"), and provide me with a completion popup menu of choices, or just do the replace if there's only one match.
I'd like to use the actual completion system for this. There exists a "user defined" completion which uses a function, and has a good example in the helps for replacing from a given list. However, I can't seem to track down many particulars that I'd need to make this happen, including:
How do I get a list of all words in the file from a vim function?
Can I list words from all buffers (with filenames), as vim's complete does?
How do I, in insert mode, get the text in the word before/after the cursor?
Can completion replace the entire word, and not just up to the cursor?
I've been at this for a couple of hours now. I keep hitting dead ends, like this one, which introduced me to \%# for matching with the cursor position, which doesn't seem to work for me. For instance, a search for \w*\%# returns only the first character of the word I'm on, regardless of where I'm in it. The \%# doesn't seem to anchor.
Although its not exactly following your desired method in the past I've written https://github.com/mjbrownie/swapit which might perform your task if you are looking for related keywords. It would fall down in this scenario if you have hundreds of matches.
It's mainly useful for 2-10 possible sequenced matches.
You would define a list
:SwapList awesomes Awesome MoreAwesome SuperCrazyAwesome FullyCompletelyAwesome UnbelievablyAwesome
and move through the matches with the incrementor decrementor keys (c+a) (c+x)
There are also a few other cycling type plugins like swap words that I know of on vim.org and github.
The advantage here is you don't have to group words together with regex.
I wrote something like that years ago when working with 3rd party libraries with rather long CamelCasePrefixes in every function different for each component. But it was in Before Git Hub era and I considered it a lost jewel, but search engine says I am not a complete ass and posted it to Vim wiki.
Here it is: http://vim.wikia.com/wiki/Custom_keyword_completion
Just do not ask me what 'MKw' means. No idea.
This will need some adaptation to your needs, as it is looking up only the word up to the cursor, but the idea is there. It works for current buffer only. Iterating through all buffers would be sluggish as it is not creating any index. For those purposes I would go with external grep.