trying to stem a string in natural language using python-2.7

trying to stem a string in natural language using python-2.7 - python-2.7

I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"

You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.

Related

How can I use RegEx in Source Graph properly?

I have virtually no knowledge of how to use Source Graph but I do know what Source Graph is and what RegEx is and its application across platforms. I am trying to learn how to better search for strings, variables, etc. in Source Graph so I can solve coding issues at work. I am not a coder/programmer/engineer but I have some general knowledge of programming in C and Python and using Query Languages.
I have gone to Source Graph's instructional page about RegEx but I honestly have a hard time understanding it.
Example:
I am trying to find "Delete %(folder_name)s and %(num_folders)s other folder from your ..." without the actual quotes and ellipses.
That is how I receive the code at work but this apparently is not how it is represented in Source Graph in its source file.
If I copy and paste that above line into Source Graph, I get no returns.
Here is what I found how the source file actually looks like in Source Graph:
"Delete \u201c%(folder_name)s\u201d and %(num_folders)s other folder from your ..." , again without actual quotes and ellipses.
I would have no idea that the \u201c and \201d were there in the original code. Is there a way around this?
What I usually have to work with and figure out how to find in Source Graph are singular variables or strings:
%(num_folders)s
This is a problem because the fewer items I have for searching, the harder it is to hunt down their source. I don't know who the author/engineer is until I find the code in Source Graph and check the blame feature (sadly it's a little disorganized at my work).
Sorry if this doesn't make any sense. This is my very first Stack Overflow post.

I can't the snippet you mentioned on sourcegraph.com, so I assume you are hosting Sourcegraph yourself.
In general, you could search for a term like Delete \u201c%(folder_name)s without turning on regular expressions to get literal matches. If you want to convert this into a regular expression, you would need to escape it like this:
Delete \\u201c%\(folder_name\)s
If %(folder_name) is meant to be a placeholder for any other expression, try this one instead:
Delete .*s and .*s other folder from your
https://regex101.com/ is my personal recommendation for learning more about how regular expressions work.

VSCode Snippets: Format File Name from my_file_name to MyFileName

I am creating custom snippets for flutter/dart. My goal is to pull the file name (TM_FILENAME_BASE) remove all of the underscores and convert it to PascalCase (or camelCase).
Here is a link to what I have learned so far regarding regex and vscode's snippets.
https://code.visualstudio.com/docs/editor/userdefinedsnippets
I have been able to remove the underscores nicely with the following code
${TM_FILENAME_BASE/[\\_]/ /}
I can even make it all caps
${TM_FILENAME_BASE/(.*)/${1:/upcase}/}
However, it seems that I cannot do two steps at a time. I am not familiar with regex, this is just me fiddling around with this for the last couple of days.
If anyone could help out a fellow programmer just trying make coding simpler, it would be really appreciated!
I expect the output of "my_file_name" to be "MyFileName".

It's as easy as that: ${TM_FILENAME_BASE/(.*)/${1:/pascalcase}/}

For the camelCase version you mentioned, you can use:
${TM_FILENAME_BASE/(.*)/${1:/camelcase}/}

how to match a sentence having particular word in different patterns

We have a problem here...
We have a text having different patterns of sentences.
We want to get the sentence having a particular word.
Eg:
One further point, by way of providing another model. The analysis in
the second paragraph could lead in the following direction. 'The
Destructors' deals with, obviously, destruction, whilst the book of
Genesis deals with creation. The vocabulary is similar: Blackie
notices that 'chaos had advanced', an ironic reversal of God's
imposing of form on a void. Furthermore, the phrase 'streaks of light
came in through the closed shutters where they worked with the
seriousness of creators', used in the context of destruction, also
parodies the creation of light and darkness in the early passages of
the Biblical book. Greene's ironic use of the vocabulary of the Bible
might be making the point that, for him, the Second World War
signalled the end of a particular Christian era. Now, it is perfectly
arguable that the rise of fascism is linked to this, or that it is the
cause. The cult of personality and secular leadership has, for Greene,
taken over from the key role of the church in Western societies. In
this way the two main themes identified above - the tension between
individual and community, and religion - are linked. In terms of essay
writing this link could well be made after the discussion of the theme
of the individual and the community, and its links with the theme of
leadership. This might be the general conclusion to the essay. After
thoughtful consideration and interpretation a student may well decide
that this is what the (destructors.)' boils down to: Greene is making a
clear link between the rise of fascism and the decline of the Church's
influence. Despite the fact that fascism has been recently defeated,
Greene sees the lack of any contemporary values which could provide
social cohesion as providing the potential for its reappearance.
In the above text, we have bold words (destructors). We want to get the sentences which are having the word "destructors".
The word "destructors" can be present in different formats. Eg: (destructors), (DesTrucTors), (Des.tructors), DESTRUCTORS, destructors, des-tructors.
When we tried writing a regex to match the sentences, we are failing to get the sentences at some conditions(like we are getting half sentences, etc.,).
Could you please help us with this.
If this information doesn't help you to solve, please let us know. Will update it.
Thank you...

I'm not too sure about Python, but I believe this might work:
for match in re.finditer(r"[^.]*destructors[^.]*\.[^\w\s]*", subject, re.IGNORECASE):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
In any case, I think the regex you want is:
[^.]*destructors[^.]*\.[^\w\s]*
with the case insensitive and global flags set.

It will be helpful if you could provide the regex pattern which you have tried with so far. The best I can come up with is,
str_text='your text here containing DESTRUCTORS'
match=re.search('pass all the destructors combination here', str_text, flags=re.IGNORECASE)
Try for more patterns available for string formatting with regex here,https://docs.python.org/3/library/re.html

word to syllable converter

I am writing a piece of code in c++ where in i need a word to syllable converter is there any open source standard algorithm available or any other links which can help me build one.
for a word like invisible syllable would be in-viz-uh-ble
it should be ideally be able to even parse complex words like "invisible".
I already found a link for algorithm in perl and python but i want to know if any library is available in c++
Thanks a lot.

Your example shows a phonetic representation of the word, not simply a split into syllables. This is a complex NLP issue.
Take a look at soundex and metaphone. There are C/C++ implementation for both.
Also many dictionaries provide the IPA notation of words. Take a look a Wiktionary API.

For detecting syllables in words, you could adapt a project of mine to your needs.
It's called tinyhyphenator.
It gives you an integer list of all possible hyphenation indices within a word. For German it renders quite exactly. You would have to obtain the index list and insert the hyphens yourself.
By "adapt" I mean adding the specification of English syllables. Take a look at the source code, it is supposed to be quite self explanatory.

Finding type of break in icu::BreakIterator

I'm trying to understang how to use icu::BreakIterator to find specific words.
For example I have following sentence:
To be or not to be? That is the question...
Word instance of break iterator would put breaks there:
|To| |be| |or| |not| |to| |be|?| |That| |is| |the| |question|.|.|.|
Now, not every pair of break points is actual word.
In derived class icu::RuleBasedBreakIterator there is a "getRuleStatus()" that returns some kind of information about break, and it gives "Word status at following points (marked "/")"
|To/ |be/ |or/ |not/ |to/ |be/?| |That/ |is/ |the/ |question/.|.|.|
But... It all depends on specific rules, and there is absolutely no documentation to understand it (unless I just try), but what would happend with different locales and languages where dictionaries are used? what happens with backware iteration?
Is there any way to get "Begin of Word" or "End of Word" information like in Qt QTextBoundaryFinder: http://qt.nokia.com/doc/4.5/qtextboundaryfinder.html#BoundaryReason-enum?
How should I solve such problem in ICU correctly?

Have you tried the ICU documentation? It appears to explain everything you are asking about including handling of internationalisation, reverse iteration, and the rules, both default and how to create your own custom set. They also have code snippets to help.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js