Weka with Missing Values - weka

I've a question about weka as this person:
Hi all:
I felt really strange about WEKA on this.
I have prepared a CSV file which has lots of missing values. One
missing value in this file is basic just no any value between pair of
commas i.e. ,random_value1,,random_value2. This is an example of the
format. You can see there is a pair of commas, between them is just
nothing not even a white_space, and it should indicates a missing
value of the data.
The weird thing is when I read this CSV into WEKA, WEKA assigns all
missing values to a question mark, i.e. '?'. This is exactly how WEKA
expresses it.
And then when I run testing analysis, WEKA started working on these
'?' as some sort useful information. It just missing values, could
WEKA please just jump over it?
These problem became really wasting. Analysis results read like if
missing then value missing, missing assocciates with missing, missing
correlates missing.
Can WEKA reads missing value as missing value, not some sort question
marks? Or can I tell WEKA that for all '?', treat them as missing
values?
Thanks guys
He solved his problem using this solution:
I found a way to tell WEKA about the missings. Just use the fine_and_replace function of a ASCII editor, replace all '?' to ?.
>
but I didn't know how can download ASCII Editor and use it ,, can anyone inform me ????

I suggest you to use notepad2 or notepad++ in windows.

You don't have to work on with missing values. Different algorithms work differently on missing values. So, don't worry, it will be handled just the way it should have been.

Related

VSCode Snippets: Format File Name from my_file_name to MyFileName

I am creating custom snippets for flutter/dart. My goal is to pull the file name (TM_FILENAME_BASE) remove all of the underscores and convert it to PascalCase (or camelCase).
Here is a link to what I have learned so far regarding regex and vscode's snippets.
https://code.visualstudio.com/docs/editor/userdefinedsnippets
I have been able to remove the underscores nicely with the following code
${TM_FILENAME_BASE/[\\_]/ /}
I can even make it all caps
${TM_FILENAME_BASE/(.*)/${1:/upcase}/}
However, it seems that I cannot do two steps at a time. I am not familiar with regex, this is just me fiddling around with this for the last couple of days.
If anyone could help out a fellow programmer just trying make coding simpler, it would be really appreciated!
I expect the output of "my_file_name" to be "MyFileName".
It's as easy as that: ${TM_FILENAME_BASE/(.*)/${1:/pascalcase}/}
For the camelCase version you mentioned, you can use:
${TM_FILENAME_BASE/(.*)/${1:/camelcase}/}

RegEx: Extract formatted number from string if it exists

First off, I'm sorry if this has already been asked somewhere else, it's just I could not find it. If it has been, I apologize deeply.
I am terrible at Regular Expressions and generally avoid them but I know the problem I have can be simply solved using them so I have come here for help.
I have a text field containing some information about a company (name, address, identifier, etc), but not all information always appears on the field and the order the information appears in is not set.
What I need is the company identifier which is a 14-digit number which can or cannot be formatted as such: XX.XXX.XXX/XXXX-XX
What expression could I use that would identify if there are either 14 digits in a row or the number formatted in the manner described above?
/[0-9]{2}[.]{1}[0-9]{3}[.]{1}[0-9]{3}[\/]{1}[0-9]{4}[-]{1}[0-9]{2}/ for XX.XXX.XXX/XXXX-XX
/[0-9]{14}/ for 14 digits
There's probably some edge cases in here somewhere.
There's also probably a way to do both of these in one, but I don't have the patience nor the time to try and figure it out.
Try Regex: \b(?:\d{14}|\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2})\b
Demo

Untranslatable character when extracting dates from strings

I am attempting to extract dates from a free-text field (because our process is awesome like that :\ ) and keep hitting Teradata error 6706. The regex I'm using is: REGEXP_SUBSTR(original_field,'(\d{2})\/(\d{2})\/(\d{4})',1) AS new_field. I'm unsure of the field's type HELP TABLE has a blank in the Type column for the field.
I've already tried converting using TRANSLATE(col USING LATIN_TO_UNICODE), as well as UNICODE_TO_LATIN, those both actually cause the error by themselves. A straight CAST(original_field AS VARCHAR(255)) doesn't fix the issue, though that cast does work. I've also tried stripping various special characters (new-line, carriage return, etc.) from the field before letting the REGEXP_SUBSTR take a crack at it, both by itself and with the CAST & TRANSLATEs I already mentioned.
At this point I'm not sure what the issue could be, and could use some guidance on additional options to try.
The final version that worked ended up being
, CASE
WHEN TRANSLATE_CHK(field USING LATIN_TO_UNICODE) = 0 THEN
REGEXP_SUBSTR(TRANSLATE(field USING LATIN_TO_UNICODE),'(\d{2})\/(\d{2})\/(\d{4})',1)
ELSE NULL
END AS Ref_Date
For whatever reason, using a TRIM inside the TRANSLATE seems to cause an issue. Only once I striped any and all functions from inside the TRANSLATE did the TRANSLATE, and thus the REGEXP_SUBSTR, work.

trying to stem a string in natural language using python-2.7

I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"
You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.

fuzzy matching japanese strings in python?

this problem has me stumped for the whole day.
I have two Japanese strings that I want to fuzzy match in Python2.7. Currently I'm using fuzzywuzzy and
jpnStr = "日本語".encode('utf-8')
jpnList = ["日本語1".encode('utf-8'),"日本語2".encode('utf-8'),"日本語3".encode('utf-8')]
bestmatch = process.extractOne(jpnStr, jpnList)
but the resulting bestmatch is always
("日本語1",0)
How would I go by resolving this issue, or is there a best practice that I'm totally missing here? Sorry if I sound frustrated, it's been a roadblock for a while. Thanks in advance.
Ok, I'm not sure how helpful this is but I've found a workaround.
I found that I could fuzzymatch japanese strings using fuzzywuzzy.
First, you get the Unicoded Japanese string, ie "日本語です"
Then you output it as ascii text into a text file. Output will look something like "/uf34/ufeac/uewa3/..." so on and so forth.
Then you read the text file and compare the ascii representation of the japanese string : "/uf34/ufeac/uewa3/" against each other. This gives a workable fuzzywuzzy match rating.
It's probably not an ideal method, but it works and is fairly accurate. Hope this helps somebody.