how to write output from rapidminer to a txt file? - data-mining

i am using rapidminer 5.3.I took a small document which contains around three english sentences , tokenized it and filtered it with respect to the length of words.i want to write the output into a different word document.i tried using Write document utility but it is not working,it is simply writing the same original document into the new one.However when i write the output to the console,it gives me the expected answer.Something wrong with the write document utility.
Here is my process
READ DOCUMENT --> TOKENIZE --> FILTER TOKENS --> WRITE DOCUMENT

Try the following
Cut Document (with (\S+) as the regular expression)
Inside the Cut Document operator use Tokenize with (non letters) followed by Filter Tokens
Use Combine Documents after Cut Document
Then use Write Document
This should do what you want
Andrew

Related

Can a regex point to a list to find matches

Im currently using a Regex procedure in Alteryx to recognize an employee number in a PDF document and split the document into individual pdfs based on ee number.
RegEx in alteryx flow
Essentially what it does is find the term "Employee" on each page, returns the proceeding six digit number, splits the page out and renames the file using that number. This has, so far, worked fine.
However I have had some errors/kickouts and honestly I want to be more sure about the process, so my question is this:
Is there a way to have the regex point to a list of employee numbers (say in excel) and split the pages based on matching numbers within the pdf file?
Any help would be greatly appreciated.
dave
The RegEx can't do it, but with Alteryx, just have another data stream that reads the list of Excel numbers, then join your data stream to that. Assume your data stream is the L input, and the valid EmpNo list is the R input. Then:
The L output is invalid data stream records: save these for further analysis.
The J output is valid data stream records: continue processing them.
The R output is valid employees not represented in the data stream; retain for further review if interested.

How to extract Text with Emoticons from an XML file?

I have an XML file with a lot of tweets and want to extract the text of every tweet which includes an Emoticon.
The XML file looks like this:
<root>
<tweet>
<id>573890929636941824</id>
<name>B&BeyondMagazine</name>
<text>Your Torrent Client May Be Mining Bitcoin Without Telling You http://t.co/xhTdmAYD20</text>
</tweet>
<tweet>
<id>573890929628614656</id>
<name>03/08</name>
<text>#8900Princess that's what I thought you was on off the rip , that's why I said why 😂</text>
</tweet>
</root>
So I need every text-tag value with an emoticon.
I would usually try to use a Regular Expression to identify the strings with Emoticons, but I read that you can't use RegEx with XML files.
How can I do that with an XML Parser (which one) or should I maybe extract all the tweets and use RegEx?
And after that I would have to extract all the Emoticons and count all the different types, any advice on that would be appreciated, too.

Classic ASP: encoding text outside tags with regex

I'm in the need of a function that could process a string and replace textual parts with html encoded ones:
Sample 1
Input: "<span>Total amount:<br>€ 50,00</span>"
Output: "<span>Total amount:<br>€ 50,00</span>"
Sample 2
Input: "<span>When threshold > x<br>act as described below:</span>"
Output: "<span>When threshold > x<br>act as described below:</span>"
These are simplified cases of course, and yes, I know I could do that by a series of replace on each specific char I need to encode, but I'd rather have a function that can recognize and skip HTML tags using Regex and perform a Server.HTMLEncode on the textual part of the input string.
Any help will be highly appreciated.
I'm not sure why you'd want to do this. Why don't you just pass the innerHTML into your parser using javascript and have Javascript create your span tag? Then you can encode the entire thing. I'd be worried that the encoding here won't have any added security for your application if that's what you are trying to do.

How do you remove text from example sets before processing the data?

I am using RapidMiner 5.3.013. I am reading from an excel file with thousands of rows of worklogs from Remedy. I want to remove texts based upon the regex ^[A-Z][\w\d/?(# ]+[\w0-9#)]{2}: then use Process Documents from Data. So far have not figured out how to do this. I could just probably write VBA, but would like to know how it can be done in Rapidminer.
Having read the Excel data, make sure the field to be processed by the Process Documents operator is set to type text. Do this using the Nominal to Text operator. Inside the process documents loop, split the data into tokens using the Tokenize operator. Use the Filter Tokens operator to remove any tokens you don't want. This operator takes a regular expression as a parameter. Make sure the invert flag is set on this operator to remove the tokens you don't want rather than keep them

Extract specific portion of HTML file using c++/boost::regex

I have a series of thousands of HTML files and for the ultimate purpose of running a word-frequency counter, I am only interested on a particular portion from each file. For example, suppose the following is part of one of the files:
<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
<div class="textelement "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
I currently have some code that opens the html file and reads the entire content into a single string, but when I try to run a boost::regex_match looking for that particular beginning of line <div class="preview_content clearfix module_panel">, I don't get any matches. I'm open to any suggestions as long as it's on c++.
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
You don't.
Never use regular expressions to process HTML. Whether in C++ with Boost.Regex, in Perl, Python, JavaScript, anything and anywhere. HTML is not a regular language; therefore, it cannot be processed in any meaningful way via regular expressions. Oh, in extremely limited cases, you might be able to get it to extract some particular information. But once those cases change, you'll find yourself unable to get done what you need to get done.
I would suggest using an actual HTML parser, like LibXML2 (which does have the ability to read HTML4). But using regex's to parse HTML is simply using the wrong tool for the job.
Since all I needed was something quite simple (as per question above), I was able to get it done without using regex or any type of parsing. Following is the code snippet that did the trick:
// Read HTML file into string variable str
std::ifstream t("/path/inputFile.html");
std::string str((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());
// Find the two "flags" that enclose the content I'm trying to extract
size_t pos1 = str.find("<div class=\"preview_content clearfix module_panel\">");
size_t pos2 = str.find("</em></p></div>");
// Get that content and store into new string
std::string buf = str.substr(pos1,pos2-pos1);
Thank you for pointing out the fact that I was totally on the wrong track.