Read MS Word .doc file with ruby and win32ole

Read MS Word .doc file with ruby and win32ole - ruby-on-rails-4

I'm trying to read .doc file with ruby, I use win32ole library.
IT my code:
require 'win32ole'
class DocParser
def initialize
#content = ''
end
def read_file file_path
begin
word = WIN32OLE.connect( 'Word.Application' )
doc = word.activedocument
rescue
word = WIN32OLE.new( 'Word.Application' )
doc = word.documents.open( file_path )
end
word.visible = false
doc.sentences.each{ |x| #content = #content + x.text }
word.quit
#content
end
end
I kick off doc reading with DocParser.new.read_file('path/file.doc')
When I run this using rails c - I don't have any problems, it's working fine.
But when I run it using rails (e.g. after button click), once in a while (every 3-4 time) this code crashes with error:
WIN32OLERuntimeError (failed to create WIN32OLE object from `Word.Application'
HRESULT error code:0x800401f0
CoInitialize has not been called.):
lib/file_parsers/doc_parser.rb:14:in `initialize'
lib/file_parsers/doc_parser.rb:14:in `new'
lib/file_parsers/doc_parser.rb:14:in `rescue in read_file'
lib/file_parsers/doc_parser.rb:10:in `read_file'
lib/search_engine.rb:10:in `block in search'
lib/search_engine.rb:43:in `block in each_file_in'
lib/search_engine.rb:42:in `each_file_in'
lib/search_engine.rb:8:in `search'
app/controllers/home_controller.rb:9:in `search'
Rendered c:/Ruby193/lib/ruby/gems/1.9.1/gems/actionpack-4.1.1/lib/action_dispatch/middleware/templates/rescues/_source.erb (0.0ms)
Rendered c:/Ruby193/lib/ruby/gems/1.9.1/gems/actionpack-4.1.1/lib/action_dispatch/middleware/templates/rescues/_trace.text.erb (2.0ms)
Rendered c:/Ruby193/lib/ruby/gems/1.9.1/gems/actionpack-4.1.1/lib/action_dispatch/middleware/templates/rescues/_request_and_response.text.erb (2.0ms)
Rendered c:/Ruby193/lib/ruby/gems/1.9.1/gems/actionpack-4.1.1/lib/action_dispatch/middleware/templates/rescues/diagnostics.erb (56.0ms)
Aditionaly, this code read doc file successfully, but RAILS CRASHES AFTER A FEW SECONDS:
look at this gist
What is my problem? How can I fix it?
Please, help!

Don't know the difference between rails c and rails, so I'll give some random advise.
First, it is a bad idea to run this in a webserver, each time Word is run on the server, so what happens if multiple users start using this at the same time ?
You'd better convert your .doc files to another format first like .rtf or .docx (a batch conversion ?) and then use other gems that don't require Word itself.
If you keep it like this, consider to not close word (remove the word.quit) buit only close the document itself, the instance will be picked up the next time by the WIN32OLE.connect
While testing you'de better keep word visible so that you can better see what is happening (errors ?).
I notice your path uses forward slashes while in this case backslashes are needed but since your code runs a few times before the error i suppose that is not the problem.
Hope this helps.

I upgrade my ruby from 1.9.3 to 2.0.0.
Now rails doesn't crashes and I have not problems with win23ole and reading old version MS Word documents.
I guess the problem was in memory usage - cause new ruby (>2.0.0) use new Garbage Collector.

Related

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.

Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

Prevent escaping slashes in url helpers (specifically when called using .send() to url_helpers)

Perhaps the title of this question is cryptic, but the problem is real, I've just upgraded an application from Rails 3 to 4 and encountered the following issue (on both Ruby 2.0 and 2.1):
I have a method which calls several url helpers in a loop, using send(), like this:
class Sitemap
include Rails.application.routes.url_helpers
#...
# regions = [:city, :county, :zip]
regions.each do |region|
url_params = ... # [name, additional_id]
send("#{region}_url", url_params)
end
In Rails 3 the above resulted in urls like http://example.com/cities/atlanta/2
In Rails 4 I get http://example.com/cities/atlanta%2f2
slash gets CGI escaped, I don't want this. I use it in generating sitemap XML for my site and it seems to work even if the forward slash is escaped, but it looks ugly and I don't know if it will work correctly for all bots or clients.
UPDATE: after some debugging I've found out that the CGI escaping occurs somewhere in ActionDispatch::Journey::Visitors::Formatter
Router::Utils.escape_segment() # method call somewhere in
ActionDispatch::Journey::Visitors::Formatter # during
Visitors::Formatter.new(path_options).accept(path.spec) # in
#set.formatter.generate(:path_info, named_route, options, recall, PARAMETERIZE) # in a call to
ActionDispatch::Routing::RouteSet::Generator#generate

I was able to fix the issue with escaping slashes when generating the URL, here's the change:
# from:
send(region_url, url_params)
# to:
send(region_url, { id: url_params[0], market_id: url_params[1]})
# sometimes url_params is a one element array
The key is to provide a hash of parameters with explicitly assigned keys and values.
I use send to dynamically call method (xxx_url) from url_helpers module which is included in my model. url_params array looks like ['some-slug', 12]

how to read only URL from txt file in MATLAB

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?

This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)

As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

Ignore a certain pattern in Lua (garry's mod lua 5.1)

found a new way with a php dump of a mysql database and 15 lines of lua, no pattern finding. Mod can delete this.
I'm trying to separate this into separate parts of a table, but I can't figure out how to make a pattern to ignore a specific thing.
local output = "<tr><td>ABAH</td><td>A Basic Anti Hack</td><td>Clark</td><td>Download"
for plugin in output:gmatch("<tr>(.-%S)Download</a>") do
--print( plugin )
for title in plugin:gmatch("<td>(.-%S)</td><td>") do
print(title)
end
for description in plugin:gmatch("</td><td>(.-%S)</t") do
print(description)
end
end
So far it outputs the title and the description, but also outputs the mail link, how can I make it ignore that?
Outputs:
1.ABAH
2.Clark
3.A Basic Anti Hack
I used http://codepad.org/XQ6rZ6ZM for testing.

Is there some reason why you can't simply apply another pattern to clean out the HTML for example:
local output = "<tr><td>ABAH</td><td>A Basic Anti Hack</td><td>Clark</td><td>Download"
for plugin in output:gmatch("<tr>(.-%S)Download</a>") do
for title in plugin:gmatch("<td>(.-%S)</td><td>") do
title= title:gsub('<.->','')
print(title)
end
for description in plugin:gmatch("</td><td>(.-%S)</t") do
description = description:gsub('<.->','')
print('description)
end
end

Redmine - Add "Spent Time" Field to Issues Display

How would I go about adding the "Spent Time" as a column to be displayed in the issues list?

Consolidating Eric and Joel's answers, this is what I needed to do to get a 'Spent time' column added to Redmine 1.0.3. Not sure if there's a better way to get the translation text added.
To give the new field a localised name, added to config/locales/en.yml around line 299 at the end of the field definitions:
field_spent_hours: Spent time
To add the new column, created lib/spent_time_query_patch.rb with content:
# Based on http://github.com/edavis10/question_plugin/blob/master/lib/question_query_patch.rb
require_dependency 'query'
module QueryPatch
def self.included(base) # :nodoc:
base.extend(ClassMethods)
# Same as typing in the class
base.class_eval do
unloadable # Send unloadable so it will not be unloaded in development
base.add_available_column(QueryColumn.new(:spent_hours))
end
end
module ClassMethods
unless Query.respond_to?(:available_columns=)
# Setter for +available_columns+ that isn't provided by the core.
def available_columns=(v)
self.available_columns = (v)
end
end
unless Query.respond_to?(:add_available_column)
# Method to add a column to the +available_columns+ that isn't provided by the core.
def add_available_column(column)
self.available_columns << (column)
end
end
end
end
To get the spent_time_query_patch above to actually load, created config/initializers/spent_time_query_patch.rb with content:
require 'spent_time_query_patch'
Query.class_eval do
include QueryPatch
end

You can also do this by adding the column at runtime. This will add the spent hours column without modifying the Redmine core. Just drop the following code into a file in lib/
Adapted from:
Redmine Budget Plugin
Redmine Question Plugin
require_dependency 'query'
module QueryPatch
def self.included(base) # :nodoc:
base.extend(ClassMethods)
# Same as typing in the class
base.class_eval do
unloadable # Send unloadable so it will not be unloaded in development
base.add_available_column(QueryColumn.new(:spent_hours))
end
end
module ClassMethods
unless Query.respond_to?(:available_columns=)
# Setter for +available_columns+ that isn't provided by the core.
def available_columns=(v)
self.available_columns = (v)
end
end
unless Query.respond_to?(:add_available_column)
# Method to add a column to the +available_columns+ that isn't provided by the core.
def add_available_column(column)
self.available_columns

Also, it would be cool, if the column "Spent time" was sortable.
After looking up the produced SQL, I just implemented the sortable feature this in this way:
base.add_available_column(QueryColumn.new(:spent_hours,
:sortable => "(select sum(hours) from time_entries where time_entries.issue_id = t0_r0)")
)
Replace the respective line. I just hope the issue_id column's name is always "t0_r0" ...
PS: You can find lots of examples in app/models/query.rb lines 122++
2-Digits Problem:
Unfortunatly, I had to hack one of the core files: app/helpers/queries_helper.rb
Around line 44, change this:
when 'Fixnum', 'Float'
if column.name == :done_ratio
progress_bar(value, :width => '80px')
else
value.to_s
end
into:
when 'Fixnum', 'Float'
if column.name == :done_ratio
progress_bar(value, :width => '80px')
elsif column.name == :spent_hours
sprintf "%.2f", value
else
value.to_s
end
EDIT: Using a patch instead manipulating the source Recently, we did an update of the redmine system, so the above mentioned Fix also was removed.
This time, we decided to implement that as a patch.
Open up any plugin (We created a plugin for our monkey-patch changes on core). open up vendor/plugins/redmine_YOURPLUGIN/app/helpers/queries_helper.rb
module QueriesHelper
def new_column_content(column, issue)
value = column.value(issue)
if value.class.name == "Float" and column.name == :spent_hours
sprintf "%.2f", value
else
__column_content(column, issue)
end
end
alias_method :__column_content, :column_content
alias_method :column_content, :new_column_content
end

This feature build in from 1.4.0 version

by using AgileDwarf plugin. You can have spent time & you can say for what you spent this time (developement - design -...)

Since no one answered, I just poked the source until it yielded results. Then I started a blog to explain how I did it.
Add spent time column to default issues list in Redmine

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Read MS Word .doc file with ruby and win32ole - ruby-on-rails-4

I upgrade my ruby from 1.9.3 to 2.0.0. Now rails doesn't crashes and I have not problems with win23ole and reading old version MS Word documents. I guess the problem was in memory usage - cause new ruby (>2.0.0) use new Garbage Collector.

Related

How can I use regex to construct an API call in my Jekyll plugin?

Prevent escaping slashes in url helpers (specifically when called using .send() to url_helpers)

how to read only URL from txt file in MATLAB

Ignore a certain pattern in Lua (garry's mod lua 5.1)

Redmine - Add "Spent Time" Field to Issues Display

Categories

Resources