Extracting fields in word files and export it to database

Extracting fields in word files and export it to database - aspose

I would need some advice for using Aspose word for .net. For one of my projects, I had a requirement where the data "fields" from word documents had to be extracted and exported to a database. May I know how should i start? I have tried the demo. However, I still have doubts with the implementation. All the fields is in a table of different columns and rows.
The demo seems to be able to extract data from one word file only. Is
it possible to extract data from multiple word file with different
filename? For example, Test1.doc and Test2.doc.
The demo doesn't seems to find my word file with "fields" it shows
the error "file not found".
I'm thinking of using Aspose word to "literate" the word files, read
the "field" contents and save it to the respective columns in
"database". Is this correct?

Please use following code example to get the values of FormFields.
Document doc = new Document(MyDir + "in.docx");
foreach (FormField formField in doc.Range.FormFields)
{
if (formField.Type == FieldType.FieldFormDropDown)
Console.WriteLine(formField.DropDownItems[formField.DropDownSelectedIndex]);
else if (formField.Type == FieldType.FieldFormTextInput)
Console.WriteLine(formField.Result);
}
I work with Aspose as Developer evangelist.

Related

How can I see if a document contains any of a set of words using ML.net?

I am learning ML.net and want to submit the text of a document to see if it contains one or more specific words. In my example I want to categorize my document with the day of the week. For example, if it contains the word "Monday" or "first", then I want it to categorize it as "Monday.

You will need to create a training dataset. This should consist of documents that contain the words you are looking for AND documents that don't. You will need to label these with a column that contains a Yes or No as the ground truth.
Then use the model builder interface and select a classify task.
Best to start with a small training set just to check you have the data in the right format.

Exporting from pgadmin reads line breaks in field cells and creates unreadable Excel

I'm new to this, so I am sure it is a silly question, but I have read through every question related on the site and can't find anything!
I am exporting from pgadmin. A few of the columns have line breaks within the cells, so the exported data is very choppy. Does anyone know how to fix this? Is there a way to make it so the line breaks within cells are not read?
I know I am doing the right settings for exporting, but basically what happens is that the header names are there, along with one row of content for each column and then Column A will have 20 more rows beneath it because of line breaks from the first cell in column E.
Any help would be much appreciated!

I assume that you're referring to the Query --> Execute to file command in the Query window. I don't think it's a bug that pgAdmin doesn't escape line breaks within strings in its csv output, but Excel can read it correctly anyway.
In the export options, please make sure that you use commas as column separators and double quotes as quote chars. Here are my settings:
Additionally, when you load your CSV into Excel, please don't use Data -> From Text. This one doesn't parse CSV with line breaks correctly. Just open the file directly in Excel (via Open within Excel, or by right clicking it in Windows Explorer and choosing Open With -> Microsoft Excel).

SP 2013 - Quick edit with Managed Meta Data columns, copy and paste from excel

I'm trying to migrate a meta data from an excel spreadsheet to a SP 2013 document library. The columns are managed meta data columns with pre defined terms matching the data in the excel spreadsheet.
However I cannot copy and paste data from excel via Quick Edit in the doucment library without getting the following error "The data returned from the tagging UI was not formatted correctly"
This happens even when I remove all formatting or paste to notepad first.
Are there any simple solutions to this issue?
http://i.imgur.com/1bqpMPA.jpg
Thanks,

Any metadata fields are in fact foreign keys, as it were, to a dynamic, hidden table (or 'list', whatever you want to call it) within SharePoint. To paste a value into a metadata column, you need to know your element's guid (as in, within the term set) and then append that to each metadata element you're pasting in as a <name>|<guid> pair.
Getting the GUID for an element within your term set
Browse to [site-root]/TaxonomyHiddenList/AllItems.aspx and create a new view (or edit the default one) to display the field 'IdForTerm'.
Where you have a term 'apple', your IdForTerm may look like '1288beaf-82e0-4d81-b9de-ad5ad8382938'. Take a note of the guid for each term which appears within your input data.
Edit your input to correctly reference each term
Let's say you're importing your data from an Excel spreadsheet. Or from a CSV. It doesn't really matter. What you need to do is, basically, a find and replace down each managed metadata column, replacing 'term' with 'term|guid'. So our example from earlier, with the apple, would become 'apple|1288beaf-82e0-4d81-b9de-ad5ad8382938'.
Finally, assuming your view is set up in exactly the same order as your input data, you should be able to 'edit list' from within the browser, hit the leftmost side of your first input row (to select the entire row) and CTRL+V all of your data at the same time.
Note there appears to be a limit to the number of entries you can make at the same time. It appears to sit at around 5,000 elements.

Adding on to #rmacd's answer, you can also get the GUID for a given MMS term by first manually entering the value(s) you need in a Quick Edit cell, then copy and paste the same value(s) from SharePoint to Excel. The pasted value will appear with the full term|guid that you need to complete the bulk copy/paste.

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.

Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

RegEx for a JSON string

Im storing a person object as JSON in my SQLite database. The table will have few 1000 records of person objects. What i need is to query person based on the "name" attribute.
After investigation i figured out using GLOB method of SQLite to perform a RegEx kind of search in the JSON elements.
My Sample JSON is something like this.
{"name":"john","age":"22","father-name":"jackson"}
Now i want a RegEx matcher to get me all the records that matches a part of the SubString provided with the name attribute in JSON. And it should be case insensitive too.
For Ex: "ohn" search should fetch me john's record.

While you can store JSON and search against it using regexes (which are rather limited in SQLite), it does not mean you should.
Instead, you should really consider splitting your JSON into fields and storing them in normal SQLite table. Doing so will not only allow you to perform easier searches without need to painfully parse data every single time, search will be much faster too (if you add necessary indexes).

If you do want to go down the regex route the following will extract the record:
/\{"name":"\w*ohn\w*[^\}]+\}/i
This will match each of these:
{"name":"john","age":"22","father-name":"jackson"}
{"name":"john","age":"22","father-name":"johnson"}
{"name":"johnny","age":"22","father-name":"smith"}
but not:
{"name":"fred","age":"22","father-name":"hall"},
{"name":"mike","age":"22","father-name":"johnson"}
{"name":"bob","age":"22","father-name":"todd"}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js