How to create a CustomMapping transform for data in the middle of the pipeline?

How to create a CustomMapping transform for data in the middle of the pipeline? - ml.net

I have a pipeline for processing text and I'd like to add a stemming step.
var textPipeline = mlContext.Transforms.Text.NormalizeText("Text", "Html", Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode.Lower, false, false, false)
.Append(mlContext.Transforms.DropColumns("Html"))
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Tokens", language: StopWordsRemovingEstimator.Language.English));
-->stemming<--
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens"))
.Append(mlContext.Transforms.Text.LatentDirichletAllocation("Features", "Tokens", numberOfTopics: 20));
Between RemoveDefaultStopWords and MapValueToKey, I'd like to call a CustomMapping action, but all the samples I've seen work from data that was loaded into the initial dataview and not data created by stages in the pipeline.
How do I create an Action<> that takes the vector of strings and returns a new vector of strings?

After a lot of playing around,
Action<HtmlRaw, StemmedOutput> mapping = (input, output) => output.StemmedTokens = input.Tokens.Select(t => stemmer.Stem(t).Value).ToArray();
It was just a matter of creating new classes and matching property names to the column names used in the pipeline.

Related

BIGQUERY csv file load with an additional column with a default value

From the example given by Google, I have managed to load CSV files into BigQuery(BQ) table following the guide(link and code below)
Now I want to add several files into BQ, and want to add a new column filename which contains the filename.
Is there a way to add column with default data?
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
// Import the Google Cloud client libraries
const {BigQuery} = require('#google-cloud/bigquery');
const {Storage} = require('#google-cloud/storage');
// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();
/**
* This sample loads the CSV file at
* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
*
* TODO(developer): Replace the following lines with the path to your file.
*/
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.csv';
async function loadCSVFromGCS() {
// Imports a GCS file into a table with manually defined schema.
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const datasetId = 'my_dataset';
// const tableId = 'my_table';
// Configure the load job. For full list of options, see:
// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
const metadata = {
sourceFormat: 'CSV',
skipLeadingRows: 1,
schema: {
fields: [
{name: 'name', type: 'STRING'},
{name: 'post_abbr', type: 'STRING'},
// {name: 'filemame', type: 'STRING', value=filename} // I WANT TO ADD COLUMN WITH FILE NAME HERE
],
},
location: 'US',
};
// Load data from a Google Cloud Storage file into the table
const [job] = await bigquery
.dataset(datasetId)
.table(tableId)
.load(storage.bucket(bucketName).file(filename), metadata);
// load() waits for the job to finish
console.log(`Job ${job.id} completed.`);
// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
throw errors;
}
}

I would say you have a few choices.
Add a column to the CSV before uploading, e.g. with awk or preprocessing in JS.
Add the individual CSV files to separate tables. You can easily query across many tables as one in BigQuery. This way you can easily see what data comes from which file, and you can access table meta data for the file-name
Post process the data, by adding the column after the data is loaded with normal sql/api calls.
See also this possible duplicate How to add new column with metadata value to csv when loading it to bigquery

According to BigQuery’s documentation [1], there is no option to set a default value for columns. The closest option without any post-processing, would be to use a NULL value for nullable columns.
However, a possible postprocessing workaround for this would be to create a View of the raw table and add a script that maps the NULL value to any default value. Here’s some information about scripting in BigQuery [2].
In case it is possible to add a pre-processing code, adding the value to the source file would be easy to achieve using any scripting language.
I think that static and function-based values will be a good feature for BigQuery’s future scope.
[1] -
https://cloud.google.com/bigquery/docs
[2] -
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

You have multiple options:
you could rebuild your CSV with the filename as a column data
you can load data into a temporary table, then moving to final table with a second step specifying the missing file name column
convert the example to be an external table where _FILE_NAME is a pseduocolumn, and later you can query and move to a final table. See more about this here.

Need to create a .doc file with list of items using Aspose

I have been working on legacy application where i am using Aspose.Words.jdk15.jar to print the .doc file. I have a requirement where i am getting list of value then we have to loop & print it in the doc file.
And that value we are replacing in doc using range.replace() method . This doc already exists in my workspace where we have mapped the value like this
Component Name:$COMPONENT_NAME
Billing Effective Date:$EFFECTIVE_DATE
Billing End Date:$END_DATE
and the code which i have written to replace the value of doc. So, my requirement is I need this value multiple times in my doc as per size of list.
for(int i=0;i<details.size();i++)
{
doc.getRange().replace("$COMPONENT_NAME" , checkNull(details.get(i).getComponentName()) + “,”, false, false);
doc.getRange().replace("$EFFECTIVE_DATE" , checkNull(details.get(i).getBillEffectiveDate()) + “,”, false, false);
doc.getRange().replace("$END_DATE" , checkNull(details.get(i).getBillEndDate()) + “,”, false, false);
}

The best way to achieve what you need is using mail Merge feature.
https://docs.aspose.com/words/java/about-mail-merge/
But in this case you need to replace your placeholders with mergefields in your template.
If you cannot change the template, you can clone the the document for each item in your list, replace values in the cloned document and them merge documents together to get the final result. Code will look like the following:
// Open template
Document doc = new Document("C:\\Temp\\in.doc");
// Create document where result will be stored to.
// Simply clone the original template without children,
// In this case styles of the original document will be kept.
Document result = (Document)doc.deepClone(false);
for(int i=0; i<details.length; i++)
{
Document component = doc.deepClone();
component.getRange().replace("$COMPONENT_NAME", details[i].getComponentName());
component.getRange().replace("$EFFECTIVE_DATE", details[i].getBillEffectiveDate().toString());
component.getRange().replace("$END_DATE", details[i].getBillEndDate().toString());
// Append the result to the final document.
result.appendDocument(component, ImportFormatMode.USE_DESTINATION_STYLES);
}
result.save("C:\\Temp\\out.doc");
But I would preferer Mail Merge approach.

SpreadJS FromJson Chart load

I'm using SpreadJS v12 as a reporting tool. User will enter the page get the wanted data, create charts and save it for later use.
When user saves the report I get Json data (GC.Spread.Sheets.Workbook.toJSon) and save this Json to database and whenever someone tries to reach the same report, I get the Json from database and give it to the page (GC.Spread.Sheets.Workbook.fromJSon). Everything works fine except if there is a chart on page the data source for chart series (xValues and yValues) change. When I check Json format it looks like this: Sheet2!$B$2:$B$25 but in chart it's: Sheet2!$A$1:$A$24 . Am I doing something wrong?
By the way my serialize options: { ignoreFormula: false, ignoreStyle: false, rowHeadersAsFrozenColumns: true, columnHeadersAsFrozenRows: true, doNotRecalculateAfterLoad: false }
this.state.spread = new GC.Spread.Sheets.Workbook(document.getElementById("spreadSheetContent"), { sheetCount: 1 });
This is my save method:
var pageJson = this.state.spread.toJSON(this.serializationOption);
let self = this;
let model = {
Id: "",
Name: reportName,
Query: query,
PageJson: JSON.stringify(pageJson)
}
this.post( { model }, "Query/SaveReportTemplate")
.done(function(reply){
self.createSpreadSheet(reply);
}).fail(function(reply){
self.PopUp(reply, 4 );
});
And this is my load method:
var jsonOptions = {
ignoreFormula: false,
ignoreStyle: false,
frozenColumnsAsRowHeaders: true,
frozenRowsAsColumnHeaders: true,
doNotRecalculateAfterLoad: false
}
this.state.spread.fromJSON(JSON.parse(template.PageJson),jsonOptions);
this.state.spread.repaint();

Well after a long day, I think I've found what's causing the problem and started working around that.
Let's say we have two sheets. Sheet1's index is 0 and Sheet2's index is 1.
Because of the json serialization options like frozenColumnsAsRowHeaders and frozenRowsAsColumnHeaders until Sheet2 is painted row numbers and column number are different in the json.
If there is a formula or a chart in Sheet1 that's referencing Sheet2, their references will point to a different cell from what you set first. So always referencing the sheets that will be painted before is the way to solve this problem.

SAPUI5 - Input error on growing list, logic issue

I am having an issue with a growing list. Previously I had a normal list, but as it is limited to displaying 100 items, I need to now change this to a growing list, which works fine now and I can get over 100 items loaded when I've put the growing="true" growingThreshold="50" growingScrollToLoad="false" properties on the list.
But now I have an issue with one of the number inputs in the custom list, when entering a number it is not staying set (it has a liveChange event that updates a text component).
I've set a breakpoint in the controller to test and it seems to bug out when I am trying to set the data changes (red arrow on attached image).
Can anyone see the issue with the logic? If any additional code snippets are required I could provide them.
onReceivedQuantityChange: function (oEvent) {
// get model and data
var oModel = this.getOrderModel();
var oData = oModel.getData();
// get item from path
var oItem = this._getOrderItemByPath(oEvent.getSource().getBindingContext(this.MODEL_ORDERS).getPath());
// set received value
oItem._ReceivedValue = oEvent.getParameters().newValue * (oItem.ValuationPrice / oItem.Quantity);
// apply data changes
oModel.setData(oData);
},
Controller code image

onReceivedQuantityChange: function (oEvent) {
var oModel = this.getOrderModel()
var sItemPath = oEvent.getSource().getBindingContext(this.MODEL_ORDERS).getPath()
var iValuationPrice = oModel.getProperty(sItemPath + '/ValuationPrice')
var iQuantity = oModel.getProperty(sItemPath + '/Quantity')
var iNewValue = oEvent.getParameters().newValue
var iReceivedValue = iNewValue * (iValuationPrice / iQuantity)
oModel.setProperty(sItemPath + '/_ReceivedValue', iReceivedValue)
}
If you use setProperty() on the Model you're only chaning the specific Property in DataModel and Sapui5 is able to proceed bindingchanges on this Property only (and not the whole model).
If you get the data out of the model by getData() you are only getting a reference to this Object. If you change something on this Object, you don't have to set it back by setData() (it is already there because you used the reference of this Object).
But Sapui5 need to know that there was a specific change in datamodel and this is done by using setProperty()

Using NBuilder to test NHibernate mappings

I have been using NBuilder for a while in unit tests to simulate in-memory data and it's awesome, then I wanted to use it to test my NHibernate mappings, I thought it was going to be transparent but I can not figure out what I am doing wrong =( it is simply not working
I am planing to test heavily my NHibernate mapping but since I have too many entities I do not want to populate data manually, that's the main reason I want to use NBuilder
just as a quick reference:
autoConfig.Override<Planet>(x =>
{
x.References(y => y.Sun).Cascade.SaveUpdate().Column("Star_id");
});
autoConfig.Override<Star>(y =>
{
y.HasMany(x => x.Planets).Inverse().Cascade.AllDeleteOrphan();
});
(If you need I can provide information about the entities and the mappings but I think they are correct since i am able to save my entities when the data is populated manually)
Manually:
using (var session = factory.OpenSession())
using (var tran = session.BeginTransaction())
{
var star = new Star { Class = StarTypes.B, Color = SurfaceColor.Red, Mass = 323.43, Name = "fu..nny star" };
star.Planets = new List<Planet>
{
new Planet { IsHabitable = true, Name = "my pla", Sun = star }
};
session.Save(star);
tran.Commit();
}
The above code actually works saving both entities to the database correctly meaning that my mappings are correct but now I want to use NBuilder to auto populate testing data like this:
var star = Builder<Star>.CreateNew().Build();
star.Planets = Builder<Planet>.CreateListOfSize(10).All().With(x => x.Sun, star).Build();
session.Save(star);
tran.Commit();
Inspecting the generated entities while debugging look correct to me, I can navigate through them without problems, but then when I want to commit the transaction I get the following error:
Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect): [CH9_NHibernateLinqToNHibernate.Domain.Planet#00000000-0000-0000-0000-000000000001]
Any thoughts?

I found the problem, basically NBuilder was assigning a value to my Id and NHibernate was considering it 'persisted', and it was trying to update the record instead of create a new one (the error message was not helping me though...):
var star = Builder<Star>.CreateNew().Build();
star.Planets = Builder<Planet>.CreateListOfSize(10).All().With(x => x.Sun, star).With(x => x.Id, Guid.Empty).Build();

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to create a CustomMapping transform for data in the middle of the pipeline? - ml.net

After a lot of playing around, Action<HtmlRaw, StemmedOutput> mapping = (input, output) => output.StemmedTokens = input.Tokens.Select(t => stemmer.Stem(t).Value).ToArray(); It was just a matter of creating new classes and matching property names to the column names used in the pipeline.

Related

BIGQUERY csv file load with an additional column with a default value

Need to create a .doc file with list of items using Aspose

SpreadJS FromJson Chart load

SAPUI5 - Input error on growing list, logic issue

Using NBuilder to test NHibernate mappings

Categories

Resources