Making Dedupe learn from existing label data

Making Dedupe learn from existing label data - python-2.7

I am aware that Dedupe uses Active learning to remove duplicates and perform Record linkage.
However , I would like to know if we can pass excel sheet with already matched pairs(label data) as the input for active learning?

Not directly.
You'll need to get your data into a format that markPairs can consume.
Something like:
labeled_examples = {'match' : [],
'distinct' : [({'name' : 'Georgie Porgie'},
{'name' : 'Georgette Porgette'})]
}
deduper.markPairs(labeled_examples)
We do provide a convenience function for getting spreadsheet data into this format trainingDataDedupe.
(I am an author of dedupe)

Related

Kinda newbie: PowerApps to populate a list with constructed data fields

Updated...
Trying to inject a series of rows into a SharePoint List via PowerApps, but running across the fact that PowerApps seems to only have FORALL as a looping function, and that does not support SET.
Set(AlertString,""); // to be used later
Set(REQ_Value,"");
Set(RITM_Value,"");
Set(Asset_Value,"");
Set(CustomerSignatureFileLocation_Value,"File location: ");
Set(LoanerKitCode_Value,"");
Set(IncidentCode_Value,"");
Set(TransferOrderCode_Value,"");
Set(TransactionType_Value,Workflow.SelectedText.Value & " - " & Workflow_Steps.SelectedText.Value);
Set(ScanItemCodeType,"");
Set(ErrorString,"");
Collect(ScanDataCollection,Split(ScanData.Text,Char(10))); // Split the data into ScanDataCollection collection
ForAll(
ScanDataCollection,
If(Left(Result,4)="RITM",Set(RITM_Value,Result); // FAIL HERE
Collect('Spider - Master Transaction List', {
REQ: REQ_Value,
RITM: RITM_Value,
Scan_Code: Result,
Asset: Asset_Value,
Transaction_Type: TransactionType_Value,
Timestamp: Now(),
Agent_Name: User().FullName,
Agent_Email: User().Email,
Agent_Location: DD_Location.SelectedText.Value,
Agent_Notes: "It was weird, man.",
Customer_Name: Cust_Name.Text,
Customer_Email: Cust_NTAccount.Text,
Customer_Signature: CustomerSignatureFileLocation_Value,
Task_Name: "",
Task_Action: "",
State_Name: "",
State_Action: "",
Stage_Name: "",
Stage_Action: "",
Work_Note_String: "",
Customer_Note_String: "",
Loaner_Kit_Code: LoanerKitCode_Value,
Incident: IncidentCode_Value,
Transfer_Order_Code: TransferOrderCode_Value,
Item_Description: ""});
);
My scanner tool will pick up a variety of different kinds of item scans, all in the same scan. Depending on what type of data it is, it populates different columns in Spider - Master Transaction List.
But we are forbidden to use the SET function inside a FORALL.
How would you recommend I approach this -- considering that each piece of data from the SPLIT could be any of the sorts of codes (such as RITM Code, REQ Code, Transfer Order Code, etc.)?

You can do that you want on various way.
Using Collection or Gallery, in powrapps Galleries can be used like a collection.
I Suggest:
ForAll(
Gallery.Allitems,
Patch(
'SharepointListName',
ThisRecord
)
);
Fields in gallery must have the same name of sharepoint list, or you have to create a record to asign the names.
{sharepoitColumnName: ThisRecord.ColumnName, ...}

MongoDB query with special characters in key

In my case, I have keys in my MongoDB database that contain a dot in their name (see attached screenshot). I have read that it is possible to store data in MongoDB this way, but the driver prevents queries with dots in the key. Anyway, in my MongoDB database, keys do contain dots and I have to work with them.
I have now tried to encode the dots in the query (. to \u002e) but it did not seem to work. Then I had the idea to work with regex to replace the dots in the query with any character but regex seems to only work for the value and not for the key.
Does anyone have a creative idea how I can get around this problem? For example, I want to have all the CVE numbers for 'cve_results.BusyBox 1.12.1'.
Update #1:
The structure of cve_results is as follows:
"cve_results" : {
"BusyBox 1.12.1" : {
"CVE-2018-1000500" : {
"score2" : "6.8",
"score3" : "8.1",
"cpe_version" : "N/A"
},
"CVE-2018-1000517" : {
"score2" : "7.5",
"score3" : "9.8",
"cpe_version" : "N/A"
}
}}

With the following workaround I was able to directly access documents by their keys, even though they have a dot in their key:
db.getCollection('mycollection').aggregate([
{$match: {mymapfield: {$type: "object" }}}, //filter objects with right field type
{$project: {mymapfield: { $objectToArray: "$mymapfield" }}}, //"unwind" map to array of {k: key, v: value} objects
{$match: {mymapfield: {k: "my.key.with.dot", v: "myvalue"}}} //query
])

If possible, it could be worth inserting documents using \u002e instead of the dot, that way you can query them while retaining the ASCII values of the . for any client rendering.
However, It appears there's a work around to query them like so:
db.collection.aggregate({
$match: {
"BusyBox 1.12.1" : "<value>"
}
})

You should be able to use $eq operator to query fields with dots in names.

How to normalize fields delimited by colon thats into a single column in informatica cloud

I need help to normalize the field "DSC_HASH" inside a single column delimeted by colon.
Input:
Outuput:

I achieved what I needed with java transformation:
1) In java transformation I created 4 output columns: COD1_out, COD2_out, COD3_out and DSC_HASH_out
2) Then I put the following code:
String [] column_split;
String column_delimiter = ";";
String [] column_data;
String data_delimiter = ":" ;
Column_split = DSC_HASH.split(column_delimiter);
COD1_out = COD1;
COD2_out = COD2;
COD3_out = COD3;
for (int I =0; i < column_split.length; i++){
column_data = column_split[i].split(data_delimiter);
DSC_HASH_out = column_data[0];
generateRow();
}

There are no generic parsers or loop construct in Informatica that can take one record and output an arbitrary number of records.
There are some ways you can bypass this limitation:
Using the Java Transformation, as you did, which is probably the easiest... if you know Java :) There may be limitations to performance or multi-threading.
Using a Router or a Normalizer with a fixed number of output records, high enough to cover all your cases, then filter out empty records. The expressions to extract fields are a bit complex to write (an maintain).
Using the XML Parser, but you have to convert your data to XML before, and design an XML schema. For example your first line would be changed in (on multiple lines for readability):
<e><n>2320</n><h>-1950312402</h></e>
<e><n>410</n><h>103682488</h></e>
<e><n>4301</n><h>933882987</h></e>
<e><n>110</n><h>-2069728628</h></e>
Using SQL Transformation or Stored Procedure Transformation to use database standard or custom functions, but that would result in an SQL query for each input row, which is bad performance-wise
Using a Custom Transformation. Does anyone want to write C++ for that ?
The Java Transformation is clearly a good solution for this situation.

Python loop through nested dictionary or dictionaries in a list

I have some data I need to process. It looks like a dictionary within a dictionary within a dictionary, all of which are being stored in a list! This is parsed JSON data so I have no control over the format of it.
Here is some of the data, I've deleted a lot of it as it's irrelevant and for brevity:
devices = [
{
'server.device.base.phyname': 'IEEE802.11',
'dot11.device': {
'dot11.device.associated_client_map': {
'68:96:1E:96:96:B5': '4202770DF206F63E_B5F4CE1EAB680000',
'60:30:CE:91:4A:96': '4202770DF206F63E_8D4A91D430600000',
'4C:32:75:66:96:10': '4202770DF206F63E_105F6675324C0000',
'50:6A:03:3E:0E:17': '4202770DF206F63E_170E3E036A500000',
'7C:C3:CE:A4:EC:86': '4202770DF206F63E_86ECA4A1C37C0000',
'2C:BE:08:F0:D5:A0': '4202770DF206F63E_A0D5F008BE2C0000',
'96:E7:96:76:9A:C7': '4202770DF206F63E_C79A762CE7700000',
'96:CE:75:57:E2:5A': '4202770DF206F63E_5AE2577510000000',
'34:68:95:96:3C:96': '4202770DF206F63E_C43C6A9568340000',
'6C:96:96:9D:CE:57': '4202770DF206F63E_57109DCF966C0000',
'CE:61:96:CE:B4:69': '4202770DF206F63E_69B4D2AE61900000',
'04:CE:CE:1C:CE:8C': '4202770DF206F63E_8CAF1CCE0C040000',
'2C:F0:CE:DC:D6:39': '4202770DF206F63E_39D6DCEEF02C0000'
}
}
}
]
I need to be able to extract the MAC addresses that are stored within the 'dot11.device' pair. I'm so far able to loop through the parent list and display all of the data:
for d in devices:
print d['dot11.device']['dot11.device.associated_client_map']
however this prints the whole nested dict.
What I'd really like to do is return a new list of just the MAC addresses (are they dictionary keys? I'm not sure).
I'm working with Python2 and any help is much appreciated!

Yes, they are indeed keys, and so the answer is quite simple:
for d in devices:
print d['dot11.device']['dot11.device.associated_client_map'].keys()

Azure Data Lake - Conditions

I would like to use if-else statement to decide at what location I have to export data.
My case is:
I extract several files from the azure blob storage (it's possible that there are no files!!).
I calculate count of records in file set.
If count of records is > 20 then I export files into specific report location
If in file set are no records, I have to output dummy empty file into different location, because I don't want replace existing report by empty report.
The solution may be IF..ELSE confition. The problem is that if I calculate count of records I got rowset variable and I cannot compare it with scalar variable.
#RECORDS =
SELECT COUNT(id) AS IdsCount
FROM #final;
IF #RECORDS <= 20 THEN //generate dummy empty file
OUTPUT #final_result
TO #EMPTY_OUTPUT_FILE
USING Outputters.Text(delimiter : '\t', quoting : true, encoding : Encoding.UTF8, outputHeader : true, dateTimeFormat : "s", nullEscape : "NULL");
ELSE
OUTPUT #final_result
TO #OUTPUT_FILE
USING Outputters.Text(delimiter : '\t', quoting : true, encoding : Encoding.UTF8, outputHeader : true, dateTimeFormat : "s", nullEscape : "NULL");
END;

U-SQL's IF statement is currently only used during compilation time. So you can do something like
IF FILE.EXIST() THEN
But if you want to output different files depending on the number of records you would have to write it at the SDK/CLI level:
The first job writes a temp file output (and maybe a status file that contains number of rows). Then you check (for example in Powershell) whether the file is empty (or whatever criteria you want to use) and if not, copy the result over otherwise create the empty output file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Making Dedupe learn from existing label data - python-2.7

I am aware that Dedupe uses Active learning to remove duplicates and perform Record linkage. However , I would like to know if we can pass excel sheet with already matched pairs(label data) as the input for active learning?

Related

Kinda newbie: PowerApps to populate a list with constructed data fields

MongoDB query with special characters in key

How to normalize fields delimited by colon thats into a single column in informatica cloud

Python loop through nested dictionary or dictionaries in a list

Azure Data Lake - Conditions

Categories

Resources