I need help in designing my C++ Console application - c++

I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?

So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.

Related

AWS GroundTruth text labeling - hide columns in the data, and checking quality of answers

I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.

Comparing data of payments

At my work we have two systems, one that collects the customers payments automatically every month. And one that manages the memberships of those customers. Sadly our outdated technology doesn’t communicate to each other so we don’t know if a customer actually paid for their membership without manually auditing them.
I’ve been put in charge of this process and boy does it take awhile to do.
I have limited knowledge of C++ and was looking into maybe writing a program to do the comparisons for me.
I have two ideas on how to implement this, and was wondering what you guys thought. If these would be best or if it’s even possible or if there’s a better solution?
Current Setup: We have a list of all members in excel, with how much each should be paying, we then go through the actual money collected and check to make sure everyone’s payment went through and was processed and not declined.
Option 1: have a multi-dimensional array of strings. Read the excel file into this array it would have three Columns, first name, last name, amount they should be paying. This would be put in alphabetical order to help with the searching. I would then export the transactions in css file format and read each line one at a time. When it reads a line it would search the array for the same first and last name. Once found it would take the amount paid confirm it said processed and not declined and if so would subtract it from the customers amount they should be paying. In the end if every customers amount they should be paying is equal to 0 then everyone paid.
Option 2: is similar to option 1 just instead of using a multidimensional array it would use two css files. And not put the items into the array at the start.
Thoughts? Is this a smart way to combat this problem? I’m a newbie programmer so I’m just looking for suggestions/advice.
Your solutions would work, but are suited for small datasets. I don't now what your constraints are, but I think that a more elegant solution would be to setup a database on the first system first(instead of the excel file).
Are you allowed to create a database? How many customers are in the excel file?

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

How do I change this data into proper format to do association rule mining in WEKA?

This is how the data is stored in the file given. There are 8 attributes given.
I need association rule mining done using Apriori Algorithm in WEKA.
Such as, if item 1 & item 2 are bought --> item 4 is also bought or something reasonable as that.
What I tried:
Converting the file into .arff format and loading into weka.
Turning all attributes in Nominal and running the algorithm Apriori.
But the rules generated are very weird.
This is how the result comes. It has no proper information. No rules like what i want, which actually define what a user will buy with what or anything.
i.e. the rules generated here are of no information to me, there is no relation/rules given as to which item will be bought with what.
How should I preprocess this data to format it well or if am making any other mistake would be really appreciated.

What are the steps of preprocessing anonymized data for predictive analysis?

Suppose we have a large dataset of anonymized data. Dataset consist if certain number of variables and observations. All we can learn about data is a type(numeric, char, date, etc.) of variable. We can do it by looking to data manually.
What are the best practise steps of pre-proccessing dataset for the further analysis?
Just for instance, let this data set be just one table, so we don't need to check any relations between tables.
This link gives the complete set of validations currently in practice. Still, to start with:
wherever possible, have your data written in such a way that you can parse it as fast and as easily as possible, using your preferred programming language's methods/constructors;
you can verify if all the data types match correctly - like int fields do not contain string data etc;
you can verify that your values are in acceptable range;
check if a non-nullable field has null values;
check if dates are in expected ranges;
check if data follows correct set-membership constraints wherever applicable;
if you have pattern following data like phone numbers, make sure they are in (XXX) XXX-XXXX design, if you prefer them that way;
are the zip codes at correct accuracy level (in US you may have 5 or 9 digits of accuracy);
if your data is time-series, is it complete (i.e. you have values for all dates)?
is there any unwanted duplication?
Hope this is good enough to get you started...