Storing values of arbitrary type - clojure

I want to store arbitrary key value pairs. For example,
{:foo "bar" ; string
:n 12 ; long
:p 1.2 ; float
}
In datomic, I'd like to store it as something like:
[{:kv/key "foo"
:kv/value "bar"}
{:kv/key "n"
:kv/value 12}
{:kv/key "p"
:kv/value 1.2}]
The problem is :kv/value can only have one type in datomic. A solution is to to split :kv/value into :kv/value-string, :kv/value-long, :kv/value-float, etc. It comes with its own issues like making sure only one value attribute is used at a time. Suggestions?

If you could give more details on your specific use-case it might be easier to figure out the best answer. At this point it is a bit of a mystery why you may want to have an attribute that can sometimes be a string, sometimes an int, etc.
From what you've said so far, your only real answer it to have different attributes like value-string etc. This is like in a SQL DB you have only 1 type per table column and would need different columns to store a string, integer, etc.
As your problem shows, any tool (such as a DB) is designed with certain assumptions. In this case the DB assumes that each "column" (attribute in Datomic) is always of the same type. The DB also assumes that you will (usually) want to have data in all columns/attrs for each record/entity.
In your problem you are contradicting both of these assumptions. While you can still use the DB to store information, you will have to write custom functions to ensure only 1 attribute (value-string, value-int, etc) is in use at one time. You probably want custom insertion functions like "insert-str-val", "insert-int-val", etc, as well as custom read functions "read-str-val" etc al. It might be also a good idea to have a validation function that could accept any record/entity and verify that exactly one-and-only-one "type" was in use at any given time.

You can emulate a key-value store with heterogenous values by making :kv/key a :db.unique/identity attribute, and by making :kv/value either bytes-typed or string-typed and encoding the values in the format you like (e.g fressian / nippy for :db.types/bytes, edn / json for :db.types/string). I advise that you set :db/index to false for :kv/value in this case.
Notes:
you will have limited query power, as the values will not be indexed and will need to be de-serialized for each query.
If you want to run transaction functions which read or write the values (e.g for data migrations), you should make your encoding / decoding library available to the Transactor as well.
If the values are large (say, over 20kb), don't store them in Datomic; use a complementary storage service like AWS S3 and store a URL.

Related

Doctrine 2 : Best way to store different information

I'd like to know what are the best practises to have different information stored depending on some variables.
For example, I have a ServerEntity, I want to store the disks plugged on this server with a ServerDiskEntity.
If this disk is SSD, I want to store the NAND type (MLC, SLC, TLC).
If this disk is HDD, I want to store the RPM.
Then, when I request ServerEntity->getDisks(), I check if the type is SSD, display the NAND type, if HDD type display RPM.
Storing everything in the same entity looks awful to me. Having two separate entities (with nothing else gluing them together) is not an option because I store some other information such as tray number.
My closest guess is : ServerDiskEntity stores DiskType and DiskId, and I use this information to getRepository(diskType)->findOneBy(["id" => $DiskId]) but this also seems very non optimised from my POV.
Please someone teach me some magic to have a clean way to do this (and I'd like to avoid using ElasticSearch :D )
Single Table Inheritance (or Joined Table Inheritance) is the answer :
https://digitalfortress.tech/php/configure-doctrine-multiple-target-entities/

What are the ways of Key-Value extraction from unstructured text?

I'm trying to figure out what are the ways (and which of them the best one) of extraction of Values for predefined Keys in the unstructured text?
Input:
The doctor prescribed me a drug called favipiravir.
His name is Yury.
Ilya has already told me about that.
The weather is cold today.
I am taking a medicine called nazivin.
Key list: ['drug', 'name', 'weather']
Output:
['drug=favipiravir', 'drug=nazivin', 'name=Yury', 'weather=cold']
So, as you can see, in the 3d sentence there is no explicit key 'name' and therefore no value extracted (I think there is the difference with NER). At the same time, 'drug' and 'medicine' are synonyms and we should treat 'medicine' as 'drug' key and extract the value also.
And the next question, what if the key set will be mutable?
Should I use as a base regexp approach because of predefined Keys or there is a way to implement it with supervised learning/NN? (but in this case how to deal with mutable keys?)
You can use a parser to tag words. Your problem is similar to Named Entity Recognition (NER). A lot of libraries, like NLTK in Python, have POS taggers available. You can try those. They are generally trained to identify names, locations, etc. Depending on the type of words you need, you may need to train the parser. So you'll need some labeled data also. Check out this link:
https://nlp.stanford.edu/software/CRF-NER.html

What are the steps of preprocessing anonymized data for predictive analysis?

Suppose we have a large dataset of anonymized data. Dataset consist if certain number of variables and observations. All we can learn about data is a type(numeric, char, date, etc.) of variable. We can do it by looking to data manually.
What are the best practise steps of pre-proccessing dataset for the further analysis?
Just for instance, let this data set be just one table, so we don't need to check any relations between tables.
This link gives the complete set of validations currently in practice. Still, to start with:
wherever possible, have your data written in such a way that you can parse it as fast and as easily as possible, using your preferred programming language's methods/constructors;
you can verify if all the data types match correctly - like int fields do not contain string data etc;
you can verify that your values are in acceptable range;
check if a non-nullable field has null values;
check if dates are in expected ranges;
check if data follows correct set-membership constraints wherever applicable;
if you have pattern following data like phone numbers, make sure they are in (XXX) XXX-XXXX design, if you prefer them that way;
are the zip codes at correct accuracy level (in US you may have 5 or 9 digits of accuracy);
if your data is time-series, is it complete (i.e. you have values for all dates)?
is there any unwanted duplication?
Hope this is good enough to get you started...

Is there a established data structure for place/transition petri-nets?

I'm trying to come up with an elegant solution for representing place/transition petri nets.
So far I save them as follows:
{:netname {:places {:name tokens, ...}
:transitions #{:t1, :t2, :t3, ...}
:edges_in #{[:from :to tokens], ...}
:edges_out #{[:from :to tokens], ...}}}
tokens is a number, everything starts with a symbol with the corresponding name.
//edit - Some more clarification:
The :netname and :name are unique, because it has to be possible to merge 2 nets, where the places again have to have unique names. The numerical tokens are determined by the user of the petri nets during creation of a place or edge.
I would be thankful for some pointers or links to a more elaborate / better data structure for my problem.
//edit 2 - I reworked my first take on the data-structure, because of the uniquenes of place-names. :places now references a hashmap. Also edges_in and out are now hashmaps, because every edge is unique with its origin, destination and token number.
//edit 3 - The use of the structure: It is read and written to in the same quantity i would say. The way a petri net is used, there is a back and forth between modifying the net and reading it, with maybe slightly more reading towards the end.
I also modified my structure above slightly, so :edges_in and :edges_out now saves the triplets as a vector instead of a list. This simplyfies saving the hashmap to file and reading it from it, because load-string evaluates lists as expressions.
You could look at ISO 15909 interchange format for HLPNs called PNML. This would at least provide you with a basis for a standard interface to your data structures.

How can I obfuscate/de-obfuscate integer properties?

My users will in some cases be able to view a web version of a database table that stores data they've entered. For various reasons I need to include all the stored data, including a number of integer flags for each record that encapsulate adjacencies and so forth within the data (this is for speed and convenience at runtime). But rather than exposing them one-for-one in the webview, I'd like to have an obfuscated field that's just called "reserved" and contains a single unintelligible string representing those flags that I can easily encode and decode.
How can I do this efficiently in C++/Objective C?
Thanks!
Is it necessary that this field is exposed to the user visually, or just that it’s losslessly captured in the HTML content of the webview? If possible, can you include the flags as a hidden input element with each row, i.e., <input type=“hidden” …?
Why not convert each of the fields to hex, and append them as a string and save that value?
As long as you always append the strings in the same order, breaking them back apart and converting them back to numbers should be trivial.
Use symmetric encryption (example) to encode and decode the values. Of course, only you should know of the key.
Alternatively, Assymetric RSA is more powerfull encryption but is less efficient and is more complex to use.
Note: i am curios about the "various reasons" that require this design...
Multiply your flag integer by 7, add 3, and convert to base-36. To check if the resulting string is modified, convert back to base-2, and check if the result modulo 7 is still 3. If so, divide by 7 to get the flags. note that this is subject to replay attacks - users can copy any valid string in.
Just calculate a CRC-32 (or similar) and append it to your value. That will tell you, with a very high probability, if your value has been corrupted.