PDI aka Kettle : is better "add constant" or "literal string" in table input? - kettle

In a Kettle/ PDI transformation, I need to write on a table the values from another table plus other static strings.
1 Table input : read records;
2 Add constants : add "status" = "A"; (and other static strings)
3 Table output : write old values + status and other constants
Is it better to add the literal in the table input "select" (select id,field1, 'A' as status from ...) or better to use an Add Constants step?
I suppose it's better to reduce steps' quantity, because with "Add constants"
you need to instantiate a new step.
EDIT: For "better" I mean faster and less memory consuming

My opinion is to make the minimum transformation on the Input Table step, because the philosophy of the PDI is to make visible all the transformation.
Now, if you're an expert in SQL, or have a legacy select of 200 lines with complex computation, my answer would be different.

Creating one more step in transformation will lead to separate thread allocation since every step is separate thread as far as allocation of at least one BlockingQueue since rows between steps is distributed in memory through these structures.
Using one more step even so simple as add constant will cause additional resource allocations.
PDI is still happy to be open source.
If you curious how it is done this is base transformation step implementation (was for long time) -> https://github.com/pentaho/pentaho-kettle/blob/master/engine/src/main/java/org/pentaho/di/trans/step/BaseStep.java
This is example of code used to distribute rows between steps -> https://github.com/pentaho/pentaho-kettle/blob/master/core/src/main/java/org/pentaho/di/core/BlockingRowSet.java#L54
Sure for simple add constant in sql query PDI will be overhead. There is a lot of use-cases how to make some operations faster or less memory consuming, but about GUI or any other feature actually PDI is famous of?

Related

SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).
A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.
My current solution, which is not very elegant (but it works) is to do the following:
if _N_ = 1 then do;
declare hash h1(Dataset:"DataSet1");
h1.DefineKey("key1");
h1.DefineData("Value");
h1.DefineDone();
declare hash h2(Dataset:"DataSet1");
h2.DefineKey("key2");
h2.DefineData("Value");
h2.DefineDone();
end;
set DataSet2;
rc = h1.find();
if rc NE 0 then do;
rc = h2.find();
end;
So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.
Does anyone know of a way to make this more efficient/easier to read/less memory intensive?
Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!
Thanks in advance,
Adam.
I am a huge proponent of hash table lookups - they've helped me do some massive multi hundred-million row joins in minutes that otherwise would could have taken hours.
The way you're doing it isn't a bad route. If you find yourself running low on memory, the first thing to identify is how much memory your hash table is actually using. This article by sasnrd shows exactly how to do this.
Once you've figured out how much it's using and have a benchmark, or if it doesn't even run at all because it runs out of memory, you can play around with some options to see how they improve your memory usage and performance.
1. Include only the keys and data you need
When loading your hash table, exclude any unnecessary variables. You can do this before loading the hash table, or during. You can use dataset options to help reduce table size, such as where, keep, and drop.
dcl hash h1(dataset: 'mydata(keep=key var1)');
2. Reduce the variable lengths
Long character variables take up more memory. Decreasing the length to their minimum required value will help reduce memory usage. Use the %squeeze() macro to automatically reduce all variables to their minimum required size before loading. You can find that macro here.
%squeeze(mydata, mydata_smaller);
3. Adjust the hashexp option
hashexp helps improve performance and reduce hash collisions. Larger values of hashexp will increase memory usage but may improve performance. Smaller values will reduce memory usage. I recommend reading the link above and also looking at the link at the top of this post by sasnrd to get an idea of how it will affect your join. This value should be sized appropriately depending on the size of your table. There's no hard and fast answer as to what value you should use, my recommendation is as big as your system can handle.
dcl hash h1(dataset: 'mydata', hashexp:2);
4. Allocate more memory to your SAS session
If you often run out of memory with your hash tables, you may have too low of a memsize. Many machines have plenty of RAM nowadays, and SAS does a really great job of juggling multiple hard-hitting SAS sessions even on moderately equipped machines. Increasing this can make a huge difference, but you want to adjust this value as a last resort.
The default memsize option is 2GB. Try increasing it to 4GB, 8GB, 16GB, etc., but don't go overboard, like setting it to 0 to use as much memory as it wants. You don't want your SAS session to eat up all the memory on the machine if other users are also on it.
Temporarily setting it to 0 can be a helpful troubleshooting tool to see how much memory your hash object actually occupies if it's not running. But if it's your own machine and you're the only one using it, you can just go ham and set it to 0.
memsize can be adjusted at SAS invocation or within the SAS Configuration File directly (sasv9.cfg on 9.4, or SASV9_Option environment variable in Viya).
I have a fairly similar problem that I approached slightly differently.
First: all of what Stu says is good to keep in mind, regardless of the issue.
If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.
Build a hash table with key1 as key, key2 as data along with your actual data. Make sure that key1 is the "better" key - the one that is more fully populated. Rename Key2 to some other variable name, to make sure you don't overwrite your real key2.
Search on key1. If key1 is found, great! Move on.
If key1 is missing, then use a hiter object (hash iterator) to iterate over all of the records searching for your key2.
This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.
The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.
Or split your dataset:
data haskey1 nokey1;
set yourdata;
if missing(key1) then output nokey1;
else output haskey1;
run;
Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.
Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

Good way to express dynamic records in C++

What might be a good practice to express dynamic records in C++?
I am trying to write a code which handles dynamic records which the length of the record differs from it's type. The target that I'm expressing does not have much operations to apply to, and may be accessed directly from the programmer, so I think it should be a struct rather than a class .
The size of the record is fixed on construction, and the number of record may exist from hundreds to tens of millions per execution.
I think I have three options. First is to use a fixed size array for having the length of maximum possible items of the records. The maximum possible length is known and the length varies from 1 to maximum 10.
edited: the records has some common entries
Second option is to make a dynamic array for each records. In this case, should it be better to use std::vector rather than manipulating the allocation myself? Or is using STL always better than implementing it myself?
Third option is to use inheritance with polymorphism. I don't want to actually do this because there is not much "common" operations or even any operations to be applied with.
Another functionality that I want is, as it is going to be a record I want to be accessible with a string or at least a name-like method. This is not mandatory but I think this might be good for future code readability. I could use a std::map instead of a dynamic array or std::vector I thought in this case I would like some get/setter like C# properties(which C++ does not have).
So I have few things to consider. One thing is the readability of future code. If I sacrifice the readability, then I could just access records by indexes, not by names. Another thing is performance and the memory usage. I want to avoid highly the "string" comparison operation that might occur during std::map and also want to use the least amount of memory since there are lots of records to handle with.
Until now, while doing some personal project I do have think a while when it comes to choosing which data structure to use, but actually the reasoning was not good. Please share your insights or it would be best if you recommend me a literature to read (Not only which has topics with generic algorithm and data-structure stuff but also with real coding practice and may explain in-depth STL).
Edited: I am keeping a log of the behavior of an AI. The AI goes through a 2d space. I am trying to log the behavior of every AI. The character(AI) could be generated, deleted, enter room, leave room, could kill other AIs and also can be killed and so on. I am trying to make a log of the behaviors. (Not all) but some records are..
Type_Enter_Room : LogType | AI_ID | Room_ID | EnterTime
Type_Leave_Room : LogType | AI_ID | Room_ID | LeaveTime
Type_Attack_AI : LogType | AttackerAI_ID | TargetAI_ID | AttackTime | Location
Type_AI_Dead : LogType | AI_ID | DeadTime | DeadPlace
The records share some items, but also have different entries. And all of the records should be in memory to perform the operation that I want to apply.

Stata: saving scalars/globals/matrices to .dta file - workaround?

I compute predicted factor scores for many different measures as well as the measurement error for these factor scores, then delete the measures. I delete the measures because my data is quite large; I do not want all of the measures using RAM in the large data set where I run my analysis.
For my analysis, I regress on factors and other variables. I can correct the regression coefficients for measurement error by using the measurement error of these factor scores. However, I cannot figure out a convenient way to save the measurement error associated with each factor to a .dta file.
Why am I not running all of this in one Stata session, thus obviating the need to save the scalars/macros/matrices?
I work on a server and on my PC. The server has lot of memory and processing power, but is very inconvenient to use. So I often break my work into two stages. First, I clean the data and reduce variable count (in this case, computing factor scores from a large set of measures). Cleaning the data itself often takes a huge amount of memory when I use multiple reshape commands. Then I save the cleaned data to a .dta file and work on it on my PC. The cleaned data is small enough to run on my PC, and doesn't require the sort of manipulations that use an excess of RAM.
I have considered a few approaches.
Create a variable for the measurement error of each factor. While
this can work, I don't like it for several reasons: A) This is a profligate use of memory. I need only one scalar per factor variable, but I am creating _N cells for this variable. While I may be able to make the data set fit in memory by judiciously dropping variables or using other workarounds, I want a better solution. B) It just seems conceptually wrong.
Create a variable that contains all scalar values, and a second variable that contains the name for these scalars (ie, the factor to
which they associate. I am having trouble making this work. How do I extract the value for each non-missing _n and put it into a Stata matrix or Mata matrix? Alternatively, how do I create a set of macros where the macro name comes from the variable containing the name and the macro value comes from the variable containing the scalar?
Somehow save the scalars/macros/Mata matrix/Stata matrix directly and load after opening the .dta file. Apparently, Stata does not save scalars, macros, Mata matrices, or Stata matrices to a .dta file. So the most convenient and obvious solution doesn't exist in Stata. I have seen other people recommend putting the scalars into a mata matrix, then loading the mata matrix into memory and saving as a .dta file. Then I could open this file, save it to a mata matrix, then load the data I want to work on. All this seems needlessly complicated and I am hoping there is a better way.
I would love advice on a simpler method to save these scalars, or else a way to make one of the above approaches simpler.
This is very frustrating. While Stata is extremely powerful and easy to use for a variety of things, it also has these frustrating 'holes' where you can spend an entire day trying make something work that you thought would be quite simple.
Stunningly, the easiest solution would be to simply copy the scalars to a spreadsheet and manually input them. This is not an automated solution, but I am realizing it would take me only maybe a quarter-of-an-hour instead of the enormous time it is taking me to automate this.
To answer your second point
Suppose the variable "factorname" contains the factor name and the variable "error" contains the measurement error
forval i = 1/`=_N'{
local factor_`=factorname[`i']' = error[`i']
}
will create _N local containing measurement errors.
Your long introduction doesn't give the reader an idea of the size order of the data you are dealing with. How many factors? Average number of measures for each of them? Observations? Also, you do not provide some code and it becomes quite hard to help you.
Anyways, I would avoid the creation of variables.
Doesn't estimates save *whatever.dta* work in your case?
If you really want to store your estimation results in a macro to load it in another Stata session, then you cannot directly save them in .dta file.
However, you can still associate the macros to a dataset and retrieve them afterwards defining characteristics.

Building static (but complicated) lookup table using templates

I am currently in the process of optimizing a numerical analysis code. Within the code, there is a 200x150 element lookup table (currently a static std::vector <std::vector<double>> ) that is constructed at the beginning of every run. The construction of the lookup table is actually quite complex- the values in the lookup table are constructed using an iterative secant method on a complicated set of equations. Currently, for a simulation, the construction of the lookup table is 20% of the run time (run times are on the order of 25 second, lookup table construction takes 5 seconds). While 5-seconds might not seem to be a lot, when running our MC simulations, where we are running 50k+ simulations, it suddenly becomes a big chunk of time.
Along with some other ideas, one thing that has been floated- can we construct this lookup table using templates at compile time? The table itself never changes. Hard-coding a large array isn't a maintainable solution (the equations that go into generating the table are constantly being tweaked), but it seems that if the table can be generated at compile time, it would give us the best of both worlds (easily maintainable, no overhead during runtime).
So, I propose the following (much simplified) scenario. Lets say you wanted to generate a static array (use whatever container suits you best- 2D c array, vector of vectors, etc..) at compile time. You have a function defined-
double f(int row, int col);
where the return value is the entry in the table, row is the lookup table row, and col is the lookup table column. Is it possible to generate this static array at compile time using templates, and how?
Usually the best solution is code generation. There you have all the freedom and you can be sure that the output is actually a double[][].
Save the table on disk the first time the program is run, and only regenerate it if it is missing, otherwise load it from the cache.
Include a version string in the file so it is regenerated when the code changes.
A couple of things here.
What you want to do is almost certainly at least partially possible.
Floating point values are invalid template arguments (just is, don't ask why). Although you can represent rational numbers in templates using N1/N2 representation, the amount of math that you can do on them does not encompass the entire set that can be done on rational numbers. root(n) for instance is unavailable (see root(2)). Unless you want a bajillion instantiations of static double variables you'll want your value accessor to be a function. (maybe you can come up with a new template floating point representation that splits exp and mant though and then you're as well off as with the double type...have fun :P)
Metaprogramming code is hard to format in a legible way. Furthermore, by its very nature, it's rather tough to read. Even an expert is going to have a tough time analyzing a piece of TMP code they didn't write even when it's rather simple.
If an intern or anyone under senior level even THINKS about just looking at TMP code their head explodes. Although, sometimes senior devs blow up louder because they're freaking out at new stuff (making your boss feel incompetent can have serious repercussions even though it shouldn't).
All of that said...templates are a Turing-complete language. You can do "anything" with them...and by anything we mean anything that doesn't require some sort of external ability like system access (because you just can't make the compiler spawn new threads for example). You can build your table. The question you'll need to then answer is whether you actually want to.
Why not have separate programs? One that generates the table and stores it in a file, and one that loads the file and runs the simulation on it. That way, when you need to tweak the equations that generate the table, you only need to recompile that program.
If your table was a bunch of ints, then yes, you could. Maybe. But what you certainly couldn't do is generate doubles at compile-time.
More importanly, I think that a plain double[][] would be better than a vector of vectors here- you're pushing a LOT of dynamic allocation for a statically sized table.

How to handle changing data structures on program version update?

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.
Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?
Let's take an address book. From program version x to x+1, what if:
a field is deleted (seems simple enough) or added (ok if all can use some new default)?
a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
maybe I combine surname/forename, or split name into two fields?
These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.
Obviously I need some hard coded logic to take care of each change.
What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?
What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?
I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.
This is nothing new, so I just wondered how others do it....
I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:
Define your goals - there are two:
new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)
Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!
If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.
Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.
Keep a set of test data from all versions around.
Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).
e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).
add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".
Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.
(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.
(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be
empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records
I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve
Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.
The following are "change" scenarios you can expect:
a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.
For implementing case 2, you also need to consider:
no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range
phew. that was a lot. But it's not as complicated as it seems.
There's a huge concept that the relational database people use.
It's called breaking the architecture into "Logical" and "Physical" layers.
Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.
You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.
You don't need to reinvent SQL to accomplish this.
If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.
If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.
Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.
Some simple guidelines if you're talking about a structure use as in a C API:
have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.
Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.
The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.
And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.
If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.
What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.
struct address_book
{
unsigned int length; // total length of this struct in bytes
char items[0];
}
where 'items' is a variable length array of a structure that describes its own size and type
struct item
{
unsigned int size; // how long data[] is
unsigned int id; // first name, phone number, picture, ...
unsigned int type; // string, integer, jpeg, ...
char data[0];
}
In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.
I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?
For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.
Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:
ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();
To read it back:
igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
torrent::Bencode b;
is >> b;
if( b.has_key("url") )
d->url = b["url"].as_string();
} catch(...) {
}
I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.
Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.
About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.
I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.
However if you must use a single struct for this, there are a few things you can do to help make things easier.
1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.
2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.
3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.
4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.
5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.
Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).
You may want to take a look at how Boost Serialization library deals with that issue.