guys.
I'm needing create a mapping to do incremental loads in informatica cloud. I know that I can do that with parameter files and using the $Lastruntinme. But, if i use FF as parameters, those parameters can be deleted. Using the $Lastruntime i could have temporal gaps into the target.
Is there other ways to do incremental loads? Maybe using loockup, or a way to use two sources in the same mapping, one reading the last written data and the second source reading the the source data; after that, compare both and get the last.
Any mechanism that reliably allows you to identify which records in your source need to be loaded into your target could be used to build an incremental etl load - but without knowing your data it is impossible for anyone to tell you what would work for you.
You also need to distinguish what would work in principle and what would work in practice. For example, comparing your source and target datasets might work with small datasets but would quickly become impractical as the size of either dataset grew
Related
I'm trying to make an application which to manage information about several providers.
Target system is windows and I'll be coding with c++.
The users are not expected to be handy on anything related to computers, so I want to make it as fool-proof as possible. Right now my objective is to distribute only an executable, which should store all the information they introduce in there.
Each user stores information of their own providers, so I don't need the aplication to share the data with other instances. They do upload the information into a preexisting system via csv, but I can handle that easily.
I expect them to introduce new information at least once a month, so I need to update the information embedded. Is that even possible? Making a portable exe and update its information? So far the only portable apps I've seen which allow saving some personification do so by making you drag files along with your exe.
I try to avoid SQL to avoid compatibility problems (for my own applications I use external TXTs and parse the data), but if you people tell me it's the only way, I'll use sql.
I've seen several other questions about embedding files, but it seems all of them are constants. My files need to be updatable
Thanks in advance!
Edit: Thanks everyone for your comments. I've understood that what I want is not worth the problems it'd create. I'll store the data separatedly and make an effort so my coworkers understand what's the difference between an executable and it's data (just like explaining the internet to your grandma's grandma...)
While I wouldn't go as far as to say that it's impossible, it will definitely be neither simple nor pretty nor something anyone should ever recommend doing.
The basic problem is: While your .exe is running, the .exe file is mapped into memory and cannot be modified. Now, one thing you could do is have your .exe, when it's started, create a temporary copy of itself somewhere, start that one, tell the new process where the original image is located (e.g., via commandline arguments), and then have the original exit. That temporary copy could then modify the original image. To put data into your .exe, you can either use Resources, or manually modify the PE image, e.g., using a special section created inside the image to hold your data. You can also simply append arbitrary data at the end of an .exe file without corrupting it.
However, I would like to stress again that I do not recommend actually doing stuff like that. I would simply store data in separate files. If your users are familiar with Excel, then they should be familiar with the idea that data is stored in files…
I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.
It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?
I try to change some information on a dmg files,
It works well but when i tried to open the dmg i have the error message
checksum invalid
So i get the header of my dmg file, and i get all the information that i need .
I have a DataForkChecksum and a MasterChecksum but I don't know how to calculate them .
does anyone knows how to do this ?
The master checksums is a checksum of checksums. It is computed as the CRC-32 checksum (polynomial=0xedb88320) computed on the concatenated binary values of all the checksums of the BLKX blocks present in the DMG. It's quite painful to compute.
libdmg contains GPL code that does this. In particular have a look at the calculateMasterChecksum() function in http://shanemcc.co.uk/libdmg/libdmg-hfsplus/dmg/dmglib.c
If you don’t know a lot about Checksums I recommend taking a quick look at the Wikipedia entry. Essentially they are used to check the integrity of a file to make sure that it has not been changed or interfered with in anyway. I believe this is especially important in the open source community, as the code uploaded to sharing websites could have been interfered with by someone other than it original provider.
(Also take a quick look at MD5 on Wikipedia, one read of this and you will quickly appreciate the difficulty of the problem and proposed solutions. )
By including a Checksum the provider is not guaranteeing the quality of the code (they may well do separately) but they are giving you the ability to ensure that what you are downloading is exactly what they provided. A change to a single byte will change the Checksum.
In your case by modifying the DMG you are changing the Checksum. Without knowing the specifics it’s hard to advise you how to get around it. If your setup is communicating with the original DMG provider in some way to compare the checksums then it will be very difficult to fix. You also have no way of knowing what their checksum is.
If it is comparing it with a locally stored file then you have a chance. The simplest way will be to get one of the free tools for creating Checksums and replace them both.
However all this brings up a question. Why are you modifying an externally provided DMG? If you want your computer to preform additional actions when you click on it I believe there are much simpler ways
I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.
I have a binary file I'm creating in C++, I'm tasked to create a metadata format to describe the data that it can be read in Java using the metadata.
One record in the data file has Time, then 64 bytes of data, then a CRC, then a new line delimiter. How should the metadata look to describe what is in the 64 bytes? I've never created a metadata file before.
Probably you want to generate a file which describes how many entries there are in the data file, and maybe the time range. Depending on what kind of data you have, the metadata might contain either a per-record entry (RawData, ImageData, etc.) or one global entry (data stored as float.)
It totally depends on what the Java-code is supposed to do, and what use-cases you have. If you want to know whether to open the file at all depending on date, that should be part of the metadata, etc.
I think that maybe you have the design backwards.
First, think about the end.
What result do you want to see? A Java program will create some kind of .csv file?
What kind(s) of file(s)?
What information will be needed to do this?
Then design the metadata to provide the information that is needed to perform the necessary tasks (and any extra tasks you anticipate).
Try to make the metadata extensible so that adding extra metadata in the future will not break the programs that you are writing now. e.g. if the Java program finds metadata it doesn't understand, it just skips it.