I have approximately 130,000 gz files, each containing one file which needs to be renamed with a .xml extension in order to be viewed properly. For example, on gz file might be named 100.san.form.gz and contain the file 100.san.form, which needs to be renamed to 100.san.form.xml to be viewed appropriately.
I am trying to write a script to unzip all of these files and then rename them appropriately, and read them into my program to parse.
Related
In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv
Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.
I have written a zip class that uses functions and code from miniz to: Open an archive, Close an archive, Open a file in the archive, Close a file in the archive, and write to the currently open file in the archive.
Currently opening a file in an archive overwrites it if it already exists. I would like to know if it is possible to APPEND to a file within a zip archive that has already been closed?
I want to say that it is possible but I would have to edit all offsets in each of the other file's internal states and within the central directory. If it is possible - is this the right path to look in to?
Note:
I deal with large files so decompressing and compressing again is not ideal and neither is doing any copying of files. I would just like to "open" a file in the zip archive to continue writing compressed data to it.
I would just like to "open" a file in the zip archive to continue writing compressed data to it.
Compressed files aren't working like a file system or folder, where you could change individual files. They keep e.g. check sums, that need to apply for the whole archive.
So no, you can't do such inplace, but have to unpack the compressed file, apply your changes and compress everything again.
I am using .txt files in my program for reading and writing records (records contains both text and numerals). Recently i came to know that .dat file also can be used like .txt for file operations. I would like to know the difference between the two and the advantages and disadvantages of one over another.
Text files or .txt files are a bit hard to parse in programs and easy to read. whereas .dat is usually used to store data that is not just plain text.
Generally .txt files contains letters, characters and symbols which is readable.
.dat is binary text file in which data is not always printable on screen.
The extension of a file is a helper so that the operating system (or user) can choose the appropriate program to open it. The actual file contents do not matter. There are some conventions what extensions to use but there is nothing from keeping you to use any arbitrary extension for your files. For instance you can rename a .jar file to .zip-file and be able to open the file with pkunzip.
So for C++ the extension does not matter, but for you as a programmer it may give a hint of the file contents i.e. open it in text or binary mode.
In most languages like C/C++ there is no difference what is your file type in file operations(Read, Write or Edit).
just if you want to work with binary files you should open them in binary format because if you reached \0 in text file it's file end. Dat files are binary too!
If you want to store and read some data, XML file and somtimes DAT files are better because of good libraries to read them. they don't need hard parsing of Text files
I tried to open a .dat file using Stata, and it actually opened, but the data set was a complete mess. I took the file from NBER (CPS data)...
click on the A icon of the year 1964 March.
I tried the regular Stata procedure for .dat files: File->Import->ASKII data created by spreadsheet (delimiter " ") as recommended in Stata manual for .dat files.
But it is still not working. Are there any other ways to open .dat file? Can I convert it to .csv somehow?
(All the data files are ASCII files compressed with the Unix compress command.)
There is a Java app to get you the data from CPS, DataFerrett This app lets you get CPS and other data sets. But it is not very efficient.
I can show you an example how to open one of them yourself (you can use it for any years in the interval 1989 till 2012).
Download the .dat file
Save it in a Desktop folder (C:\Users\Owner...)
Download corresponding .do and .dct files from here
Save them in the same folder
Open the .dat file just the way you open it in your question in Stata
Save it as a Stata .dta file in the same folder (C:\Users\Owner...)
Open the .do file (using Notepad++) that is in your (C:\Users\Owner...) folder
At the very beginning you will see the author presctibes local variables for the paths of .dta, .dat and .dct files. Change the paths so that they point to the saved .dta, .dat and .dct files in your folder (C:\Users\Owner...) on your Desktop
Reopen Stata, and run the .do file from your folder (C:\Users\Owner...)
Done! Save the .dta file
Now, for the years 1962 to 1988, you can do the same procedure (10 steps) as I explained above, but unfortunately NBER does not provide the .do and .dct files. It means that you have to write them yourself. Take one of the available .do and .dct files from any of the years (1989 - 2012) as a benchmark, and write your own .do and .dct files. You will have to make corrections so that the new .do and .dct files are consistent with the corresponding .pdf documentation for each year. I know it is very tideous, but this is the only way you can handle it.
We need more information.
".dat" is not an extension that is special so far as Stata is concerned. Perhaps you meant .dta.
Even if so, what file was it, what command did you use and what was wrong?
The page you linked to leads to numerous files. We have not a hope of guessing which you mean.
Spelling is "Stata".
might not save you from spending days digging into that data but here's some ideas:
the file contains 2 completely different kinds of lines. this might be the reason why you can't import them. you can see this by opening the unzipped file in a text editor. you have to find out what that means.
what do you want to obtain from this file? according to the pdf it contains 85 different values per record. do you need them all? if you're only interested in a few values you could extract them in a unix shell.
I have a folder that contains 300 different files. There are 150 .cft files and 150 .s01 files. Each .cft file has a corresponding .s01 file of the same name. I would like to create a program that can read the files from the folder and place each .cft file and its corresponding .s01 file into an excel document. I would like the .cft file to be on the first worksheet in the document and the .s01 file to be on the second sheet. Then I would like the program to save the file and name it (---------).xls. The (---------) would be the name of the .cft and .s01 file since they are both the same.
So!!! I wrote a program that is able to take the .cft file and the .s01 file, append them and place them in a user defined .xls document. However...I don't want to manually get the names of the 150 files and have to type each one into the program. I also don't want the files to be placed on the same worksheet.
So!!!! I don't want to waste time trying to code something impossible, so before I spend anymore time on this I have a few questions:
Is it possible to read all of the files in a folder and match files of the same name but with different types?
If this is possible, is it then possible to place the corresponding .cft file and .s01 file in the same excel document but on different worksheets?
Then, is it possible to create and save this worksheet as (---------).xls, (-------) being the name of the matching .cft and .s01 file?
So basically...I want to write this code because I am lazy and I don't want to do anything manually ><;;; lol
Example:
The main folder contains 8 files:
dog.cft dog.s01 cat.cft cat.s01 tree.cft tree.s01 bird.cft bird.s01
The program reads all of the files in the folder and recognizes that dog.cft and dog.s01 go together.
The program then creates an excel document and on worksheet 1 places dog.cft and on worksheet 2 places dog.s01.
The program then saves the excel document as dog.xls
Then the program loops through the main folder repeating this process for each of the .cft and .s01 pairs until all 150 pairs have been separated and saved in their own excel document.
I don't know if I'm dreaming a little too big with this but any advice is much appreciated!
personally I would do this with a macro in excel rather than in c++ because doing excel related functions is much easier that way. All of the requirements are possible using VBA within excel.
Yes, it's possible.
For the listing of files in a folder, you can use the Windows API functions FindFirstFile and FindNextFile. When you finish iterating the folder, you'll need to call FindClose.
For creating the Excel spreadsheet and working with the workbook's sheets, you can use COM automation. Here's a link to an article on doing so from C++ (MFC); the article explains where to find one that isn't MFC based.
If you get started and have specific questions about either of the tasks, please post them as separate questions. This should have been two individual questions, in fact - one about iterating the content of a folder and a different one about working with Excel files from C++.