I have multiple excel files imported on a daily basis, example code of one of the files is here:
Booked <- read_excel("./Source_Data/CONFIDENTIAL - MI8455 Future Change 20180717.xlsx", skip = 1, sheet = "Appendix 1 - Info Data")
Each day this file changes, the name and structure is always the same, the only difference is the date at the end of the file name.
Is there anyway to have R search for the specific name starting with "CONFIDENTIAL - MI8455 Future Change" and import the data accordingly?
To get the path of the file out you can use this pattern
(?'path'\.\/Source_Data\/CONFIDENTIAL - MI8455 Future Change \d+\.xlsx)
Ok, through a lot of trial, error and google I found an answer and hope that someone else who is new to R may have come across the same problem.
First I needed to identify the file, in the end I used the list.files command: MI8455 <- list.files(path= "G:/MY/FilE/PATH/MI8455", pattern="^MI8455_Rate_Change_Report_1.*\\.xlsx$") If likemyself your files are either in other folders/subfolders of the working directory than the first part of the code specifies where the list.files should look. The pattern element allows you to show what format the name is in and then you are able to specify the file type.
Next you can import using the read_excel package, but rather than specifiying a file path you tell it to use the value that you created earlier: Customer_2017 <- read_excel(MI8455,skip = 5, sheet = "Case Listing - Eml")
Related
I am searching since days but I can't find an answer to my question.
I need to change a single cell currently named "29" into "Si29" in hundreds of csv files.
The position of the cell is the same in every file [3,7].
Then I need to save the files again (can be under the same name).
For one file I would do:
read_data[3,7]<-"Si29
However, I have no clue how I apply this to multiple files.
Cheers
I am writing a program to evaluate the hourly energy output from PV in various cities in the US. For simplicity, I have them in a dictionary (tmy3_cities) so that I can loop through them. To read the code, I followed the TMY to Power Tutorial off Github. Rather than showing the whole loop, I have only added the code that reads and shifts the time by 30-min.
Accordingly, the code taken from the tutorial works for all of the TMY3 files except for Houston, Atlanta, and Baltimore (HAB, for simplicity). All of the tmy3 files were downloaded from NREL and renamed for my own use. The error I get in reading these three files is related to the datetime, and it essentially comes down to a "ValueError: invalid literal for int() with base 10: '1' after some traceback.
Rather than looping into the same problem, I entered each file into the reader individually, and sure enough, only the HAB tmy3 files give errors.
Secondly, I downloaded the files again. This, obviously had no impact.
In a lazy attempt to bypass the issue, I copied and pasted the date and time columns from working tmy3 files (e.g., Miami) into the non-working ones (i.e., HAB) via excel.
I am not sure what else to do, as I am still fairly new to Python and coding in general.
#The dictionary below is not important to the problem below, but is
provided only for some clarification.
tmy3_cities = {'miami': 'TMY3Miami.csv',
'houston': 'TMY3Houston.csv',
'phoenix': 'TMY3Phoenix.csv',
'atlanta': 'TMY3Atlanta.csv',
'los_angeles': 'TMY3LosAngeles.csv',
'las_vegas': 'TMY3LasVegas.csv',
'san_francisco': 'TMY3SanFrancisco.csv',
'baltimore': 'TMY3Baltimore.csv',
'albuquerque': 'TMY3Albuquerque.csv',
'seattle': 'TMY3Seattle.csv',
'chicago': 'TMY3Chicago.csv',
'denver': 'TMY3Denver.csv',
'minneapolis': 'TMY3Minneapolis.csv',
'helena': 'TMY3Helena.csv',
'duluth': 'TMY3Duluth.csv',
'fairbanks': 'TMY3Fairbanks.csv'}
#The code below was taken from the tutorial.
pvlib_abspath = os.path.dirname(os.path.abspath(inspect.getfile(pvlib)))
#This is the only section of the code that was modified.
datapath = os.path.join(pvlib_abspath, 'data', 'TMY3Atlanta.csv')
tmy_data, meta = pvlib.tmy.readtmy3(datapath, coerce_year=2015)
tmy_data.index.name = 'Time'
# TMY data seems to be given as hourly data with time stamp at the end
# Shift the index 30 Minutes back for calculation of sun positions
tmy_data = tmy_data.shift(freq='-30Min')['2015']
print(tmy_data.head())
I would expect each tmy3 file that is read to produce its own tmy_data DataFrame. Please comment if you'd like to see the whole
The M query code below extracts a range of rows from a sheet called "Survey Information" in an excel file called Paris. - But what if I wish to do this not just for a single excel file, but for all of the excel files located in Folder1 (Berlin, Milan, etc.)? (Whilst each excel file would contain multiple sheets, each file would have a single sheet called "Survey Information".)
Unfortunately there are no named ranges. I have a large number of excel files to pull the data out of.
Very grateful for any insight,
Chris
let
Source = Excel.Workbook(File.Contents("Q:\Folder1\Paris.xls"), null, true),
#"Survey Information1" = Source{[Name="Survey Information"]}[Data],
#"Kept Range of Rows" = Table.Range(#"Survey Information1",19,14) in
#"Kept Range of Rows"
I think I may have found an answer to my own question. - I need to turn the code into a function by prefixing it with (filepath)=>
I found a step-by-step guide here:
https://www.excelguru.ca/blog/2015/02/25/combine-multiple-excel-workbooks-in-power-query/
All the best.
I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.
I'm working with a project using Kettle (PDI).
I have to input multiple file of .csv or .xls and insert it into DB.
The file name are AAMMDDBBBB, where AA is code for city and BBBB is code for shop. MMDD is date format like MM-DD. For example LA0326F5CA.csv.
The Regexp I use in the Input file steps look like LA.\*\\.csv or DT.*\\.xls, which is return all files to insert it into DB.
Can you indicate me how to select the files the file just for yesterday (based on the MMDD of the file name).
As you need some "complex" logic in your selection, you cannot filter based only on regexp. I suggest you first read all filenames, then filter the filenames based on their "age", then read the file based on the selected filenames.
In detail:
Use the Get File Names step with the same regexp you currently use (LA.*\.csv or DT.*\.xls). You may be more restrictive at that stage with a Regexp like LA\d\d\d\d.....csv, to ensure MM and DD are numbers, and DDDD is exactly 4 characters.
Filter based on the date. You can do this with a Java Filter, but it would be an order of magnitude easier to use a Javascript Script to compute the "age" of you file and then to use a Filter rows to keep only the file of yesterday.
To compute the age of the file, extract the MM and DD, you can use (other methods are available):
var regexp = filename.match(/..(\d\d)(\d\d).*/);
if(regexp){
var age = new Date() - new Date(2018, regexp[1], regexp[2]);
age = age /1000 /60 /60 /24;
};
If you are not familiar with Javascript regexp: the match will test
the filename against the regexp and keep the values of the parenthesis
in an array. If the test succeed (which you must explicitly check to
avoid run time failure), use the values of the match to compute the
corresponding date, and subtract the date of today to get the age.
This age is in milliseconds, which is converted in days.
Use the Text File Input and Excel Input with the option Accept file from previous step. Note that CSV Input does not have this option, but the more powerful Text File Input has.
well I change the Java Filter with Modified Java Script Value and its work fine now.
Another question, how can I increase the Performance and Speed of my current transformation(now I have 2 trans. for 2 cities)? My insert update make my tranformation to slow and need almost 1hour and 30min to process 500k row of data with alot of field(300mb) and my data not only this, if is it work more fast and my company like to used it, im gonna do it with 10TB of data/years and its alot of trans and rows. I need sugestion about it