VBA Excel : Read file, and stock images in a column - regex

I am new to VBA (I mean, REALLY new) and I would like you to give me some tips.
I have an Excel file with 2 columns: SKU and media_gallery
I also have images stocked in a folder (lets name it /imageFolder)
I need to parse the imageFolder and look for ALL images sarting by SKU.jpg , and put them into the media_gallery column separated by a semicolon ( ; )
Example: My SKU is "1001", I need to parse the image folder for all images starting by 1001 (all image have this pattern: 1001-2.jpg , 1001-3.jpg etc...)
I can do that in Java or C# but I want to give a chance to VBA. :)
How can I do that?
EDIT: I only need file names yes! And I should of said that I have 20 000 images in my folder, and 8000 SKUs , so I don't know how we can handle looping on 20 000 images names.
EDIT2: If SKU contains a dash ( - ), I don't need to treat it, so I can pass to the next SKU. And each SKU has a maximum of 5 images (....;SKU-5.jpg)
Thanks all.

How to insert images given you have one image name per cell in a column: How to get images to appear in Excel given image url
Take the above and introduce an inner loop for the file name:
if instr(url_column.Cells(i).Value, "-") = 0 then
dim cur_file_name as string
cur_file_name = dir("imageFolder\" & url_column.Cells(i).Value & "*.jpg")
do until len(cur_file_name) = 0
image_column.Cells(i).Value = image_column.Cells(i).Value & cur_file_name & ";"
cur_file_name = Dir
loop
end if

Related

Replace blanks with a string with DAX coding in Power BI

I have a data in Excel and I have uploaded in power bi, created a visualisation using a chart which looks like -
blank, CP, Jj10 are basically my y axis and dashes are my bars of horizontal chart. I have tried to show how my chart looks like because I don't get any other option
(Blank)-------------998
CP-----------56
Jj10--------44
0BN--------------77
Hi-po---2
Naas-------21
There is a column named performance (sheet_name=Empl_data) and what I want is to replace the blanks with Non-GT in power bi with creating a new column.
What my output should look like -
(Non-GT)-------------998
CP-----------56
Jj10--------44
0BN--------------77
Hi-po---2
Naas-------21
I have tried this -
Non-GT = IF(ISBLANK('Empl_data'[performance]),"Non-GT",'Empl_data'[performance])
What i get is
Non-GT----------------964
(Blank)-------------34
CP-----------56
Jj10--------44
0BN--------------77
Hi-po---2
Naas-------21
I just want to replace blanks with Non-TSG completely but still it shows blank. Please help me out to solve the problem and please let me know if I have made clear what my prblm is.
My data -
Empl_id
Empl_name
performance
99807
Somman paul
0076
Richards.M
8870
Maheen Josef.T
11209
Dojar Farah
6651
Macklegn Sagoe
Hi-po
551
Cada Farez
Jj10
12
Qwezy Goha
Hi-po
6567
Beheriop Produse
CP
2227
John semmers
0BN
656
Majeeio .f
80100
Drejju Yan
Is it actually Blank where it is showing nothing? First confirm there are spaces or really null in the data, then apply conditions as bellow-
Non-GT =
IF(
'Empl_data'[performance] = BLANK() || 'Empl_data'[performance] = "",
"Non-GT",
'Empl_data'[performance]
)
Q: Where you are transforming data with that condition? Power Query Editor? Or creating Measure or Calculated column?

How to load specific columns with varying location from a text file in python?

I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])
You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

source file fixed width , need only Header and Footer to the target(oracle)

I have this scenario with source as fix width flat file, and I need to read to target only the Header and Footer not the details records.
I need to trim the first column (PA22109 ) and get only PA and next 2 columns to rows as two different dates.
For Footer get only the PT(PT000000000700000030620E00000055612I00000010277I) and the rest into a column of the target.
How can I achieve this logic, inputs are appreciated.
source file :
PA22109 00153252015110905408179 2015110820151108PO ---header
DE0E9D TESTGROUPEXCH TESTINSEXCH TESTLOCEXCH ID014 LNAME014 FNAME014 14 MAIN ST ANYWHERE NJ011110000 195001012Z 01000000014 LNAME014 PATFIRST014 14 MAIN ST ANYWHERE NJ011110000 1955010110106000220 TESTGROUPEXCH 8179 TESTBENEXCH TESTCNTE53 0000000000 0000002643005 011234567890 011234567890 1234 TEST PHARMACY TEST PHARMACY LANE PHARMACYTOWN NJ09876 5555555555 11Y5 019876543210 019876543210 NJPRESCLAST PRESCFIRST 5555555551 DRLAST DRFIRST 110110000009770990300406048410 2015092720150927154401000000000000120150929 0000100000000000000000000000000
PT000000000700000030620E00000055612I00000010277I --Footer
As this a fixed file you can perform following to meet your requirement.
In your Informatica mapping, Read row in a single column.
In Expression, Mark each record for filter out if It does not start with PA OR PT (Assumption your Detail records do not start with PA or PT). Filter detail record out using Filter transformation.
Now you have only Header and Footer Records.
Now you can apply respective condition in expression for PA and PT Records.

How to rename a column of a data frame with part of the data frame identifier in R?

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")