How can I parse an excel file within a zip file? - regex

I want to be able to parse an excel within a zip file. I've been able to parse the zip file to return files within that compressed file, and if regex match brings up an excel file, I'd like to parse the file.
Here's the script that parses the zip file for the name of the excel spreadsheet...
#!/usr/bin/perl
use strict;
use warnings;
use Archive::Zip;
use Spreadsheet::ParseExcel;
my $zipFile = Archive::Zip->new();
my $xl_file = "";
#open zipfile
$zipFile->read( '/home/user/Desktop/test.zip' ) == 0 || die "cannot read zip file\n";
#find all files within zipfile
my #files = $zipFile->memberNames('/home/user/Desktop/test.zip');
foreach my $file (sort #files) {
#find all excel files
if($file =~ m/(.*xls)/){
$xl_file = $1;
print "excel file found.\n";
}
}
And this is the script that parses for the value in cells.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "/home/user/worksheet.xls";
use Spreadsheet::ParseExcel;
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse("$filename");
if ( !defined $workbook ) {
die $parser->error(), ".\n";
}
open(FILE, '>', "parse.txt")||die "cannot open parse.txt!\n";
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
my $s = $worksheet -> get_cell(2,2);
my $p = $worksheet-> get_cell(2,3);
print FILE $s->value()."\n";
print FILE $p->value()."\n";
}
close FILE;
How do I integrate these together?

According to the documentation of Archive::Zip, it's possible to get the contents of a compressed file member as a string:
$xls_content = $zipFile->contents($file);
And according to the documentation of Spreadsheet::ParseExcel, it's possible to parse a string containg the contents of an Excel file by passing the string as a reference:
my $workbook = $parser->parse(\$xls_content);
So you should be able to combine both together.
Another possibility is to extract the zip file member into a temporary file.

Related

Matlab Use Regex Expression to load file from directory

I have looked a lot for this but have not found anything. I am very new to matlab and regex in general.
My problem is, have a directory path 'dir' with only one .txt file in it. I do however not know the filename of the txt file. I want to load this file.
I have tried multiple things but cannot find the solution.
foo = load(fullfile(dir, '-regexp', '*.txt'))
Thank you for your help!
That syntax isn't valid for fullfile, and dir is an in-built function which it appears you're using as a variable... Here is something a little clearer which should work when you have a single txt file within a given folder
folder = 'my\folder\path\';
files = dir( fullfile( folder, '*.txt' ) );
if numel( files ) ~= 1
error( 'More or less than one .txt file found!' );
end
filepath = fullfile( files(1).folder, files(1).name );
foo = load( filepath ); % load is designed for .mat files, if your .txt contains anything
% non-numeric then you may want something more like readtable here...

Drupal 8 unable to get path location from custom module

I created a module in Drupal 8 that needs to load a csv file from the module folder, but I was unable to do it, I have already tried:
$directory = drupal_get_path('module', 'my_module');
$file = 'source.csv';
$path = $directory . '/' . $file;
kint($path);
// open the CVS file
$handle = fopen($path, 'r');
if (!$handle) {
// ...
}
But I'm getting false when loading the file, so looks like it's not the correct way.
I found a way to got it using the following code:
$file = 'source.csv';
$path = __DIR__ . '/' . $file;
// open the CVS file
$handle = #fopen($path, 'r');
if (!$handle) {
// ...
}
If there is a better way just let me know.
basically:
$moduleDir = drupal_get_path('module','my_module');
is the right way
so if your source.csv file is located under modules/MY_MODULE/files/sources.csv
then you should be able to do something like the following in your my_module.module file or elsewhere:
$file = $moduleDir . DIRECTORY_SEPARATOR . 'files' . DIRECTORY_SEPARATOR . 'sources.csv;
if(file_exists($file)){
//do your stuff
}

Bash & Perl script to convert relative paths to absolute paths

I have a top level dir path and I want to convert all the relative paths to absolute paths existing in all files inside this directory recursively.
e.g. I have this dir structure:
$ tree
.
|-- DIR
| |-- inner_level.ext1
| `-- inner_level.ext2
|-- top_level.ext1
`-- top_level.ext2
Content of top_level.ext1:
../../a/b/c/filename_1.txt
../../a/d/e/filename_2.txt
Assume the top level dir path is /this/is/the/abs/dir/path/
Want to convert the content of top_level.ext1 to:
/this/is/the/abs/a/b/c/filename_1.txt
/this/is/the/abs/a/d/e/filename_2.txt
Content of top_level.ext2:
cc_include+=-I../../util1/src/module1/moduleController -I../../util/src/module1/module2Controller;
cc_include+=-I../../util2/src/module2/moduleUtility;
Want to convert the content of top_level.ext2 to:
cc_include+=-I/this/is/the/abs/util1/src/module1/moduleController -I/this/is/the/abs/util/src/module1/module2Controller;
cc_include+=-I/this/is/the/abs/util2/src/module2/moduleUtility;
Also, want to apply this same conversion over the files inside DIR.
e.g.
Content of DIR/inner_level.ext1:
../../../a/b/c/filename_1.txt
../../../a/d/e/filename_2.txt
Want to convert the content of DIR/inner_level.ext1 to:
/this/is/the/abs/a/b/c/filename_1.txt
/this/is/the/abs/a/d/e/filename_2.txt
Same for the DIR/inner_level.ext2 also.
Have written this two scripts.
Conversion of top_level.ext1 is working successfully.
file_manager.sh:
#!/usr/bin/bash
file='resolve_path.pl'
basedir='/this/is/the/abs/dir/path'
run_perl(){
echo -e "\n File getting modified: $1"
cp $1 tmp.in
perl $file
mv tmp.out $1
rm tmp.in
}
find $basedir -type f |while read inputfile
do
run_perl $inputfile
done
resolve_path.pl:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use 5.010;
use Switch;
#******************************************************
# Set-up Directory And Input/Output File Names
#******************************************************
our $in_file = glob('tmp.in');
my $out_file1 = 'tmp.out';
print "Input file: $in_file\n";
#************************************
# Local and Global Variables
#*************************************
my $current_path = "/this/is/the/abs/dir/path";
my $temp_path = $current_path;
#************************************
# Open Read and Write File
#************************************
open(READ, $in_file) || die "cannot open $in_file";
open(WRITE, ">$out_file1") || die "cannot open $out_file1";
#******************************************************
# Read The Input [*.out] File Line By Line
#******************************************************
while (<READ>) {
if(/^(\.\.\/){1,}(\w+\/)*(\w+).(\w+)/){
my $file_name = $3;
my $file_ext = $4;
my #count = ($_ =~ /\.\.\//g);
my $cnt = #count;
my #prev_dir = ($_ =~ /\w+\//g);
my $prev_dir_cnt = #prev_dir;
my $file_prev_dir = join('', #prev_dir);
$temp_path = $current_path;
for(my $i=0; $i<$cnt; $i++){
if($temp_path =~m/(\/.*)\/\w+/){
$temp_path = $1;
}
}
print WRITE "$temp_path"."\/"."$file_prev_dir"."$file_name"."\."."$file_ext"."\n";
} else {
print WRITE "$_";
}
}
Issues I am facing:
No conversion is applied over top_level.ext2 & DIR/inner_level.ext2
as my Perl script is not parsing properly for ../es (i.e.
cc_include+=-I is coming at the beginning).
conversion from relative path to absolute path is not working
properly for DIR/inner_level.ext1 and a wrong path is getting
appended.
It would be helpful if someone can suggest expected changes in my scripts to solve the above said two issues.
Why the 2 scripts? That's inefficient.
Perl is perfectly capable of retrieving the list of files and has modules which simplifies that process as well as modules to parse and alter the paths.
File::Find - Traverse a directory tree.
File::Find::Rule - Alternative interface to File::Find
File::Basename - Parse file paths into directory, filename and suffix.
File::Spec - portably perform operations on file names

ARFF to CSV multiple files conversions

Anyone successfully tried to convert many ARFF files to CSV files from windows Command line.
I tried to use weka.core.converters.CSVSaver but it works for a single file only.
Can it be done for multiple files?
I found a way to solve this conversion by using R as shown in the following Script:
#### Set the default directory to the folder that contains all ARFF files
temp = list.files(pattern="*.arff")
library(foreign)
for (i in 1:length(temp)) assign(temp[i], read.arff(temp[i]))
for(i in 1:length(temp))
{
mydata=read.arff(temp[i])
t=temp[i]
x=paste(t,".csv")
write.csv(mydata,x,row.names=FALSE)
mydata=0
}
On a windows command line, type powershell
Change to the directory where your *.arff files reside in
Enter this command
dir *.arff | Split-Path -Leaf| ForEach-Object {Invoke-Expression "C:\Program Files\Weka-3-6\weka.jar;." weka.core.converters.CSVSaver -i $_ -o $_.csv"}
This assumes that your filenames do not contain any blanks, and all arff files reside in a single directory, and you want to convert them all. It will create a new csv file from each arff file. myfile.arff will be exported/converted to myfile.arff.csv
I write a simple python script in github: arff2csv.py.
paste my code.
"""trans multi-label *.arff file to *.csv file."""
import re
def trans_arff2csv(file_in, file_out):
"""trans *.arff file to *.csv file."""
columns = []
data = []
with open(file_in, 'r') as f:
data_flag = 0
for line in f:
if line[:2] == '#a':
# find indices
indices = [i for i, x in enumerate(line) if x == ' ']
columns.append(re.sub(r'^[\'\"]|[\'\"]$|\\+', '', line[indices[0] + 1:indices[-1]]))
elif line[:2] == '#d':
data_flag = 1
elif data_flag == 1:
data.append(line)
content = ','.join(columns) + '\n' + ''.join(data)
# save to file
with open(file_out, 'w') as f:
f.write(content)
if __name__ == '__main__':
from multi_label.arff2csv import trans_arff2csv
# setting arff file path
file_attr_in = r'D:\Downloads\birds\birds-test.arff'
# setting output csv file path
file_csv_out = r"D:\Downloads\birds\birds-test.csv"
# trans
trans_arff2csv(file_attr_in, file_csv_out)

use shell command tesseract in perl script to print a text output

Hi i have some script that i want to write, first i took from the html the image, and then i wanted to use tesseract to take the output txt from it.
i cant really figure out how to do it.
Here is the code:
#!/usr/bin/perl -X
##########
$user = ''; # Enter your username here
$pass = ''; # Enter your password here
###########
# Server settings (no need to modify)
$home = "http://37.48.90.31";
$url = "$home/c/test.cgi?u=$user&p=$pass";
# Get HTML code
$html = `GET "$url"`;
#### Add code here:
# Grab img from HTML code
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/) {
$takeImg = $1;
}
#dirs = split m!/!, $takeImg;
$img = $dirs[2];
#########
die "<img> not found\n" if (!$img);
# Download img to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
#### Add code here:
# Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system ("tesseract", "tesseract ocr_me.img ocr_result");
###########
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# Check OCR results:
$txt = `cat ocr_result.txt`;
I took the image right from the html or i need another Regex?
and how to display the 'ocr_result.txt'
Thanks for all who will help!