Regex and file processing - regex

This question relates to R but really isn't language specific per se. I have a bunch of csv files with this general format "sitename_03082015.csv". The files have 5 columns and various rows
Host MaximumIn MaximumOut AverageIn AverageOut
device1 30.63 Kbps 0 bps 24.60 Kbps 0 bps
device2 1.13 Mbps 24.89 Kbps 21.76 Kbps 461 bps
device5 698.44 Kbps 37.71 Kbps 17.49 Kbps 3.37 Kbps
I ultimately want to read in all the files and merge which I can do but during the merge I want to read the site name and date and add it to each related line so the output looks like this
Host MaximumIn MaximumOut AverageIn AverageOut Site Name Date
device1 30.63 Kbps 0 bps 24.60 Kbps 0 bps SiteA 3/7/15
device12 1.13 Mbps 24.89 Kbps 21.76 Kbps 461 bps SiteA 3/8/15
device1 698.44 Kbps 37.71 Kbps 17.49 Kbps 3.37 Kbps SiteB 3/7/15
device2 39.08 Kbps 1.14 Mbps 10.88 Kbps 27.06 Kbps SiteB 3/8/15
device3 123.43 Kbps 176.86 Kbps 8.62 Kbps 3.78 Kbps SiteB 3/9/15
With my R code I can do the following:
#Get list of file names
filenames<- list.files(pattern = ".csv$")
#This extracts everything up to the underscore to get site name
siteName <- str_extract(string = filenames, "[^_]*")
# Extract date from file names use
date <- str_extract(string = filenames, "\\d{8}" )
With the below R code I can merge all the files but that will be without the added columns of site name and date that I want.
myDF<-do.call("rbind", lapply(filenames, read.table, header=TRUE, sep=","))
I just can't get my head around how to do the extracts for site and date adding and populating the columns to create my ideal dataframe which is the second table above.
The solution that best worked for me was posted below :)

The way that immediately comes to my mind is to do cbind while reading information with additional infor and do rbind afterwards. Something similar to this:
myDF<-do.call("rbind",
lapply(filenames,
function(x) cbind(read.table(x, header=TRUE, sep=","),
"Site Name" = str_extract(string = x, "[^_]*"),
"Date" = as.Date(str_extract(string = x, "\\d{8}"), "%m%d%Y"))))

I have done something similar which can be applied here. You can add more fileNames separated by comma. Also Site can be extracted similarly. Let me know if you need more help .
##Assuming your csv files are saved in location C:/"
library(stringr)
##List all filenames
fileNames <- c("hist_03082015.csv","hist_03092015.csv")
##Create a empty dataframe to save all output to
final_df <- NULL
for (i in fileNames) {
##Read CSV
df <- read.csv(paste("C:/",i,sep=""),header = TRUE,
sep = ",",colClasses='character')
##Extract date from filename into a column
df$Date <- gsub("\\D","",i)
##Convert string to date
df$Date <-as.Date(paste(str_sub(df$Date, 1, 2),
str_sub(df$Date, 3,-5),
str_sub(df$Date, 5,-1),sep="-"),"%d-%m-%Y")
##save all data into 1 dataframe
final_df <- rbind(final_df,df)
print(summary(final_df))
}

Related

Counting gradient using 2 columns array from external .dat file

I have got a .dat file with 2 columns and rows between 14000 to 36000 saved in file like below:
0.00 0.00
2.00 1.00
2.03 1.01
2.05 1.07
.
.
.
79.03 23.01
The 1st column is extension, the 2nd is strain. When I want to count gradient to designate Hooks Law of the plot, I use below code.
CCCCCC
Program gradient
REAL S(40000),E(40000),GRAD(40000,1)
open(unit=300, file='Probka1A.dat', status='OLD')
open(unit=321, file='result.out', status='unknown')
write(321,400)
400 format('alfa')
260 DO 200 i=1, 40000
read(300,30) S(i),E(i)
30 format(2F7.2)
GRAD(i,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i,1)
777 Format(F7.2)
200 Continue
END
But after I executed it I got the warning
PGFIO-F-231/formatted read/unit=300/error on data conversion.
File name = Probka1A.dat formatted, sequential access record = 1
In source file gradient1.f, at line number 9
What can I do to count gradient by this or other way in Fortran 77?
You are reading from file without checking for the end of the file. Your code should be like this:
260 DO 200 i=1, 40000
read(300,*,ERR=400,END=400) S(i),E(i)
if (i>1) then
GRAD(i-1,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i-1,1)
end if
777 Format(F7.2)
200 Continue
400 continue

Two sample T-Test in SAS

I am trying to run a campaign analysis. I have two campaigns A and B and I have their sample size and response rates.
Data structure:
Campaign_Name Response_Flag
A 0
A 1
A 1
B 1
B 0
I have summarized to get a response rate and sample size
Campaign_name Sample_size Response Rate
A 6500 0.7%
B 3600 1.2%
I want to see if the two campaigns are statistically similar or different .
Please help !!
Thanks

Detecting and Removing Commas from only a portion of a string in a large list of strings (R)

I have a large list of strings where each item on the list shows like this:
largeList<-
c("\t\t\t73,Tuesday,08/23/2014,09:03PM,Data Transfer,KB,\"60 KB\",MSDG,AT,GPRR,,0.00",
"\t\t\t74,Tuesday,08/23/2014,10:17PM,Data Transfer,KB,\"1,412 KB\",MSDG,AT,GPRR,,0.00",
"\t\t\t75,Wednesday,08/24/2014,12:08AM,Data Transfer,KB,\"2,589 KB\",MSDG,AT,GPRR,,0.00",
"\t\t\t76,Wednesday,08/24/2014,12:26PM,Data Transfer,KB,\"23,576 KB\",MSDG,AT,GPRR,,0.00",
"\t\t\t85,Thursday,08/25/2014,05:17PM,Data Transfer,KB,\"78,088 KB\",MSDG,AT,GPRR,,0.00")
I am trying to split the data by commas using
lapply(largeList, "strsplit",",")
but the issue I am coming across is that while most of the values are less than 1000 (like "\"60 KB\"), there are large values have a comma in them every once in a while (like "23,576 KB\"). I have tried
grep('(["KB"])', test, value=TRUE)
to try and find the pattern for that only but all that keeps happening is that the whole string is returned. I know that eventually I would use gsub() to replace only that portion but I am at a loss as to what the pattern should be. The best partial solution I was able to come up with is using the stringr package:
str_locate_all(test, '([""])')
which returns with
[[1]]
start end
[1,] 52 52
[2,] 62 62
on the 5th value of the example list above:
[5] "\t\t\t85,Thursday,08/25/2014,05:17PM,Data Transfer,KB,\"78,088 KB\",MSDG,AT,GPRR,,0.00"
And as I understand this does target the start and end of the portion I want to change. But I feel like there is a better way to manipulate the string, I just can't seem to figure out the regular expression for it. Anyone have a more elegant solution to this?
Perhaps save yourself an afternoon of head-banging regular expressions and consider read.csv(). Since the KB values you are looking for are surrounded by quotation marks in your data, and you want to split the rest of the data on the comma anyway, this seems like a nice choice. Notice column V7 in the following.
read.csv(text = largeList, header = FALSE, stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 73 Tuesday 08/23/2014 09:03PM Data Transfer KB 60 KB MSDG AT GPRR NA 0
# 2 74 Tuesday 08/23/2014 10:17PM Data Transfer KB 1,412 KB MSDG AT GPRR NA 0
# 3 75 Wednesday 08/24/2014 12:08AM Data Transfer KB 2,589 KB MSDG AT GPRR NA 0
# 4 76 Wednesday 08/24/2014 12:26PM Data Transfer KB 23,576 KB MSDG AT GPRR NA 0
# 5 85 Thursday 08/25/2014 05:17PM Data Transfer KB 78,088 KB MSDG AT GPRR NA 0
To deliver only the KB values you can use
read.csv(text = largeList, header = FALSE, stringsAsFactors = FALSE)[[7]]
# [1] "60 KB" "1,412 KB" "2,589 KB" "23,576 KB" "78,088 KB"
Additionally, if you need to retain the exact text like 0.00 and \t in the split data, you can add the argument colClasses = "character" and remove stringsAsFactors = FALSE. This way the data will look exactly as it did, only split on the relevant commas.
read.csv(text = largeList, header = FALSE, colClasses = "character")
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 \t\t\t73 Tuesday 08/23/2014 09:03PM Data Transfer KB 60 KB MSDG AT GPRR 0.00
# 2 \t\t\t74 Tuesday 08/23/2014 10:17PM Data Transfer KB 1,412 KB MSDG AT GPRR 0.00
# 3 \t\t\t75 Wednesday 08/24/2014 12:08AM Data Transfer KB 2,589 KB MSDG AT GPRR 0.00
# 4 \t\t\t76 Wednesday 08/24/2014 12:26PM Data Transfer KB 23,576 KB MSDG AT GPRR 0.00
# 5 \t\t\t85 Thursday 08/25/2014 05:17PM Data Transfer KB 78,088 KB MSDG AT GPRR 0.00
read.csv(text = largeList, header = FALSE, colClasses = "character")[[7]]
# [1] "60 KB" "1,412 KB" "2,589 KB" "23,576 KB" "78,088 KB"
To get all the values inside double quotes, use
gsub("^[^\"]*\"([^\"]+).*", "\\1", largeList)
The pattern matches 0 or more characters other than " from the start of the string up to the first ", then captures the contents inside the double quotes, and matches the rest of the contents. Then the captured text replaces the whole match.
See IDEONE demo
Try:
gsub('.*\"(.*)\".*','\\1',largeList)
[1] "60 KB" "1,412 KB" "2,589 KB" "23,576 KB" "78,088 KB"

Search for multiple strings in many text files, count hits on combinations

I'm struggling to automate a reporting exercise, and would appreciate some pointers or advice please.
I have several hundred thousand small (<5kb) text files. Each contains a few variables, and I need to count the number of files that match each combination of variables.
Each file contains a device number, such as /001/ /002/.../006/.
Each file also contains a date string, such as 01.10.14 (dd.mm.yy)
Some files contain a 'status' string which is always "Not Settled"
I need a way to trawl through each file on a Linux system (spread across several subdirectories), and produce a report file that counts 'per device' how many files include each date stamp (6 month range) and for each of those dates, how many contain the status string.
The report might look like this:
device, date, total count of files
device, date, total "not settled" count
e.g.
/001/, 01.12.14, 356
/001/, 01.12.14, 12
/001/, 02.12.14, 209
/001/, 02.12.14, 8
/002/, 01.12.14, 209
/002/, 01.12.14, 7
etc etc
In other words:
Foreach /device/
Foreach <date>
count total matching files - write number to file
count toal matching 'not settled' files - write number to file
Each string to match could appear anywhere in the file.
I tried using grep piped to a second (and third) grep commands, but I'd like to automate this and loop through the variables (6 devices, about 180 dates, 2 status strings) . I suspect Perl and Bash is the answer, but I'm out of my depth.
Please can anyone recommend an approach to this?
Edit: Some sample data as mentioned in the comments. The information is basically receipt data from tills - as would be sent to a printer. Here's a sample (identifying bits stripped out).
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37!
c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
Contents = Not Settled
In the case above, I'd be looking for /003/ , 08.01.15, and "Not Settled"
Many thanks.
First, read everything into an SQLite database, then run queries against it to your heart's content. Putting the data in an SQL database is going to save you time if you need to tweak anything. Besides, even simple SQL can tackle this kind of thing if you have the right tables set up.
First of all I agree with #Sinan :-)
The following might work as hack to make a hash out of your file data.
# report.pl
use strict;
use warnings;
use Data::Dumper;
my %report;
my ($date, $device) ;
while (<>) {
next unless m/^ .*
(?<device>\/00[1-3]\/) .*
(?<date>\d{2}\.\d{2}\.\d{2})
.*$/x ;
($date, $device,) = ($+{date}, $+{device});
$_ = <> unless eof;
if (/Contents/) {
$report{$date}{$device}{"u_count"}++ ;
}
else {
$report{$date}{$device}{"count"}++ ;
}
}
print Dumper(\%report)
This seems to work with a collection of data files in the format shown below (since you don't say or show where the Contents = Not Settled appears, I assume it is either part of the last line along with the device ID or in a separate and final line for each file).
Explanation:
The script reads the STDIN of all the files passed as a glob in while(<>){} loop. First, next unless m/ ... skips forward lines of input until it matches the line with device and date information.
Next, the match then uses named capture groups (?<device> ?<date> to hold the values of the patterns it finds and places those values in corresponding variables (($date, $device,) = ($+{date}, $+{device});). These could simply be $1 and $2 but naming keeps me organized here.
Then, in case there is another line to read $_ = <> unless eof; reads it and tries the final set of conditional matches in order to add to $counts and $u_counts.
Data file format:
file1.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
file2.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/002/132 08.01.15 11:18 A-00
Contents = Not Settled
(a set of files for testing are listed here: http://pastebin.com/raw.php?i=7ALU80fE).
perl report.pl file*.data
Data::Dumper Output:
$VAR1 = {
'08.01.15' => {
'/002/' => {
'u_count' => 4
},
'/003/' => {
'count' => 1
}
},
'08.12.15' => {
'/003/' => {
'count' => 1
}
}
};
From that you can make a report by iterating through the hash with keys() (the date) and retrieving the inner hash and count values per machine. Really it would be a good idea to have some tests to make sure everything works as expected - that or just do as #sinan_Ünür suggests: use SQLite!
NB: this code was not extensively tested :-)

Hive Compression Orc in Snappy

Using : Amazon Aws Hive (0.13)
Trying to : output orc files with snappy compression.
create external table output{
col1 string}
partitioned by (col2 string)
stored as orc
location 's3://mybucket'
tblproperties("orc.compress"="SNAPPY");
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output = true;
set mapred.output.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table output
partition(col2)
select col1,col2 from input;
The problem is that, when I look at the output in the mybucket directory, it is not with SNAPPY extension. However, it is a binary file though. What setting am I missing out to convert these orc file to be compressed and output with a SNAPPY extension ?
OrcFiles are binary files that are in a specialized format. When you specify orc.compress = SNAPPY the contents of the file are compressed using Snappy. Orc is a semi columnar file format.
Take a look at this documentation for more information about how data is laid out.
Streams are compressed using a codec, which is specified as a table property for all streams in that table To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
In short, your files are compressed using Snappy codec, you just can't tell that they are because the blocks inside the file are what's actually compressed.
Additionally, you can use hive --orcfiledump /apps/hive/warehouse/orc/000000_0 to see the details of your file. The output will look like:
Reading ORC rows from /apps/hive/warehouse/orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 6
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
File Statistics:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
....