R: need to replace invisible/accented characters with regex - regex

I'm working with a file generated from several different machines that had different locale-settings, so I ended up with a column of a data frame with different writings for the same word:
CÓRDOBA
CÓRDOBA
CÒRDOBA
I'd like to convert all those to CORDOBA. I've tried doing
t<-gsub("Ó|Ó|Ã’|°|°|Ò","O",t,ignore.case = T) # t is the vector of names
Wich works until it finds some "invisible" characters:
As you can see, I'm not able to see, in R, the additional charater that lies between à and \ (If I copy-paste to MS Word, word shows it with an empty rectangle). I've tried to dput the vector, but it shows exactly as in screen (without the "invisible" character).
I ran Encoding(t), and ir returns unknown for all values.
My system configuration follows:
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Colombia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.7-12 dplyr_0.4.2 data.table_1.9.4
loaded via a namespace (and not attached):
[1] R6_2.1.0 assertthat_0.1 magrittr_1.5 plyr_1.8.3 parallel_3.2.1 DBI_0.3.1 tools_3.2.1 reshape2_1.4.1 Rcpp_0.11.6 stringi_0.5-5
[11] grid_3.2.1 stringr_1.0.0 chron_2.3-47 lattice_0.20-31
I've saveRDS a file with a data frame of actual and expected toy values, wich could be loadRDS from here. I'm not absolutely sure it will load with the same problems I have (depending on you locale), but I hope it does, so you can provide some help.
At the end, I'd like to convert all those special characters to unaccented ones (Ó to O, etc.), hopefully without having to manually input each one of the special ones into a regex (in other words, I'd like --if possible-- some sort of gsub("[:weird:]","[:equivalentToWeird:]",t). If not possible, at least I'd like to be able to find (and replace) those "invisible" characters.
Thanks,
############## EDIT TO ADD ###################
If I run the following code:
d<-readRDS("c:/path/to(downloaded/Dropbox/file/inv_char.Rdata")
stri_escape_unicode(d$actual)
This is what I get:
[1] "\\u00c3\\u201cN N\\u00c2\\u00b0 08 \\\"CACIQUE CALARC\\u00c3\\u0081\\\" - ARMENIA"
[2] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA"
[3] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA(ALTERNO)"
Normal output is:
> d$actual
[1] ÓN N° 08 "CACIQUE CALARCÃ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA(ALTERNO)

With the help of #hadley, who pointed me towards stringi, I ended up discovering the offending characters and replacing them. This was my initial attempt:
unweird<-function(t){
t<-stri_escape_unicode(t)
t<-gsub("\\\\u00c3\\\\u0081|\\\\u00c1","A",t)
t<-gsub("\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8","E",t)
t<-gsub("\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc","I",t)
t<-gsub("\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba","O",t)
t<-gsub("\\\\u00c3\\\\u2018|\\\\u00d1","N",t)
t<-gsub("\\u00a0|\\u00c2\\u00a0","",t)
t<-gsub("\\\\u00f3","o",t)
t<-stri_unescape_unicode(t)
}
which produced the expected result. I was a little bit curious about other stringi functions, so I wondered if its substitution one could be faster on my 3.3 million rows. I then tried stri_replace_all_regex like this:
stri_unweird<-function(t){
stri_unescape_unicode(stri_replace_all_regex(stri_escape_unicode(t),
c("\\\\u00c3\\\\u0081|\\\\u00c1",
"\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8",
"\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc",
"\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba",
"\\\\u00c3\\\\u2018|\\\\u00d1",
"\\u00a0|\\u00c2\\u00a0",
"\\\\u00f3"),
c("A","E","I","O","N","","o"),
vectorize_all = F))
}
As a side note, I ran microbenchmark on both methods, these are the results:
g<-microbenchmark(unweird(t),stri_unweird(t),times = 100L)
summary(g)
min lq mean median uq max neval cld
1 423.0083 425.6400 431.9609 428.1031 432.6295 490.7658 100 b
2 118.5831 119.5057 121.2378 120.3550 121.8602 138.3111 100 a

Related

Amount of time to count lines for sequence of files using Rcpp higher than expected

I have a process for cleaning files and then saving them into correctly formatted files that arrow can read at a later time. The files are tsv format and have around 30 columns, mixed data types -- mostly character, but a couple of numeric columns. There is a significant number of files that have only a header and no data content. I decided that prior to reading in the files for cleaning, I would check to ensure that they have content, rather than reading in the file as a data frame, and then checking for content. So, essentially, I just wanted to check that the number of lines in the file was >= 2. I am using a simple C++ method that I pulled into R using Rcpp:
Rcpp::cppFunction(
"
#include <fstream>
bool more_than_one_line(std::string filepath) {
std::ifstream input_file;
input_file.open(filepath);
std::string unused;
int numLines = 0;
while(std::getline(input_file, unused)) {
++numLines;
if (numLines >= 2) {
return true;
}
}
return false;
}
"
)
I take some timing measurements like so:
v <- vector(mode="numeric", length=1000)
ii = 0
for (file in listOfFiles[1:1000]) {
print(ii)
ii = ii + 1
t0 <- Sys.time()
more_than_one_line(file);
v[ii] <- difftime(Sys.time(), t0)
}
When I run this code, it takes about 1 second per file, if the files have never been read before; it's much, much faster if I run the code over files that have previously been processed. Yet, according to this SO answer, the fastest time for counting the lines in a 12M row file is 0.1 seconds (my files are max 500k rows), and the SO user who recommended that fastest strategy (which used Linux wc) also recommended that using C++ would be quite fast. I thought the C++ method I wrote would be equally as fast as the wc method, if not faster, at least due to the fact that I am only reading, at most, the first two lines.
Am I thinking about this wrong? Is my approach wrong?
In my answer to the question linked to by the OP I mention package fpeek, function peek_count_lines. This is a fast function coded in C++. With a directory of 82 CSV files ranging from 4.8K lines to 108K lines on my computer (1 year old 11th Gen Intel(R) Core(TM) i5-1135G7 # 2.40GHz running Windows 11) it gives me an average 0.03 secs per file to get the number of lines.
Then you can use these values and subset on the condition flsize >= 2.
#path <- "my/path/omited"
fls <- list.files(path, full.names = TRUE)
# how many files
length(fls)
#> [1] 82
# range of file sizes in MB
range(file.size(fls)) / 1024L / 1024L
#> [1] 0.766923 19.999812
# total file size in MB
sum(file.size(fls)) / 1024L / 1024L
#> [1] 485.6675
# this is the main problem
t0 <- system.time(
flsize <- sapply(fls, fpeek::peek_count_lines)
)
# the files have from 4.8K to 108K lines
range(flsize)
#> [1] 4882 108503
# how many files have more than just the header line
sum(flsize >= 2)
#> [1] 82
# timings
t0
#> user system elapsed
#> 0.28 1.12 2.30
# average timings per file
t0/length(flsize)
#> user system elapsed
#> 0.003414634 0.013658537 0.028048780
Created on 2023-02-04 with reprex v2.0.2

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

parsing access.log to data.frame

I want to parse an access.log in R. It has the following form and I want to get it into a data.frame:
TIME="2013-07-25T06:28:38+0200" MOBILE_AGENT="0" HTTP_REFERER="-" REQUEST_HOST="www.example.com" APP_ENV="envvar" APP_COUNTRY="US" APP_DEFAULT_LOCATION="New York" REMOTE_ADDR="11.222.33.444" SESSION_ID="rstg35tsdf56tdg3" REQUEST_URI="/get/me/something" HTTP_USER_AGENT="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" REQUEST_METHOD="GET" REWRITTEN_REQUEST_URI="/index.php?url=/get/me/something" STATUS="200" RESPONSE_TIME="155,860ms" PEAK_MEMORY="18965" CPU="99,99"
The logs are 400MB per file and currently I have about 4GB logs so size matters.
Another thing.. There are two different log structures (different columns are included) so you can not assume to have the same columns always, but you can assume that only one kind of structure is parsed at a time.
What I have up to now is a regex for this structure:
(\\w+)[=][\"](.*?)[\"][ ]{0,1}
I can read the data in and somehow fit it into a dataframe using readlines, gsub and read.table but it is slow and messy.
Any ideas? Tnx!
You can do this for example:
text <- readLines(textConnection(text))
## since we can't use = as splitter (used in url) I create a new splitter
dd <- read.table(text=gsub('="','|"',text),sep=' ')
## use data.table since it is faster to apply operation by columns and bind them again
library(data.table)
DT <- as.data.table(dd)
DT.split <- DT[,lapply(.SD,function(x)
unlist(strsplit(as.character(x) ,"|",fixed=TRUE)))]
DT.split[c(F,T)]

Stata macro and for loop when there are quotes and numbers

Suppose you have the macro
global LabNames "3M" "ABBOTT" "MERCK SHARP DOHME"
I am using the quotes so that the words are correctly grouped (MERCK SHARP DOHME is one company, not three different ones). I am trying to write a program that goes over a variable and replaces it when it has one of the strings of LabNames as a substring.
Let us start with the part of the code that works fine.
foreach company of global LabNames {
display "`company'"
}
This code proceeds as needed in my case - lists 3 different companies (not 5). The following code, however, does not run correctly. It breaks down for 3M.
gen hasLab = 0
foreach company of global LabNames {
display "`company'"
replace hasLab = (index(lab,`"`company'"'))
replace lab = `"`company'"' if hasLab > 0
}
If we apply this code to
lab
asdf 3M
3M
ABBOTT
ABBOTT asdf
MERCK SHARP DOHME AS
MERCK SHARP DOHME 4
we get
lab
3M
asdf 3M
ABBOTT
ABBOTT
MERCK SHARP DOHME
MERCK SHARP DOHME
Would you know what to do so that the code can handle the 3M case correctly?
You have unnecessary quotes in your global. It's getting messed up. See
. global LabNames "3M" "ABBOTT" "MERCK SHARP DOHME"
. mac list LabNames
LabNames: 3M" "ABBOTT" "MERCK SHARP DOHME
You can just type
global LabNames 3M ABBOTT "MERCK SHARP DOHME"
See help macrolists for some tips.

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.