R: removing the last three dots from a string - regex

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.

This does the trick, though not especially elegant...
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")

This will get you most of the way there, and it will have no problems with numbers that include commas:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.

Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result
[haiku-pseudocode]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/haiku-pseudocode]

Related

Obtaining geographic decimal coordinates from proprietary text format using regex

Using only Notepad++ with regex support I would like to extract some data from a txt file, representing geographic coordinates and organize the output like that:
-123456789 becomes -123.456789
123456789 becomes 123.456789
-23456789 becomes -23.456789
56789 becomes 0.056789
-89 becomes -0.000089
Tried this: (-?)([0-9]*)([0-9]{6}) but fails when input is less than 6 digits long
You will need 2 steps in notepad++ to do this. First, let's take a look at the regex:
(?<sign>-?)(?<first>\d+(?=\d{6}))?(?<last>\d+)
captures the necessary parts in groups.
Explanation: (you can lose the named grouping if you want)
(?<sign>-?) # read the '-' sign
(?<first>\d+(?=\d{6}))? # read as many digits as possible,
# leaving 6 digits at the end.
(?<last>\d+) # read the remaining digits.
see regex101.com
How to use this in notepad++? Using a two step-search and replace:
(-?)(\d+(?=\d{6}))?(\d+)
replace with:
\1(?2\2.:0.)000000\3 # copy sign, if group 2 contains any
# values, copy them, followed by '.'.
# If not show a '0.'
# Print 6 zero's, followed by group 3.
Next, replace the superfluous zeros.
\.(0+(?=\d{6}\b))(\d{6}) # Replace the maximum number of zero's
# leaving 6 digits at the end.
replace with:
.\2
You can do it with three steps :
Step1 : replace : (-?)\b(\d{1,6})\b with \10000000\2
Step2 : replace : (-?)(\d{0,})(\d{6}) with \1\2.\3
Step3 : replace : 0{2,}\. with 0.
The idea is simple :
In the first step comple all the numbers less than 6 length with 6
zeros before to insure the length should be more than 6
In the step two put the dot before the 6th number
Step three replace all the multiple zeros before the dot with just one
In the end the output
-123.456789
123.456789
-23.456789
0.056789
-0.000089
Check the three steps :
You could use a Python Script plugin available for notepad++:
editor.rereplace('(\d+)', lambda m: ('%f' % (float(m.group(1))/1000000)))

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

regex variable substitution in "replacement" argument

I have a string in R. I want to find part of the string and append a variable number of zeroes. For example, I have 1 2 3. Sometimes I want it to be 1 20 3; sometimes I want it to be 1 2000 3. If I store the number of appended zeroes in a variable, how can I use it in the "replacement" part of a sub command?
I have in mind code like this:
s <- '1 2 3'
z <- '3'
sub('(\\s\\d)(\\s.*)', '\\10{z}\\2', s)
This code returns 1 20{z} 3. But I want 1 2000 3. How can I get this sort of result?
One way is
s <- '1 2 3'
z <- '3'
zx <- paste(rep(0, z), collapse = '')
sub('(\\s\\d)(\\s.*)', paste0('\\1', zx, '\\2'), s)
but this is a little clunky.
Try concatenate operator from stringi package:
require(stringi)
"abc"%stri+%"123abc"
## [1] "abc123abc"
Your approach to create the replacement string zx is pretty good. However, you can improve your sub command. If you use lookbehind and lookahead instead of matching groups, you don't need to create a new replacement string. You can use zx directly.
sub("(?<=\\s\\d)(?=\\s)", zx, s, perl = TRUE)
# [1] "1 2000 3"

Count the occurrences of word by pattern in R

Perhaps an oft asked question, am royally stuck here.
From an XML File, I'm trying to search for all occurrences, their lines and the total count of occurrence of each 12 character string containing only alpha and numerals (literally alpha-numeric).
For example: if my file is xmlInput, I'm trying to search and extract all the occurrences,positions and total counts of a 12-character alpha-num string.
Example output:
String Total Count Line-Num
CPXY180D2324 2 132,846
CPXY180D2131 1 372
CPCY180D2139 1 133
I know that, I could use regmatches to get all occurrences of a string by pattern. I've been using the below for that: (Thanks to your help on this).
ProNum12<-regmatches(xmlInput, regexpr("([A-Z0-9]{12})", xmlInput))
ProNum12
regmatches give me all the matches that follow the pattern. but it doesnt give me the line numbers of where the pattern appeared. grep gives me the line numbers of all occurrences.
I thought I could use the textcnt package of library Tau but couldnt get it to run correctly. Perhaps it is not the right package?
Is there a package/library in R which will search for all words matching the pattern and return the total count of appearence and linenumers of each occurrence? If no such pacakge exists, any idea how I can do this using any of the above or better?
Without seeing your data, it is hard to offer a suggestion on how to proceed. Here is an example with some plain character strings that might help you get started on finding a solution of your own.
First, some sample data (which probably looks nothing like your data):
x <- c("Some text with a strange CPXY180D2324 string stuck in it.",
"Some more text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.",
"Even more text with strings that CPXY180D2131 don't make much sense.",
"I'm CPXY180D2324 tired CPXY180D2324 of CPXY180D2324 text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.")
We can split it by spaces. This is another area it might not fit with your actual problem, but again, this is just to help you get started (or help others provide a much better answer, as may be the case.)
x2 <- strsplit(x, " ")
Search the split data for values matching your regex pattern. Create a data.frame that includes the line numbers and the matched string.
temp <- do.call(rbind, lapply(seq_along(x2), function(y) {
data.frame(line = y,
value = grep("([A-Z0-9]{12})", x2[[y]],
value = TRUE))
}))
temp
# line value
# 1 1 CPXY180D2324
# 2 2 CPXY180D2131
# 3 2 CPCY180D2139
# 4 3 CPXY180D2131
# 5 4 CPXY180D2324
# 6 4 CPXY180D2324
# 7 4 CPXY180D2324
# 8 4 CPXY180D2131
# 9 4 CPCY180D2139
Create your data.frame of line numbers and counts.
with(temp, data.frame(
lines = tapply(line, value, paste, collapse = ", "),
count = tapply(line, value, length)))
# lines count
# CPXY180D2324 1, 4, 4, 4 4
# CPCY180D2139 2, 4 2
# CPXY180D2131 2, 3, 4 3
Anyway, this is purely a guess (and me killing time....)