Modify a file with R at a given position - regex

How can I modify a text file with R at a given position?
It's not a table but any file, for example an xml file.
For example to replace the content of line 122 from columns 7 (to 10) by the content of a variable, that is "3.14".
and update the file.
Imagine that line was
<name=0.32>
now should be
<name=3.14>
Or another option, maybe easier.
Look for all aparitions of occurences of "Variable=" and change the next 4 characters.

You would have to read in the entire file and then use string manipulation. E.g.,
f <- "file.xml"
x <- readLines(f)
x[122] <- paste0(substring(x[122], 1, 8), "3.14", substring(x[122], 13, nchar(x[122])))
writeLines(x, f)
I assume there is a command-line tool that would be better suited to this.

Related

How to get the C/C++ source code of the a secondary function of R?

I was wondering what is the correct way to get the C/C++ source code of any secondary(to distinguish from the Primitive/Internal ones) function in R.
Related questions are here, here, here and here:
Mine is different so that I used "secondary" in my question.
For example, the read.table() function, within R console I got:
>?read.table
read.table package:utils R Documentation
Data Input
Description:
Reads a file in table format and creates a data frame from it,
with cases corresponding to lines and variables to fields in the
file.
Usage:
read.table(file, header = FALSE, sep = "", quote = "\"'",
......
Or
> getAnywhere(read.table)
A single object matching ‘read.table’ was found
It was found in the following places
package:utils
namespace:utils
with value
function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
......
attr(data, "row.names") <- row.names
data
}
<bytecode: 0x560ff88edd40>
<environment: namespace:utils>
Search the website I got:
https://svn.r-project.org/R/trunk/src/library/utils/src/utils.c
https://svn.r-project.org/R/trunk/src/library/utils/src/utils.h
How to get the C/C++ source code of the read.table function instead of R code, if this is reasonable?
The searchable R source code at https://github.com/wch/r-source is really useful for this:
First we can look for the read.table definition
The actual data reading is done by the scan function which in the end uses
.Internal(scan(file, what, nmax, sep, dec, quote, skip, nlines,
[...]
Now scan is mapped to do_scan
So here you are: The underlying C implementation for read.table can be found in src/main/scan.c, starting with the function do_scan.

Regular expression search and substitute values in CSV file

I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.

Hello I have a code that prints what I need in python but i'd like it to write that result to a new file

The file look like a series of lines with IDs:
aaaa
aass
asdd
adfg
aaaa
I'd like to get in a new file the ID and its occurrence in the old file as the form:
aaaa 2
asdd 1
aass 1
adfg 1
With the 2 element separated by tab.
The code i have print what i want but doesn't write in a new file:
with open("Only1ID.txt", "r") as file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print item.title(), file.count(item)
As you use Python 2, the simplest approach to convert your console output to file output is by using the print chevron (>>) syntax which redirects the output to any file-like object:
with open("filename", "w") as f: # open a file in write mode
print >> f, "some data" # print 'into the file'
Your code could look like this after simply adding another open to open the output file and adding the chevron to your print statement:
with open("Only1ID.txt", "r") as file, open("output.txt", "w") as out_file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print >> out_file item.title(), file.count(item)
However, your code has a few other more or less bad things which one should not do or could improve:
Do not use the same variable name file for both the file object returned by open and your processed list of strings. This is confusing, just use two different names.
You can directly iterate over the file object, which works like a generator that returns the file's lines as strings. Generators process requests for the next element just in time, that means it does not first load the whole file into your memory like file.readlines() and processes them afterwards, but only reads and stores one line at a time, whenever the next line is needed. That way you improve the code's performance and resource efficiency.
If you write a list comprehension, but you don't need its result necessarily as list because you simply want to iterate over it using a for loop, it's more efficient to use a generator expression (same effect as the file object's line generator described above). The only syntactical difference between a list comprehension and a generator expression are the brackets. Replace [...] with (...) and you have a generator. The only downside of a generator is that you neither can find out its length, nor can you access items directly using an index. As you don't need any of these features, the generator is fine here.
There is a simpler way to remove trailing newline characters from a line: line.rstrip() removes all trailing whitespaces. If you want to keep e.g. spaces, but only want the newline to be removed, pass that character as argument: line.rstrip("\n").
However, it could possibly be even easier and faster to just not add another implicit line break during the print call instead of removing it first to have it re-added later. You would suppress the line break of print in Python 2 by simply adding a comma at the end of the statement:
print >> out_file item.title(), file.count(item),
There is a type Counter to count occurrences of elements in a collection, which is faster and easier than writing it yourself, because you don't need the additional count() call for every element. The Counter behaves mostly like a dictionary with your items as keys and their count as values. Simply import it from the collections module and use it like this:
from collections import Counter
c = Counter(lines)
for item in c:
print item, c[item]
With all those suggestions (except the one not to remove the line breaks) applied and the variables renamed to something more clear, the optimized code looks like this:
from collections import Counter
with open("Only1ID.txt") as in_file, open("output.txt", "w") as out_file:
counter = Counter(line.lower().rstrip("\n") for line in in_file)
for item in sorted(counter):
print >> out_file item.title(), counter[item]

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)
Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"

CSV File line data to dictionary input formatting

PYTHON - I would like to input the first two elements of data from the lines of a csv file. For example, 1,3,4 is the first line of the CSV file and I would like to make a dictionary of tuples where the first two numbers(1,3) as the key and the value is the third number(3).
So the output looks like this,
{('1','3') : 4}
Whats the difficulty you face?.Post the code you tried.The follwoing should do the trick:
d={}
fp=open("csv","r")
for i in fp.readlines():
d[(i[0],i[2])]=i[4]