I have a bunch of excel file in my directory. Is there a way to read all of them separately (not appending one another) in single command. FOr example.
I have 3 files in my folder
File1.xlsx
File2.xlsx
File3.xlsx
Expected output in R (Instead of reading them separately)
File1
## should have File1 contents
File2
## should have File2 contents
File3
## should have File3 contents
The function assign should help you - it assigns a value to a name given as a string. So the following code should do what you want (hint: use gsub to clean non-word characters to make a valid variable name):
library(readxl)
for (file in list.files(".", pattern = "xls$", full.names = TRUE))
assign(gsub("\\W", "", file), read_excel(file))
We observe that we have two objects in our workspace:
> ls()
[1] "file_1xls" "file_2xls"
Now let's see what they are:
> for(x in ls()) str(get(x))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables:
$ a: num 1
$ b: num 2
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables:
$ c: num 4
$ d: num 5
The below code
loads the readxl library
creates a variable of the files in the current directory with the .xlsx extension
then reads in the files with the lapply() function into a dataframe list
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
Related
cat practice.txt
test_0909_3434 test_8838 test_case_5656_5433 case_4333_3211 note_4433_2212
practice.txt file contains some more files.
required output:
test_0909_3434 test_8838
These test files contain some data so that I need to merge these two files data into one final file.
test_0909_3434 file contains:
id name
1 hh
2 ii
test_8838 file contains:
id name
2 ii
3 gg
4 kk
Final Output of output file: mergedfile.txt will be like follows:
id name
1 hh
2 ii
3 gg
4 kk
we need to remove redundant data also like above mergedfile.txt
1) simplistic (data is "in-order" and "well-formatted" in both input files):
cat f1 f2 | sort -u > f3
2) more complex (not "in-order" and not "well-formatted"). use regex.
Read records from both input files. Assume input record is called 'x'.
if [[ "${x}" =~ ^[[:space:]]*([[:digit:]]+)[[:space:]]+(.*)$ ]]; then
d="${BASH_REMATCH[1]}"
s="${BASH_REMATCH[2]}"
echo "d == $d, s == $s"
fi
aa["${d}"]="${k}"
Where aa is a Bash associative array (available in Bash >= 4x).
declare -A aa=()
This assumes the the first field is an integer (and a key). You can process accordingly, as to whether or not the key is unique.
If it's any more complex than this, consider using Perl.
I am writing a script to extract data from a file and split the data to multiple files contents for each file is split by 5 "#"s
Example:
#####
hello
#####
world
#####
in this case, "hello" should be in one file and "world" should be in another file
I am using python
If I understand your requirements correctly, you want to be able to take input from a file with a delimiter of #####
#####
hello
#####
world
#####
and this would generate a file for each block between
hello
and
world
You can use re.split to get the splits
splits = re.split("[#]{5}\n", input_buffer)
would give something like (note: above assumes the split also includes a newline)
['', 'hello\n', 'world\n', '']
and to get only the splits with actual text (assuming that trailing new lines are to be removed)
[i.strip() for i in splits if i]
Output filename was also not specified so used
for index, val in enumerate([i.strip() for i in splits if i]):
with open("output%d"%index, "w+") as f:
to create files named output0, outputN
import re
import StringIO
input_text = '''#####
hello
#####
world
#####
'''
string_file = StringIO.StringIO(input_text)
input_buffer = string_file.read()
splits = re.split("[#]{5}\n", input_buffer)
for index, val in enumerate([i.strip() for i in splits if i]):
with open("output%d"%index, "w+") as f:
f.write(val)
Just a helper, can obviously use a different regular expression to split on, change output name to something more suitable, etc.
Also if as the title of this question says using text between [- and -] splits could be obtained using re.findall instead
input_text = '''[-hello-]
[-world-]
'''
string_file = StringIO.StringIO(input_text)
input_buffer = string_file.read()
splits = re.findall("\[-(.*)-\]", input_buffer)
for index, val in enumerate(splits):
with open("output%d"%index, "w+") as f:
f.write(val)
This could do the trick:
with open('a.txt') as r: #open source file and assign it to variable r
r = r.read().split('#####') #read the contents and break it into list of elements separated by '#####'
new = [item.strip() for item in r if item] #clean empty rows from the list
for i, item in enumerate(new): #iterate trough new list and assign a number to each iteration starting with 0 (default)
with open('a%s.txt' % i+1, 'w') as w: #create new file for each element from the list that will be named 'a' + 'value of i + 1' + '.txt'
w.write(item) #writing contents of current element into file
This will read your file that I called 'a.txt' and produce files named a1.txt, a2.txt ... an.txt
I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))
I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")
I have an interesting issue I am trying to solve and I have taken a good stab at it but need a little help. I have a squishy file that contains some lua code. I am trying to read this file and build a file path out of it. However, depending on where this file was generated from, it may contain some information or it might miss some. Here is an example of the squishy file I need to parse.
Module "foo1"
Module "foo2"
Module "common.command" "common/command.lua"
Module "common.common" "common/common.lua"
Module "common.diagnostics" "common/diagnostics.lua"
Here is the code I have written to read the file and search for the lines containing Module. You will see that there are three different sections or columns to this file. If you look at line 3 you will have "Module" for column1, "common.command" for column2 and "common/command.lua" for column3.
Taking Column3 as an example... if there is data that exists in the 3rd column then I just need to strip the quotes off and grab the data in Column3. In this case it would be common/command.lua. If there is no data in Column3 then I need to get the data out of Column2 and replace the period (.) with a os.path.sep and then tack a .lua extension on the file. Again, using line 3 as an example I would need to pull out common.common and make it common/common.lua.
squishyContent = []
if os.path.isfile(root + os.path.sep + "squishy"):
self.Log("Parsing Squishy")
with open(root + os.path.sep + "squishy") as squishyFile:
lines = squishyFile.readlines()
squishyFile.close()
for line in lines:
if line.startswith("Module "):
path = line.replace('Module "', '').replace('"', '').replace("\n", '').replace(".", "/") + ".lua"
Just need some examples/help in getting through this.
This might sound silly, but the easiest approach is to convert everything you told us about your task to code.
for line in lines:
# if the line doesn't start with "Module ", ignore it
if not line.startswith('Module '):
continue
# As you said, there are 3 columns. They're separated by a blank, so what we're gonna do is split the text into a 3 columns.
line= line.split(' ')
# if there are more than 2 columns, use the 3rd column's text (and remove the quotes "")
if len(line)>2:
line= line[2][1:-1]
# otherwise, ...
else:
line= line[1] # use the 2nd column's text
line= line[1:-1] # remove the quotes ""
line= line.replace('.', os.path.sep) # replace . with /
line+= '.lua' # and add .lua
print line # prove it works.
With a simple problem like this, it's easy to make the program do exactly what you yourself would do if you did the task manually.