Processing .bib entries that contain asterisks in field names

Processing .bib entries that contain asterisks in field names - r-markdown

I am using Pandoc to process some R Markdown files. These files include citations to works specified in a BibTeX (.bib) file. In the YAML header of the R Markdown file, I specify the path to this file:
bibliography: 'c:/myDir/myRefs.bib'
None of that is problematic. But the .bib file contains a lot of entries that pandoc won't process. Specifically, it contains a lot of entries in which the field names begin with an asterisk. For example:
#ARTICLE{Smith_Hello_2021,
AUTHOR = {John Smith},
TITLE = {Some Title},
JOURNAL = {Some Journal},
YEAR = {2021},
volume = {1},
number = {1},
pages = {1-2},
*month = {},
}
The problem is the *month field. I often add an asterisk to the start of field names when I don't want them to be processed by BibTeX; I have hundreds of .bib entries like this. When Pandoc comes across such an entry, it gives me this error message:
Error reading bibliography file c:/myDir/myRefs.bib:
(line 54, column 3):
unexpected "*"
expecting space, white space or "}"
Error: pandoc document conversion failed with error 25
Execution halted
Is there any workaround, short of removing the asterisks from my .bib file?

Related

Error with CRS argument while reprojecting

I'm trying to iterate multiple rasters (+500) in a for loop but I'm facing some problems.
First I want to reproject them from CRS EPSG:4326 to CRS EPSG: 32614, then resample them by using a mask raster which has a smaller resolution as well as extension and finally writing a result raster for each raster in the working directory, but I've been obtaining the following error message regarding the CRS argument:
Error in CRS(x) : PROJ4 argument-value pairs must begin with +: E:\Proyecto PM2.5\2_PM_2.5_Processing\Test/AOD_MOD_CDTDB_April_2016.tif
I took a look at multiple posts here, but I couldn't go over this problem. Below is my code, any help will be really appreciated from this R beginner
#find all tifs in your directory
dir<-"E:\\Proyecto PM2.5\\2_PM_2.5_Processing\\Test"
#get a list of all files with .tif in the name in the directory
files<-list.files(path=dir, pattern='.tif', full.names = TRUE)
#raster with the expected characteristics: extension, cellsize, number of pixels
r_ref <- raster("E:\\Proyecto PM2.5\\3_PM_2.5_Entrega\\temporal\\Raster_C.tif")
for (file in files){
name <- file
projectRaster(name,crs="+init=epsg:32614")
resample(file,r_ref,method="ngb")
savename<-sub("ZMVM",name,basename(file))
writeRaster(r,file=savename,)
}

You do
for (file in files){
name <- file
projectRaster(name,crs="+init=epsg:32614")
So name is the same as file (why do you make a copy?) --- a filename.
You ask projectRaster to project a character string (file name). What you intended is surely something like this
for (file in files){
r <- raster(file)
projectRaster(r, crs="+init=epsg:32614")

how do i concatenate two different files into one file using python

input: I have more than 100 sample files. Each sample file has two different files has an extension of *.column' and *.datatypes
File1 each file has column names and has an extension of *.column datatypes description and has an extension of *.datatypes
What I need is an output file in their respective files sample
Output File should have column names along with datatypes.
Currently am getting all 100 files data merged and saved into one file.
Eg: file_1:
column names datatypes
id int
name string
Eg: file_2:
column names datatypes
id int
name string
i got the output for all files column names and datatypes merged in one single file.
What I need is to get individual files merged separately for each sample.
for name in os.listdir("C:\Python27"):
if name.endswith(".column"):
for file in name:
file = os.path.join(name)
joined = file+ ".joined"
with open(joined,"w") as fout:
filenames = glob.glob('*.column')
for filename in filenames:
with open(filename) as f1:
file_names = glob.glob('*.datatypes')
for filename in file_names:
with open(filename) as f2:
for line1,line2 in zip(f1,f2):
x = ("{0} {1} \n".format(line1.rstrip(),line2.rstrip()))
y = x.strip()
fout.write(y.strip() + ',\n')
Please assist me.

Hopefully the below would work. This is on the understanding that each *.column file has a corresponding *.datatypes file name, if not the code will throw a File not found. error.
for colname in os.listdir("C:\Python27"):
if colname.endswith(".column"):
print('Processing:' + colname)
file = os.path.splitext(colname)[0]
joined = file+ ".joined"
with open(joined,"w") as fout:
with open(colname) as f1:
datname = file+'.datatypes'
with open(datname) as f2:
for line1,line2 in zip(f1,f2):
x = ("{0} {1}".format(line1.rstrip(),line2.rstrip()))
y = x.strip()
fout.write(y.strip() + ',\n')
print('Finished writing to :'+joined)
I test ran this with a few sample input files as below file1.column
date_sev
pos
file1.datatypes
timestamp
date
file2.column
id
name
file2.datatypes
int
string
file3.column
id
name
file3.datatypes
int
string
When I run the file I get the below output in the console
Processing:file1.column
Finished writing to :file1.joined
Processing:file2.column
Finished writing to :file2.joined
Processing:file3.column
Finished writing to :file3.joined
And the output files I get are file1.joined
date_sev timestamp,
pos date,
file2.joined
id int,
name string,
file3.joined
id int,
name string,
Also if you want to better the output syntax of the files then I would make the changes as below...
From
x = ("{0} {1}".format(line1.rstrip(),line2.rstrip()))
To
x = ("{0},{1}".format(line1.rstrip(),line2.rstrip()))
From
fout.write(y.strip() + ',\n')
To
fout.write(y.strip() + '\n')
I left the formatting as is from your initial version in my original solution posted in the beginning.

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

I have not done any programming in about 12 years and have been asked by one of my colleagues to help with what is apparently a basic Python 2.7 script. My question is very similar to what this person asked (though has not been answered):
Python - Batch combine Multiple large CSV, filter data, skip header, appending vertically into a single CSV
I need to prompt the user for the folder path, read in each file from that folder (there are hundreds of CSV files), conduct processing, and then output the finished processing from each file into a single CSV file with the output of each file separated by a blank line and the filename of the file that it was read from.
It would result in something like this:
CHEM_0_5
etc etc
etc etc
etc etc
LAW_4_1
etc etc
etc etc
LAW_7_3
etc etc
etc etc
Currently the script has to be edited with the name of the file it has to read, saved, and then run. Then the contents of the output file has to be manually copied into a new csv file. It is very tedious and time consuming.
This is what I currently have. Please note I have removed some of the processing from the example.
import time
import datetime
x = 0
stamp = 0
compare = 1
values = []
## INSERT NAME OF FILE YOU WANT TO CLEAN
g = open('CHEM_0_5.csv','r')
for line in g:
lis=[line.split() for line in g]
lis.pop(0)
lis.pop(0)
timestamps = []
results = []
x = 0
for i in cl:
## INSERT WHAT YOU WANT TO SAVE THE FILE AS
fd = open('new.csv','a')
fd.write(str(ts[x]) + "," + str(i) + "\n")
fd.close()
x = x + 1
g.close()
I have been trying to re-learn python in the process of searching for answers but given that I don't really know what I'm doing I feel that this could be something to do after I've completed the task for my colleague.
Thank you for taking the time to read my submission!

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0

To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

Retrieve EMBL-Bank ID through corresponding Ensembl Gene ID in batch

I got a list of around 5000 genes as a search result from Gene Expression Atlas. From the result page i can download all the result in a file. That file contains gene identifiers(Ensembl Gene ID) for each gene. So now i want corresponding EMBL-Bank ID for each Ensembl Gene ID so that i can download their nucleotide sequences in batch from Dbfetch.
Anyone knows how can we achieve that?
Can we use biopython to achieve that?

The file you can download is in a custom tab-delimited format (which none of Biopython's parsers are equipped to handle).
Instead, you can just use the csv module to extract what you'd like:
import csv
with open("listd1.tab") as tab_file:
data_lines = (line for line in csv_file if not line.startswith("#"))
csv_data = csv.reader(data_lines, delimiter="\t")
header = csv_data.next() # ['Gene name', 'Gene identifier', ...]
gene_id_index = header.find("Gene identifier")
for line in csv_data:
gene_id = line[gene_id_index] # Do whatever you'd like with this

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Processing .bib entries that contain asterisks in field names - r-markdown

Related

Error with CRS argument while reprojecting

how do i concatenate two different files into one file using python

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

BadDataError when editing a .dbf file using dbf package

Retrieve EMBL-Bank ID through corresponding Ensembl Gene ID in batch

Categories

Resources