Line breaking issue to move csv file in Linux - regex

[I have moved the csv file into Linux system with binary mode. File content of one field is spitted into multiple lines its comment sections,I need to remove the new line , keep the same format, Please help on shell command or perl command
here is the example for three records, Actual look like]
Original content of the file
[After moved into linux, comments field is splitted into 4 lines , i want to keep the comment field in the same format but dont want the new line characters
"First line
Second line
Third line
all lines format should not change"
]2

As I said in my comment above, the specs are not clear but I suspect this is what you are trying to do. Here's a way to load data into Oracle using sqlldr where a field is surrounded by double-quotes and contains linefeeds where the end of the record is a combination carriage return/linefeed. This can happen when the data comes from an Excel spreadsheet saved as a .csv for example, where the cell contains the linefeeds.
Here's the data file as exported by Excel as a .csv and viewed in gvim, with the option turned on to show control characters. You can see the linefeeds as the '$' character and the carriage returns as the '^M' character:
100,test1,"1line1$
1line2$
1line3"^M$
200,test2,"2line1$
2line2$
2line3"^M$
Construct the control file like this using the "str" clause on the infile option line to set the end of record character. It tells sqlldr that hex 0D (carriage return, or ^M) is the record separator (this way it will ignore the linefeeds inside the double-quotes):
LOAD DATA
infile "test.dat" "str x'0D'"
TRUNCATE
INTO TABLE test
replace
fields terminated by ","
optionally enclosed by '"'
(
cola char,
colb char,
colc char
)
After loading, the data looks like this with linefeeds in the comment field (I called it colc) preserved:
SQL> select *
2 from test;
COLA COLB COLC
-------------------- -------------------- --------------------
100 test1 1line1
1line2
1line3
200 test2 2line1
2line2
2line3
SQL>

Related

Remove newline after incorrect field splitting in csv file

I use linux and I'm trying to use sed for this. I download a CSV from an institutional site providing some data to be analyzed. There are several thousand lines per CSV, and many columns per row (I haven't counted them, but I think the number is useless). The fields are separated by semicolons and quoted, so the format per line is:
"Field 1";"Field 2";"Field 3"; .... ;"Field X";
Each correct line ends with semicolon and '\n'. The problem is that, from time to time, there's some field that incorrectly has a newline, and the solution is to delete the newline character, so the two lines go back to be together into only one. Example of an incorrect line:
"Field 1";"Field 2";"Fi
eld 3";"Field X";
I've found that there can be a \n right after the opening quote or somewhere in the between the quotes.
I've found a way to manage this last case, where the newline is right after the quote:
sed ':a;N;$!ba;s/";"\n/";"/g' file.csv
but not for "any number of alphabet characters after the quote not ending in semicolon". I have a pattern file (to be used with -f) with these lines:
:a;N;$!ba;s/";"\n/";"/g
:a;N;$!ba;s/\([A-z]\)\n/\1/g
:a;N;$!ba;s/\([:alpha:]\)\n/\1/g
The first line of the pattern file works, but I've tried combinations of the second and third and I always get an empty file.
If current line doesn't end with a semicolon, read and append next line to pattern space and remove line break.
sed '/[^;]$/{N;s/\n//}' file

Adding a space within a line in file with a specific pattern

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.
Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo
You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

Hadoop process file with different field delimiters

What are the options to process a text file with different field delimiters in the same file and non new line row delimiter?
Some fields in the file can be fixed length and some can be separated by a character.
Example:
100 xyz |abc#hello#200 xyz1 |abc1#world
In this example, 100 is the first field value, xyz is the second field value, abc is the 3rd field value, hello is the fourth field value. | and # are the delimiters for the 3rd and the 4th fields. The lines are separated by #.
Any of Map reduce or pig or hive solution is fine.
One option may be an MR to configure a custom row delimiter, read the entire line and process the same. But any InputFormat accepts a custom delimiter?
You can override the record delimiter and set it to #.After that load the records as a line and then replace the '|' and '#' characters with space.Then you will get all the fields separated by ' '.Use STRSPLIT to get the individual fields.
SET textinputformat.record.delimiter '#'
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A REPLACE(REPLACE(line,'|',' '),'#',' ') AS line;-- Note:'\\|' if you need to escape '|'
C = FOREACH B GENERATE STRSPLIT(line,' ',4);
DUMP C;
You could try Hive with RegexSerDe

Reading files which contain unquoted newline characters in text fields using R

I am trying to read a large table into R but one of the text fields occasionally contains one or more unquoted, un-escaped newline characters (\n), thus the read.table() function is not able to easily import this file. The file is pipe delimited and the text fields are not quoted.
I can read it in if I pass the argument fill=T with read.table() but, of course, rows with newline characters in a text field are be corrupted by this.
I have successfully been able to use f <- readChar(fname, nchars=file.info(fname)["size"], TRUE) to read sub-segments of the file, then use gsub() to search and destroy the offending newline characters. (see code below) However, the full file is > 100mb, so gsub() does little more than turn my laptop into a hand-warmer (it's still trying to gsub all the newline characters as I write this).
Anyone have any suggestions for how to efficiently read in a file like this?
It seems like there should be some way of telling R to expect a certain number of delimiters before expecting a newline, but I haven't been able to find any way to do this in the documentation.
Sorry, this seems like it should be easy, but it's really been stumping me, and I have not been able to find anything in stackoverflow or google offering a solution.
Here is the code I've tried so far:
attempt 1:
fdat = read.table(file=fname,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=T,
strip.white=T,
comment.char="",
header=T,
sep="|")
attempt 2:
f <- readChar(fname, nchars=file.info(fname)["size"], TRUE)
f2 = gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=T)
fdat = read.table(text=f2,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=F,
strip.white=T,
comment.char="",
header=T,
sep="|")
Here are a few lines from the file:
NCT_ID|DOWNLOAD_DATE|DOWNLOAD_DATE_DT|ORG_STUDY_ID|BRIEF_TITLE|OFFICIAL_TITLE|ACRONYM|SOURCE|HAS_DMC|OVERALL_STATUS|START_DATE|COMPLETION_DATE|COMPLETION_DATE_TYPE|PRIMARY_COMPLETION_DATE|PRIMARY_COMPLETION_DATE_TYPE|PHASE|STUDY_TYPE|STUDY_DESIGN|NUMBER_OF_ARMS|NUMBER_OF_GROUPS|ENROLLMENT_TYPE|ENROLLMENT|BIOSPEC_RETENTION|BIOSPEC_DESCR|GENDER|MINIMUM_AGE|MAXIMUM_AGE|HEALTHY_VOLUNTEERS|SAMPLING_METHOD|STUDY_POP|VERIFICATION_DATE|LASTCHANGED_DATE|FIRSTRECEIVED_DATE|IS_SECTION_801|IS_FDA_REGULATED|WHY_STOPPED|HAS_EXPANDED_ACCESS|FIRSTRECEIVED_RESULTS_DATE|URL|TARGET_DURATION|STUDY_RANK
NCT00000105|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|2002LS032|Vaccination With Tetanus and KLH to Assess Immune Responses.|Vaccination With Tetanus Toxoid and Keyhole Limpet Hemocyanin (KLH) to Assess Antigen-Specific Immune Responses||Masonic Cancer Center, University of Minnesota|Yes|Terminated|July 2002|March 2012|Actual|March 2012|Actual|N/A|Observational|Observational Model: Case Control, Time Perspective: Prospective||3|Actual|112|Samples With DNA|analysis of blood samples before and 4 weeks postvaccination|Both|18 Years|N/A|Accepts Healthy Volunteers|Probability Sample|- Normal volunteers
- Patients with Cancer (breast, melanoma, hematologic)
- Transplant patients (umbilical cord blood transplant, autologous transplant)
- Patients receiving other cancer vaccines|March 2012|March 26, 2012|November 3, 1999|Yes|Yes|Replaced by another study.|No||http://clinicaltrials.gov/show/NCT00000105||6670
NCT00000106|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|NCRR-M01RR03186-9943|41.8 Degree Centigrade Whole Body Hyperthermia for the Treatment of Rheumatoid Diseases|||National Center for Research Resources (NCRR)||Active, not recruiting||||||N/A|Interventional|Allocation: Randomized, Intervention Model: Parallel Assignment, Primary Purpose: Treatment|||||||Both|18 Years|65 Years|No|||November 2000|June 23, 2005|January 18, 2000||||No||http://clinicaltrials.gov/show/NCT00000106||7998
As can be seen, this sample lines from my problem file include the header (line #1), a problematic line (line #2), and a non-problematic line (line #3). Each non-header line starts with NCT and ends with \n (this was leveraged in gsub's regular expression).
Any suggestions are much appreciated.
It seems there is no way to solve it using read.table. Sadly, it doesn't allow to change the "record separator" as awk can do, for example.
Your attempt 2 failed because the DOS format newline is \r\n (0x0d 0x0a) and only \n is matched by gsub. Say you have following file:
NCTa|b|c
NCT1|how
are
you?|well
NCT2|are
you
sure?|yes
Then look at the output of your second command:
f2 <- gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
f2
# [1] "NCTa|b|c\r\nNCT1|how\r are\r you?|well\r\nNCT2|are\r you\r sure?|yes\r "
So you have to remove \r too. Just fix it to:
f2 <- gsub(pattern="\r?\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
And it will work.
Regarding performance, you can try to readChar it by smaller chunks in a loop, gsub them and write them back to file, then read.table it. Just an idea.

Stata: removing line feed control characters

I have a dataset which I export with command outsheet into a csv-file. There are some rows which breaks line at a certain place. Using a hexadecimal editor I could recognize the control character for line feed "0a" in the record. The value of the variable producing the line break shows visually (in Stata) only 5 characters. But if I count the number of characters:
gen xlen = length(x)
I get 6. I could write a Perl programm to get rid of this problem but I prefer to remove the control characters in Stata before exporting (for example using regexr()). Does anyone have an idea how to remove the control characters?
The char() function calls up particular ASCII characters. So, you can delete such characters by replacing them with empty strings.
replace x = subinstr(x, char(10), "", .)