Combining values from different files into one CSV file - regex

I have a couple of files containing a value in each line.
EDIT :
I figured out the answer to this question while in the midst of writing the post and didn't realize I had posted it by mistake in its incomplete state.
I was trying to do:
paste -d ',' file1 file2 file 3 file 4 > file5.csv
and was getting a weird output. I later realized that was happening because some files had both a carriage return and a newline character at the end of the line while others had only the newline character. I got to always remember to pay attention to those things.
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

file 1:
1
2
3
file2:
2
4
6
paste --delimiters=\; file1 file2
Will yield:
1;2
3;4
5;6

I have a feeling you haven't finished typing your question yet, but I'll give it a shot still. ;)
file1: file2: file3:
1 a A
2 b B
3 c C
~$ paste file{1,2,3} |sed 's/^\|$/"/g; s/\t/","/g'
"1","a","A"
"2","b","B"
"3","c","C"
Or,
~$ paste --delimiter , file{1,2,3}
1,a,A
2,b,B
3,c,C

you probably need to clarify or retag your question but as it stands the answer is below.
joining two files under Linux
cat filetwo >> fileone

Also don't forget about the ever versatile LogParser if you're on Windows.
It can run SQL-like queries against flat text files to perform all sorts of merge operations.

The previous answers using logparser or the commandline tools should work. If you want to do some more complicated operations to the records like filtering or joins, you could consider using an ETL Tool (Pentaho, Mapforce and Talend come to mind). These tools generally give you a graphical palette to define the relationships between data sources and any operations you want to perform on the rows.

Related

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Grep pattern match between very large files is way too slow

I've spent way too much time on this and am looking for suggestions. I have too very large files (FASTQ files from an Illumina sequencing run for those interested). What I need to do is match a pattern common between both files and print that line plus the 3 lines below it into two separate files without duplications (which exist in the original files). Grep does this just fine but the files are ~18GB and matching between them is ridiculously slow. Example of what I need to do is below.
FileA:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
TGTTCAAAGCAGGCGTATTGCTCGAATATATTAGCATGGAATAATAGAAT
+DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
__\^c^ac]ZeaWdPb_e`KbagdefbZb[cebSZIY^cRaacea^[a`c
You can see 3 unique headers starting with # followed by 3 additional lines
FileB:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
_[_ceeeefffgfdYdffed]e`gdghfhiiihdgcghigffgfdceffh
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
There are 4 headers here but only 2 are unique as one of them is repeated 3 times
I need the common headers between the two files without duplicates plus the 3 lines below them. In the same order in each file.
Here's what I have so far:
grep -E #DLZ38V1.*/ --only-matching FileA | sort -u -o FileA.sorted
grep -E #DLZ38V1.*/ --only-matching FileB | sort -u -o FileB.sorted
comm -12 FileA.sorted FileB.sorted > combined
combined
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/
This is only the common headers between the two files without duplicates. This is what I want.
Now I need to match these headers to the original files and grab the 3 lines below them but only once.
If I use grep I can get what I want for each file
while read -r line; do
grep -A3 -m1 -F $line FileA
done < combined > FileA.Final
FileA.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
The while loop is repeated to generate FileB.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
This works but FileA and FileB are ~18GB and my combined file is around ~2GB. Does anyone have any suggestions on how I can dramatically speed up the last step?
Depending on how often do you need to run this:
you could dump (you'll probably want bulk inserts with the index built afterwards) your data into a Postgres (sqlite?) database, build an index on it, and enjoy the fruits of 40 years of research into efficient implementations of relational databases with practically no investment from you.
you could mimic having a relational database by using the unix utility 'join', but there wouldn't be much joy, since that doesn't give you an index, yet it is likely to be faster than 'grep', you might hit physical limitations...I never tried to join two 18G files.
you could write a bit of C code (put your favourite compiled (to machine code) language here), which converts your strings (four letters only, right?) into binary and builds an index (or more) based on it. This could be made lightning fast and small memory footprint as your fifty character string would take up only two 64bit words.
Thought I should post the fix I came up with for this. Once I obtained the combined file (above) I used a perl hash reference to read them into memory and scan file A. Matches in file A were hashed and used to scan file B. This still takes a lot of memory but works very fast. From 20+ days with grep to ~20 minutes.

sed or cut? How do I only get column headers from a file?

First off, thank you for everyone's help in advance! I've been learning Unix in school and have been doing well up until this most recent homework assignment.
I'm trying to figure out what the best way to approach this particular part in my homework assignment.
I have a headers file which I must separate into two separate files. There are two parts to this part of the assignment:
First, the first two lines of the file go into one file. I did this by doing:
head -2 headers > file1
However, the next request is to take two column headers (--Regular-- and --Overtime--) and put them into another file...which is what I'm having trouble with.
The header file looks like this:
Merry Land Training Academy
Pay Report
Week of June 12, 1999
--Regular--- --Overtime-- Gross Net
Employee Hours Rate Hours Rate Pay Pay
I know that grep can only match lines that contain the pattern, however how can I remove characters after the last two -- in Overtime?
For example, my grep will return the following:
egrep 'Regular' headers
--Regular--- --Overtime-- Gross Net
I know I could manually do a sed replace of "Gross" and "Net" after doing a grep to remove the words, however I know this is inefficient.
This command will be part of a script which will contain many other processes (which I have been able to do thus far).
In my research online, I know a lot of people recommend using awk, however we have not yet learned this in the course.
Again, thank you in advance. I really look forward to learning from everyone's experience.
Why do you think using sed would be inefficient? Certainly piping grep to sed would be a mistake, but sed is pretty good. You haven't really defined the problem very well, but assuming that you can distinguish a header by the existence of the string --, you could simply do:
sed -n -e '/--/s/[^-]*$//p' input > output
This will take all lines that contain -- and output everything up to the final -. If you only want to print the first such line:
sed -n -e '/--/{s/[^-]*$//p;q;}' input > output

Using Sed or Script to Inline Edit Values in Data Files With Variable Spacing

I have a number of scripts that replace variables separate by white space.
e.g.
sed -i 's/old/new/g' filename.conf
But say I have
#NAME Weight Age Name
Boss 160.000 43 BOB
The below data is made more readable if it stays within the current alignment, so to speak. So if I'm writing a new double, I'd like to only overwrite the width of each of the fields.
My questions are:
1. How to I capture the patterns between values to preserve spaces?
2. Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
3a. If so how do I define this replace field width?
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
EDIT 1
Let me give a couple more examples.
Let's say my file is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 5.6 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 6.5 donteditthis_comment
#Yet another informative comment
VAR3 321
in my bash script I have:
#!/bin/bash
#Vars -- real script will have vals in arrays as
#multiple identically tagged config files will be altered
FOO='Foo'
BAR='Bar'
FOO_VAL_NEW='33.3333'
BAR_VAL_NEW='22.1111'
FILENAME='file.conf'
#Define sed patterns
#These could be inline, but are defined here for readability as they're long.
FOO_MATCH=${FOO}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
FOO_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${FOOD_VAL_NEW}
BAR_MATCH=${BAR}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
BAR_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${BAR_VAL_NEW}
#Do the inline edit ... will be in a loop to handle multiple
#identically tagged config files in full-fledged script.
sed -i "s/${FOO_MATCH}/${FOO_REPLACE}/g" ${FILENAME}
sed -i "s/${BAR_MATCH}/${BAR_REPLACE}/g" ${FILENAME}
My expected output is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Currently my script works... but there's a couple of annoyances/dangers.
PROBLEM 1
Currently to match the tag, I include the exact whitespace characters after it. E.g. for the given example I would define
FOO='Foo '
...as I'm unsure of how to capture ws characters and then output them in the the replace field.
This is nice for me, as I know I'm going to keep the spaces to the first field the same, to maintain readability. But if one of my users (this is for a public project) writes their own file and writes:
#FOO THE VALUE WARNING
Foo 22.0
Now my script is broken for them. I need to capture the whitespace chars in my match pattern, then output them in my output pattern. That way it will play nice with my file (optimally spaced for readability) but if someone wants to muck things up and not space things nicely it will still work for them as well, preserving their current spaces.
PROBLEM 2
Okay so we've read a tag and injected a consistent amount of spaces after it for the replace, based on what we found with a regex in the match.
But now I need to replace fields within the string.
Currently my script does this. However it isn't the clean style I show above in my desired input. For the above script, for example, I'd get:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Well the values are right, but all that work for readability is ruined.... argghhh. Now if I opened the files in emacs and pressed the insert key I would be able to arrow over to the '3' in the Foo tagged value and then start typing the new value and get the output file I listed as desired. I want my sed inline edit to do the same thing... (Maybe as Kent showed this is possible with column?)
I want it to only overwrite on the trailing end. Further, I want it to start the next field (let's say I do end up editing the warning) at the same column it started at in the old file.
Put more simply I want to do a variant sed -i "s/${MATCH}/${REPLACE}/g" ${FILENAME} that writes replacement variables to a tagged line, starting at the same column that entry is at in the CURRENT version of the config. file.
This requires both saving the spaces and somehow coding to only write on the trailing end and pad the output so that the next entry stays in the same starting column if my new value's string is shorter than the old one.
In order to improve upon my current solution it is crucial to both maintain the column start position for each piece of data in a tagged entry and secondly to be able to match a tag with an arbitrary amount of trailing whitespace (which must be preserved)... these are trivial operations in a text editor (see the emacs example above) with the help of the insert key, but more complicated in the script scenario.
This way:
1. I make sure the values can be written no matter how other users space their file.
2. If users (like myself) do bother to match the fields column-wise to the comment above to improve readability, then the script won't mess this up, as it only writes on the trailing side.
Let me know if this is unclear at all.
If this can't be done or is overly onerous with sed alone, I'd be open to an efficient perl or python subscript that my bashscript would call, although obviously an inline solution (if concise and understandable) is preferable, if possible.
the column may help you, see the example below if it was you are looking for:
kent$ cat f
#NAME Weight Age Name
Boss 160.000 43 BOB
kent$ sed 's/160.000/7.0/' f|column -t
#NAME Weight Age Name
Boss 7.0 43 BOB
kent$ sed 's/160.000/7.7777777777/' f|column -t
#NAME Weight Age Name
Boss 7.7777777777 43 BOB
Using one of your sample datasets, you can get
$ doit Weight 160 7.555555 <<\EOD
#NAME Weight Age Name
Boss 160.000 43 BOB
Me 180 25 JAKE
EOD
#NAME Weight Age Name
Boss 7.555555555555 43 BOB
Me 180 25 JAKE
$
with this function:
$ doit ()
{
awk -v tag=$1 -v old=$2 -v new=$3 '
NR==1 { for (i=0;i++<NF;) field[$i]=i } # read column headers
$field[tag] == old {
$field[tag] = new
}
{print}
' | column -t
}
the useful part being loading the column headers into the field name->column map. With tag being "Weight", field[tag] evaluates to 2 for this input so $field[tag] is $2 i.e. the second field, the Weight column.
To answer your questions as asked:
My questions are:
How do I capture the patterns between values to preserve spaces?
Because of what Kent pointed out, it's probably best to regenerate spacing correct for the new data. If preserving the exact input spacing where at all possible, forcing lines with replacement values to have different alignment for some values, I'd say ask that again as a separate "no, really, help me here" question.
Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
sed's Turing complete, but that's as close to a feature as it's got for this. Sardonic humor aside, the only correct answer here is "no".
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
Kent got that one. I didn't know about column, I get questions answered here I didn't even know to ask. For the value location and substitution awk should do you just fine.

Is there a c++ library that reads named columns from files?

I regularly deal with files that look like this (for compatibility with R):
# comments
# more comments
col1 col2 col3
1 a hi
2 b there
. . .
Very often, I will want to read col2 into a vector or other container. It's not hard to write a function that parses this kind of file, but I would be surprised if there were no well tested library to do it for me. Does such a library exist? (As I say, it's not hard to roll your own, but as I am not a C++ expert, it would be some trouble for me to do use the templates that would allow me to use an arbitrary container to contain arbitrary data types.)
EDIT:
I know the name of the column I want, but not what order the columns in this particular file will be in. Columns are separated by an unknown amount white space which may be tabs or spaces (probably not both). The first entry on each line may or may not be preceded by white space, sometimes that will change within one file, e.g.
number letter
8 g
9 h
10 i
Boost split may do what you want, providing you can consistently split on whitespace.
I am not aware of any C++ library that will do this. A simple solution, however, would be to use linux cut. You would have to remove the comments first, which is easily done with sed:
sed -e '/^#/d' <your_file>
Then you could apply the following command which would select just the text from the third column:
cut -d' ' -f3 <your_file>
You could combine those together with a pipe to make it a single command:
sed -e '/^#/d' <your_file> | cut -d' ' -f3 <your_file>
You could run this command programmatically, then rather simply append each line to a stl container.
// pseudocode
while(file.hasNextLine())
{
container << file.readNextLine();
}
For how to actually run cut from within code, see this answer.