bash: clean and merge data - regex

I have three csv files containing different data for a common object. These represent data about distinct collections of items at work. These objects have unique codes. The number of files is not important so I will set this problem up with two. I have a handy recipe for joining these files using join -- but the cleaning part is killing me.
File A snippet - contains unique data. Also the cataloging error E B.
B 547
J 65
EB 289
E B 1
CO 8900
ZX 7
File B snippet - unique data about a different dimension of the objects.
B 5
ZX 67
SD 4
CO 76
J 54
EB 10
Note that file B contains a code not in common with file A.
Now I submit to you the "official" canon of codes designated for this set of objects:
B
CO
ZX
J
EB
Note that File B contains a non-canonical code with data. It needs to be captured and documented. Same with bad code in file A.
End goal: run trend and stats on the collections using the various fields from the multiple reports. They mostly match the canon but there are oddballs due to cataloging errors and codes that are no longer in use.
End goal result after merge/join:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
So my first idea was to use grep -F -f for this, using the canonical codes as a search list then merge with join. Problem is, with one letter codes it's too inclusive. It would seem like a job for awk where it can work with tab delimiters and REGEX the oddball codes. I'm not sure though, how to get awk to use a list to sift other files. Will join alone handle all this? Maybe I merge with join or paste, then sift out the weirdos? Which method is the least brittle and more likely to handle edge cases like the drunk cataloger?
If you're thinking, "Dude, this is better done with Perl or Python ...etc.". I'm all ears. No rules, I just need to deliver!

Your question says the data is csv, but based on your samples I'm assuming it's tsv. I'm also assuming E B should end up in the outlier output and that NA values should be filled with 0.
Given those assumptions, the following may be sufficient:
sort -t $'\t' -k 1b,1 fileA > fileA.sorted && sort -t $'\t' -k 1b,1 fileB > fileB.sorted
join -t $'\t' -a1 -a2 -e0 -o auto fileA.sorted fileB.sorted > out
grep -f codes out > out-canon
grep -vf codes out > out-oddball
The content of file codes:
^B\s
^CO\s
^ZX\s
^J\s
^EB\s
Result:
$ cat out-canon
B 547 5
CO 8900 76
EB 289 10
J 65 54
ZX 7 67
$ cat out-oddball
E B 1 0
SD 0 4

Try this(GNU awk):
awk 'BEGIN{FS=OFS="\t";}ARGIND==1{c[$1]++;}ARGIND==2{b[$1]=$2}ARGIND==3{if (c[$1]) {print $1,$2,b[$1]+0; delete b[$1];} else {if(tolower($1)~"[a-z]+ +[a-z]+")print>"error.fileA"; else print>"oddball.fileA";}}END{for (i in b) {print i,0,b[i] " (? maybe?)";print i,b[i] > "oddball.fileB";}}' codes fileB fileA
It will create error.fileA, oddball.fileA if such lines exists, oddball.fileB.
Normal output didn't write to file, you can write with > yourself when results are ok:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
SD 0 4 (? maybe?)
Had a hard time reading your description, not sure if this is what you want.
Anyway it's easy to improve this awk code.
You can change to FILENAME=="file1", or FILENAME==ARGV[1] if ARGIND is not working.

Related

How can I fetch only files which can contains 'test' word in practice.txt file and merging data into

cat practice.txt
test_0909_3434 test_8838 test_case_5656_5433 case_4333_3211 note_4433_2212
practice.txt file contains some more files.
required output:
test_0909_3434 test_8838
These test files contain some data so that I need to merge these two files data into one final file.
test_0909_3434 file contains:
id name
1 hh
2 ii
test_8838 file contains:
id name
2 ii
3 gg
4 kk
Final Output of output file: mergedfile.txt will be like follows:
id name
1 hh
2 ii
3 gg
4 kk
we need to remove redundant data also like above mergedfile.txt
1) simplistic (data is "in-order" and "well-formatted" in both input files):
cat f1 f2 | sort -u > f3
2) more complex (not "in-order" and not "well-formatted"). use regex.
Read records from both input files. Assume input record is called 'x'.
if [[ "${x}" =~ ^[[:space:]]*([[:digit:]]+)[[:space:]]+(.*)$ ]]; then
d="${BASH_REMATCH[1]}"
s="${BASH_REMATCH[2]}"
echo "d == $d, s == $s"
fi
aa["${d}"]="${k}"
Where aa is a Bash associative array (available in Bash >= 4x).
declare -A aa=()
This assumes the the first field is an integer (and a key). You can process accordingly, as to whether or not the key is unique.
If it's any more complex than this, consider using Perl.

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

Forvalues dropping leading 0's, how to fix?

I am attempting to create a loop to save me having to type out the code many times. Essentially, I have 60 csv files that I need to alter and save. My code looks as follows:
forvalues i = 0203 0206 : 1112 {
cd "C:\Users\User\Desktop\Data\"
import delimited `i'.csv, varnames(1)
gen time=`i'
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `i'.dta, replace
}
However, I am getting the error "203.csv" does not exist. It seems to be dropping the leading 0, any way to fix this?
You are asking for a numlist, but in this context 0203, with nothing else said, just looks to Stata like a quirky but acceptable way to write 203: hence your problem.
But do you really have a numlist that is 0203 0206 : 1112?
Try it:
numlist "0203 0206 : 1112"
ret li
The list starts 203 206 209 212 215 218 221 224 227 230 233 236 ...
My wild guess is that you have files, one for each quarter over a period, labelled 0203 for March 2002 through to 1112 for December 2011. In fact you do say that you have times, even though my guess implies 40 files, not 60. If so, that means you won't have a file that is labelled 0215, so this is the wrong way to think in any case.
Here is a better approach. First take the cd out of the loop: you need only do that once!
cd "C:\Users\User\Desktop\Data"
Now find the files that are ????.csv. You need only install fs once.
ssc inst fs
fs ????.csv
foreach f in `r(files)' {
import delimited `f', varnames(1)
gen time = substr("`f'", 1, 4)
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `time'.dta, replace
}
On my guess, you still need to fix the time to something civilised and you would be better off appending the files, but one problem at a time.
Note that insisting on leading zeros, which you think is the problem here, but is probably a red herring, is written up here.

How optimize word counting in Python?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)
You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:
cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.