Parse ASCII Output of a Device-File in C++ - c++

i wrote a kernelspace driver for a USB-device. If it is connected it mounts to /dev/myusbdev0 for example.
Via command line with echo -en "command" > /dev/myusbdev0 i can send commands to the device and read results with cat /dev/myusbdev0.
Ok, now i have to write a C++ program. At first i would open the device file for read/write with:
int fd = open("/dev/echo", O_RDRW);
After that a cmd will be send to get the device working:
char cmd[] = { "\x02sEN LMDscandata 1\x03" };
write(fd, cmd, sizeof(cmd));
Now i get to the part i dont now how to handle yet. i now need to read from the device, as its keeping on sending me data continously. this data i need to read and parse now ...
char buf[512];
read(fd, buf, sizeof(buf);
The data looks like following, each one starts with \x02 and ends with \x03, they are not always the same size:
sRA LMDscandata 1 1 89A27F 0 0 343 347 27477BA9 2747813B 0 0 7 0 0
1388 168 0 1 DIST1 3F800000 00000000 186A0 1388 15 8A1 8A5 8AB 8AC 8A6
8AC 8B6 8C8 8C2 8C9 8CB 8C4 8E4 8E1 8EB 8E0 8F5 908 8FC 907 906 0 0 0
0 0 0 All Values are separated with a 20hex {SPC}
it think i need some kind of while loop to continiously read the data from an \x02 until i read a \x03.
if i have a complete scan, i need to parse this ascii message in its seperate parts (some variables uint_16, uint_8, enum_16, ...).
any idea how i can read a complete scan into a buf[] and then parse its components out?

As you say the device is sending continiously, i would recommend adding a queue to hold the chunks coming in, and some dispatching that takes out parts of the queue, i.e. x02 to x03, decoupling the work that is done from receiving chunks.
Furthermore you can have then single objects handling one complete block from x02 to x03, perhaps threaded (makes sense with the information given).
device => chunk reader => input queue => inputer reader => data handling
hope this helps

Related

How do I convert s.st_dev to /sys/block/<name>

I want to determine whether a file is on an HDD or an SDD.
I found out that I could check the type of drive using the /sys/block info:
prompt$ cat /sys/block/sdc/queue/rotational
1
This has 1 if it is rotational or unknown. It is 0 when the disk is an SSD.
Now I have a file and what to know whether it is on an HDD or an SDD. I can stat() the file to get the device number:
struct stat s;
stat(filename, &s);
// what do I do with s.st_dev now?
I'd like to convert s.st_dev to a drive name as I have in my /sys/block directory, in C.
What functions do I have to use to get that info? Or is it available in some /proc file?
First of all for the input file we need to file on which partition the file exists
you can use the following command for that
df -P <file name> | tail -1 | cut -d ' ' -f 1
Which will give you output something like this : /dev/sda3
Now you can apply following command to determine HDD , SDD
cat /sys/block/sdc/queue/rotational
You can use popen in your program to get output of these system commands
Okay, I really found it!
So my first solution, reading the partitions, wouldn't work. It would give me sbc1 instead of sbc. I also found the /proc/mounts which includes some info about what's mounted where, but it would still not help me convert the value to sbc.
Instead, I found another solution, which is to look at the block devices and more specifically this softlink:
/sys/dev/block/<major>:<minor>
The <major> and <minor> numbers can be extracted using the functions of the same name in C (I use C++, but the basic functions are all in C):
#include <sys/types.h>
...
std::string dev_path("/sys/dev/block/");
dev_path += std::to_string(major(s.st_dev));
dev_path += ":";
dev_path += std::to_string(minor(s.st_dev));
That path is a soft link and I want to get the real path of the destination:
char device_path[PATH_MAX + 1];
if(realpath(dev_path.c_str(), device_path) == nullptr)
{
return true;
}
From that real path, I then break up the path in segments and search for a directory with a sub-directory named queue and a file named rotational.
advgetopt::string_list_t segments;
advgetopt::split_string(device_path, segments, { "/" });
while(segments.size() > 3)
{
std::string path("/"
+ boost::algorithm::join(segments, "/")
+ "/queue/rotational");
std::ifstream in;
in.open(path);
if(in.is_open())
{
char line[32];
in.getline(line, sizeof(line));
return std::atoi(line) != 0;
}
segments.pop_back();
}
The in.getline() is what reads the .../queue/rotational file. If the value is not 0 then I consider that this is an HDD. If something fails, I also consider that the drive is an HDD drive. The only way my function returns false is if the rotational file exists and is set to 0.
My function can be found here. The line number may change over time, search for tool::is_hdd.
Old "Solution"
The file /proc/partition includes the major & minor device numbers, a size, and a name. So I just have to parse that one and return the name I need. Voilà.
$ cat /proc/partitions
major minor #blocks name
8 16 1953514584 sdb
8 17 248832 sdb1
8 18 1 sdb2
8 21 1953263616 sdb5
8 0 1953514584 sda
8 1 248832 sda1
8 2 1 sda2
8 5 1953263616 sda5
11 0 1048575 sr0
8 32 976764928 sdc
8 33 976763904 sdc1
252 0 4096 dm-0
252 1 1936375808 dm-1
252 2 1936375808 dm-2
252 3 1936375808 dm-3
252 4 16744448 dm-4
As you can see in this example, the first two lines represent the column names and an empty.The Name column is what I was looking for.

How do I set the column width of a pexpect ssh session?

I am writing a simple python script to connect to a SAN via SSH, run a set of commands. Ultimately each command will be logged to a separate log along with a timestamp, and then exit. This is because the device we are connecting to doesn't support certificate ssh connections, and doesn't have decent logging capabilities on its current firmware revision.
The issue that I seem to be running into is that the SSH session that is created seems to be limited to 78 characters wide. The results generated from each command are significantly wider - 155 characters. This is causing a bunch of funkiness.
First, the results in their current state are significantly more difficult to parse. Second, because the buffer is significantly smaller, the final volume command won't execute properly because the pexpect launched SSH session actually gets prompted to "press any key to continue".
How do I change the column width of the pexpect session?
Here is the current code (it works but is incomplete):
#!/usr/bin/python
import pexpect
import os
PASS='mypassword'
HOST='1.2.3.4'
LOGIN_COMMAND='ssh manage#'+HOST
CTL_COMMAND='show controller-statistics'
VDISK_COMMAND='show vdisk-statistics'
VOL_COMMAND='show volume-statistics'
VDISK_LOG='vdisk.log'
VOLUME_LOG='volume.log'
CONTROLLER_LOG='volume.log'
DATE=os.system('date +%Y%m%d%H%M%S')
child=pexpect.spawn(LOGIN_COMMAND)
child.setecho(True)
child.logfile = open('FetchSan.log','w+')
child.expect('Password: ')
child.sendline(PASS)
child.expect('# ')
child.sendline(CTL_COMMAND)
print child.before
child.expect('# ')
child.sendline(VDISK_COMMAND)
print child.before
child.expect('# ')
print "Sending "+VOL_COMMAND
child.sendline(VOL_COMMAND)
print child.before
child.expect('# ')
child.sendline('exit')
child.expect(pexpect.EOF)
print child.before
The output expected:
# show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
---------------------------------------------------------------------------------------------------------------------------------------------------------
controller_A 0 45963169 1573.3KB 67 386769785 514179976 6687.8GB 5750.6GB
controller_B 20 45963088 4627.4KB 421 3208370173 587661282 63.9TB 5211.2GB
---------------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
# show vdisk-statistics
Name Serial Number Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
------------------------------------------------------------------------------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0 45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2282.4KB 164 23229435 76509765 5506.7GB 1605.3GB
DATA1 00c0ff1311f3000087d8c44f00000000 2286.5KB 167 23490851 78314374 5519.0GB 1603.8GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0 26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 654.8KB 5 3049980 15317236 1187.3GB 1942.1GB
FRA1 00c0ff13349e000007d9c44f00000000 778.7KB 6 3016569 15234734 1179.3GB 1940.4GB
------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
# show volume-statistics
Name Serial Number Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
-----------------------------------------------------------------------------------------------------------------------------------------------------
CRS_v001 00c0ff13349e0000fdd6c44f01000000 14.8KB 5 239611146 107147564 1321.1GB 110.5GB
DATA1_v001 00c0ff1311f30000d0d8c44f01000000 2402.8KB 218 1701488316 336678620 33.9TB 3184.6GB
DATA2_v001 00c0ff1311f3000040f9ce5701000000 0B 0 921 15 2273.7KB 2114.0KB
DATA_v001 00c0ff1311f30000bdd7c44f01000000 2303.4KB 209 1506883611 250984824 30.0TB 2026.6GB
FRA1_v001 00c0ff13349e00001ed9c44f01000000 709.1KB 28 25123082 161710495 1891.0GB 2230.0GB
FRA_v001 00c0ff13349e00001fd8c44f01000000 793.0KB 34 122052720 245322281 3475.7GB 3410.0GB
-----------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
The output as printed to the terminal (as mentioned, the 3rd command won't execute in its current state):
show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
controller_A 3 45962495 3803.1KB
73 386765821 514137947 6687.8GB
5748.9GB
controller_B 20 45962413 5000.7KB
415 3208317860 587434274 63.9TB
5208.8GB
----------------------------------------------------------------------
Success: Command completed successfully.
Sending show volume-statistics
show vdisk-statistics
Name Serial Number Bytes per second IOPS
Number of Reads Number of Writes Data Read Data Written
----------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0
45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2187.2KB 152
23220764 76411017 5506.3GB 1604.1GB
DATA1 00c0ff1311f3000087d8c44f00000000 2295.2KB 154
23481442 78215540 5518.5GB 1602.6GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0
26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 1829.3KB 14
3049951 15310681 1187.3GB 1941.2GB
FRA1 00c0ff13349e000007d9c44f00000000 1872.8KB 14
3016521 15228157 1179.3GB 1939.5GB
----------------------------------------------------------------------------
Success: Command completed successfully.
Traceback (most recent call last):
File "./fetchSAN.py", line 34, in <module>
child.expect('# ')
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/spawnbase.py", line 321, in expect
timeout, searchwindowsize, async)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/spawnbase.py", line 345, in expect_list
return exp.expect_loop(timeout)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/expect.py", line 107, in expect_loop
return self.timeout(e)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/expect.py", line 70, in timeout
raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x105333910>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', 'manage#10.254.27.49']
buffer (last 100 chars): '-------------------------------------------------------------\r\nPress any key to continue (Q to quit)'
before (last 100 chars): '-------------------------------------------------------------\r\nPress any key to continue (Q to quit)'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 19519
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: <open file 'FetchSan.log', mode 'w+' at 0x1053321e0>
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("# ")
And here is what is captured in the log:
Password: mypassword
HP StorageWorks MSA Storage P2000 G3 FC
System Name: Uninitialized Name
System Location:Uninitialized Location
Version:TS230P008
# show controller-statistics
show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
controller_A 3 45962495 3803.1KB
73 386765821 514137947 6687.8GB
5748.9GB
controller_B 20 45962413 5000.7KB
415 3208317860 587434274 63.9TB
5208.8GB
----------------------------------------------------------------------
Success: Command completed successfully.
# show vdisk-statistics
show vdisk-statistics
Name Serial Number Bytes per second IOPS
Number of Reads Number of Writes Data Read Data Written
----------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0
45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2187.2KB 152
23220764 76411017 5506.3GB 1604.1GB
DATA1 00c0ff1311f3000087d8c44f00000000 2295.2KB 154
23481442 78215540 5518.5GB 1602.6GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0
26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 1829.3KB 14
3049951 15310681 1187.3GB 1941.2GB
FRA1 00c0ff13349e000007d9c44f00000000 1872.8KB 14
3016521 15228157 1179.3GB 1939.5GB
----------------------------------------------------------------------------
Success: Command completed successfully.
# show volume-statistics
show volume-statistics
Name Serial Number Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
CRS_v001 00c0ff13349e0000fdd6c44f01000000 11.7KB
5 239609039 107145979 1321.0GB
110.5GB
DATA1_v001 00c0ff1311f30000d0d8c44f01000000 2604.5KB
209 1701459941 336563041 33.9TB
3183.3GB
DATA2_v001 00c0ff1311f3000040f9ce5701000000 0B
0 921 15 2273.7KB
2114.0KB
DATA_v001 00c0ff1311f30000bdd7c44f01000000 2382.8KB
194 1506859273 250871273 30.0TB
2025.4GB
FRA1_v001 00c0ff13349e00001ed9c44f01000000 1923.5KB
31 25123006 161690520 1891.0GB
2229.1GB
FRA_v001 00c0ff13349e00001fd8c44f01000000 2008.5KB
37 122050872 245301514 3475.7GB
3409.1GB
----------------------------------------------------------------------
Press any key to continue (Q to quit)%
As a starting point: According to the manual, that SAN has a command to disable the pager. See the documentation for set cli-parameters pager off. It may be sufficient to execute that command. It may also have a command to set the terminal rows and columns that it uses for formatting output, although I wasn't able to find one.
Getting to your question: When an ssh client connects to a server and requests an interactive session, it can optionally request a PTY (pseudo-tty) for the server side of the session. When it does that, it informs the server of the lines, columns, and terminal type which the server should use for the TTY. Your SAN may honor PTY requests and use the lines and columns values to format its output. Or it may not.
The ssh client gets the rows and columns for the PTY request from the TTY for its standard input. This is the PTY which pexpect is using to communicate with ssh.
this question discusses how to set the terminal size for a pexpect session. Ssh doesn't honor the LINES or COLUMNS environment variables as far as I can tell, so I doubt that would work. However, calling child.setwinsize() after spawning ssh ought to work:
child = pexpect.spawn(cmd)
child.setwinsize(400,400)
If you have trouble with this, you could try setting the terminal size by invoking stty locally before ssh:
child=pexpect.spawn('stty rows x cols y; ssh user#host')
Finally, you need to make sure that ssh actually requests a PTY for the session. It does this by default in some cases, which should include the way you are running it. But it has a command-line option -tt to force it to allocate a PTY. You could add that option to the ssh command line to make sure:
child=pexpect.spawn('ssh -tt user#host')
or
child=pexpect.spawn('stty rows x cols y; ssh -tt user#host')

Proper reading of MP3 file disrupted by ID3 tags

My semestral project is due this Thursday and I have major problem with reading MP3 file (the project is about sound analysis, don't ask my what exactly is it about and why I'm doing it so late).
First, I read first 10 bytes to check for ID3 tags. If they're present, I'll just skip to the first MP3 header - or at least that's the big idea. Here is how I count ID3 tag size:
if (inbuf[0] == 'I' && inbuf[1] == 'D' && inbuf[2] == '3') //inbuf contains first 10 bytes from file
{
int size = inbuf[3] * 2097152 + inbuf[4] * 16384 + inbuf[5] * 128 + inbuf[6]; //Will change to binary shifts later
//Do something else with it - skip rest of ID3 tags etc
}
It works ok for files without ID3 tags and for some files with them, but for some other files ffmpeg (which I use for decoding) returns "no header" error, which means it didn't catch MP3 header correctly. I know that since if I remove ID3 from that .mp3 file (with Winamp for example), no errors occur. The conclusion is that size count algorithm isn't always valid.
So the question is: how do I get to know how big exactly is entire ID3 part of the .mp3 file (all possible tags, album picture and whatever)? I'm looking for it everywhere but I just keep finding this algorithm I posted above. Sometimes also something about some 10 bytes footer I need to take into account, but it seems it frequently gets more than 10 bytes for it to eventually catch proper MP3 frame.
The size of an ID3v1 Tag is always fixed 128 Bytes.
I will find the following description
If you one sum the the size of all these fields we see that 30+30+30+4+30+1 equals 125 bytes and not 128 bytes. The missing three bytes can be found at the very beginning of the tag, before the song title. These three bytes are always "TAG" and is the identification that this is indeed a ID3 tag. The easiest way to find a ID3v1/1.1 tag is to look for the word "TAG" 128 bytes from the end of a file.
Source: http://id3.org/ID3v1
There is another version, called ID3v2:
One of the design goals were that the ID3v2 should be very flexible and expandable...
Since each frame can be 16MB and the entire tag can be 256MB you'll probably never again be in the same situation as when you tried to write a useful comment in the old ID3 being limited to 30 characters.
This ID3v2 always starts at the begin of an audio file, as you can read here: http://id3.org/ID3v2Easy
ID3v2/file identifier "ID3"
ID3v2 version $03 00
ID3v2 flags %abc00000
ID3v2 size 4 * %0xxxxxxx
The ID3v2 tag size is encoded with four bytes where the most significant bit (bit 7) is set to zero in every byte, making a total of 28 bits. The zeroed bits are ignored, so a 257 bytes long tag is represented as $00 00 02 01.
bool LameDecoder::skipDataIfRequired()
{
auto data = m_file.read(3);
Q_ASSERT(data.size() == 3);
if (data.size() != 3)
return false;
if (memcmp(data.constData(), "ID3", 3))
{
m_file.seek(0);
return true;
}
// ID3v2 tag is detected; skip it
m_file.seek(3+2+1);
data = m_file.read(4);
if (data.size() != 4)
return false;
qint32 size = (data[0] << (7*3)) | (data[1] << (7*2)) |
(data[2] << 7) | data[3];
m_file.seek(3+2+1+4+size);
return true;
}

Parsing (partially) non-uniform text blocks in Perl

I have a file with a few blocks that look like this in a file (and in a variable, at this point in the program).
Vlan2 is up, line protocol is up
....
reliability 255/255, txload 1/255, rxload 1/255^M
....
Last clearing of "show interface" counters 49w5d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
....
L3 out Switched: ucast: 17925 pkt, 23810209 bytes mcast: 0 pkt, 0 bytes
33374 packets input, 13154058 bytes, 0 no buffer
Received 926 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
3094286 packets output, 311981311 bytes, 0 underruns
0 output errors, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
Here's a second block, to show you how the blocks can slightly vary:
port-channel86 is down (No operational members)
...
reliability 255/255, txload 1/255, rxload 1/255
...
Last clearing of "show interface" counters 31w2d
...
RX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 input packets 119954232 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 0 CRC 0 no buffer
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 output packets 119954232 bytes
0 jumbo packets
0 output error 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause
0 interface resets
I want to pick out certain data elements from each block, which may or may not exist in each block. For example, in the first block I posted I may want to know that there are 0 runts, 0 input errors and 0 overrun. In the second block, I might want to know that there are 0 jumbo packets, collisions, etc. If a given query isn't in the block, it's acceptable to just return na, as this is designed to be processed uniformly.
Each block is structured in a similar way to the two I posted; newlines and spaces delimiting some entries, commas delimiting others.
I have a few ideas as to how this might work. I'm unaware if there is any kind of "look back" function in Perl, but I could attempt to look for the field names (runts, "input errors", etc) and then grab the previous integer; that seems like it would be the most elegant solution for this, but I'm unsure if it's possible.
Currently, I'm doing this in Perl. Each "block" that I'm processing is actually several of these blocks (separated by double newlines). It doesn't have to be done in a single regular expressions; I believe it can be done by applying several regular expressions per block. Performance is not really a factor, as this script will run maybe once per hour.
My goal is to get all of this into a .csv file (or some other data format that's easily graphable) in an automated fashion.
Any ideas?
Edit: example output in CSV as I mentioned, which would be written line by line (for multiple entries like this) to a file as the end result. If a particular entry isn't found in the block, it is marked na in the corresponding line:
interface_name,txload,rxload,last_clearing,input_queue,output_drops,runts,....
vlan2,1,1,49w5d,0-75-0-0,0,0,....
port-channel86,1,1,31w2d,na,na,0,...
Simple hash of properties and numbers.
sub extract {
my ($block) = #_;
my %r;
while ($block =~ /(?<num>\d+) \s (?<name>[A-Za-z\s]+)/gmsx) {
my $name = $+{name};
my $num = $+{num};
$name =~ s/\A \s+//msx;
$name =~ s/\s+ \z//msx;
$r{$name} = $num;
}
return %r;
}
my $block = <<'';
Vlan2 is up, line protocol is up
⋮
my $block2 = <<'';
port-channel86 is down (No operational members)
⋮
use Data::Dumper qw(Dumper);
print Dumper {extract $block};
print Dumper {extract $block2};
I don't think a single regex could do it, nor would I want to support it if it could.
Using multiple regexes, you could easily use something like:
(\d+) runts
(\d+) input errors
...etc...
A simple array of property names and a loop could solve this pretty quickly and with very little fuss.
If you can strip down the input to smaller chunks with some preprocessing, you would be less likely to get false positives.
Here is one way to do it in awk, but this needs lots of tweak to be perfect.
But again, use SNMP.
awk '{
printf $1
for (i=1;i<=NF;i++) {
if ($i" "$(i+1)~/Input queue:/) printf ",%s",$(i+2)
if ($i~/runts/) printf ",%s",$(i-1)
if ($i~/multicast,/) printf ",%s",$(i-1)
}
print ""
}' RS="swapped out" file

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.