What is wrong with this Fortran write-statement? [duplicate]

What is wrong with this Fortran write-statement? [duplicate] - fortran

Is it about performance, clean source code, compilers, ...? I know that many compilers allow longer single-line codes. But, if this extension is possible without any compromise, then why does Fortran standard strictly adhere to this rule?
I know that this is very general question (stackoverflow warns me that this question might be downvoted given its title), but I cannot find any resources that explain the logic behind a max length of 132 characters in modern Fortran standard.
Update Oct 22, 2019: See https://j3-fortran.org/doc/year/19/19-138r1.txt for a proposal accepted as a work item for the next 202X revision of the Fortran standard, which eliminates the maximum line length and continuation limits.

Take a look at specification:
ftp://ftp.nag.co.uk/sc22wg5/N001-N1100/N692.pdf
section: 3.3.1
It's just convention. Somebody decided that 132 will be ok. In 66 version it was 72.
Standards: https://gcc.gnu.org/wiki/GFortranStandards#Fortran_Standards_Documents
Usually, these limitations (like 80, 132 characters per line), were dictated by terminals.
Just to illustrate, in a "funny" way, how was it to code in 90's ;)

The first programming language I learned back in the 1980s was Fortran (FORTRAN77 to be exact)
Everybody was super excited because my group of students were the first ones allowed to use the brand new terminals that had just been set up in the room next to the computer. BTW: The computer was an IBM mainframe and it resided in a room the size of a small concert hall, about four times the size of the classroom with the 16 terminals.
I remember having more than once spent hours and hours debugging my code only to find out that in one of my code lines I had again been using the full line width of 80 characters that the terminal provided instead of the 72 characters allowed by Fortan77. I used to call the language Fortran72 because of that restriction.
When I asked my tutor for the reason he pointed me to the stack of cardboard boxes in the hallway. It was rather a wall of boxes, 8m long and almost 2m high. All these boxes were full of unused punch cards that they did not need anymore after the installation of the terminals.
And yes the punchcards only used 72 characters per code line because the remaining 8 were required for the sequence number of the card.
(Imagine dropping a stack of cards with no sequence numbers punched in.)
I am aware that I broke some rules and conventions here: I hope you still like that little piece of trivia and won't mind that my story does not exactly answer the original question. And yeah, it also repeats some information from previous answers.

The old IBM line printers has 132 character width, so when IBM designed Fortran, that was the max line length

The reason was sequence numbers punched in columns 73-80 of the source code cards. When you dropped your program deck on the floor, they allowed you to bring the scrambled deck to a sorting machine (a large 5 foot long stand alone machine) and sort the deck back into order.
A sequencer program read the deck and could punch a new deck with updated sequence numbers, so the programmer did not get involved in the numbering. You punched a new deck after every few dozen changes.
I did it many times 1970-1990.

In the olden days the punchcards also were of finite length. I forget what was being used for terminals in the 90s other than they were long CRTs, but do not recall the resolution... But it was NOT 2k pixels wide.

Related

Can we move the pointer of ofstream back and forth for output to file? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need to output results to a file that has a predefined format. The columns look like this:
TIME COL1 COL2 COL3 COL4 ...
I am using ofstream. I have to output results line by line. However, the case can be that results for certain columns may not be available at a certain time. The order of the results may also not be sorted.
I can control the spacing between the columns while initially specifying the headers.
I guess my question is: Is it possible to move the ofstream pointer back and forth horizontally per line?
What I tried so far:
1) find current position of ofstream pointer using:
long pos = fout.tellp()
2) calculate the position to be shifted based on spacing:
spacing = column_spacing * column_number
long newpos = pos + spacing
3) then use seekp() to move pointer:
fout.seekp(newpos)
4) provide output:
fout << "output"
This does not work. Basically, the pointer does not move. The idea is to make my ofstream fout move back and forth if possible. I would appreciate any suggestions on how to control it.
Some information about the output: I am computing the elevation angle of GPS satellites in the sky over time. Hence, there are 32 columns synonymous to number of GPS satellites in total. At any point in time, not all satellites are visible and hence the need to skip some satellites/columns. Also, the list of elevation of satellites may not be arranged in ascending order due to the limitations of observation file. Hope that helps in drawing the situation.
An example of desired output. The header (TIME, SAT1, ... SAT32) is defined prior to the output of results and is not part of the question here. The spacing between each column is controlled during definition of the headers (lets say 15 spaces between each column). The output can be truncated to 1 decimal place. A new line occurs once all results at current time t are written. Then I process the observations for time t+1 and then write the outputs again, and so on. Hence the writing occurs in an epochwise manner. Satellite elevation is stored in a vector(double) and satellite number is stored in a vector(int). Both vectors are of same length. I just need to write them to a file. For the example below, the output of time is in seconds and satellite elevation is in degrees:
TIME SAT1 SAT2 SAT3 ... SAT12 SAT13 ... SAT32
1 34.3 23.2 12.2 78.2
2 34.2 23.1 12.3 78.2
3 34.1 11.3 23.0 78.3
And so on... As you may notice, satellite elevations may or may not be available, all depends on the observations. Lets also assume that the size of output and efficiency is not of priority here. Based on 24 hours of observations, the output file size can reach upto 5-10 MB's.
Thanks for your time in advance!

Can we move the pointer of ofstream back and forth for output to file?
NO, You probably don't want to do that (even if it would be doable in principle; but that would be inefficient and very brittle to code, and nearly impossible to debug), in particular for a textual output whose width is variable (I am guessing that your COLi could have variable width, as usual in most textual format). It looks that your approach is wrong.
The general way is to build in memory, as some graph of "objects" or "data structure", the entire representation of your output file. This is generally enough, unless you really need to output something huge.
If your typical textual output is of reasonable size (a few gigabytes at most) then representing the data as some internal data structure is worthwhile and it is very common practice.
If your textual output is huge (dozens of gigabytes or terabytes, which is really unlikely), then you won't be able to represent it in memory (unless you have a costly computer with a terabyte of RAM). However, you could use some database (perhaps sqlite) to serve as internal representation.
In practice, textual formats are always output in sequence (from some internal representation) and textual files in those formats have a reasonable size (so it is uncommon to have a textual file of many gigabytes today; in such cases, databases -or splitting the output file in several pieces in some directory- are better).
Without specifying precisely your textual format (e.g. using EBNF notation) and giving an example - and some estimation of the output size-, your question is too broad, and you can only get hints like above.
the output file size can reach upto 5-10 MB's
This is really tiny on current computers (even a cheap smartphone has a gigabyte of RAM). So build the data structure in memory, and output it at once when it is completed.
What data structures should you use depend upon your actual problem (and the inputs your program gets, and the precise output you want it to produce). Since you don't specify your program in your question, we cannot help. Probably C++ standard containers and smart pointers could be useful (but this is just a guess).
You should read some introduction to programming (like SICP), then some good C++ programming book and read some good Introduction to Algorithms. You probably need to read something about compilation techniques (since they include parsing and outputting structured data), like the Dragon Book. Learning to program takes a lot of time.
C++ is really a very difficult programming language, and I believe it is not the best way to learn programming. Once you have learned a bit how to program, invest your time in learning C++. Your issues is not on std::ostream or C++ but on designing your program and its architecture correctly
BTW, if the output of your program is feeding some other program (and is not only or mostly for human consumption) you might use some established textual format, perhaps JSON, YAML, CSV, XML (see also this example), ....
2 34.2 23.1 12.3 78.2
How significant are the spaces in the above line (what would happen if a space is inserted after the first 2 and another space is removed after 12.3) ? Can a wide number like 3.14159265358979323846264 appear in your output? Or how many digits do you want? That should be documented precisely somewhere ! Are you allowed to improve the output format above (you might perhaps use some sign like ? for missing numbers; that would make the output less ambiguous and more readable for humans and easier to parse by some other program)?
You need to define precisely (in English) the behavior of your program, including its input and output formats. An example of input and output is not a specification (it is just an example).
BTW, you may also want to code your program to provide several different output formats. For example, you could decide to provide CSV format for usage in speadsheets, JSON format for other data processing, gnuplot output to get nice figures, LaTeX output to be able to insert your output in some technical report, HTML output to be usable thru a browser, etc. Once you have a good internal representation (as convenient data structures) of your computed data, outputting it in various formats is easy and very convenient.
Probably your domain (satellite processing) has defined some widely used data formats. Study them in detail (at least for inspiration on specifying your own output format). I am not at all an expert of satellite data, but with google I quickly found examples like GEOSCIENCE AUSTRALIA
(CCRS) LANDSAT THE MATIC MAPPER DIGITAL DATA FORMAT
DESCRIPTION (in more than a hundred pages). You should specify your output format as precisely as they do (perhaps several dozens of pages in English, with a few pages of EBNF), and EBNF is a convenient notation for that (with a lot of additional explanations in English)
Look also for inspiration into other output data format descriptions.
You probably should, if you invent your output format, publish its specification (in English) so that other people could code programs taking your output as input to their code.
In many domains, data is much more valuable (i.e. costs much more, in € or US$) than the code processing it. This is why its format should be precisely documented. You need to specify that format so that a future programmer in 2030 could easily write a parser for it. So details matter a big lot. Specify unambiguously your output format in great details (in some English document).
Once you have specified that output format, coding the output routines from some good enough internal data representation is easy work (and don't require insane tricks like moving the file offset of the output). And a good enough specification of the output format is also a guideline in designing your internal data representations.
Is it possible to move the ofstream pointer back and forth horizontally per line?
It might be doable, but it is so inefficient and error-prone (and impossible to debug) that in practice you should never do that (but instead, specify in details your output and code a simple sequential output routines, as all textual format related software do).
BTW, today we use UTF-8 everywhere in textual files, and a single UTF-8 encoded Unicode character might span one (e.g. for some digit like 0 or latin letters like E) or several bytes (e.g. for accentuated letters like é, or cyrillic letters like я, or symbols like ∀, etc...) so replacing a single UTF8 character by a single other one could mean some byte insertion or deletion.
Notice that current file systems do not allow to insert characters or bytes or delete a span of characters in the middle of a file (for example, on Linux, there is no syscalls(2) allowing this) and do not really know about lines (the end of line is just a convention, e.g. \n byte on Linux). Programs doing that (like your favorite source code editor) are always representing the data in memory. Today, a file is a sequence of bytes, and you can only append bytes at its end, or replace bytes in the middle (from the operating system's point of view); but insertion or deletion of bytes span in the middle of the file is not possible, and that is why a textual file is -in practice- always written sequentially, from start to end, without moving inside the current file offset (other than appending bytes at its end).
(if this is homework for some CS college or undergraduate course, I guess that your teacher is expecting you to define and document your output format)

Why is strlen() about 20 times faster than manually looping to check for null-terminated character?

The original question was badly received and got many downvotes. So I thought I'd revise the question to make it easier to read and hopefully to be of more help to anyone seeing it. The original question was why strlen() was 20 times faster than manually looping through the string and finding the '\0' character. I thought this question was well founded, as everywhere I'd read strlen()'s technique to find the string length is essentially looping until it finds a null-terminating character '\0'. This is a common criticism of C strings for more reasons than one. Well as many people pointed out, functions that are part of the C library are created by smart programmers to maximise performance.
Thanks to ilen2, who linked me to a VERY clever way of using bitwise operators to check 8 bytes at once, I managed to get something that, on a string larger than about say 8 to 15 characters runs faster than strlen(), and many many times faster than strlen() when the string is considerably larger. For example, and strangely, strlen() seems to be linearly time dependent on the length of the string to finish. On the other hand, the custom one takes pretty much the same amount of time no matter the string length (I tested up to a couple of hundred). Anyway, my results are rather surprising, I did them with optimisation turned OFF, and I don't know how valid they are. Thanks a lot to ilen2 for the link, and John Zwinck. Interestingly, John Zwinck suggested SIMD as a possibility for why strlen() might be faster, but I don't know anything about that.

strlen() is a very heavily hit function and you can bet that several very bright people have spent days and months optimizing it. Once you get your algorithm right, the next thing is, can you check multiple bytes at once? The answer of course is that you can, using SIMD (SSE) or other tricks. If your processor can operate on 128 bits at a time, that's 16 characters per clock instead of 1.

4 randomly pulled cards at least one would be ace

Please help me my c++ program that I don't know how to write. Question is as below.
There is a well mixed deck of 32 cards. Method of statistical tests to obtain the probability of an event that of the 4 randomly pulled charts at least one would be ace.
Compare the value of the error of calculating the probability of the true error (the true probability value is approximately equal to 0.432). Vary the number of experiments n.

What are the odds of not drawing an ace in one draw?
In four successive draws?
What are the odds that that doesn't happen?

From what I understand of your question, you have already calculated the odds of drawing the ace, but now need a program to prove it.
Shuffle your cards.
Draw 4 cards.
Check your hand for the presence of an ace.
Repeat these steps n times, where n is the number of test you need to make. Your final, "proven" probability is a/n, where a is the number of times an ace came up.
Of course, given the nature of randomness, there's no way to ensure that your results will be near the mathematical answer, unless you have the time available to make n equal to infinity.

Unfortunately I need to 'answer' rather than comment as I would wish because my rep is not high enough to allow me to do so.
There is information missing which will make it impossible to be sure of providing a correctly functioning program.
Most importantly coming to your problem from a mathematical /probability background :
I need to know for sure how many of the reduced deck of 32 cards are aces!
Unfortunately this sentence :
Method of statistical tests to obtain
the probability of an event that of
the 4 randomly pulled charts at least
one would be ace.
is mathematical goobledygook!
You need to correctly quote the sentences given to you in your assignment.
Those sentences hold vital information on which depends what the c++ program is to simulate!

Identifying keywords of a (programming) language

this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s{}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.

I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.

Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.

You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.

In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.

How do I improve breaking substitution ciphers programmatically?

I have written (am writting) a program to analyze encrypted text and attempt to analyze and break it using frequency analysis.
The encrypted text takes the form of each letter being substituted for some other letter ie. a->m, b->z, c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.
An example would be :
Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde
Here it has only got I, A and Y correctly.
Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.
I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice, I get:
1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.
I am using Romeo and Juliet as a base to get the frequency data.
It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.

I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:
1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.
2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.
3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.
Edit:
I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.

Looking at character pairs makes a lot of sense to me.
Every single letter of the alphabet can be used in valid text, but there are many pairs that are either extremely unlikely or will never happen.
For example, there is no way to get qq using valid English words, as every q must be followed by a u. If you have the same letters repeated in the encrypted text, you can automatically exclude the possibility that they represent q.
The fact that you are removing spaces from the input limits the utility somewhat since combinations that would never exist in a single word e.g. ht can now occur if the h ends one word and the t begins another one. Still, I suspect that these additional data points will enable you to resolve much shorter strings of text.
Also, I would suggest that Romeo and Juliette is only a good basis for statistical data if you intend to analyze writings of the period. There have been some substantial changes to spelling and word usage that may skew the statistics.

First of all, Romeo and Juliet probably isn't a very good basis to use. Second, yes digraphs are helpful (and so are trigraphs). For a substitution cipher like you're looking at, a good place to start would be the Military Cryptanalysis books by William Friedman.

Well, I have solved some simple substitution ciphers in my time, so I can speak freely.
Removing the spaces from the input string makes it nearly impossible to solve.
While it is true that most English sentences have 'e' in higher frequency, that is not all there is to the process.
The part that makes the activity fun, is the series of trial hypothesis/test hypothesis/accept or reject hypothesis that makes the whole thing an iterative process.
Many sentences contain the words 'of' and 'the'. By looking at your sentence, and assuming that one of the two letter words is of, implies further substitutions that can allow you to make inferences about other words. In short, you need a dictionary of high frequency word, to allow you to make further inferences.
As there could be a large amount of backtracking involved, it may be wise to consider a prolog or erlang implementation as a basis for developing the c++ one.
Best of luck to you.
Kindly share your results when done.

Single letter word are a big hint (generally only "A" and "I", rarely "O". Casual language allows "K"). There are also a finite set of two and three letter words. No help if spaces have been stripped.
Pairs are much more diagnostic than you would think. For instance: some letters never appear doubled in English (though this is not absolute if the spaces have been stripped or if foreign vocabulary is allowed), and others are common double; also some heterogeneous pairs are very frequent.
As a general rule, no one analysis will provide certainty. You need to assign each cipher letter a set of possible translation with associated probabilities. And combine several tests until the probabilities become very significant.
You may be able to determine when you've gotten close by checking the Shannon Entropy.

Not a complete answer, but maybe a helpful pointer: you can use a dictionary to determine how good your plaintext candidate is. On a UNIX system with aspell installed, you can extract an English word list with the command
aspell -l en dump master

You might try looking at pairs rather than individual letters. For instance, a t is often followed by an h in English, as is an s. Markov modeling would be useful here.

Frequency Analysis
Frequency analysis is a great place to start. However, Romeo and Juliet is not a very good choice to take character frequencies from to decipher Pride and Prejudice text. I would suggest using frequencies from this page because it uses 7 different texts that are closer in age to Pride and Prejudice. It also lists probabilities for digraphs and trigraphs. However, digraphs and trigraphs may not be as useful when spaces are removed from the text because this introduces the noise of digraphs and trigraphs created by words being mashed together.
Another resource for character frequencies is this site. It claims to use 'a good mix of different literary genres.'
Frequency analysis generally becomes more probabilistically correct with increased length of the encrypted text as you've seen. Frequency analysis also only helps to suggest the right direction in which to go. For instance, the encrypted character with the highest frequency may be the e, but it could also very well be the a which also has a high frequency. One common method is to start with some of the highest frequency letters in the given language, try matching these letters with different letters of high frequency in the text, and look to see if they form common words like the, that, is, as, and, and so on. Then you go from there.
A Good Introductory Book
If you are looking for a good layman introduction to cryptography, you might try The Code Book by Simon Singh. It's very readable and interesting. The books looks at the development of codes and codebreaking throughout history. He covers substitution ciphers fairly early on and describes some common methods for breaking them. Also, he had a Cipher Challenge in the book (which has already been completed) that consisted of some various codes to try to break including some substitution ciphers. You might try reading through how the Swedish team broke these ciphers at this site. However, I might suggest reading at least through the substitution cipher part of the book before reading these solutions.
By the way I'm not affiliated in any way with the publication of this book. I just really enjoyed it.

Regarding digraphs, digrams and word approximations, John Pierce (co-inventor of the transistor and PCM) wrote an excellent book, Introduction to Information Theory, that contains an extended analysis of calculating their characteristics, why you would want to and how to locate them. I found it helpful when writing a frequency analysis decryption code myself.
Also, you will probably want to write an ergodic source to feed your system, rather than relying on a single source (e.g., a novel).

interesting question,i ask a similar question :)
one thing i'm trying to find out and do is:
to scan the bigger words that have repeating letters in them..
then find a corresponding word with a similar pattern to the bigger word from the cipher..
the reason as to why is simply because,the bigger the word the most possible different deciphered letters found at once and because bigger words are easier to decode,just the same as to why a bigger text is easier to decode.. more chances to see patterns emerge :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js