The "compiled reference material" would be the " hallmark" that any sample is tested by.
Isnt a list of that caliber usually decided upon by peer review?
How to aquire a haber complete" set ", that would be difficult to say the least.
Yes, peer reviewers would decide through their critique whether the data set was adequate. But they wouldn't decide what the experimenter should or should not use. The experimental design and its data set is solely decided upon by the researcher. Their job is to judge the quality of the work, its adequacy and whether the conclusions are justified by information and techniques used. They can (and do) make suggestions about where the design can be improved.
True; producing every word ever produced by Haber would be impossible to accomplish. We don't have access to the entire body of his writings. But every word that can be discovered should be included. Cherry picking equals experimental failure.
Context also plays a part. It possibly wouldn't be of much use to take as a sampling a technical manual written by Haber and compare that against Titor's informal online posts. Writing a tech manual usually involves the writer's careful choosing his/her words where online posts are generally written on the fly. The specific words, doublets, triplets, etc. used by a writer when producing a tech manual could well have little correlation with same person's informal writing (though that's a question that can only be answered through a seperate experiment
).
Author identification is a very difficult task. Worse, in the end the most accurate analysis only tends to eliminate candidates rather than unambiguously identifying matches. The Chi Square Test, for example, only tells you the degree of confidence that the result varies from expected randomness.
When we originally performed the Chi Square based linguistic analysis back in 2004 we did include every word written by Titor/TTO in his posts and did the same for both the randomly chosen control group as well as the experimental group. We also stated that no group (control or experimental) really provided a sufficiently large body of written material for a true analysis. That usually requires 100,000 to 200,000 words. I checked my old files a few minutes ago and the Titor/TTO concordance only contains ~4,600 words. He really didn't make very many posts. When he did post a large proportion of the text was in the form of direct quotes from other posters. The quotes were eliminated because, obviously, they were not his writing. We didn't include in the Titor concordia the posts submitted by Pamela because we had no way of being sure that he wrote the material that she reposted. That was a judgement call but we believed that it was the correct call.
Pamela herself was suspected by a few people of being Titor. Thus we had to only include in both his and her concordance writings that were directly posted under their name through their own TTI and Post-2-Post accounts. Mixing the two together by including materials posted by her with the claim that it was actually written by him would have skewed the results, especially given the small size of the sampling, toward an increased confidence that the results were not random (a possible match) - not because it was a true match but because we would have taken two seperate concordances and combind them into one. Of course they would be similar in that case.