metadata only access
The reliable identification of IGHD genes within human immunoglobulin heavy chains is challenging with up to one third of rearrangements having no identifiable IGHD gene. The short, mutated IGHD genes are generally assumed to be indistinguishable from the N-REGIONS of non-template encoded nucleotides that surround them. In this study we have characterised N-REGIONS, demonstrating the importance of nucleotide composition biases in the addition process, including the formation of homopolymer tracts. We then use a simulation approach to determine the likelihood of misidentification of highly mutated IGHD genes among the JUNCTION nucleotides. These likelihoods provide general rules for the identification of mutated D-REGIONs, and suggest that longer D-REGIONs (>25 nucleotides) with as many as ten mutations can be identified with a low risk of error. Shorter D-REGIONs (> 16 nucleotides) with as many as four mutations are also identifiable. The reliability of different alignments is dependent upon the junction length (combined N-REGIONs and D-REGION). Data is presented that can guide the alignment of sequences with junction lengths from 5 to 50 nucleotides, including explicit selection between two D-REGION possibilities. The use of such a statistically-based approach to the alignment of IGHD genes will improve the reliability of the partitioning of immunoglobulin sequences, and this in turn will facilitate the study of the many processes that contribute to the diversity of the immunoglobulin repertoire. (C) 2007 Elsevier B.V. All rights reserved.