A primer of signal detection theory

11/18/2023

The second class contains datasets of highly adapted viruses, such as the flaviviruses, which have likely co-evolved with more than one host species for timescales around 100 million years 8. The first class contains datasets of truly newly emerged viruses, such as SARS-CoV-2, where the dataset available represents well the evolutionary space the human-tropic virus has explored but variability is still low because limited time means the overall explored region is small. There are two important classes of such datasets. However, even amongst RNA virus genome datasets there may be found examples where the overall variability between samples is low and so it is more challenging to distinguish signal from background. This means that both “signal” (conserved) and “background” (variable) regions are seen to exhibit variability, but the lower variability in signal regions can be detected. Such an approach is more efficient at finding conserved regions of unexpected length scales, potentially at the cost of being less efficient at finding conserved regions at tuned length scales.Īll the approaches described so far work in RNA virus genome datasets in which relatively high variability is seen between different sequenced samples.

We have previously described a scale-agnostic approach to this problem 6, showing that this approach could find previously discovered conserved regions of nucleic acid in influenza A virus and group A rotaviruses, and successfully applied this approach to a new analysis of HIV-1 7. The difficulty may be overcome by repeating analyses with different window sizes, at a cost of increased computational and analytical time and increased risk of false-positive signals. Many such approaches have used a sliding window technique to find conserved regions in sequences: such a technique is often adequate, as demonstrated by its successes, and is particularly appropriate when the length scale of the expected regions is known, but otherwise it imposes a length scale on the problem that may make it more difficult to discover regions differing from this scale. Such approaches have, for example, found packaging signals and previously undescribed proteins in influenza A virus 1, 2, 3, noted a previously undescribed open reading frame in enteroviruses 4, and identified conserved structural elements in group A rotaviruses 5. The idea of these approaches is that such nucleic acid conservation may correlate with previously uncharacterised lifecycle features. One class of approaches to this problem seeks to find regions of high conservation in viral nucleic acid that cannot be explained by conservation in resultant amino acids. The identification and functional characterisation of such targets is a difficult informatic and biological problem, precisely because of the lack of analogues for comparison. Coupled with the need to avoid host toxicity from drug cross-reactivity with host molecules, the identification of drug targets that control key lifecycle features of viruses, but that do not have close analogues in the host or in unrelated viruses, becomes an important goal. We propose the precise location of a previously described packaging signal, and discuss explanations for other regions of high conservation.Ī key challenge in the development of antivirals is the relative lack of targets that are sufficiently conserved across multiple families of viruses for drugs to exhibit a broad spectrum of activity. We demonstrate the application of these methods by analysing over 5 million genome sequences of the recently-emerged RNA virus SARS-CoV-2 and correlating these results with an analysis of 119 genome sequences of SARS-CoV. Here, we present methods that ensure we can leverage all the information available in a low-signal, low-noise set of sequences, to find contiguous regions of relatively conserved nucleic acid. In organisms with low genetic diversity, such as newly-emerged pathogens, it is key to obtain this information early to develop new treatments. Heavily constrained regions can be investigated to understand their roles in an organism’s life cycle, and drugs can be sought to disrupt these roles. Collections of genetic sequences belonging to related organisms contain information on the evolutionary constraints to which the organisms have been subjected.

0 Comments

Author

Archives

Categories

A primer of signal detection theory

Leave a Reply.