A team of researchers with New England Biolabs Inc. (NEB) has found that sequenced DNA samples held in public databases had higher than expected low-frequency mutation error rates. In their paper published in the journal Science, the team describes how they created an algorithm that is able to calculate an error rate for samples in a database and what it showed when run on two public genome databases.
Researchers involved in studying the role DNA plays in cell mutations that lead to cancerous tumors rely on the accuracy of databases that hold sequencing information—those looking for commonalities, for example, among different groups of people rely on information in such databases when attempting to isolate trends. Such studies involve comparing the genomes of different people with low-frequency mutations versus the general population and using what they find to build cancer datasets. But now, the accuracy of public databases has been called into question by work done by the team at NEB, which in turn calls into question the accuracy of the cancer datasets.
To measure the accuracy rate of a given dataset, the researchers created an algorithm that could be used to count the numbers of sequences showing mutations due to damage during the sequencing process versus those that happened naturally. The team then used their algorithm to calculate error rates for several public databases—most notably the 1000 Genomes Project and part of the TCGA database—they report that they found error rates of 41 percent and 73 percent respectively.
The researchers note that their algorithm is not capable of revealing the source of unnatural damage, but suggest it is likely due to certain sample preparation techniques used prior to sequencing. They also point out that other algorithms have been developed for sequencers to test their own work for errors, but due to lack of a compelling reason, they have not been widely used. They suggest DNA sequencers begin doing so. They also note that new tools have been developed that could help minimize DNA damage during preparation and that their use could improve the accuracy of public databases.
يعتمد معظم الباحثون في دراستهم على المعلومات التي يأخذونها من قواعد البيانات التي وضعت من مقارنة بين الجينوم البشري والطفرات التي تحدث لهذه الجينات.
ومؤخراً طور الباحثون خوارزمية قادرة عل حساب نسبة الخطأ في العينات، واستخدموها لحساب نسب الخطأ في قواعد البيانات العامة والمتوفرة على الانترنت.
ووجد الباحثون نسبة الخطأ بين 41% إلى 73% في معظم هذه المصادر ولا سيما (1000Genomes Project) و (TCGA database).
وإلى الآن ليس من المعروف سبب هذه النسب العالية من الخطأ لكن يرجح الباحثون أنها قد تكون بسبب مشكلة في إعداد عينات الحمض النووي قبل تحديد متوالية الدنا (DNA sequencing).
ويفتح هذا الاكتشاف باب التساؤل بمدى إمكانية الثقة بهذه القواعد البيانية والتي تعتمد عليها العديد من الدراسات وخاصة الدراسات السرطانية.
تم نشر هذه الدراسة في مجلة : Science