免費考博網&免費考博論壇：為您提供最新最權威的考博資訊!中國最大最權威的考博網站!
專注考博19年，考博第一品牌！幫助100萬考博人輕松上岸！

我的快捷通道

加入VIP 上傳考博資料您的流量增加流量考博報班每日簽到

上一主題下一主題

主題 : 2009年全國優秀博士論文:數據挖掘的建模及在生物信息學中的應用研究

復制鏈接 ┊瀏覽器收藏┊打印

nanafly

級別: 總版主

顯示用戶信息

樓主發表于: 2009-10-10

倒序閱讀 ┊ 只看樓主 ┊ 小中大

2009年全國優秀博士論文:數據挖掘的建模及在生物信息學中的應用研究

作者姓名：沈紅斌
　　論文題目：數據挖掘的建模及在生物信息學中的應用研究
　　作者簡介：沈紅斌，男，1979年8月出生，2004年4月師從于上海交通大學楊杰教授，于2007年3月獲博士學位。

　　中文摘要
　　隨著科學技術的飛速發展，經濟和社會都取得了極大的進步，與此同時，在各個領域產生了大量的數據，如何從這些數據中發現有價值的知識及規律，成為目前理論與實踐研究的熱點與難點。與此同時，生命科學技術的快速發展也產生了大量的生物數據，單純地利用傳統的生物實驗方法將很難快速且全面的處理如此多生物數據，從而必然制約了生命科學及制藥工程的快速發展。在這種情況下，生物信息學應運而生。生物信息學是一門生物學與信息科學交叉而形成的年輕學科，旨在運用信息學、物理學、化學、數學、計算機科學、系統科學的理論和方法來研究生物系統和生物過程的信息量和信息流，在已有數據的基礎之上發現相應的規律和知識并進而用來進一步指導與解釋生物實驗與生命現象，加速對生命本質特征的認識。本論文在數據挖掘及生物信息學理論與方法上進行了深入的研究與探索。
　　聚類分析是數據挖掘研究中的重要內容，成為各學科研究中的重要工具。但在現實生活中，常常遇到高維數據集的處理且在大多數情況下，這些數據集對于各個聚類存在屬性不平衡的現象。根據這一點，本文創新性提出了在核特征空間中的屬性加權核聚類算法，實驗表明新聚類算法能很好地反映各屬性對于各個聚類的重要性，因而取得了比傳統聚類算法更好的結果。傳統聚類算法的應用對象往往局限于單一獨立的數據集，但在很多情況下一個數據集要和其他數據集相互發生關聯�；谛畔⒗碚�，本文首先提出了一合作聚類算法，反映了數據集間的相互作用關系，結果表明聚類結果將受到其他數據集的影響。我們同時也從理論上證明了這兩個算法的收斂性。
　　蛋白折疊是比蛋白的三維結構更深層次的知識信息，因而是更加困難的研究課題，同時，從蛋白序列預測蛋白折疊類型能夠進一步為預測該蛋白的三維結構提供極有價值的信息。本文從生物系統的復雜性角度出發，創新性地提出了基于集成分類器框架的蛋白折疊預測系統，從多個生物特征角度對序列信息源及特征進行融合決策預測，結果證明所得到的集成預測系統是非常有效的，把蛋白折疊的預測精度提高了6-21%。
　　蛋白的三維結構是標識所有蛋白折疊類型的重要屬性。即使蛋白之間所包含的序列信息或者其功能特性有所不同，其所包含的折疊類型或者結構類型也可能是相似的。鑒于此，Levitt和Chothia把蛋白分成以下的4種結構類型：(1) all- ，(2) all- ，(3) 和 (4) 。從蛋白序列出發，預測蛋白的結構類型是蛋白質科學中的重要研究課題。本文首次有機地將有監督聚類算法與模糊系統學習算法結合在一起進行蛋白三級結構預測，提高了蛋白結構預測的精度，該工作第一次將模糊系統學習方法引入到蛋白結構預測中，為生物信息學進一步的研究開辟了新的思路。
　　膜蛋白是一種非常重要的蛋白，占人體蛋白總數的約1/3，但目前已經知道的膜蛋白結構只占1%左右。膜蛋白的主要功能之一是離子通道，我們的認知、感覺、情緒等的產生都是由于這些通道在不停地開關，所以，膜蛋白對人體的重要性是不言而喻的，如phospholamban離子通道蛋白對心臟功能有著重要作用。絕大多數疾病都是由于某一特定的膜蛋白不足引起的，現在市場上銷售的80%的藥物都集中在膜蛋白上。因此，研究膜蛋白的序列特征以及其三維結構對于了解膜蛋白的功能起著重要的作用，已經成為結構生物學中的研究熱點，但同時由于膜蛋白不溶于水的特性也使得生物實驗方法求解膜蛋白結構非常困難，這就為我們利用計算方法從序列預測膜蛋白拓撲結構提出了挑戰及嶄新的課題。本文創新性地提出了基于集成分類器模型及蛋白序列進化信息的新穎PsePSSM離散化模型，提出了融合序列功能域特征及PsePSSM特征的蛋白屬性預測框架，并成功應用于膜蛋白拓撲結構預測及酶蛋白功能家族預測，新預測模型在8類膜蛋白的拓撲結構上準確率達到了85%以上，比傳統方法的預測精度提高了約30%。
　　蛋白在細胞中的位置信息與其功能特性是密切相關的，甚至即使我們知道了一個蛋白的功能特性，了解該蛋白在細胞中行使功能的位置也是非常重要的。例如，細胞核包含了細胞的遺傳因子DNA，控制著細胞的整個活動過程等。但隨著人類基因項目的成功實施，人類所發現的新蛋白數目呈現指數增長的趨勢，根據國際蛋白數據庫UniProtKB/Swiss-Prot的統計，2006年6月份的蛋白數目達到了223,100，比1986年增加了56倍多。面對如此快的蛋白合成速度，單純依靠生物實驗方法測定蛋白的亞細胞位置是幾乎不可能完成的任務，迫切希望能通過生物信息學的研究在已經掌握的相關知識的基礎上提出預測分析新蛋白的亞細胞位置，為加快生命科學研究及制藥工程服務。本文首次在國際上提出并探討了a) 蛋白在細胞中多個位置出現的預測模型；b) 蛋白在細胞核中出現的位置的預測模型，即 “亞亞細胞位置預測模型”，獲得國際學術界的認可；c) 本文首次將亞細胞定位的預測研究推廣到覆蓋22個亞細胞位置，極大地提高了預測模型的實用價值，并提出了融合蛋白序列高層基因本體特征及序列自身氨基酸特征的蛋白亞細胞位置預測方法，提出了面向不同物種的亞細胞定位的預測新思路；結果表明新算法方法在嚴格的數據集上獲得了比傳統算法方法高出35%以上的預測精度，所開發的工具被廣泛應用于生物實驗中。
　　為了推廣理論研究成果的應用，我們在科學研究中還建立了15個在線的生物信息學網站平臺：http://www.csbio.sjtu.edu.cn/bioinf/，全世界的相關領域生物學家只要通過互聯網提交生物數據，就能得到網站即時運算返回的結果。經不完全統計，網站已被使用了1,100,000余次，極大地推動了生物信息學理論研究的應用成果化。國際上許多生物學家在發表的學術論文中應用了經我們所開發的生物信息學應用平臺分析運算得到的相關數據來驗證他們的實驗結果，獲得了良好的評價。

　　關鍵詞：數據挖掘，聚類分析，生物信息學，機器學習，信息理論，證據理論，集成分類器，蛋白結構預測，蛋白亞細胞位置預測，膜蛋白識別，細胞網絡，蛋白進化理論

評價一下你瀏覽此帖子的感受

分享到： QQ空間新浪微博騰訊微博人人網網易微博

關鍵詞: 博士

nanafly

級別: 總版主

顯示用戶信息

沙發發表于: 2009-10-10

只看該作者 ┊ 小中大

Researches on data mining modeling theories and its applications in bioinformatics
Shen Hong-Bin
ABSTRACT
In the past decades, large amount of data has been obtained with the fast development of science, economic and society. How to find valuable knowledge and rules behind these data is a critical problem and is a hot research topic in both theoretical and practical researches. At the same time, the biological data has also increased exponentially with the development of the various biological devices. Under such conditions, it is both very expensive and time consuming for dealing with such large size of data only based on the conventional biological experiments. It has become a major challenge to bridge the gap between the number of newly generated data and understanding the knowledge they contain. Bioinformatics is a very young research direction, trying to find the knowledge and rules behind the biological data by combining information science, computer science, physics as well as the life science knowledge, which could be further used to explain the biological life. It is expected that the life science researches and the drug discovery can be speeded up by the bioinformatics researches. In this paper, we focus on the data mining and bioinformatics theoretical and practical researches.
Clustering analysis is one of the most important research areas in data mining. In the real world, we often have to deal with the high-dimensional dataset, in which, different attributes will contribute differently to each cluster in most cases. Considering such a problem, a kind of attribute weighted fuzzy kernel clustering algorithm is proposed. This new kernel clustering algorithm can reflect properly the attribute importance for each cluster and hence can yield much higher clustering accuracy than the conventional clustering algorithms. Another thing we often encounter in the real world is that one dataset is independent of others but also cooperate with others at the same time. Based on such cooperative constraints, new information based collaborative clustering algorithm is proposed. Such collaborative clustering algorithm considers the influence from other datasets and the corresponding clustering results will be more flexible.
Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo amino acid composition, that were extracted from a training dataset. Their outcomes were combined thru a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns. The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have less than 25% sequence identity with the proteins used in training the classifier. Such a rate is 6-21% higher than the corresponding rates obtained by various existing NN (Neural Networks) and SVM (Support Vector Machines) approaches, implying that the ensemble classifier is very promising and might become an useful vehicle in protein science, as well proteomics and bioinformatics.
     The structural class is an important attribute used to characterize the overall folding type of a protein. Proteins often have quite similar or identical folding patterns even if they consist of very different sequences or bear various biological functions. In view of this, Levitt and Chothia tried to classify proteins into the following four structural classes: (1) all- , (2) all- , (3) , and (4) . Prediction of protein classification from the sequences is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called “supervised fuzzy clustering approach” is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of “if-then” fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated thru three different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigator. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
As a “building block of life”, a cell is deemed the most basic structural and functional unit of all living organisms. It is highly organized with many functional units or organelles according to the cellular anatomy. Most of these units are “enveloped” by one or more membranes, which are the structural basis for many important biological functions. Membrane proteins are a special group in the protein families, which accounts for ~30% of all proteins but solved membrane protein structures only represent <1% of known protein structures to date. This class of proteins constitutes the majority of ion channels, transporters, and receptors in living organisms, for example, phospholamban protein is an integral membrane protein that regulates the Ca2+ pump in the heart. Because of the importance of membrane proteins, they act as the targets of approximately 80% drugs in the markets. Hence, solving the structures of membrane proteins plays key important roles in modern life science researches. Due to the intrinsic structural plasticity associated with many of these proteins, the chance of obtaining crystals suitable of X-ray or electron diffraction studies is small. Although helical membrane proteins pose higher degree of experimental difficulty, their conformation is, in a number of ways, more predictable than that of water-soluble proteins. In this paper, we have proposed a novel protein sequence discrete model, i.e. PsePSSM, and an ensemble classifier framework to predict the membrane protein topology in the cell membrane. Experimental results on the stringent dataset have shown that the prediction accuracy of the membrane protein topology in the 8 classes is more than 85%, which is about 30% than the conventional methods.
The knowledge of locations of protein in the cell is closely related with its functions.  Even the function characters of a protein are known, it is still critical to know where the protein functions in the cell. One of the fundamental goals in molecular cell biology and proteomics is to identify their subcellular locations or environments because the function of a protein and its role in a cell are closely correlated with which compartment or organelle it resides in. For example, in 1986 the SWISS-PROT databank contained only 3,939 entries of protein sequences; recently, the number jumped to 223,100 according to the version released on June-2006 at http://www.ebi.ac.uk/swissprot/, meaning that the number of the entries now is more than 56 times the number of 1986! With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast and reliably annotating the subcellular locations of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly-found protein sequences for both basic research and drug discovery. In this paper, a) we have firstly in the literature proposed the prediction algorithm to predict the dynamic feature of proteins may simultaneously exist at, or move between, two or more different subcellular locations, i.e. the model that can deal with proteins with multiple subcellular location sites; b) we have firstly proposed the model of prediction protein sub-sub-cellular location problem, i.e. prediction the protein subnuclear locations; c) we have for the first time extended the prediction scope to cover 22 subcellular locations, which greatly improves the practical value of the computational models. At the same time, we have also proposed to use the novel combined “high-level” gene ontology with the “ab-initio” sequence features to predict the protein subcellular locations. Also, we have proposed the “organism specific” ideas in developing the protein subcellular location prediction models. Experimental results on the stringent datasets have shown that the performance of the new models proposed in this paper is 35% higher than the conventional methods. All of these work have been accepted and used by other international researchers.