We have derived / developed a EP/ES non-linear SVRc hybrid that will theoretically optimize non-linear survival system performance as well as ascertain the efficacy of drug treatments. This new hybrid in currently being implemented and tested and results will be compared to those obtained using the ¨Dgold standard Cox PH and K-M approaches. This is a joint work with Xingye Qiao, Dan Margolis and Ron Gottlieb.
Stability is an important yet under-addressed issue in feature selection from high-dimensional and small sample data. In this paper, we show that stability of feature selection has a strong dependency on sample size. We propose a novel framework for stable feature selection which first identifies consensus feature groups from subsampling of training samples, and then performs feature selection by treating each consensus feature group as a single entity. Experiments on both synthetic and real-world data sets show that an algorithm developed under this framework is effective at alleviating the problem of small sample size and leads to more stable feature selection results and comparable or better generalization performance than state-of-the-art feature selection algorithms. Synthetic data sets and algorithm source code are available at http://www.cs.binghamton.edu/~lyu/KDD09/.
Bioinformatics is a rapidly developing field that is making major contributions to our understanding of biological complexity. Despite advances, there are numerous challenges that remain. I will highlight a few of these challenges including: 1) Small n and large p problem, 2) high dimensionality of biological data, 3) multiple comparisons, 4) controlling for variability in automated scoring of biological traits, 5) integrating multiple types of biological data and 6) making solutions accessible to biologists. I will provide biological examples of each of these challenges, usually focusing on Next Generation Sequencing (NGS) applications in genomics.
I have attached a pdf that highlights some challenges for genome-wide association studies. Interested parties are welcome to read the paper but it is certainly not expected.
High dimensional data exist everywhere in our life and in all the sectors of our society in every modality of the data we live with today, including text, imagery, audio, video, and graphics. Pattern change discovery from high dimensional data sets is a general problem that arises in almost every application in the real-world; examples of such applications include concept drift mining in text data, event discovery in surveillance video data, event discovery in news data, hot topic discovery in the literature, image pattern change detection, as well as genome sequence change detection in bioinformatics, to just name a few.
This work investigates the general problem of pattern change discovery between high-dimensional data sets. Current methods either mainly focus on magnitude change detection of low-dimensional data sets or are under supervised frameworks. The notion of the principal angles between the subspaces is introduced as a measurement of the subspace difference between two high-dimensional data sets. Principal angles bear a property to isolate subspace change from the magnitude change. To address the challenge of directly computing the principal angles, we elect to use matrix factorization to serve as a statistical framework and develop the principle of the dominant subspace mapping to transfer the principal angle based detection to a matrix factorization problem. We show how matrix factorization can be naturally embedded into the likelihood ratio test based on the linear models. The proposed method is of an unsupervised nature and addresses the statistical significance of the pattern changes between high-dimensional data sets. We have showcased the different applications of this solution in several specific real-world applications to demonstrate the power and effectiveness of this method. Full paper link can be found here.
In biomedical science, data mining techniques have been applied to extract statistically significant and clinically useful information from a given dataset. Finding biomarker gene sets for diseases can aid in understanding disease diagnosis, prognosis and therapy response. Gene expression microarrays have played an important role in such studies and yet, there have also been criticisms in their analysis. Analysis of these datasets presents the high risk of over-fitting (discovering spurious patterns) because of their feature-rich but case-poor nature. This paper describes a GA-SVM hybrid along with Gaussian noise perturbation (with a manual noise gain) to combat over-fitting; determine the strongest signal in the dataset; and discover stable biomarker sets. A colon cancer gene expression microarray dataset is used to show that the strongest signal in the data (optimal noise gain where a modest number of similar candidates emerge) can be found by a binary search. The diversity of candidates (measured by cluster analysis) is reduced by the noise perturbation, indicating some of the patterns are being eliminated (we hope mostly spurious ones). Initial biological validated has been tested and genes have different levels of significance to the candidates; although the discovered biomarker sets should be studied further to ascertain their biological significance and clinical utility. Furthermore, statistical validity displays that the strongest signal in the data is spurious and the discovered biomarker sets should be rejected.
Re-identification in networked data models involves testing procedures for the identification of similar observations. We consider this testing problem from first principles: we derive probability distributions for a version of a similarity score for three well known network data models. Our method is unique in that it suggests a sufficiency property for (at least) these distributions, an unexplored area of network/graphical modeling.