ADSPA

EN

ElasticNet^[1] is a linear regression model combining Lasso and Ridge Regression. Elastic network is a linear regression model using L1 and L2 norm as prior regular term training. This combination allows you to learn a non-zero sparse model with only a few parameters, like Lasso, but it still retains some of the regular properties of the Ridge. We can use the l1_ratio (0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.) parameter to control the convex combination of L1 and L2. Elastic networks are an iterative approach.

Elastic networks are useful when many features are related. Lasso is likely to consider only one of these features at random, whereas elastic networks tend to choose two. In practice, one of the advantages of the tradeoff between Lasso and Ridge is that it allows Ridge stability to be inherited in cyclic processes (Under Rotate).

RF（Random Forest）

Random forest^[2] is an ensemble learning method, which consists of multiple decision trees. Each decision tree is a classifier. For an input sample, N trees will have N classification results. Random forest will integrate all classification voting results and designate the category with the most votes as the final output.

RF performance on the data set is good, because of the introduction of the two random, making random forest is not easy to fall into a fitting, and has a good anti-noise ability, can deal with high-dimensional data, don't have to do feature selection; It can process both discrete data and continuous data, and has fast training speed, ranking of feature importance, and easy parallelization, which has advantages in large data sets. RF performs well on large data sets because of the introduction of two randomness, which makes the random forest not easy to fall into overfitting and has good anti-noise ability. It can also process high-dimensional data directly without feature selection. In addition, RF can process both discrete data and continuous data, with fast training speed, ranking of feature importance, and easy parallelization, so it has advantages in large data sets.

However, RF is difficult to learn the combination features, and in the generation process of each decision tree, every division is a local optimal choice, and the final result cannot guarantee the global optimal.

SVR

SVR (Support vector regression)^[3] is an important application of SVM (support vector machine). The difference between SVR regression and SVM classification is that SVM divides the sample points into two categories, while SVR sample points only have one category in the end. The optimal hyperplane it seeks is not to make two or more types of sample points "most open" like SVM, but to minimize the total deviation of all sample points from the hyperplane. That is, SVM should maximize the "distance" from the nearest sample point to the hyperplane. The SVR minimizes the "distance" to the most distant sample point in the hyperplane. For SVR, it is necessary to find a surface or a function and fit all the data (all the data points, regardless of the category, are closest to the surface or function).

Different from the traditional regression method, the prediction is considered correct only when the regression f(x) is completely equal to y, support vector regression (SVR) believes that the prediction can be considered correct without calculating the loss as long as the deviation degree of f(x) and Y is not too large. Concrete is to set a threshold value of alpha, just calculate | f (x) - y | > alpha data points loss. As shown in the figure below: Support vector regression means that as long as the value inside the dotted line is considered correct, as long as the loss of the value outside the dotted line is calculated.

FM

FM (Factorization machine)^[4] was proposed in 2010, which essentially adds a polynomial to the linear model to describe the second-order crossover between features. In addition, k-dimension (k<<n) auxiliary vector V is introduced for each feature component, and the original combined parameter is represented by the result of vector inner product, which solves the problem that most parameters are difficult to be fully trained under sparse conditions. Finally, the quadratic term is simplified to reduce the time complexity of the model to linear.

Advantages: The second-order cross feature is added to improve the expression ability of the model; Implicit vector is introduced to alleviate the problem of parameter training caused by sparse data. The model complexity is reduced to linear, and it is still linear when it is changed into high-order feature combination.

Defect: only the intersection of second-order features is considered, and the depth model is not involved. The use of the same implicit vector for the same feature and different feature combinations violates the fact that the importance of feature and different feature combinations is different.

DeepDSC

DeepDSC^[5] is a deep learning architecture to improve the performance of drug sensitivity prediction based on these data. It integrated both genomic features of cell lines and chemical information of compounds to predict the half maximal inhibitory concentrations ðIC50Þ on the Cancer Cell Line Encyclopedia (CCLE) and the Genomics of Drug Sensitivity in Cancer (GDSC) datasets using a deep neural network. Specifically, it first applied a stacked deep autoencoder to extract genomic features of cell lines from gene expression data, and then combined the compounds’ chemical features to these genomic features to produce final response data. They conducted 10-fold cross-validation to demonstrate the performance of DeepDSC in terms of root-mean-square error (RMSE) and coefficient of determination R2. It shows that DeepDSC outperforms the previous approaches with RMSE of 0.23 and R2 of 0.78 on CCLE dataset, and RMSE of 0.52 and R2 of 0.78 on GDSC dataset, respectively. Moreover, to demonstrate the prediction ability of oDeepDSC on novel cell lines or novel compounds, they left cell lines originating from the same tissue and each compound out as the test sets, respectively, and the rest as training sets. The performance was comparable to other methods.

Figure 1: Flowchart of a DeepDSC^[5], with a three-layer encoder and a three-layer decoder. Where x and x’ denote the input and the reconstruction output, and h denotes the encoded feature representation. An autoencoder first encodes the input x to hidden representation h and then use a decoder to reconstruct the input.

GBDT

GBDT (Gradient Boosting Decision Tree)^[6] has a good effect in data analysis and prediction. It is an integration algorithm based on decision tree. Gradient Boosting is an algorithm in Ensemble method Boosting, which iterates the new learner through Gradient descent. The CART decision tree is used in GBDT.

Firstly, GBDT is an algorithm for data classification or regression by adopting additive model (i.e. linear combination of basis functions) and continuously reducing residual errors generated in the training process.

We illustrate the training process of GBDT with a picture from the source:

Figure 1: GBDT training process.

GBDT iterates through multiple rounds, each round generates a weak classifier, and each classifier is trained on the basis of the residual of the previous round. The requirements for weak classifiers are generally simple enough and low variance and high bias. Because the process of training is to continuously improve the accuracy of the final classifier by reducing the deviation.

Weak classifiers are usually selected as CART TREE (i.e., classification regression TREE). Due to the above high bias and simple requirements, the depth of each classification regression tree will not be very deep. The final total classifier is the weighted sum of the weak classifiers obtained in each training round (i.e., the addition model).

The model can finally be described as:

The model was trained in a total of M rounds, and each round generated a weak classifier T(x;θ_m). Loss function of weak classifier is:

Where F_m-1(x) is the current model, and GBDT determines the parameters of the next weak classifier through empirical risk minimization. In terms of the choice of the loss function itself, the choice of L, there's the square loss function, the 0-1 loss function, the logarithmic loss function, and so on. If we choose the square loss function , then the difference is actually what we normally call the residual.

Let the loss function go down along the gradient. This is the gb core of GBDT. The value of the negative gradient of the loss function in the current model is used as the approximate value of the residuals in the ascending tree algorithm for regression problems to fit a regression tree. In each iteration of GBDT, the negative gradient of the loss function under the current model is fitted.

In this way, the loss function can be reduced as quickly as possible in each round of training, and the local optimal solution or global optimal solution can be achieved as soon as possible.

XGBoost

XGBoost (eXtreme Gradient Boosting)^[7] is an algorithm toolkit (including engineering implementation) based on Boosting framework, which is very powerful in parallel computing efficiency, missing value processing, and prediction performance.

XGBoost is an improvement of the gradient lifting algorithm. Newton's method is used to solve the extreme value of the loss function, and the loss function Taylor is expanded to the second order. In addition, regularization terms are added to the loss function. The objective function during training consists of two parts: the first part is the loss of gradient lifting algorithm, and the second part is the regularization term. The loss function is defined as:

Where n is the number of samples of the training function, l is the loss of a single sample, assuming that it is a convex function, y'_i is the predicted value of the training sample by the model, and y_i is the real label value of the training sample. Regularization terms define the complexity of the model:

Where, ϒ and λ are manually set parameters, w is the vector formed by values of all leaf nodes in the decision tree, and T is the number of leaf nodes.

comboFM

ComboFM^[8] is an artificial intelligence algorithm developed by researchers from the universities of Alto, Helsinki, and Turku in Finland. It can accurately predict whether different combinations of anticancer drugs can form a combined killing effect on cancer cells. The research paper was published on December 1, 2020 in the journal Nature Communications, titled "Leveraging multi way interactions for systematic prediction of pre clinical drug combination effects.

This new artificial intelligence model was trained using a large amount of data obtained from previous studies, reportedly providing a highly efficient means for the system to pre screen drug combinations.

The actual implementation idea is that comboFM can simulate whether different drug combinations have synergistic effects through high-order tensors and score them with correlation coefficients. Based on tensor decomposition, comboFM can utilize previous research data on similar drugs and cells to predict the response of untested cells to new drug combinations. Therefore, even with limited research data, comboFM can still achieve highly accurate predictions. As shown in the figure below, it is the prediction performance of comboFM-5, comboFM-1, comboFM-1 and random forest (RF) on tissue type and drug category.

DeepSynergy

DeepSynergy^[9] comes from a research paper titled DeepSynergy: predicting anti cancer drug synergy with Deep Learning published in Bioinformatics on May 1, 2018. It uses the feed forward neural network as the architecture. The input layer neurons will input the chemical properties of the two drugs and the gene expression of the cell line as the feature vector. After the deep synergy network transformation of the hidden layer, the predicted synergy score can be output finally, representing the interaction of the drug combination on a cell line. The score value can represent the synergy and antagonism. In this study, a dataset containing 23062 samples was used for training and prediction, each consisting of two drugs and a cell line genome, consisting of 38 anticancer drugs and 39 human tumor cell lines. The final calculated synergy score was compared with previous drug experimental results, confirming that the score predicted by DeepSynergy was consistent with it, and also reducing the time and cost required for experimental validation. In addition, this study compared several other machine learning methods, such as random forest, Support Vector Machine (SVM), Gradient Boosting Machine, and so on, and obtained the result that the area under the curve (AUC) of DeepSynergy's ROC is 0.9, and the overall accuracy is better than the above traditional machine learning methods.

This new artificial intelligence model was trained using a large amount of data obtained from previous studies, reportedly providing a highly efficient means for the system to pre screen drug combinations.

Fig. 1. Synergy calculation workflow. (a) Single agent screens at 8 concentration points were run for each of the 38 compounds against each of the 39 cell lines. (b) Checkerboards of 4-by-4 nonzero concentrations were measured for each of the 583 tested combinations, again for each of the cell lines. (c) Values at the checkerboard concentrations were interpolated from the fitted Hill curves from (a), and combined with the measured checkerboards from (b) to yield a 5-by-5 matrix, from which Loewe synergy values could be obtained. (d) The procedure from (c) was performed for each pairwise combination of the drug pairs. Notably self–self combinations were not explicitly measured. Furthermore, pairwise combinations within a set of 16 of the drugs (the ‘supplemental’ set) were similarly not measured (hence the gray block in the bottom right of the heatmap). This procedure was repeated for each cell line to yield a 38 38 39 data cube of input data, from which training, validation and test data were drawn using stratified nested cross-validation

synergy

synergy^[10] is a python package to calculate, analyze, and visualize drug combination synergy and antagonism. Currently supports multiple models of synergy, including MuSyC, Bliss, Loewe, Combination Index, ZIP, Zimmer, BRAID, Schindler, and HSA.

(i) implements a broad array of popular synergy models;

(ii) provides tools for evaluating confidence intervals and conducting power analysis;

(iii) provides standardized tools to analyze and visualize drug combinations and their synergies and antagonisms.

synergy has Python implementations of four single-drug and nine synergy models. Single-drug models include 2- and 4-parameter Hill equations, the median effect equation (MEE), (Chou and Talalay, 1984) and a nonparametric interpolated model. Synergy models include MuSyC (Meyer et al., 2019), Zimmer (Yadav et al., 2015), BRAID (Twarog et al., 2016), Bliss (Bliss, 1939), Loewe (Loewe and Muischnek, 1926), HSA (Berenbaum, 1989), CI (Chou and Talalay, 1984), ZIP (Yadav et al., 2015) and Schindler (Schindler, 2017). MuSyC, Bliss, Loewe, HSA, CI and Schindler are supported for combinations of three or more drugs.

Parameters for parametric models are estimated using the trust region reflective fitting algorithm, with bounds on parameters. Parameter searches for the Hill and two-drug MuSyC models use Jacobians. If the parameter search converges, synergy automatically calculates model fit scores. To estimate parameter confidence intervals, synergy uses the Monte Carlo residual resampling method from Chapter 17 of Motulsky and Christopoulos (2003).

(A) Example data, where each row represents a single measurement. (B–D) Example supported visualizations, including (B) heatmap, (C) surface and (D) isosurface. Isosurfaces may be thought of as stacking multiple heatmaps on top of one another

References

[1] Basu A, Mitra R, Liu H, et al. RWEN: response-weighted elastic net for prediction of chemosensitivity of cancer cell lines [J]. Bioinformatics, 2018, 34 (19): 3332–3339.

[2] Breiman L. Random forests [J]. Machine learning, 2001, 45 (1): 5–32.

[3] Drucker H, Burges C J, Kaufman L, et al. Support vector regression machines [J]. Advances in neural information processing systems, 1996, 9.

[4] Rendle S, Gantner Z, Freudenthaler C, et al. Fast context-aware recommendations with factorization machines [C]. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, 2011: 635–644.

[5] Li M , Wang Y , Zheng R , et al. DeepDSC: A Deep Learning Method to Predict Drug Sensitivity of Cancer Cell Lines[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019, PP(99):1-1.

[6] Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boosting decision tree [J]. Advances in neural information processing systems, 2017, 30.

[7] Chen T, Guestrin C. Xgboost: A scalable tree boosting system [C]. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016: 785–794.

[8] Julkunen, H., Cichonska, A., Gautam, P. et al. Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nat Commun 11, 6136 (2020). https://doi.org/10.1038/s41467-020-19950-z

[9] Preuer K , Lewis R P I , Hochreiter S , et al. DeepSynergy: Predicting anti-cancer drug synergy with Deep Learning[J]. Bioinformatics, 2017, 34(9).

[10] Wooten D J , Albert Réka. synergy: a Python library for calculating, analyzing and visualizing drug combination synergy[J]. Bioinformatics(10):10.

Dir

ADSPA: Anticancer Drug Sensitivity

Prediction Analysis