In simulations on the database of the US National Cancer Institute of the National Institute of Health {http://www.nci.nih.gov} a study was maid to test Matrix' ability to rapidly optimize and identify compounds. The database consists of the experimental results of the impact of more than 30,000 compounds on 6 different types of cancer.
Starting with an initial selection of only 50 compounds and additional 20 compounds per cycle the activity of the best molecule was improved by a magnitude of over 4. More than 50% of the 12 most active compounds were found. We repeated the simulation over several runs using a different initial selection of compounds. The results were repeated in all simulation runs.
Results are summarised in table 1 and figure 1.
[table]
Table 1 Improvement of the activities of the compounds (simulations on the NCI database, see text). The criterion was the mean GI-50 activity over all 6 criteria listed in the database. Also given the NCI internal number. In figure 1 the appendant 3D structures are shown.
Figure 1: Evolution of the structures in simulations on the NCI database, see text. Shown are the structures of the best molecules of each cycle. Cycles 0--3 are shown in the top line and cycles 4,6--8 in the bottom line (from left to right). Note the ability to 'jump' between diverse motives and chemical structures.
Some additional comments on the results:
1. In total there are 31097 entries in the database.
2. 14781 entries do not show any effect on the targets < 4.1, all values below 4.0
were set to 4.0 for simplicity.
3. The mean activity of the molecules is 4.50.
4. Only 25 molecules (<0.1%) show activities higher than 9.00.
The distribution of the mean GI-50 over 6 cancer types for all entries in the NCI database is shown in figure 2.
Figure 2: Statistical distribution of the mean GI-50 over 6 cancer types for all 31097 entries in the NCI database in logarithmic scaling.
watch the modelling demo
This film shows the progress of one optimization cycle targeting the activity on melanoms. Initially the activity (measured as the GI-50 value, being on a logarithmic scale) of the best compound found so far is 4.80. The best molecule is shown in its 3D structure. Subsequently the optimization process is illustrated. The horizontal axis represents true activity and the vertical represents the corresponding predicted activity values. In the beginning the models are not specific to internal relations within the data. You can see this from the fact that almost all molecules are predicted to have essentially the same activity, the points scatter around a horizontal line representing the mean value. This is reflected by a root of the mean squared (RMS) error of almost 0.5 (0.497).
Using all information from the 20 molecules available in the simulation at this moment in time the Matrix modeling technology was applied to this data set: molecule structures on the one hand and activities on the other hand. What you now watch is the part of the process which completely runs within the computer; no additional data is fed into the process at this time. Continuously the points approach the diagonal, which represents the outcome for perfect models for the given data points. In parallel the RMS error is reduced to under 0.07. Finally 10 molecules are proposed based on the final models which than are checked for activities in the database. The best of which is shown and has an activity of over 9.5. Representing a 50,000 fold activity. The associated 3D structure indeed shows almost no similarity to the best compound from the initial cycle.
|