MEDIUM - Machine lEarning Drug dIscovery throUgh dynaMics
MEDIUM - Machine lEarning Drug dIscovery throUgh dynaMics
Giorgio Colombo Group (UNIPV)
Why this tool is useful?
The prediction of the best ligand for a specific protein could be a huge challenge using the classical approaches like molecular docking and stabilisation energy calculations.
Here we report on a fast and solid workflow which starts from our DF-matrix method to analyse how the protein globally behaves in the presence of a ligand. Machine Learning (ML) trains a Convolutional Neural Network (CNN) model directly on the pixel images of DF: train is preformed using a known ligand and the different behaviour of the protein is evaluated in the presence and in absence of it.
With the so trained model further predictions can be performed using different ligands.
How to use the script
• Requisites
- Python 3.0 (or newer version)
- Numpy
- Tensorflow
- Pandas
- Sklearn
- cv2
- Matplotlib
• Usage
- CNN-training-script.py constitutes the main code of the tool: here different models of CNN can be customized, by changing also activation function and classification mode. In its final part it operates also a test using unseen data and save the trained model as a .h5 file.
The first operation that is required by the user regards the very initial prepartion of the DF-images from the DF-matrices [see the following link for the DF preparation https://wiki.ebrains.eu/bin/view/Collabs/distance-fluctuations]. This can be done using the gnuplot.in file and the exectute-DF.sh file, which renames the .png accroding with the nanoseconds used to extract the image.
The images required for the training of the model has to be selected and classified by the user between the two states of interest and by using the random-selector files to divide them between test, trainig and validation datasets. Here we usually preformed a random separation between test (20%), train (64%) and validation (16%) sets using the last 200ns of the equilibrated dynamics.
- CNN-external-data-test.py is a script which aims to use the trained model (.h5 file) and test it on data belonging to different proteins from the ones used during the build of the model.