Your goal is to build a model that assigns a kinase type (as defined by Enzyme Commission number) to an arbitrary sequence of amino acids. 
All the sequences in both training and test dataset are complete sequences of protein kinases. Each comprises up to 560 positions, and each position can take one of 20 values (there are 20 standard amino acids produced in eukaryotic cells).
As each letter in an amino acid sequence represents a physical structure (one amino acid), these sequences can be augmented by converting each letter into a numerical representation of that amino acid. There are several ways to do this:
- Using the physical and chemical properties of the amino acids (https://github.com/vadimnazarov/kidera-atchley/blob/master/aa_properties.txt contains a table with this kind of data)
- Using pre-computed low-dimensional embeddings - either Atchley or Kidera factors. These take a comprehensive list of biophysical and biochemical descriptors and map them to a five- (Atchley) or ten- (Kidera) dimensional space. This means that each amino acid can be represented by five or ten values, respectively. These embeddings have been computed in such a way that they explain a large fraction of variance in biochemical descriptors (in a similar way to Principal Component Analysis). Amino acids with similar factors will tend to have similar properties. You can find tables of these factors for the different amino acids here: https://github.com/vadimnazarov/kidera-atchley .Using either AAindex directly or the embeddings may help you design an improved classification method.
- Using a larger set of embeddings. 'embeddings.csv' is similar to the low-dimensional embeddings described above but contains 1024 values for each amino acid. You can see how it was generated in the deep learning starter notebook. You may use these embeddings in your solution.
You may use all the data in https://github.com/vadimnazarov/kidera-atchley as well as the provided embeddings as part of your solutions for this challenge.
The sizes of the classes are not uniform, so we have separated some of the subclasses, to make the classification problem more balanced and more relevant to real life. The class labels and counts in the training set are as follows:
2.7.13.3     634731
2.7.11.1     240371
2.7.11        64559
2.7.11.24     16152
2.7.11.30      8103
2.7.11.32      5436
2.7.12         4252
2.7.14         2126
 It is not guaranteed that the test set has the same label distribution.
Files available for download:
- 
Train.csv - contains an ID, string indicating the protein and the target. This is the dataset that you will use to train your model.
- 
Test.csv- resembles Train.csv but without the target-related column. This is the dataset on which you will apply your model to.
- 
SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of test.csv and the target columns containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.
- 
Amino_acid_embeddings.csv - amino acid values mapped in 1024-dimensional space.
- 
StarterNotebook_ML.ipynb- this Machine Learning starter notebook will help you make your first submission on Zindi. 
- 
starter_notbook_dl__v1.ipynb - this is an UPDATED Deep Learning notebook. If you would like to try deep learning use this notebook to make your first submission. 
Here is a Gdrive link (http://bit.ly/UmojaHackTunisiaData) that contains a .zip folder of the data. This folder is password protected and the password is in the discussion forum.