Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ARTICLE
https://doi.org/10.1038/s41467-020-17207-3 OPEN
Structure-based machine-guided mapping of
amyloid sequence space reveals uncharted
sequence clusters with higher solubilities
Nikolaos Louros1,2, Gabriele Orlando1,2, Matthias De Vleeschouwer1,2, Frederic Rousseau 1,2 ✉ &
Joost Schymkowitz 1,2 ✉
1234567890():,;
The amyloid conformation can be adopted by a variety of sequences, but the precise
boundaries of amyloid sequence space are still unclear. The currently charted amyloid
sequence space is strongly biased towards hydrophobic, beta-sheet prone sequences that
form the core of globular proteins and by Q/N/Y rich yeast prions. Here, we took advantage
of the increasing amount of high-resolution structural information on amyloid cores currently
available in the protein databank to implement a machine learning approach, named Cordax
(https://cordax.switchlab.org), that explores amyloid sequence beyond its current bound-
aries. Clustering by t-Distributed Stochastic Neighbour Embedding (t-SNE) shows how our
approach resulted in an expansion away from hydrophobic amyloid sequences towards
clusters of lower aliphatic content and higher charge, or regions of helical and disordered
propensities. These clusters uncouple amyloid propensity from solubility representing
sequence flavours compatible with surface-exposed patches in globular proteins, functional
amyloids or sequences associated to liquid-liquid phase transitions.
1 Switch Laboratory, VIB Center for Brain and Disease Research, Herestraat 49, 3000 Leuven, Belgium. 2 Switch Laboratory, Department of Cellular and
Molecular Medicine, KU Leuven, Herestraat 49, 3000 Leuven, Belgium. ✉email: Frederic.Rousseau@kuleuven.vib.be; Joost.Schymkowitz@kuleuven.vib.be
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications 1ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3
T
he amyloid cross-β state is a polypeptide conformation that identified less hydrophobic polar and charged aggregation prone
is adopted by 36 proteins or peptides associated to human sequences that increasingly uncouple solubility and amyloid
protein deposition pathologies1. It also constitutes the propensity, closely resembling characteristics of phase-separation
structural core of a growing number of functional amyloids in inducers. Clustering by t-distributed stochastic neighbour
both bacteria and eukaryotes2,3. Beyond these bona fide func- embedding reveals the heterogeneous substructure of amyloid
tional and pathological amyloids it has been demonstrated that sequence space consisting in varying clusters corresponding to
many if not most proteins can adopt an amyloid-like con- sequences compatible with globular structure, functional scaf-
formation upon unfolding/misfolding4. This has led to the notion folding amyloids, N/Q/Y-rich prions, helical peptides and
that just like the α-helix or β-sheet, the amyloid state is a generic intrinsically disordered sequences. Together, the structural
polypeptide backbone conformation but also that amino acids exploration performed here demonstrates that the field now
have different propensities to adopt the amyloid conformation5. gathered sufficient structural and sequence information to start
Initially, it was observed that amyloid-like aggregation correlates classifying amyloids according to different structural and func-
with hydrophobicity, β-strand propensity, and (lack of) net tional niches. Just like for globular proteins in the 1980s, this will
charge6. This triggered the development of aggregation prediction allow to fine-tune both general and context-dependent structural
algorithms that essentially evaluate the above biophysical rule learning allowing to manipulate and design amyloid struc-
propensities7,8. Others extended to scaling residue propensities ture and function.
between protein folding and aggregation9,10. These algorithms
confirmed the ubiquity of amyloid-like propensity in natural
Results
protein sequences and particularly in globular proteins as it was
Overall approach of Cordax. We wanted to design a novel
estimated that 15–20% of residues in a typical globular domain
structure-based amyloid core sequence prediction method that
are within aggregation-prone regions (APRs)11,12. These APRs
(a) leverages all the available structure information that is cur-
are sequence segments of six to seven amino acids in length on
rently available, and (b) employs a machine-learning element for
average and are mostly buried within the protein structure where
optimal prediction performance. To this end, we first built a
they constitute the hydrophobic core stabilising tertiary protein
curated template library of amyloid core structures as described
structure13–15. On the other hand, the increasing identification of
in the paragraph below. In the vein of previous prediction
both yeast prions and functional amyloids clearly indicated that
methods29, we fixed on the hexapeptide as a unit of prediction. In
amyloid sequence space is not monolithic and that more polar/
order to determine the amyloid propensity of a query hexapeptide
less aliphatic sequences represent important alternative popula-
we start by modelling its side chains on all the available amyloid
tions of amyloid sequence space3. The limited sensitivity of the
template structures using the FoldX force field30, which yields a
above cited algorithms to specifically identify these other sub-
model and an associated free energy estimate (ΔG, kcal/mol) for
populations confirmed the underestimated sequence versatility of
each template. These free energies are then fed into a logistic
the amyloid conformation. Indeed, more recently the role of
regression model, which is a simple statistical method relating a
amyloid-like sequences in proteins mediating liquid–liquid phase
binary outcome to continuous variables. The prediction output of
transitions again demonstrates the ubiquity of the amyloid in
Cordax is multiple: First, there is the prediction from the logistic
biological function and further withers the image of the amyloid
regression whether or not the segment is an amyloid core
state as a predominantly disease and/or toxicity-associated pro-
sequence. Second, for the sequences deemed amyloid core, the
tein conformation16–18. Rather, this suggests that like globular
most likely amyloid core model is provided. For longer query
protein folding, amyloid assembly is a matter of kinetic and
sequences, a sliding window approach is adopted. The technical
thermodynamic control that can be evolutionary tuned by
details of the pipeline can be found in the “Methods” section.
sequence variation and selection.
Efforts to develop aggregation predictors that can identify a
broader spectrum of amyloid sequences have increased over the Refinement of fibril structures for machine learning. We iso-
years19. Such approaches focused on identifying position-specific lated 78 short segment fibril core high-resolution structures from
patterns by reference to accumulated experimental data of the Protein Data Bank (Supplementary Data 1). Templates were
APRs20–22, or by using energy functions of cross-beta pairings23. grouped into seven distinct topological classes out of eight the-
Recently developed meta-predictors produce consensus outputs oretically possible based on their overall structural properties, as
by combining previous methods, in an attempt to boost previously proposed by Sawaya et al.31. Briefly, topologies are
performance24,25. Indirect structure-based methods were initially defined by whether β-sheets have parallel versus antiparallel
developed by considering secondary structure propensities26,27. orientation, by the orientation of the strand faces that form the
Complementary studies extended this notion by suggesting that steric zipper (face-to-face versus face-to-back), and finally the
disease-related amyloids form β-strand-loop-β-strand motifs28. orientation of both sheets towards each other and whether that
However, the principle of using structural information to accu- results in identical or different fibril edges. This complexity was
rately predict aggregation prone segments in protein sequences addressed by generating an ensemble of amyloid cores per
stems from the detailed work of Eisenberg and co-workers. The structure using crystal contact information derived from the
3D-profiling method utilised the crystal structure of the fibril- solved structures. Every template comprises two facing β-sheets,
forming segment NNQQNY (PDB ID: 1YJO) derived from the each composed of five successive β-strands. Since parallel archi-
Sup35 prion protein, to thread and evaluate sequence fitting using tectures can share more than one homotypic packing interface,
the Rosetta energy function29. In this work, we build on this those structures were split into separate individual entries (Fig. 1).
principle to develop Cordax, an exhaustively trained regression To ensure uniformity, we expanded the number of structural
model that leverages a substantial library of curated template variants by breaking down longer segments into hexapeptide
structures combined with machine learning. Cordax not only constituents, thus yielding a library of 179 peptide fragment
detects APRs in proteins, but also predicts the structural topology, structures (Fig. 1 and Supplementary Data 1).
orientation and overall architecture of the resulting putative fibril The amyloid interaction interfaces were analysed in detail
core. To validate the accuracy of our predictions, we designed a following energy refinement by the FoldX force field30. During
screen of 96 newly predicted APRs and experimentally deter- this step we identified and rejected 33 imperfect β-packing
mined their aggregation properties. Using this approach, we interfaces formed by β-strands that contribute less than three
2 NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunicationsNATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3 ARTICLE
Interface fragmentation
1
90° 90°
Interface a Interface b
2
Hexapeptide fragmentation
Fragment a_1 Fragment a_2 Fragment b_1 Fragment b_2
Fig. 1 Processing steps of the peptide fragment library. a Crystal contact information was used to generate fibril cores from isolated PDB structures.
Structures containing multiple packing interfaces were split into individual templates (1), which were in turn split into hexapeptide core fragments (2).
Source data are provided as a Source Data file.
interacting residues, thus reducing the ensemble to 146 structures characteristic (ROC) analysis33. The ROC curves generated
(Supplementary Data 1). Detailed analysis of the contributions of highlight that Cordax performance exceeds over seven state-of-
various energy components showed that these excluded β- the-art methods, which we applied using optimised options
packing interfaces have inefficient shape complementarity and defined by the developers7,9,21–24,34. In detail, Cordax performs
low overall stability, stemming from a combination of weak well over random as depicted by the highest total area under the
electrostatic contributions, diminished van der Waals interactions curve (AUC) value of 0.87 (Fig. 3a). Distribution analysis of the
and exposure of hydrophobic residues to the solvent (Fig. 2a). scoring values indicates that the method achieves optimal
Previous work has highlighted that distinct topological layouts separation, resulting in minimal scoring overlay between positive
can potentially introduce a stronger tolerance for the integration and negative amyloid forming sequences (Fig. 3b). As previously
of protein sequence segments and as a result can generate several reported, TANGO showed high specificity due to the over-
potential type-I errors (false positives)29. To address this issue, we representation of unscored values, which is also evident for
implemented a two-step cross-threading exploration of putative WALTZ as well as MetAmyl, which incorporates the latter
structural promiscuous traps. In more detail, we extracted a non- method in its meta-prediction. The cost of high specificity is also
redundant set of hexapeptide sequences from the structural reflected by the calculated F1 values, as PASTA and TANGO
library (73 sequences), which was subsequently cross-modelled in report low recall values. On the other hand, AGGRESCAN and
an all-against-all reiteration process. Using an empirical cut-off GAP produce significant overpredictions as depicted by their
threshold (=5), a sum of three structural fragments was initially reported false-positive rates (FPR values of 0.54 and 0.76,
identified and removed. Eliminating these structures led to the respectively) (Fig. 3c). The optimal score thresholding of our
identification and subsequent elimination of three additional method was determined from the ROC curve analysis as the score
promiscuous templates, resulting in the final Cordax library, where predictions show the highest sensitivity-to-specificity ratio.
composed of 140 zipper structures (Fig. 2b and c). According to this, Cordax achieves a well-balanced prediction by
reporting with high specificity (86%) more than 7 out of 10
Benchmarking aggregation propensity detection with Cordax. aggregation prone segments (72%), which is reflected by the
As an initial test of the prediction accuracy of the regression highest calculated MCC, AUC and F1 values compared to other
model, we performed leave-one-out cross-validation on the available software (Fig. 3c).
training dataset32 and performance metrics were determined on a To further benchmark the method, we tested it against full-
peptide basis. Due to the extensive size of the dataset, comparison length protein sequences. For this we used a standardised set of
to other software was performed only with methods supporting 34 annotated amyloidogenic proteins that was previously
multiple sequence input and a non-binary scoring function, implemented for validation of several previous aggregation
since performances were compared using receiver operating predictors25, following a filtering step for potential overlaps to
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications 3ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3
a ΔG BackHbonds SideHbonds VdW Electrostatics SolvPolar SolvHphobic
b 1YJO 3FR1 6CFH_3 3FOD 4XFN 4W67_2
Cor : 0.221 Cor : 0.0339 Cor : 0.611 Cor : 0.559 Cor : –0.0299 Cor : 0.697
Accepted: 0.206 Accepted: –0.0205 Accepted: 0.524 Accepted: 0.526 Accepted: 0.158 Accepted: 0.654
ΔG
146 Structures 143 Structures 140 Structures
Rejected: 0.211 Rejected: –0.306 Rejected: 0.791 Rejected: 0.365 Rejected: –0.538 Rejected: 0.806 Starting Filtered CORDAX
library library library
Cor : 0.128 Cor : 0.213 Cor : –0.197 Cor : –0.287 Cor : 0.136
BackHbonds
Accepted: 0.132 Accepted: 0.173 Accepted: –0.248 Accepted: –0.219 Accepted: 0.11 1st Cross-thread step 2nd Cross-thread step
Rejected: 0.0738 Rejected: 0.215 Rejected: –0.342 Rejected: –0.516 Rejected: 0.101
15 1yjo 15 15
Cor : 0.365 Cor : 0.0391 Cor : –0.549 Cor : –0.0231
10 6cfh_3 10 10
SideHbonds
Accepted: 0.385 Accepted: –0.0142 Accepted: –0.56 Accepted: –0.0886
5 5 5
Selection counts
Selection counts
Selection counts
Rejected: –0.145 Rejected: –0.0855 Rejected: 0.0135 Rejected: –0.223 4xfn
3fr1 4w67_2 3fod
0 0 0
Cor : 0.22 Cor : –0.677 Cor : 0.901
VdW
Accepted: 0.151 Accepted: –0.595 Accepted: 0.859
Rejected: –0.0543 Rejected: –0.879 Rejected: 0.982
Cor : –0.0129 Cor : 0.217
Electrostatics
Accepted: 0.0724 Accepted: 0.169 PDB Code PDB Code PDB Code
Rejected: 0.31 Rejected: –0.0201
WALTZ-DB 2.0 1 DATABASE OF AMYLOID
Cor : –0.416 c FORMING PEPTIDES
Sequence
input
SolvPolar
Accepted: –0.263
1402 Sequences
Rejected: –0.804
CORDAX
library
Topology model
196.280 Regression
SolvHphobic
ΔG Energies model
Training data
140 Structures
Amyloid scoring
Fig. 2 Optimising the Cordax structural library. a Correlation plot of interface energies calculated using FoldX. Top half shows correlation values with
scatter plots indicated at the bottom half. Rejected fragments sharing low shape complementarity (shown in yellow) have correlating weak van der Waals
interfaces, as well as poor solvation energies for hydrophobic side chains compared to the remaining library (indicated in purple). Linear regression lines
are shown with 95% confidence interval (shown in grey shaded areas). b Promiscuity sorting of the structural library performed as a two-step cross-
threading process. Circular histograms highlight three major promiscuous structures (n > 5) which were removed during the primary (PDB ID: 1YJO, 3FR1
and 6CFH_3) and secondary step (PDB ID: 3FOD, 4XFN and 4W67_2). c Schematic representation of Cordax training and the derived pipeline.
a b 1.00
1.0 1 5
PASTA 2.0 energy
0.75
CORDAX score
0
0.50
0.25 –5
0.8 0.8
0.00
Amyloid Non-Amyloid 0.0 0.5 1.0 1.5 2.0 2.5 Amyloid Non-Amyloid 0.00 0.05 0.10 0.15 0.20
Classification Density Classification Density
Average true positive rate
0.6 0.6 600
CORDAX cutoffs
n
tio
400
ic
ed
WALTZ score
pr
TANGO score
400 300
m
do
an
200
R
200
0.4 AUC = 0.87 0.4
100
CORDAX 0 0
WALTZ
PASTA 2.0 Amyloid Non-Amyloid 0.00 0.25 0.50 0.75 1.00 Amyloid Non-Amyloid 0.00 0.25 0.50 0.75 1.00
AGGRESCAN Classification Scaled density Classification Scaled density
MetAmyl
TANGO
0.2 GAP 0.2
MILAMP
0.15
MetAmyl (total hot spot area)
100
AGGRESCAN (Na4vSS)
50
0.10
0
0.0 0 –50
0.05
–100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
Average false positive rate –150
Amyloid Non-Amyloid 0.00 0.25 0.50 0.75 1.00 Amyloid Non-Amyloid 0.000 0.005 0.010
Classification Scaled density Classification Density
c CORDAX WALTZ PASTA 2.0 AGGRESCAN MetAmyl TANGO GAP MILAMP
1.00 1.00
Accuracy 0.81 0.67 0.76 0.57 0.79 0.73 0.51 0.73
Precision 0.74 0.54 0.74 0.60 0.72 0.82 0.42 0.66 0.75 0.75
MILAMP score
GAP score
TPR (Recall) 0.72 0.65 0.52 0.85 0.69 0.31 0.94 0.49 0.50
0.50
FPR 0.14 0.33 0.10 0.54 0.15 0.04 0.74 0.14
0.25 0.25
MCC 0.57 0.34 0.46 0.29 0.54 0.38 0.25 0.38
F1 0.73 0.60 0.61 0.71 0.70 0.45 0.58 0.56 0.00 0.00
Amyloid Non-Amyloid 0.00 0.25 0.50 0.75 1.00 Amyloid Non-Amyloid 0.0 0.5 1.0 1.5 2.0
AUC 0.87 0.73 0.84 0.84 0.78 0.64 0.70 0.76 Classification Scaled density Classification Density
Fig. 3 Benchmarking of CORDAX. a ROC curve analysis for Cordax and seven other state-of-art methods against WALTZ-DB 2.0. For WALTZ, TANGO
and MetAmyl, FPR stops at earlier rates due to minimal scoring variations. b Cordax score distribution compared to other tools. The regression model
achieves better scoring separation for predictions between amyloid-forming (shown in blue) and non-amyloid sequences (shown in red). Distributions are
shown as standard box-plots highlighting median, upper (Q3) and lower (Q1) quartiles. Whiskers indicate the ranges for values within 1.5 interquartile
range (IQR) above and below Q3 and Q1, respectively. Density plots for WALTZ, TANGO, MetAmyl and GAP are scaled due to the overrepresentation of
unscored values or false positives, respectively. c Performance metrics comparison indicating Cordax superiority to other sequence predictors (MCC =
0.57, F1 = 0.73 and AUC = 0.87).
4 NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunicationsNATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3 ARTICLE
the training data set. Despite its wide use, this collection suffers picked up both by Cordax and traditional sequence-based
from insufficient experimental characterisation of certain large methods are usually found buried within the core of soluble
entries (i.e. gelsolin, kerato-epithelin, lactoferrin, amphoterin and proteins. Contrary to what has been previously reported14,15,
others), which has been shown to introduce type-I errors (false however, our regression model also discovered additional nucle-
positives). This error propensity derives from non-amyloid ating sequences that primarily appear to reside on the surface of
annotations which primarily correspond to regions of undeter- protein molecules (Fig. 5b–h) and as a result, are characterised by
mined aggregation propensity, a notion that is highlighted by high solvent exposure (Fig. 5i and j). Partition coefficients clearly
recent studies, such as in the case of calcitonin35, cystatin-C36 and indicate that these exposed peptide segments identified by Cordax
transthyretin37. In contrast, other proteins have been linked to are primarily water-soluble sequences, whereas APRs that are
the formation of β-helical structures and as an after effect contain predicted by the majority of sequence-based predictors are largely
elongated fragments characterised, yet unverified in their entirety, insoluble (Fig. 5k). Sequence distribution analysis signifies that
as amyloidogenic, which can introduce type-II errors (false this increased exposure and solubility is complemented by an
negatives) when applying predictors of local aggregation expected decrease in sequence hydrophobicity (Fig. 5l). More
propensity38–41. The aforementioned shortcomings are reflected specifically, APRs identified solely by Cordax are relatively enri-
by the low MCC values that are reported for all aggregation ched in charged or polar side chains (Fig. 5l) and are frequently
predictors (Supplementary Table 1) and the fact that predicted parts of α-helical or unstructured segments (Fig. 5m). This
segments were originally considered neutral, but later shown to implies that these regions are in fact conformational switches that
be aggregation hotspots (Supplementary Fig. 1)35–41. may, under fitting misfolding conditions, transiently move
towards the formation of β-aggregates. The fact that these
sequences are not dictated by typical sequence propensities, such
Designed APR nucleators validate the accuracy of Cordax
as hydrophobicity or β-structure tendency, explains why
predictions. In the interest of improving the current description
sequence-based predictors overlook them.
of the familiar amyloidogenic protein dataset, we selected and
synthesised a subset of 96 peptides corresponding to strong
aggregation prone regions identified in these proteins by Cordax.
Cordax infiltrates uncharted areas of amyloid sequence space.
Apart of prediction strength, the peptide screen was also selec-
To further explore the capabilities of our method, we composed a
tively constructed to ensure broad sequence variability and a wide
map of the known amyloid-forming sequence space using t-
distribution on the proteins of the dataset, with a preference for
distributed stochastic neighbour embedding (t-SNE) for dimen-
longer entries defined by inadequate previous characterisation.
sionality reduction (Fig. 6a). As input, we used a 20-dimensional
Peptide sequences were cross-checked and filtered to exclude
parameterisation vector describing all newly identified amyloi-
overlapping sequences with previously identified amyloid regions
dogenic peptides merged to the known amyloid-forming hex-
and WALTZ-DB (Supplementary Data 2). The remaining selec-
apeptide sequences in WALTZ-DB, in terms of their basic
tion of 96 peptides were synthesised using standard solid phase
physicochemical properties and amino acid composition, as well
synthesis and their amyloid-forming properties were initially
as prediction outputs derived from Cordax and other high spe-
examined using Thioflavin-T (Th-T) or pFTAA binding, fol-
cificity predictors. t-SNE mapping pinpointed clear areas of
lowing rotating incubation for 5 days at room temperature. The
sequence space where Cordax correctly identifies amyloid pro-
binding assays are complementary, as Th-T and pFTAA are
pensity (purple colour in Fig. 6a), which primarily extend towards
opposingly charged molecules, which increases the amyloid
regions that remain unpredicted (shown in black) and seclude
identification rate by overcoming cases of dye-specific failure to
from a large base of sequences identified by multiple methods,
bind to amyloid surfaces based on charge repulsion. Under these
including Cordax (cyan colour). Clustering analysis (Fig. 6b)
conditions, 66 peptides successfully bind to the specific dyes
performed using physicochemical properties (Figs. 6c–e), sec-
(Fig. 4a and b) by forming fibrils with typical amyloid
ondary structure propensities (Fig. 6f) and side chain size dis-
morphologies and properties that were verified using transmis-
tributions (Fig. 6g, h) identifies that this common base of by-now
sion electron microscopy (Fig. 4c) and Congo red staining for
easy to predict APRs are characterised by high hydrophobicity,
selected cases (Fig. 4d). As these dyes are known to yield false
strong β-sheet propensity and a high relative content of aliphatic
negatives, in particular for short peptides, all dye-negative pep-
side chains (cluster 1 in Fig. 6b), still echoing the initial discovery
tides were further investigated using electron microscopy. During
of APRs by these features6. Cordax explores regions adjacent to
this scan, we recovered 19 additional sequences that were capable
this with a higher content of shorter side chains (clusters 2 and 5).
of forming sparse amyloid-like fibrils with shorter lengths (Sup-
Notably, amyloid nucleators of this composition are an invaluable
plementary Fig. 2). Taking the latter into account, Cordax was
resource for amyloid nanomaterial designs with elastin-like
able to fish out a total number of 85 novel nucleation segments
properties, are enriched in functional amyloids and have also
with unparalleled accuracy (89%), thus providing a rigorously
been linked to ancestral amyloid scaffolds in early life42–45. A
improved description of the protein set to be used for the efficient
similar trend in amino acid composition has also been reported
testing and development of future predictors (Supplementary
for proteins that form condensates through phase transition, such
Fig. 1).
as TDP-43 and FUS16,18. Low complexity regions (LCRs) that are
enriched in short side chains, such as Gly or Ala, have been
Cordax detects highly soluble surface-exposed conformational shown to drive phase separation, often as an intermediate event
switches. The expanded amyloidogenic annotation of the protein towards fibrillation, particularly in polar LCRs with lower ali-
dataset was supplemented with structural analysis of the newly phatic content and strong disorder or α-helical propensities, such
identified aggregation prone regions. Out of 96 peptides designed as the sequences discovered in cluster 517,46. Further to this,
and experimentally tested, 85 peptides were found to display Cordax provides significant advancement by traversing in areas
evident amyloid-forming features, with more than half (55.3%) with a higher content of negatively or positively charged regions
being predicted specifically by Cordax, contrary to shared pre- (clusters 3, 4, 6 and 7, respectively). Charged residues often act as
dictions with sequence-based tools of high specificity (44.7%) gatekeepers that directly disrupt aggregation or modulate it by
(Supplementary Data 2). Pinpointing the location of the identified flanking APRs within protein sequences47. Based on this premise,
nucleators in parental protein folds (Fig. 5a) revealed that APRs most sequence-based predictors negatively correlate net charge to
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications 56
Th-T fluorescence intensity (490 nm)
e
c
a
pFTAA fluorescence intensity (510 nm)
5.0 × 10
2 × 10
4 × 10
0.0
2.5 × 103
3
4
4
6 × 104
8 × 104
0
3 × 104
6 × 104
9 × 104
(SVDYEV) Acylphos2 (9 – 14) p > 0.9999 (SVDYEV) Acylphos2 (9 – 14) p < 0.0001
(IGVVGW) Acylphos2 (34 – 39) p < 0.0001 (IGVVGW) Acylphos2 (34 – 39) p = 0.0322
(NFSIRY) Acylphos2 (94 – 99) p = 0.9995 (NFSIRY) Acylphos2 (94 – 99) p < 0.0001
(VNFSEF) Amphot (36 – 41) p = 0.9995 (VNFSEF) Amphot (36 – 41) p = 0.8600
IFTVQL
STVISY
ARTICLE
IQVYSR
p > 0.9999 p = 0.9804
NDAFVL
(SEFSKK) Amphot (39 – 44) (SEFSKK) Amphot (39 – 44)
(AFFLFC) Amphot (101 – 106) p < 0.0001 (AFFLFC) Amphot (101 – 106) p < 0.0001
(LSSYWE) ApoC2 (44 – 49) p > 0.9999 (LSSYWE) ApoC2 (44 – 49) p = 0.9995
(NLYEKT) ApoC2 (57 – 62) p > 0.9999 (NLYEKT) ApoC2 (57 – 62) p = 0.9790
(VYVDVL) ApoAI (41 – 46) p < 0.0001 (VYVDVL) ApoAI (41 – 46) p = 0.9988
(LNLKLL) ApoAI (66 – 71) p > 0.9999 (LNLKLL) ApoAI (66 – 71) p < 0.0001
p > 0.9999 p = 0.9792
500 nm
(LLDNWD) ApoAI (70 – 75) (LLDNWD) ApoAI (70 – 75)
200 nm
500 nm
500 nm
(MDVFMK) -Syn (1 – 6) p > 0.9999 (MDVFMK) -Syn (1 – 6) p = 0.0433
(IQVYSR) 2-micro (27 – 32) p < 0.0001 (IQVYSR) 2-micro (27 – 32) p < 0.0001
(LCSTFC) Casein (50 – 55) p < 0.0001 (LCSTFC) Casein (50 – 55) p < 0.0001
(LNFLKK) Casein (176 – 181) p > 0.9999 (LNFLKK) Casein (176 – 181) p = 0.9998
(KTVYQH) Casein (196 – 201) p > 0.9999 (KTVYQH) Casein (196 – 201) p = 0.7809
VHIVKL
IIYNWQ
NFSIRY
VYVDVL
(TKVIPY) Casein (213 – 218) p > 0.9999 (TKVIPY) Casein (213 – 218) p = 0.8792
(GIVAGA) Het-S (6 – 11) p > 0.9999 (GIVAGA) Het-S (6 – 11) p < 0.0001
(DCFEYV) Het-S (23 – 28) p < 0.0001 (DCFEYV) Het-S (23 – 28) p < 0.0001
(FEYVQL) Het-S (25 – 30) p < 0.0001 (FEYVQL) Het-S (25 – 30) p = 0.0443
(ILLLFE) Het-S (87 – 92) p < 0.0001 (ILLLFE) Het-S (87 – 92) p < 0.0001
(DLVVFE) Het-S (109 – 114) p < 0.0001 (DLVVFE) Het-S (109 – 114) p < 0.0001
p < 0.0001 (AEIEIE) Het-S (184 – 189) p < 0.0001
500 nm
200 nm
200 nm
(AEIEIE) Het-S (184 – 189)
500 nm
(ASLTIL) Het-S (195 – 200) p < 0.0001 (ASLTIL) Het-S (195 – 200) p < 0.0001
(MVVEHP) Gelsolin (52 – 57) p > 0.9999 (MVVEHP) Gelsolin (52 – 57) p = 0.6738
(AYVILK) Gelsolin (94 – 99) p > 0.9999 (AYVILK) Gelsolin (94 – 99) p < 0.0001
(GWFLGW) Gelsolin (758 – 763) p = 0.9781 (GWFLGW) Gelsolin (758 – 763) p = 0.9995
(YDLHYW) Gelsolin (110 – 115) p < 0.0001 (YDLHYW) Gelsolin (110 – 115) p = 0.0396
YIILYN
ASLTIL
FVIEEV
(IFTVQL) Gelsolin (130 – 135) p < 0.0001 (IFTVQL) Gelsolin (130 – 135) p = 0.0042
QVFVWV
(FLGYFK) Gelsolin (157 – 162) p > 0.9999 (FLGYFK) Gelsolin (157 – 162) p = 0.9991
(MSVSLV) Gelsolin (310 – 315) p < 0.0001 (MSVSLV) Gelsolin (310 – 315) p < 0.0001
(EDCFIL) Gelsolin (329 – 334) p < 0.0001 (EDCFIL) Gelsolin (329 – 334) p < 0.0001
(KIFVWK) Gelsolin (341 – 346) p > 0.9999 (KIFVWK) Gelsolin (341 – 346) p = 0.0235
(DFITKM) Gelsolin (364 – 369) p > 0.9999 (DFITKM) Gelsolin (364 – 369) p = 0.9998
p < 0.0001 p < 0.0001
200 nm
200 nm
500 nm
(YIILYN) Gelsolin (474 – 479)
500 nm
(YIILYN) Gelsolin (474 – 479)
(IIYNWQ) Gelsolin (489 – 494) p < 0.0001 (IIYNWQ) Gelsolin (489 – 494) p < 0.0001
(MIIYKG) Gelsolin (544 – 549) p = 0.9997 (MIIYKG) Gelsolin (544 – 549) p < 0.0001
(NDAFVL) Gelsolin (591 – 596) p < 0.0001 (NDAFVL) Gelsolin (591 – 596) p < 0.0001
(AAYLWV) Gelsolin (601 – 606) p < 0.0001 (AAYLWV) Gelsolin (601 – 606) p < 0.0001
(FVIEEV) Gelsolin (679 – 684) p < 0.0001 (FVIEEV) Gelsolin (679 – 684) p < 0.0001
ILLLFE
AEIEIE
p = 0.1841 p < 0.0001
VNFSEF
YQLVQL
(QVFVWV) Gelsolin (706 – 711) (QVFVWV) Gelsolin (706 – 711)
(GYLTIR) RepA (45 – 50) p > 0.9999 (GYLTIR) RepA (45 – 50) p = 0.0002
(KLFNRD) RepA (75 – 80) p = 0.9998 (KLFNRD) RepA (75 – 80) p = 0.0003
(FHVKYR) RepA (97 – 102) p > 0.9999 (FHVKYR) RepA (97 – 102) p < 0.0001
(MLHKEF) RepA (122 – 127) p = 0.9995 (MLHKEF) RepA (122 – 127) p = 0.9037
(FYAVRL) RepA (141 – 146) p > 0.9999 (FYAVRL) RepA (141 – 146) p = 0.9996
500 nm
p > 0.9999 p = 0.0261
2000 nm
200 nm
(SQFIKL) RepA (151 – 156) (SQFIKL) RepA (151 – 156)
200 nm
(FSFTIA) RepA (216 – 221) p < 0.0001 (FSFTIA) RepA (216 – 221) p < 0.0001
Th-T Control pFTAA Control
Th-T fluorescence intensity (490 nm)
f
pFTAA fluorescence intensity (510 nm)
d
b
0.0
2.5 × 103
5.0 × 103
2 × 104
4 × 104
6 × 104
8 × 104
0
3 × 104
6 × 104
9 × 104
(YQLVLQ) Keratoep (30 – 35) p < 0.0001 (YQLVLQ) Keratoep (30 – 35) p < 0.0001
(STVISY) Keratoep (77 – 82) p < 0.0001 (STVISY) Keratoep (77 – 82) p < 0.0001
(VNIELL) Keratoep (163 – 168) p < 0.0001 (VNIELL) Keratoep (163 – 168) p < 0.0001
polarised light (CP). Scale bars: 500 μm. Source data are provided as a Source Data file.
(IQIHHY) Keratoep (202 – 207) p > 0.9999 (IQIHHY) Keratoep (202 – 207) p < 0.0001
(QIIEIE) Keratoep (245 – 250) p < 0.0001 (QIIEIE) Keratoep (245 – 250) p < 0.0001
(GVIHYI) Keratoep (362 – 367) p > 0.9999 p = 0.0003
BF
(GVIHYI) Keratoep (362 – 367)
(LNSVFK) Keratoep (416 – 421) p > 0.9999 (LNSVFK) Keratoep (416 – 421) p = 0.4282
(LYHGQT) Keratoep (449 – 454) p > 0.9999 (LYHGQT) Keratoep (449 – 454) p = 0.0006
(TLFTMD) Keratoep (490 – 495) p = 0.9997 (TLFTMD) Keratoep (490 – 495) p = 0.0925
(VYTVFA) Keratoep (536 – 541) p < 0.0001 (VYTVFA) Keratoep (536 – 541) p < 0.0001
(HYYAVA) Lactoferrin (110 – 115) p > 0.9999 (HYYAVA) Lactoferrin (110 – 115) p < 0.0001
(GDVAFI) Lactoferrin (223 – 228) p < 0.0001 (GDVAFI) Lactoferrin (223 – 228) p < 0.0001
YIILYN
AEIEIE
p > 0.9999 p = 0.0319
NFSIRY
(LLFKDS) Lactoferrin (317 – 322) (LLFKDS) Lactoferrin (317 – 322)
IQVYSR
(YFTAIQ) Lactoferrin (343 – 348) p < 0.0001 (YFTAIQ) Lactoferrin (343 – 348) p < 0.0001
(GWNIPM) Lactoferrin (487 – 492) p = 0.9997 (GWNIPM) Lactoferrin (487 – 492) p < 0.0001
(MDKVER) Lactoferrin (622 – 627) p = 0.9998 (MDKVER) Lactoferrin (622 – 627) p = 0.9650
CP
(YVAGIT) Lactoferrin (686 – 691) p < 0.0001 (YVAGIT) Lactoferrin (686 – 691) p < 0.0001
(CEFLRK) Lactoferrin (705 – 710) p = 0.9999 (CEFLRK) Lactoferrin (705 – 710) p = 0.0014
(LQVDLG) Medin (289 – 294) p < 0.0001 (LQVDLG) Medin (289 – 294) p = 0.0453
(MFSEFD) Prolactin (64 – 69) p < 0.0001 (MFSEFD) Prolactin (64 – 69) p < 0.0001
(LIVSIL) Prolactin (111 – 116) p = 0.9813 (LIVSIL) Prolactin (111 – 116) p < 0.0001
(LYHLVT) Prolactin (123 – 128) p > 0.9999 (LYHLVT) Prolactin (123 – 128) p = 0.9988
(MELIVS) Prolactin (158 – 163) p < 0.0001 (MELIVS) Prolactin (158 – 163) p < 0.0001
(LHCLRR) Prolactin (200 – 205) p = 0.9998 (LHCLRR) Prolactin (200 – 205) p = 0.8737
(FDKFKH) Myoglobin (44 – 49) p > 0.9999 (FDKFKH) Myoglobin (44 – 49) p = 0.9921
(QVYYRP) PrP (160 – 165) p > 0.9999 p = 0.0130
(QVYYRP) PrP (160 – 165)
(YFHARG) SAA (53 – 58) p > 0.9999 (YFHARG) SAA (53 – 58) p = 0.0001
(LIFMGH) Sup35 (263 – 268) p = 0.9990 (LIFMGH) Sup35 (263 – 268) p < 0.0001
BF
(MYVSEM) Sup35 (350 – 355) p < 0.0001 (MYVSEM) Sup35 (350 – 355) p < 0.0001
(VVVNKM) Sup35 (403 – 408) p > 0.9999 (VVVNKM) Sup35 (403 – 408) p < 0.0001
(SNFLRA) Sup35 (428 – 433) p > 0.9999 p = 0.0420
(SNFLRA) Sup35 (428 – 433)
(IGYNIK) Sup35 (434 – 439) p > 0.999 p = 0.1168
(IGYNIK) Sup35 (434 – 439)
(TDVVFM) Sup35 (440 – 445) p < 0.0001 (TDVVFM) Sup35 (440 – 445) p < 0.0001
(VDMAMC) Sup35 (542 – 547) p < 0.0001 (VDMAMC) Sup35 (542 – 547) p < 0.0001
(VHIVKL) Sup35 (612 – 617) p > 0.9999 (VHIVKL) Sup35 (612 – 617) p < 0.0001
ILLLFE
ASLTIL
FVIEEV
p = 0.0445
NDAFVL
(INFEFS) Ure2p (35 – 40) p < 0.0001 (INFEFS) Ure2p (35 – 40)
(FKVAIV) Ure2p (126 – 131) p > 0.9999 (FKVAIV) Ure2p (126 – 131) p = 0.0004
(GYTLFS) Ure2p (113 – 118) p > 0.9999 (GYTLFS) Ure2p (113 – 118) p = 0.0003
p < 0.0001
CP
(TIFLDF) Ure2p (141 – 146) p < 0.0001 (TIFLDF) Ure2p (141 – 146)
(LSIWES) Ure2p (176 – 181) p < 0.0001 p < 0.0001
(LSIWES) Ure2p (176 – 181)
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
(HLVNKY) Ure2p (187 – 192) p = 0.9999 (HLVNKY) Ure2p (187 – 192) p = 0.9910
(INAWLF) Ure2p (211 – 216) p < 0.0001 (INAWLF) Ure2p (211 – 216) p < 0.0001
(LVMELD) Ure2p (270 – 275) p < 0.0001 (LVMELD) Ure2p (270 – 275) p < 0.0001
(IGINIK) Ure2p (323 – 328) p > 0.9999 p = 0.9999
(IGINIK) Ure2p (323 – 328)
(EVYKWT) Ure2p (333 – 338) p > 0.9999 (EVYKWT) Ure2p (333 – 338) p < 0.0001
(SYVLQT) Semenogelin (193 – 198) p = 0.9329 (SYVLQT) Semenogelin (193 – 198) p = 0.0036
(LLVYNK) Semenogelin (260 – 265) p > 0.9999 p = 0.0941
(LLVYNK) Semenogelin (260 – 265)
(LHYGEN) Semenogelin (298 – 303) p > 0.9999 p = 0.2828
(LHYGEN) Semenogelin (298 – 303)
Th-T Control pFTAA Control
amyloid fibrils bind Congo red as displayed under bright field illumination (BF) and exhibit typical for amyloids apple-green birefringence under crossed-
multiple comparison against the vehicle control). e Electron micrographs of amyloid fibrils formed by Th-T or pFTAA-binding peptides. f Suspensions of
Fig. 4 Amyloid-forming properties of the peptide screen designed by employing Cordax. a and b Measured pFTAA and c and d Th-T fluorescence of
synthetic peptides following rotation at 200 μM for 5 days. Data represents mean ± SD (n = 6 independent experiments, statistics: one-way ANOVA with
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3 ARTICLE
a b c RepA
180° 180°
Ure2p PDB: 1HQO PDB: 1HKQ
Sup35
d e
Acylphosphatase-2
Ure2p
180°
180°
PDB: 1APS PDB: 4CRN
Prolactin
Lactoferrin
f g h Keratoepithelin
180°
180°
180°
PDB: 1RW5 PDB: 1CB6 PDB: 5NV6
i 1.2
p < 0.0001
j 1.2
p < 0.0001
k 5 p < 0.0001
l 40 m 100
CORDAX CORDAX
1.0 1.0 vi Joined Joined
30 75
0.8 i
Side chain burial
Main chain burial
0.8 0 n
Residue counts
Residue (%)
logD values
s
0.6 0.6 20 50
vs
0.4 0.4 –5
10 25
0.2 0.2
0.0 0.0 –10 0 0
ys
AX
ed
ge
ge
ic
x
ic
tic
r
ed
ed
AX
la
da
l
ix
nd
at
ob
oi
C
ha
in
ar
ar
Po
D
el
in
in
om
C
D
or
o/
ra
ph
Jo
ch
ch
R
H
Jo
lip
Jo
R
Pr
C
St
O
α-
Ar
ro
O
ta
e
e
C
β-
C
yd
iv
tiv
or
at
si
H
Sh
eg
Po
N
Fig. 5 Cordax identifies surface-exposed aggregation nucleators spanning residues that are typically considered unconventional for amyloid fibril
formation. a Schematic representation of Cordax-predicted topological models for APRs charted against the cognate native crystal structure of the
amyloidogenic protein Ure2p. b–h Surface representation of folded structures for b Ure2p, c RepA, d Acylphosphatase-2, e Sup35, f Prolactin, g Lactoferrin
and h Kerato-epithelin reveals that aggregation nucleators uniquely identified by Cordax (highlighted in red) are primarily exposed to the surface of
proteins, compared to segments of joint prediction (shown in blue) which are predominantly buried within the hydrophobic core of the native fold. Cordax-
specific predicted APRs produced lower volumetric burial values, calculated using FoldX, for i side chain and j main chain groups indicating that they are
considerably exposed compared to jointly identified nucleators. k Partition coefficients indicate that Cordax-specific APRs are significantly more soluble
compared to typically predicted sequences that are primarily hydrophobic and therefore insoluble. Solubility regions (vi very insoluble, i insoluble, n neutral,
s soluble, vs very soluble) are shown as coloured backgrounds72. Significant differences were computed using one-way ANOVA with multiple comparison.
l Surface-exposed Cordax-specific APRs are composed of residues with a 20% increase in polar and charged side chains, in expense of hydrophobic
residues (n = 219 residues in APRs identified by Cordax, n = 203 residues in APRs from joined predictions). m Secondary structure analysis, using FoldX,
indicates that Cordax identifies several APRs that reside in α-helical or unstructured regions within the native fold, suggesting that amyloidogenic proteins
may harbour a plethora of exposed conformation switches that can act as potential nucleators of amyloid fibril formation, under suitable misfolding
conditions (n = 219 residues in APRs identified by Cordax, n = 203 residues in APRs from joined predictions). Violin plots represent the kernel probability
densities of the data with the median, upper and bottom quartiles. Source data are provided as a Source Data file.
protein aggregation and have increased failure rates when iden- scope to improve the method, as in particular, the region with
tifying such amyloid-forming stretches. On the other hand, high disorder propensity (cluster 11) still contains many false
sequences with a high content of aromatic side chains are rela- negatives, in spite of the ability of Cordax to partially pick up a
tively easy to identify (clusters 9a and 9b), following several lines minority of sequences. Interestingly, a closer look at the partition
of evidence supporting their role in amyloid fibril formation48. coefficients of the known amyloid sequence space reveals that
Cordax also pushes forward into less well-charted areas of amy- although Cordax takes a significant step forward towards the
loid sequence space, e.g. exploring clusters with high α-helical right direction, these APRs remain very hard to identify as they
content (cluster 10) and overall a low content of aliphatic amino are characterised by even higher solubility values (Fig. 6i). Similar
acids (clusters 5, 6, 7, 8 and 9b). These regions also reveal the charting of the amyloid sequence space is achieved by using
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications 7ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3
a b
1
1. Hydrophobic/β-propensity/High aIndex
2 2. Short side chains/hydrophobic/High aIndex
3 3. Acidic/High aIndex
4. Basic/High aIndex
CORDAX
Clustering sorting scheme
4
5. Short side chains/hydrophobic/Low aIndex
Predictions
Non-CORDAX 5
6. Acidic/Low aIndex
Joined
9a 7. Basic/Low aIndex
Unpredicted
8. Hydrophobic/High instability/Low aIndex
7 9a. Aromatic/High aIndex
9b 9b. Aromatic/Low aIndex
10. α-helix propensity
6
11. Unstructured propensity
11
8
t-SNE dim2
t-SNE dim1 10
c d e i
10
p < 0.0001
p < 0.0001
1 2 300 p = 0.343
Hydrophobicity
1
Alndex
0.5 200 p = 0.0002
Charge
0
0 p = 0.027
–1 100
–0.5 –2 p < 0.0001
0 5
–1 –3
vi
logD values
i
0
n
s
f g h
–5 vs
100
α-helix
Secondary structure
(%) Tiny sc
60
(%) Aromatic
75
α/β 40 50
20 25
β-strand 0
0 –10
Unstructured
x
x
ed
ed
da
da
in
ct
or
or
Jo
di
C
C
re
n-
np
no
U
Fig. 6 t-SNE 2D-representation of the known experimentally determined amyloidogenic sequence space. a State-of-the-art sequence-based methods
predict amyloid sequences, with (shown in cyan) or without Cordax (shown in yellow), that are grouped together in a major landing cluster and two
islands. Cordax predictions (shown in purple) transgress towards areas of amyloid-forming sequences that remain undetected by most methods (shown in
black). b Clustering of the t-SNE map using basic physicochemical properties and amino acid composition of the amyloid peptides. Each data point is
colour-coded based on the sorting scheme shown in the legend and background areas are used to pinpoint the major clusters of each defined category. The
clustering scheme was defined by characterising the t-SNE map using peptide c hydrophobicity, d net charge, e aliphatic index, f secondary structure
propensity and percentage content of g aromatic or h short residue side chains. i Partition coefficient analysis reveals that APRs identified by Cordax are
primarily soluble sequences compared to easy to identify sequences of joint prediction. On the other hand, APRs that remain hard to detect are
characterised by higher solubilities. Solubility regions (vi very insoluble, i insoluble, n neutral, s soluble, vs very soluble) are shown as coloured
backgrounds. Significant differences were computed using one-way ANOVA with multiple comparison. Violin plots represent the kernel probability
densities of the data with the median, upper and bottom quartiles.
uniform manifold approximation and projection (UMAP) for structural compatibility can overcome typical sequence propen-
dimensionality reduction (Supplementary Fig. 3a and b), while sities as a pivotal driver of aggregation nucleating sequences and
PCA analysis highlights that CORDAX slowly infiltrates the suggests that under the proper conditions, the boundaries currently
sequence space of higher solubilities (Supplementary Fig. 3c and d). considered compatible to protein amyloid-like assembly are
Overall, dimensionality reduction transformation highlights that potentially far wider than previously expected.
8 NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunicationsNATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3 ARTICLE
a 100 b
CORDAX
d
3D Profile
te
tric
es
Fib Predictor
l-R
80
lle
ra
Pa
d
Prediction (%)
te
ric
60 VIGIAQ NNQGSN KGAIIG GVNLVN GAIIGL AVVTGV
st
Re
1-
RMSD : 1.6 Å RMSD : 1.3 Å RMSD : 2.7 Å RMSD : 1.2 Å RMSD : 1.4 Å RMSD : 2.1 Å
ss
Cla
40
20
0
y
na nd
og
HGVTTV GYLLGS GVTGIA IFQINS GVTAVA AALQSS
io a
y
l
ct str
lit
po
re β-
To
RMSD : 2.3 Å RMSD : 1.1 Å RMSD : 0.9 Å RMSD : 2.5 Å RMSD : 1.3 Å RMSD : 1.2 Å
di
c p < 0.0001 d e (IYQYGG) 6G8C (LNIYQY) 6G8D (VTQVGF) 6G8E (IFAEDV) 5TXH
40 p < 0.0001
20
p < 0.0001
30
Blosum62 score
Blosum62 score
10
20
PDB
10 0 strucure
0
–10
–10
–20 –20
0 20 40 60 80
at e
at y
at y
m log
m g
m plat
is lo
ch
ch
ch
po
m opo
Model rank
m
Cordax
To
Te
T
Template mismatches
model
f PDB lD CORDAX 3D Profile FibPredictor
Topology prediction (Å)
6G8C 0.4 2.7 3.8
6G8D 2.1 2.1 2.4
Structural
overlay
6G8E 1.1 1.5 Class mismatch
5TXH 1.7 Class mismatch Class mismatch
RMSD : 0.4 Å RMSD : 2.1 Å RMSD : 1.1 Å RMSD : 1.7 Å
Fig. 7 High-precision recognition of amyloid fibril structural architectures using Cordax. a Prediction accuracy comparison of Cordax to the only publicly
available structural predictors, Fibpredictor and 3D-profile. For comparison, methods were run against a non-redundant sequence set extracted from
amyloid-forming peptide interfaces (n = 73 templates). b Model topologies, predicted by applying Cordax (shown in orange), strongly superimpose to
matching solved structural layouts of amyloidogenic nucleators (shown in magenta), as indicated by the reported minor RMSD values. c Sequence identity
contribution for template selection during cross-threading analysis of the Cordax structural library. Alignment scores for selected models matching the
template sequences compared to mismatching template selections of similar or different topological layouts (n = 73 templates). d Alignment scores of the
APRs newly identified by Cordax to the sequence of the selected templates, plotted against their corresponding model ranks (n = 96 sequences threaded).
e Structural alignment of Cordax outputs to experimentally determined 3D structures. Models were calculated for three aggregation prone sequences
derived from CsgA curli forming protein (PDB IDs: 6G8C, 6G8D and 6G8E, respectively) and a peptide mutant sequence derived from Aβ amyloid peptide
(PDB ID: 5TXH). Predicted topologies are overlapping representations of the experimentally determined amyloid fibril cores, f as displayed by a direct
comparison to other software. Violin plots represent the kernel probability densities of the data with the median, upper and bottom quartiles. Source data
are provided as a Source Data file.
Cordax predicts the structural layout and topology of fibril closer look reveals that sequence specificity may be a modulating,
cores. Due to restricted availability of experimentally determined yet not determining factor for this selection process. Steric per-
structures not included in the Cordax library, we first analysed turbations can be introduced due to restrictions deriving from
the information derived from cross-threading analysis in order to closely interdigitating side chains within the packed interfaces,
test the performance of the tool in predicting the structural therefore, key residue positions can be bound to the overall sta-
architecture of aggregation prone stretches. Among 73 unique bility of certain structural topologies and decrease the acceptable
sequences corresponding to the structural library, Cordax was sequence space that can accommodate energetically favourable
able to accurately assign the correct architecture to 63%, whereas interactions. This is highlighted by the sequence similarity
81% was identified with proper β-strand orientation (parallel/ observed between topological matches (Fig. 7c, Supplementary
antiparallel) (Fig. 7a, Supplementary Data 3 and 4). In compar- Data 4). On the other hand, topologically different model selec-
ison, FibPredictor49 correct topology allocation was limited to tions could also be a consequential outcome of amyloid poly-
9.5% of the sequences and assigned β-strand directionality morphism. The observed sequence redundancy of the Cordax
amounted to 32.9%, while introducing an evident preference library illustrates that APRs can form amyloid fibrils with distinct
towards antiparallel architectures (Fig. 7a). Similarly, the 3D- morphological layouts50–52, a notion that is also supported by the
profile method is restricted to linking all potential queries with a common morphological variability of aggregates formed at the
class 1 topology, hence was incapable of predicting alternative level of full-length amyloid-forming proteins53,54. The modulat-
architectures (Fig. 7a). Structural alignment indicated that even in ing role of sequence dependency was also evident for the 96-
cases of mismatching selected templates, modelled architectures peptide screen. A ranked analysis of the output models indicated
strongly superimpose to the solved structures (Fig. 7b), suggesting that templates with higher alignment scores were not crucial for
that Cordax identifies the correct topology with high accuracy. A the topology selection process, although could often correspond
NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications 9ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3
to the favourable architectures (Fig. 7d), thus highlighting that the provided significant insights on the polymorphic architectures of
structural predictions of Cordax are relatively unbiased in terms amyloid fibrils57. In this line, the emergence of cryo-EM has been
of the sequence space composing the structural templates. pivotal in determining features of amyloid fibril polymorphs58,
The accuracy of the tool was also cross-referenced against complementing earlier efforts developed using solid-state NMR
experimentally determined structures of fibril cores not included spectroscopy53,59. Notably, these structures represent snapshots
in the structural library. We utilised the recently solved structures of the kinetic cores of aggregation or end-state morphologies of
of parallel fibril-forming segments derived from the major curli amyloid fibrils and therefore provide limited information on the
protein CsgA55, as well as an anti-parallel polymorphic APR underlying aggregation pathways and toxicity-related effects of
variant segment derived from the amyloid-β peptide56. Compared amyloids. On the other hand, the growing number of high-
to other structural predictors, only Cordax could invariantly resolution cryo-EM structures has highlighted the in vivo struc-
predict the correct architecture for every steric zipper as the tural diversity of amyloid fibrils60, whereas steric zippers have
closest representation of the experimentally determined reference been recently used for the development of targeted ther-
structures (Fig. 7e and f). This performance can only improve as apeutics61–63. However, determining the structural layout of
the fragment library expands, so we aim to update it at regular amyloid fibrils still remains challenging. Cordax attempts to
intervals, providing there is a noticeable increase in solved provide a cost-effective complementary powerful computational
structures in the future. alternative that can be operated without any required scientific
expertise necessary to apply the intricate technical approaches.
Apart from its function as an aggregation predictor, the tool is
Discussion uniquely poised to provide detailed complementary structural
The number of amyloid structures in the protein databank has information on the putative amyloid fibril architecture of iden-
been steadily increasing over the last two decades. It has now tified APRs. Users can utilise the method to structurally char-
achieved a number (>80) that was reached for globular proteins at acterise identified APRs by classifying their overall specific
the beginning of the 1980s and that then triggered the first topological preferences, including β-strand directionality and key
developments of template-based modelling methods including residue positions that are integral parts of the amyloid core. The
homology-based and threading (or fold recognition) in an latter information is imperative for efforts focused on under-
attempt to estimate the versatility of individual folds and discover standing the underlying mechanisms that dictate amyloid-related
novel folds in a more directed manner. Similarly, we here diseases or the formation of functional amyloids, but can also
developed Cordax, an exhaustively trained regression model that have an immense impact on the design of applied nano-
leverages a substantial library of curated amyloid template biomaterials64, targeted amyloid inducers65 or counteragents,
structures combined with machine learning. Cordax uses a following the increased interest in the development of structure-
logistic regression approach to translate structural compatibility based inhibitors of aggregation61–63.
and interaction energies into sequence aggregation propensity
and is therefore unconstrained by defined sequence tendencies, Methods
such as hydrophobicity or secondary structure preference that Regression model training. In previous work we synthesised and explored the
direct most sequence-based predictors. As a result, we discovered aggregation potential of 940 peptide sequences derived from both functional and
unconventional amyloid-like sequences, including sequences with pathological amyloid-forming proteins, which were supplemented with additional
data on 462 hexapeptides derived from other published sources to develop
low aliphatic content, high net charge or sequences with low WALTZ-DB 2.032, the largest public comprehensive repository of experimentally
intrinsic structural propensities. Clustering amyloid sequences by defined amyloidogenic peptides. In total, 1402 hexapeptide sequences from
t-SNE two-dimensional reduction revealed the substructure of WALTZ-DB were modelled on the 140 backbone structures of the Cordax library,
amyloid sequence space. Apart from a large cluster corresponding leading to the generation of 196,280 models. The thermodynamic stability of each
model (ΔG, kcal mol−1) was calculated using FoldX and fed into a logistic
to sequences found in the hydrophobic core of globular proteins, regression model (Fig. 2c). This model was used to distil the aggregation propensity
we also found clusters corresponding to surface-exposed amyloid from the free energy values. Towards this end, from the calculated ΔGs, we isolated
sequences in globular proteins, small aliphatic functional amy- 50 representative energies using a recursive feature elimination algorithm (using
loids, N/Q/Y prions, strongly helical and intrinsically disordered the RFE module of the SciKit-learn python package33 and selecting for the set of
sequences which could be compatible with liquid–liquid phase templates that maximised the AUC). As a result, each sequence is described with a
50-dimensional vector. Next, the data were transformed in order to be constrained
responsive sequences. Our analysis highlights the discovery of in a scoring range between 0 and 1, using a Min/Max scaling algorithm. The
highly soluble, yet amyloid-forming, sequences and suggests that regression model was trained with L2 penality and regularisation strength (C) equal
the largest portion of the remaining uncharted amyloid sequence to 1. Both scaling of the estimated ΔG and the machine-learning model were
space is hidden in this corner (Fig. 6a and i). Indeed, most developed using the SciKit-learn python package66.
archetypal hydrophobic APR sequences have low intrinsic solu-
bility. As a result, low solubility and aggregation propensity are Model pipeline. Cordax receives a protein sequence in FASTA format as input,
properties that are often wrongly used interchangeably. It is which is fragmented into hexapeptides using a sliding window process. Sequences
are then threaded against the fragment library utilising FoldX and the derived free
important to differentiate between the initial solubility and energies are translated into scoring values for every peptide window. An energe-
aggregation propensity of a peptide, as soluble monomeric tically fitted model is selected as the closest representative of the overall topology of
sequences can often self-assemble, at later time points, into the amyloid fibril core for each predicted window and is provided as output in
insoluble amyloid fibrils. The APRs that are newly discovered by standard PDB format to the users (Fig. 2c). An amyloidogenic profile is generated
by scoring every single residue of the input sequence with the maximum calculated
Cordax are often highly soluble in their monomeric form, even score of the corresponding windows, followed by a binary prediction for every
more than the already known polar APRs from the yeast prions, segment. Finally, calculated energies are stored automatically in a growing local
as they contain many charged and polar residues, yet surprisingly database and can be retrieved, thus creating a ‘lazy’ interface that bypasses unne-
can still assemble into amyloids. Overall, our approach demon- cessary computation for recurring sequence segments or future runs.
strates that the increasing structural information on amyloids
now allows for more fine-graded structural rule learning of the Datasets. Performance assessment of Cordax was carried out utilising two indi-
amyloid state. vidual data sets for peptide and protein aggregation propensity detection. Further
validation of the method was performed against an independent subset screen of 96
Recent developments in microcrystal electron diffraction have hexapeptides sequences.
enabled structural determination from nanocrystals that are not For peptide aggregation propensity, we used a dataset of 1402 non-redundant
typically suited for traditional X-ray diffraction and have hexapeptides contained in the WALTZ-DB 2.0 repository32. This database is the
10 NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunicationsYou can also read