Bystroff Publications with abstracts
Motivation: We present HMMSTRTM, a Hidden Markov Model (HMM) that is useful for predicting topology of transmembrane (TM) proteins. HMMSTRTM provides additional prediction categories of TM regions provided by the PDBTM corpus such as transmembrane beta sheets, coils, and reentrant loops. Results: HMMSTRTM is competitive with existing TM protein topology predictors like TMHMM, it correctly pre-dicts at least half the residues in 96.18% of all transmembrane helices in a cross validation dataset. Availability: Model architecture, source code, and supplementary figures are made available on github: github.com/TiburonB/HMMSTRTM.
The fusion loop (FL), a 51-residue segment of the dengue virus (DENV) envelope (E) protein, has been shown to bind antibodies that neutralize DENV infection in cell culture. Vaccination with this loop could raise broadly neutralizing antibodies and avoid antibody dependent enhancement in second serotype infections associated with whole virus vaccination. We propose a new DENV vaccine in which FL has been genetically fused to a well-known and highly immunogenic carrier, the human papillomavirus (HPV) L1 protein (L1). Chimeric L1-FL was expressed in human cell culture, but expression levels of virus-like particles (VLP) were initially low. Expression levels were improved after adding a bridging disulfide bond at the base of the loop, and were further improved by transfecting cells with a mixture of 9 parts chimera to 1 part wild-type L1 expression vectors. VLPs formed from the chimeric construct were purified using ultracentrifugation and were shown to form hollow particles of the expected size using transmission electron microscopy. The improvements in expression are discussed in the context of a theoretical pathway for folding and assembly of VLPs.
Leave-One-Out Green Fluorescent Proteins (LOO_GFPs) unite a peptide sensor and detection method in a single biocompatible molecule. In LOO_GFPs, one β-strand of the GFP protein is removed and the cavity is re-engineered to specifically bind a desired analyte. Unfortunately, LOO_GFPs have a reduced quantum yield relative to the parent protein, and form fluorescent oligomers in the unbound state, reducing signal-to-noise. Immobilizing LOO_GFPs in materials composed of the Drosophila protein Ultrabithorax (Ubx) via gene fusion both increased the fluorescent signal and prevented oligomerization, substantially reducing background noise. These fibers represent the first incorporation of a heterodimeric protein into materials via gene fusion. Interactions between LOO_GFP and Ubx that hampered analyte rebinding were mitigated by optimizing salt and detergent concentrations in the assay. The result is a useful first-generation fluorescent biosensor, immobilized in and stabilized by robust protein fibers. This study highlights the advantages and identifies potential pitfalls associated with protein immobilization in materials.
Asthenozoospermia accounts for over 80% of primary male infertility cases. Reduced sperm motility in asthenozoospermic patients are often accompanied by teratozoospermia, or defective sperm morphology, with varying severity. Multiple morphological abnormalities of the flagella (MMAF) is one of the most severe forms of asthenoteratozoospermia, characterized by heterogeneous flagellar abnormalities. Among various genetic factors known to cause MMAF, multiple variants in the DNAH2 gene are reported to underlie MMAF in humans. However, the pathogenicity by DNAH2 mutations remains largely unknown. In this study, we identified a novel recessive variant (NM_020877:c.12720G/T;p.W4240C) in DNAH2 by whole-exome sequencing, which fully co-segregated with the infertile male members in a consanguineous Pakistani family diagnosed with asthenozoospermia. 80-90% of the sperm from the patients are morphologically abnormal, and in silico analysis models reveal that the nonsynonymous variant substitutes a residue in dynein heavy chain domain and destabilizes DNAH2. To better understand the pathogenicity of various DNAH2 variants underlying MMAF in general, we functionally characterized Dnah2-mutant mice generated by CRISPR/Cas9 genome editing. Dnah2-null males, but not females, are infertile. Dnah2-null sperm cells display absent, short, bent, coiled, and/or irregular flagella consistent with the MMAF phenotype. We found misexpression of centriolar proteins and delocalization of annulus proteins in Dnah2-null spermatids and sperm, suggesting dysregulated flagella development in spermiogenesis. Scanning and transmission electron microscopy analyses revealed that flagella ultrastructure is severely disorganized in Dnah2-null sperm. Absence of DNAH2 compromises the expression of other axonemal components such as DNAH1 and RSPH3. Our results demonstrate that DNAH2 is essential for multiple steps in sperm flagella formation and provide insights into molecular and cellular mechanisms of MMAF pathogenesis
Vaccines train the immune system to recognize antigens, preventing disease by enabling a timely adaptive immune response. Vaccine efficacy and safety may depend on directing the antibodies to specific epitopes, especially if the antigenic target protein is immunologically ‘‘self’’, such as a cancer cell or a sperm cell. Short peptides can focus antibodies to the desired epitopes, but such small anti- gens are unstable unless incorporated into a larger protein. We selected human papillomavirus (HPV) L1 protein as a carrier for a novel peptide subunit vaccine, constructed by inserting the coding sequences of the desired epitopes into sites of natural variation in the L1 proteins of the many HPV strains. The native confor- mation of these epitopes, based on molecular models, is preserved by disulfide linking its endpoints, an act which may also help to preserve the folding pathway of L1. Successful raising of antibodies that bind the target is seen as an indicator of successful native-state antigenic loop modeling. Successful self-assembly into 55 nm virus like particles (VLPs) is an indicator that the inserted loops do not inter- fere with folding and assembly of the L1 protein. Previous work has shown that VLPs are especially immunogenic scaffolds, and that the polyvalency possible in VLP display of peptide antigens sometimes improves the immune response. This work focuses on the design and testing of VLPs displaying peptides from a sperm specific calcium channel, CatSper, to produce an anti-sperm vaccine for contra- ceptive use. We will discuss the folding and assembly of the designed chimeric capsid proteins.
Projections of future global human population are traditionally made using birth/death trend extrapolations, but these methods ignore limits. Expressing humanity as a K-selected species whose numbers are limited by the global carrying capacity produces a different outlook. Population data for the second millennium up to the year 1970 was fit to a hyper-exponential growth equation, where the rate constant for growth itself grows exponentially due to growth of life-saving technology. The discrepancies between the projected growth and the actual population data since 1970 are accounted for by a decrease in the global carrying capacity due to ecosystem degradation. A system dynamics model that best fits recent population numbers suggests that the global biocapacity may already have been reduced to one-half of its historical value and global carrying capacity may be at its 1965 level and falling. Simulations suggest that population may soon peak or may have already peaked. Population projections depend strongly on the unknown fragility or robustness of the Earth's essential ecosystem services that affect agricultural production. Numbers for the 2020 global census were not available for this study.
CatSper is a voltage-dependent calcium channel located in the plasma membrane of the sperm flagellum and is responsible for triggering hyperactive motility. A homology model for the transmembrane region was built in which the arrangement of the subunits around the pseudo-four-fold symmetry axis was deduced by the pairing of conserved transmembranal cysteines across mammals. Directly emergent of the predicted quaternary structure is an architecture in which tetramers polymerize through additional, highly conserved cysteines, creating one or more double-rows channels extending the length of the principal piece of the mammalian sperm tail. The few species that are missing these cysteines are eusocial or otherwise monogamous, suggesting that sperm competition is selective for a disulfide-crosslinked macromolecular architecture. The model suggests testable hypotheses for how CatSper channel opening might behave in response to pH, 2-arachidonoylglycerol, and mechanical force. A flippase function is hypothesized, and a source of the concomitant disulfide isomerase activity is found in CatSper-associated proteins β, δ and ε.
Background With increasing interest in ab initio protein design, there is a desire to be able to fully explore the design space of insertions and deletions. Nature inserts and deletes residues to optimize energy and function, but allowing variable length indels in the context of an interactive protein design session presents challenges with regard to speed and accuracy.
Results Here we present a new module (INDEL) for InteractiveRosetta which allows the user to specify a range of lengths for a desired indel, and which returns a set of low energy backbones in a matter of seconds. To make the loop search fast, loop anchor points are geometrically hashed using C α-C α and C β-C β distances, and the hash is mapped to start and end points in a pre-compiled random access file of non-redundant, protein backbone coordinates. Loops with superposable anchors are filtered for collisions and returned to InteractiveRosetta as poly-alanine for display and selective incorporation into the design template. Sidechains can then be added using RosettaDesign tools.
Conclusions INDEL was able to find viable loops in 100% of 500 attempts for all lengths from 3 to 20 residues. INDEL has been applied to the task of designing a domain-swapping loop for T7-endonuclease I, changing its specificity from Holliday junctions to paranemic crossover (PX) DNA.
Paranemic crossover DNA (PX-DNA) is a four-stranded multicrossover structure that has been implicated in recombination-independent recognition of homology. Although existing evidence has suggested that PX is the DNA motif in homologous pairing (HP), this conclusion remains ambiguous. Further investigation is needed but will require development of new tools. Here, we report characterization of the complex between PX-DNA and T7 endonuclease I (T7endoI), a junction-resolving protein that could serve as the prototype of an anti-PX ligand (a critical prerequisite for the future development of such tools). Specifically, nuclease-inactive T7endoI was produced and its ability to bind to PX-DNA was analyzed using a gel retardation assay. The molar ratio of PX to T7endoI was determined using gel electrophoresis and confirmed by the Hill equation. Hydroxyl radical footprinting of T7endoI on PX-DNA is used to verify the positive interaction between PX and T7endoI and to provide insight into the binding region. Cleavage of PX-DNA by wild-type T7endoI produces DNA fragments, which were used to identify the interacting sites on PX for T7endoI and led to a computational model of their interaction. Altogether, this study has identified a stable complex of PX-DNA and T7endoI and lays the foundation for engineering an anti-PX ligand, which can potentially assist in the study of molecular mechanisms for HP at an advanced level.
Previously a method was described for structure-based modeling of protein folding pathways using pivots, hinges and breaks; called GeoFold (Ramakrishnan et al, 2012). The method reproduces experimental observations for proteins that have been covalently modified by introduced disulfide linkages. However, GeoFold fails to find experimentally reasonable unfolding steps for proteins that contain closed beta barrels. Here we add a barrel opening move for GeoFold and validate the results of pathway simulations, using experimental data. Barrels are detected as cycles in a network graph of beta strand contacts. A barrel opening move is defined as the breaking of one row of beta-beta contacts (called a “seam”) along with contacts to any residues that bridge the seam (called “buttons”). Unfolding pathways were generated for three well-studied green fluorescent protein variants that have one or two added disulfide bonds. The GeoFold results are consistent with the experimental data when seam moves are used but not when using the previous method. The new model for protein unfolding helps to understand how disulfide linkages can be used to engineer increased kinetic and thermodynamic stability in proteins.
Fluctuations in the fluorescence levels of mutants of GFP have been attributed to differences in the amount of the fluorescent chromophore that is synthesized along with variations in its quantum yield. The majority of these studies analyzed these features as a function of structural changes and mutations to key residues that define the chromophore microenvironment. Still unanswered however, is the question of how maturation levels change when the chromophore microenvironment is preserved across mutants of GFP. Here we present the results of a study that attempted to decipher the mechanism that regulates chromophore maturation efficiency and quantum yield in GFP. In the context of mutants with conserved chromophore microenvironments we looked at how mutations distal to this region affected the amount and fluorescence of chromophore synthesized. Mutations that have a destabilizing impact on the structure of GFP were found to restrict the function of two key residues (R96 and E222) in chromophore maturation while reducing the quantum yield of fluorescence. In summary, these data identify structural features of GFP that can be used in the assessment of engineered mutants prior to synthesis in the lab.
Leave-one-out GFP can be redesigned to accept an exogenous replacement for the missing strand, creating a protein that glows in the presence of the targeted sequence. To probe the possibility of using this approach to detect dengue virus in mosquitos, we first asked whether we could detect a s11-tagged envelope protein using LOO11-GFP. The results were mixed. If the s11 tag was placed on the C-terminus of the E-protein, we saw very little LOO11-GFP fluorescnce, but if s11 was placed on the N-terminus we saw considerably more signal. The control experiment wherein we added un-tagged protein, produced a smaller signal. chemical or heat denaturation allowed detection of the C-terminally tagged protein, showing that steric occlusion was the cause of the lack of signal. The results provide an initial protocol for detection of virus by LOO-GFP and abaseline for the expected performance of computationally designed LOO-GFPs that do not depend on tagging for detection. The source of non-specific binding is discussed. Secondly, we computationally designed a LOO-GFP to bind the sequence  from dengue serotype 3 envelope protein, producing several variants that glow in the presence of the co-expressed dengue peptide and which do not glow in its absence. The two experiments pave the way for the selection of computationally designed LOO-GFP biosensors that detect the targeted peptide in the context of the virus.
The auto-catalytic maturation of the chromophore in green fluorescent protein (GFP) was thought to require the precise positioning of the side chains surrounding it in the core of the protein, many of which are strongly conserved among homologous fluorescent proteins. In this study, we screened for green fluorescence in an exhaustive set of point mutations of seven residues that make up the chromophore microenvironment, excluding R96 and E222 because mutations of these positions have been previously characterized. Contrary to expectations, nearly all amino acids were tolerated at all seven positions. Only four point mutations knocked out fluorescence entirely. However, chromophore maturation was found to be slower and/or fluorescence reduced in several cases. Selected combinations of mutations showed non-additive effects including cooperativity and rescue. The results provide guidelines for the computational engineering of GFPs.
Cutinases are esteresases of industrial importance for applications in recycling and surface modification of polyesters. The cutinase from Thiellavia terrestris (TtC) is distinct in terms of its ability to retain its stability and activity in acidic pH. Stability and activity in acidic pHs is desirable for esterases as the pH of the reaction tends to go down with the generation of acid. The pH stability and activity are governed by the charged state of the residues involved in catalysis or in substrate binding. In this study we performed the detailed structural and biochemical characterization of TtC coupled with surface charge analysis to understand its acidic tolerance. The stability of TtC in acidic pH was rationalized by evaluating the contribution of charge interactions to the Gibbs free energy of unfolding at varying pHs. The activity of TtC was found to be limited by substrate binding affinity, which is a function of the surface charge. Additionally, the presence of glycosyation affects the biochemical characterstics of TtC owing to steric interactions with residues involved in substrate binding.
Motivation: Mutations in homologous proteins affect changes in the backbone conformation that involve a complex interplay of forces and are hard to predict. Protein design algorithms need to anticipate these backbone changes in order to accurately calculate the energy of the structure given an amino acid sequence, and they must do so without the knowledge of the final, designed sequence.
Results: We explored the ability of the Rosetta suite of protein de-sign tools to move the backbone from its position in one structure (template) to its position in a homolog structure (target) as a function of the diversity of the backbone ensemble, the percent sequence identity, and the size of the local zone being modeled. We describe a pareto front in the likelihood of moving the backbone toward the target as a function of ensemble diversity and zone size. The num-bers presented here will be useful for homology modeling and for protein design using the piecemeal approach.
Availability: PyRosetta scripts available here.
Cutinases are the powerful hydrolases that can cleave ester bonds of polyesters such as poly(ethyleneterephthalate) (PET), opening up new options for enzymatic routes for polymer recycling and surface modification reactions. Cutinase from Apsergillus oryzae (AoC) is promising owing to the presence of an extended groove near the catalytic triad which is important for the orientation of polymeric chains. However, the catalytic efficiency of AoC on rigid polymers like PET is limited by its low thermostability as it is essential to work at or over the glass transition temperature (Tg) of PET i.e. 70C. Consequently, in this study we worked towards the thermostabilization of AoC. Use of Rosetta computational protein design software in conjunction with rational design led to a 6C improvement in the thermal unfolding temperature (Tm) and a 10-fold increase in the half-life of the enzyme activity at 60C. However, thermostabilization did not improve the rate or temperature optimum of enzyme activity. This study presents the first systematic approach towards the thermostabilization of cutinases. The notable findings include (a) surface salt bridge optimization leads to enthalpic stabilization, (b) mutations to proline lower the free energy by reducing the configurational entropy loss upon folding and (c) the lack of a correlative increase in the temperature optimum of catalytic activity with thermodynamic stability suggests that the active site is locally denatured at a temperature below the Tm of the global structure.
Leave-one-out green fluorescent protein (LOO-GFP) is a circularly permuted and split protein lacking one secondary structural element, and fluorescence. LOO-GFP folds and reconstitutes fluorescence upon addition of the missing element. Computational protein design may be used to modify the sequence of LOO-GFP to fit a new peptide sequence, while retaining its ability to reconstitute fluorescence. In this proof of concept, we present the results of computational design of a LOO-GFP that is missing strand 7, and has been designed to accommodate a 12-residue peptide from the H5 antigenic region of the Thailand strain A/Thailand/16/2004 H5N1 H. influenza hemagglutinin (HA) in place of strand 7. The new protein design software, DEEdesign, uses dead-end elimination (DEE) combined with Monte Carlo rotamer search methods, with novel energy terms to account for solvation, buried hydrogen bonding groups and buried void volume. A profile of candidate designs was converted to a gene library using degenerate oligo whole-gene assembly PCR, and this was coexpressed with the HA peptide. Proteins from fluorescent colonies were sequenced and purified. However, in vitro binding of the HA peptide was weak, and the unbound LOO-GFP was still fluorescent. Separating homodimers by refolding on beads removed the auto-fluorescence, which originated from dimer formation. Immobilized monomeric LOO-GFP glows upon addition of either the HA peptide or the wild type strand 7, whereas the wild type LOO-GFP binds only the wild type peptide but not the HA peptide. Further diversification of the LOO-GFP gene library by DNA shuffling has produced mutants in which the speed of chromophore maturation was increased by the presence of the HA peptide. The results show that a computationally designed LOO-GFP reports the presence of a specific peptide, and holds promise for computational design of biosensors using the leave-one-out method.
Summary: Modern biotechnical research is increasingly becoming more reliant on computational structural modeling programs to develop novel solutions to pressing scientific questions. Rosetta is one such protein modeling suite that has already demonstrated wide applicability to a number of diverse research projects. Unfortunately, Rosetta is largely a command-line driven software package which restricts its use among non-computational researchers. Some graphical interfaces for Rosetta exist, but typically are not as sophisticated as commercial software. Here we present InteractiveROSETTA, a graphical interface for the PyRosetta framework that presents easy-to-use controls for several of the most widely-used Rosetta protocols alongside a sophisticated selection system utilizing PyMOL as a visualizer. InteractiveROSETTA is also capable or interacting with remote servers running a standalone Rosetta install, rendering it easy to incorporate more sophisticated protocols that are not accessible in PyRosetta and/or require significant computa-tional resources. Availability: InteractiveROSETTA is freely available at github.com/schenc3/InteractiveROSETTA, and relies upon a separate download of PyRosetta which is available at http://www.pyrosetta.org after obtaining a license (free for academic use).
We have introduced two disulfide crosslinks into the loop regions on opposite ends of the beta barrel in superfolder green fluorescent protein (GFP) in order to better understand the nature of its folding pathway. When the disulfide on the side opposite the N/C-termini is formed, folding is 2× faster, unfolding is 2000× slower, and the protein is stabilized by 16 kJ/mol. But when the disulfide bond on the side of the termini is formed we see little change in the kinetics and stability. The stabilization upon combining the two crosslinks is approximately additive. When the kinetic effects are broken down into multiple phases, we observe Hammond behavior in the upward shift of the kinetic m-value of unfolding. We use these results in conjunction with structural analysis to assign folding intermediates to two parallel folding pathways. The data are consistent with a view that the two fastest transition states of folding are "barrel closing" steps. The slower of the two phases passes through an intermediate with the barrel opening occurring between strands 7 and 8, while the faster phase opens between 9 and 4. We conclude that disulfide crosslink-induced perturbations in kinetics are useful for mapping the protein folding pathway.
The green fluorescent protein (GFP) has seen its utility expand far beyond that of a fluorophore in Aqueroia victoria. Some of the main drivers of this have been work done to increase the spectroscopic range of the protein as well as to develop GFP as a biosensor. In our study we have generated leave-one-out variants of GFP (LOO-GFPs) that have been circularly permuted with a secondary structural element omitted. Co-expression of the truncated GFP (sensor) with wt peptides (target) reconstitutes fluorescence to varying degrees depending on which strand has been omitted.
Motivation: Accuracy in protein design requires a fine grained rotamer search, multiple backbone conformations, and a detailed energy function, creating a burden in runtime and memory requirements. A design task may be split into manageable pieces in both three dimensional space and in the rotamer search space to produce small, fast jobs that are easily distributed. However, these jobs must overlap, presenting a problem in resolving conflicting solutions in the overlap regions. Results: Piecemeal design, in which the design space is split into overlapping regions and rotamer search spaces, accelerates the design process whether jobs are run in series or in parallel. Large jobs that cannot fit in memory were made possible by splitting. Accepting the consensus amino acid selection in conflict regions led to non-optimal choices. Instead, conflicts were resolved using a second pass, in which the split regions were re-combined and designed as one, producing results that were closer to optimal with a minimal increase in runtime over the consensus strategy. Splitting the search space at the rotamer level instead of at the amino acid level further improved the efficiency by reducing the search space in the second pass. Availability and Implementation: Programs for splitting protein design expressions are available at www.bioinfo.rpi.edu/tools/piecemeal.html
Wild-type green fluorescent protein (GFP) folds on a time-scale of minutes. The slow step in folding is a cis-trans peptide bond isomerization. The only conserved cis-peptide bond in the native GFP structure, at P89, was remodeled by the insertion of two residues, followed by iterative energy minimization and side chain design. The engineered GFP was synthesized and found to fold faster and more efficiently than its template protein, recovering 50% more of its fluorescence upon refolding. The slow phase of folding is faster and smaller in amplitude, and hysteresis in refolding has been eliminated. The elimination of a previously reported kinetically trapped state in refolding suggests that X-P89 is trans in the trapped state. A 2.55A resolution crystal structure revealed that the new variant contains only trans-peptide bonds, as designed. This is the first instance of a computationally remodeled fluorescent protein that folds faster and more efficiently than wild-type.
Nature possesses a secret formula for the energy as a function of the structure of a protein. In protein design, approximations are made to both the structural representation of the molecule and to the form of the energy equation, such that the existence of a general energy function for proteins is by no means guaranteed. Here we present new insights towards the application of machine learning to the problem of finding a general energy function for protein design. Machine learning requires the definition of an objective function, which carries with it the implied definition of success in protein design. We explored four functions, consisting of two functional forms, each with two criteria for success. Optimization was carried out by a Monte Carlo search through the space of all variable parameters. Cross-validation of the optimized energy function against a test set gave significantly different results depending on the choice of objective function, pointing to relative correctness of the built-in assumptions. Novel energy cross-terms correct for the observed non-additivity of energy terms and an imbalance in the distribution of predicted amino acids.
The ability to selectively activate function of particular proteins via pharmacological agents is a longstanding goal in chemical biology. Recently, we reported an approach for designing a de novo allosteric effector site directly into the catalytic domain of an enzyme. This approach is distinct from traditional chemical rescue of enzymes in that it relies on disruption and restoration of structure, rather than active site chemistry, as a means to achieve modulate function. However, rationally identifying analogous de novo binding sites in other enzymes represents a key challenge for extending this approach to introduce allosteric control into other enzymes. Here we show that mutation sites leading to protein inactivation via tryptophan-to-glycine substitution and allowing (partial) reactivation by the subsequent addition of indole are remarkably frequent. Through a suite of methods including a cell-based reporter assay, computational structure prediction and energetic analysis, fluorescence studies, enzymology, pulse proteolysis, x-ray crystallography and hydrogen-deuterium mass spectrometry we find that these switchable proteins are most commonly modulated indirectly, through control of protein stability. Addition of indole in these cases rescues activity not by reverting a discrete conformational change, as we had observed in the sole previously reported example, but rather rescues activity by restoring protein stability. This important finding will dramatically impact the design of future switches and sensors built by this approach, since evaluating stability differences associated with cavity-forming mutations is a far more tractable task than predicting allosteric conformational changes. By analogy to natural signaling systems, the insights from this study further raise the exciting prospect of modulating stability to design optimal recognition properties into future de novo switches and sensors built through chemical rescue of structure.
The use of cell-cell communication or "quorum sensing (QS)" elements from Gram-negative Proteobacteria has enabled synthetic biologists to begin engineering systems composed of multiple interacting organisms. However, additional tools are necessary if we are to progress towards synthetic microbial consortia that exhibit more complex, dynamic behaviors. EsaR from Pantoea stewartii subsp. stewartii is a QS regulator that binds to DNA as an apo-protein, and releases the DNA when it binds to its cognate signal molecule, 3 oxohexanoyl-homoserine lactone (3OC6HSL). In the absence of 3OC6HSL, EsaR binds to DNA and can act as either an activator or a repressor of transcription. Gene expression from PesaR , which is repressed by wild-type EsaR, requires 100 to 1000-fold higher concentrations of signal than commonly used QS activators, such as LuxR and LasR. Here we have identified EsaR variants with increased sensitivity to 3OC6HSL using directed evolution and a dual ON/OFF screening strategy. Although we targeted EsaR-dependent derepression of PesaR , our EsaR variants also showed increased 3OC6HSL-sensitivity at a second promoter, PesaS , which is activated by EsaR in the absence of 3OC6HSL. Here, the increase in AHL sensitivity led to gene expression being turned off at lower concentrations of 3OC6HSL. Overall, we have increased the signal sensitivity of EsaR more than 70-fold and generated a set of EsaR variants that recognize 3OC6HSL concentrations ranging over four orders of magnitude. QS-dependent transcriptional regulators that bind to DNA and are active in the absence of a QS signal represent a new set of tools for engineering cell-cell communication-dependent gene expression.
Nature possesses a secret formula for the energy as a func-tion of the structure of a protein. In protein design, ap-proximations are made to both the structural representation of the molecule and to the form of the energy equation, such that the existence of a general energy function for proteins is by no means guaranteed. Here we present new insights to-wards the application of machine learning to the problem of finding a general energy function for protein design. Ma-chine learning requires the definition of an objective func-tion, which carries with it the implied definition of success in protein design. We explored four functions, consisting of two functional forms, each with two criteria for success. Optimization was carried out by a Monte Carlo search through the space of all variable parameters. Cross-validation of the optimized energy function against a test set gave significantly different results depending on the choice of objective function, pointing to relative correctness of the built-in assumptions. Novel energy cross-terms correct for the observed non-additivity of energy terms and an imbal-ance in the distribution of predicted amino acids.
Green fluorescent protein (GFP) has journeyed far from its role in nature as a cofactor in the jellyfish light organ, having been bioengineered and re-engineered over three decades to take on a wide variety of service roles in molecular imaging and sensing. In this chapter, we explore the ways GFP has been used as a biomarker and biosensor, its capabilities, its strengths and weaknesses, and its potential for future applications. To begin, we will review what is known about the GFP structure, its extreme kinetic stability and its very slow and multiphasic folding kinetics. Biophysical characteristics will be covered, including the chemical and structural requirements for the autocatalyzed maturation of the integral fluorescent chromophore, its excitation/emission spectra, and its variety of enginered emission wavelengths. Efforts in protein engineering have produced GFP variants with faster folding, faster chromophore maturation and increased solubility. Circular and non-circular permutations of the GFP polypeptide chain are found to be well tolerated, as are many ways of splitting the chain into two parts, leading to biosensors based on circularly permuted and split GFP.
We review several GFP-based biomarkers and biosensors, with emphasis on their construction, their detection targets and the applications. Among the detection targets are pH, ions, reactive oxygen species, proteins, peptides and enzyme activity. Biosensors are created from GFP by making mutations that change its sensitivity, or by fusing it to functional domains, or by splicing functional domains and loops into exposed loops of GFP, or by splitting GFP. Forster resonance energy transfer (FRET) is used in many cases as a powerful and sensitive means of detecting interacting components of a system. Finally, there is considerable promise for the future of GFP-biosensors created by computational protein design, in which the site of one of the eleven beta strands is replaced by a binding site for a desired target peptide. Proofs of concept are presented here.
Summary: Protein unfolding is modeled as an ensemble of pathways, where each step in each pathway is the addition of one topologically possible conformational degree of freedom. Starting with a known protein structure, GeoFold hierarchically partitions (cuts) the native structure into substructures using revolute joints and translations. The energy of each cut and its activation barrier are calculated using buried solvent accessible surface area, side chain entropy, hydrogen bonding, buried cavities, and backbone degrees of freedom. A directed acyclic graph is constructed from the cuts, representing a network of simultaneous equilibria. Finite difference simulations on this graph simulate native unfolding pathways. Experimentally observed changes in the unfolding rates for disulfide mutants of barnase, T4 lysozyme, dihydrofolate reductase, and factor for inversion stimulation were qualitatively reproduced in these simulations. Detailed unfolding pathways for each case explain the effects of changes in the chain topology on the folding energy landscape. GeoFold is a useful tool for the inference of the effects of topology and mutation on the energy landscape of protein unfolding.
Psychrophilic organisms have adapted to live at low temperatures by using a variety of mechanisms. Here, we examine twenty homologous enzyme pairs from psychrophiles and mesophiles to investigate flexibility as a key characteristic for cold adaptation. B- factors in protein X-ray structures are one way to measure flexibility. Comparing psychrophilic to mesophilic protein B-factors shows that psychrophilic enzymes are more flexible in 3-turn and strand secondary structures. Enzyme cavities, identified using CASTp at various probe sizes, indicate that psychrophilic enzymes have larger average cavity sizes at probe radii of 1.4-1.5Å Furthermore, amino acid side chains lining these cavities show an increased frequency of acidic groups in psychrophilic enzymes. These findings suggest that embedded water molecules may play a significant role in cavity flexibility, and therefore, overall protein flexibility. Thus, our results point to the important role enzyme flexibility plays in adaptation to cold environments.
ABSTRACT: Several versions of split green fluorescent protein (GFP) fold and reconstitute fluorescence, as do many circular permutants, but little is known about the dependence of reconstitution on circular permutation. Explored here is the capacity of GFP to fold and reconstitute fluorescence from various truncated circular permutants, herein called "leave-one-outs" using a quantitative in vivo solubility assay and in vivo reconstitution of fluorescence. Twelve leave-one-out permutants are discussed, one for each of the 12 secondary structure elements. The results expand the outlook for the use of permuted split GFPs as specific and self-reporting gene encoded affinity reagents.
The pathway which proteins take to fold can be influenced from the earliest events of structure formation. In this light, it was both predicted and confirmed that increasing the stiffness of a beta hairpin turn decreased the size of the transition state ensemble (TSE) while increasing the folding rate. Thus there appears to be a relationship between conformationally restricting the TSE and increasing the folding rate, at least for beta hairpin turns. In this study, we hypothesize that the enormous sampling necessary to fold even two-state folding proteins in silico could be reduced if local structure constraints were used to restrict structural heterogeneity by polarizing folding pathways or forcing folding into preferred routes. Using a Go model we fold Chymotrypsin Inhibitor 2 (CI-2) and the SH3 domain after constraining local sequence windows to their native structure by rigid body dynamics. Trajectories were monitored for any changes to the folding pathway and differences in the kinetics compared to unconstrained simulations. For both proteins folding time is generally decreased after constraining any local sequence window. Structural polarization of the folding pathway appears to explain these rate increases and occurs regardless of whether the locally constrained structure exists in the native TSE or not. Folding rate enhancements are consistent with the goal to reduce sampling time necessary to reach native structures during folding simulations. Interestingly, not all constrained windows decreased folding time equally. We conclude by analyzing these differences and explain why rigid body dynamics may be the preferred way to constrain structure.
The sequential order of secondary structural elements in proteins affects the folding and activity to an unknown extent. To test the dependence on sequential connectivity, we reconnected secondary structural elements by their solvent-exposed ends, permuting their sequential order, called "rewiring". This new protein design strategy changes the topology of the backbone without changing the core side chain packing arrangement. While circular and noncircular permutations have been observed in protein structures that are not related by sequence homology, to date no one has attempted to rationally design and construct a protein with a sequence that is noncircularly permuted while conserving three-dimensional structure. Herein, we show that green fluorescent protein can be rewired, still functionally fold, and exhibit wild-type fluorescence excitation and emission spectra.
Background: Proteins have evolved subject to energetic selection pressure for stability and flexibility. Structural similarity between proteins that have gone through conformational changes can be captured effectively if flexibility is considered. Topologically unrelated proteins that preserve secondary structure packing interactions can be detected if both flexibility and Sequential permutations are considered. We propose the FlexSnap algorithm for flexible non-topological protein structural alignment. Results: The effectiveness of FlexSnap is demonstrated by measuring the agreement of its alignments with manually curated non-sequential structural alignments. FlexSnap showed competitive results against state-of-the-art algorithms, like DALI, SARF2, MultiProt, FlexProt, and FATCAT. Moreover on the DynDom dataset, FlexSnap reported longer alignments with smaller rmsd. Conclusions: We have introduced FlexSnap, a greedy chaining algorithm that reports both sequential and non-sequential alignments and allows twists (hinges). We assessed the quality of the FlexSnap alignments by measuring its agreements with manually curated non-sequential alignments. On the FlexProt dataset, FlexSnap was competitive to state-of-the-art flexible alignment methods. Moreover, we demonstrated the benefits of introducing hinges by showing significant improvements in the alignments reported by FlexSnap for the structure pairs for which rigid alignment methods reported alignments with either low coverage or large rmsd.
The remarkable predominance of right-handedness in beta-alpha-beta helical crossovers has been previously explained in terms of thermodynamic stability and kinetic accessibility, but a different kinetic trapping mechanism may also play a role. If the beta-sheet contacts are made before the crossover helix is fully formed, and if the backbone angles of the folding helix follows the energetic pathway of least resistance, then the helix would impart a torque on the ends of the two strands. Such a torque would tear apart a left-handed conformation but hold together a right-handed one. Right-handed helical crossovers predominate even in all-alpha proteins, where previous explanations based on the preferred twist of the beta sheet do not apply. Using simple molecular simulations, we can reproduce the right-handed preference in beta-alpha-beta units, without imposing specific beta strand geometry. The new kinetic trapping mechanism is dubbed the "phone cord effect" because it is reminiscent of the way a helical phone cord forms superhelices to relieve torsional stress. Kinetic trapping explains the presence of a right-handed superhelical preference in alpha helical crossovers, and provides a possible folding mechanism for knotted proteins. See Supplementary materials.
Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.
Green fluorescent protein (GFP) has been used as a proof of concept for a novel "leave-one-out" biosensor design where a protein that has a segment omitted from the middle of the sequence by circularly permuting and truncating, binds the missing peptide and reconstitutes its function. Three variants of GFP have been synthesized that are each missing one of the eleven beta strands from its beta barrel structure, and in two of the variants adding the omitted peptide sequence in trans reconstitutes fluorescence. Detailed biochemical analysis indicates that GFP with beta-strand 7 "left out" (t7SPm) exists in a partially unfolded state. The apo-form t7SPm binds the free beta-strand 7 peptide with dissociation constant of ~0.5uM and folds into the native state of GFP, resulting in fluorescence recovery. Folding of t7SPm, both with and without the peptide ligand, is at least a three-state process and has a rate comparable to the full-length and unpermuted GFP. The conserved kinetic properties strongly suggest that the rate limiting steps in the folding pathway have not been altered by circular permutation and truncation in t7SPm. This study shows that structural and functional reconstitution of GFP can occur with a segment omitted from the middle of the chain, and that the unbound form is in a partially unfolded state.
Protein folding is a hierarchical process where structure forms locally first, then globally. Some short sequence segments initiate folding through strong structural preferences that are independent of their three-dimensional context in proteins. We have constructed a knowledge-based force field in which the energy functions are conditional on local sequence patterns, as expressed in the hidden Markov model HMMSTR. CALF (C-ALpha Force field) builds sequence specific statistical potentials based on database frequencies for alpha-carbon virtual bond opening and dihedral angles, pair-wise contacts and hydrogen bond donor-acceptor pairs, and simulates folding via Brownian dynamics. We introduce hydrogen bond donor and acceptor potentials as alpha-carbon probability fields that are conditional on the predicted local sequence. Constant temperature simulations were carried out using 27 peptides selected as putative folding initiation sites, each 12 residues in length, representing several different local structure motifs. Each 0.6 microsecond trajectory was clustered based on structure. Simulation convergence or representativeness was assessed by subdividing trajectories and comparing clusters. For 21 of the 27 sequences, the largest cluster made up more than half of the total trajectory. Of these 21 sequences, 14 had cluster centers that were at most 2.6A RMSD from their native structure in the corresponding full-length protein. To assess the adequacy of the energy function on non-local interactions, 11 full length native structures were relaxed using low-temperature Brownian dynamics. Equilibrated structures deviated from their native states but retained their overall topology and compactness. A simple potential that folds proteins locally and stabilizes proteins globally may enable a more realistic understanding of hierarchical folding pathways.
Amino acid sequence probability distributions, or profiles, have been used successfully to predict secondary structure and local structure in proteins. Profile models assume the statistical independence of each position in the sequence, but the energetics of protein folding is better captured in a scoring function that is based on pairwise interactions, like a force field. Results I-sites motifs are short sequence/structure motifs that populate the protein structure database due to energy-driven convergent evolution. Here we show that a pairwise covariant sequence model does not predict alpha helix or beta strand significantly better overall than a profile-based model, but it does improve the prediction of certain loop motifs. The finding is best explained by considering secondary structure profiles as multivariant, all-or-none models, which subsume covariant models. Pairwise covariance is nonetheless present and energetically rational. Examples of negative design are present, where the covariances disfavor non-native structures. Conclusions Measured pairwise covariances are shown to be statistically robust in cross-validation tests, as long as the amino acid alphabet is reduced to nine classes. Availability: An updated I-sites local structure motif library that pro-vide sequence covariance information for all types of local structure in globular proteins and a web server for local structure prediction are available at www.bioinfo.rpi.edu/bystrc/hmmstr/server.php .
Summary: Most proteins are in equilibrium with partially and globally unfolded conformations. In contrast, kinetically stable proteins (KSPs) are trapped by an energy barrier in a specific state, unable to transiently sample other conformations. Among many potential roles, it appears that kinetic stability (KS) is a feature used by nature to allow proteins to maintain activity under harsh conditions and to preserve the structure of proteins that are prone to misfolding. The biological and pathological significance of KS remain very poorly understood due to the lack of simple experimental methods to identify this property, and its infrequent occurrence in proteins. Based on our previous correlation between KS and a proteins resistance to the denaturing detergent sodium dodecyl sulfate (SDS), we show here the application of a diagonal two-dimensional (D2D) SDS-polyacrylamide gel electrophoresis (PAGE) assay to identify KSPs in complex mixtures. We applied this method to the lysate of E. coli, and upon proteomics analysis have identified 50 non-redundant proteins that were SDS resistant (i.e. putatively kinetically stable), either individually or as part of a protein complex. Structural and functional analyses of a subset (44) of these proteins with known 3D structure revealed some potential structural and functional biases towards and against KS. This simple D2D SDS-PAGE assay will allow the widespread investigation of KS, including the proteomics-level identification of KSPs in different systems, potentially leading to a better understanding of the biological and pathological significance of this intriguing property of proteins.
Summary: We describe an efficient method for partial complementary shape matching for use in rigid protein-protein docking. The local shape features of a protein are represented using boolean data structures called context shapes. The relative orientation of the receptor and ligand surfaces is searched using pre-calculated lookup tables. Energetic quantities are derived from shape complementarity and buried surface area computations using efficient boolean operations. Preliminary results indicate that our context shapes based approach outperforms stateof-the-art geometric shape based rigid docking algorithms like ZDOCK(PSC) and PatchDock. Binary code of the implementation is available on request. The code will be available for downloading once the project website is set up.
Summary: Hidden Markov models (HMMs) are an extremely versatile statistical representation that can be used to model any set of one-dimensional discrete symbol data. HMMs can model protein sequences in many ways, depending on what features of the protein are represented by the Markov states. For protein structure prediction, states have been chosen to represent either homologous sequence positions, local or secondary structure types, or transmembrane locality. The resulting models can be used to predict common ancestry, secondary or local structure, or membrane topology by applying one of the two standard algorithms for comparing a sequence to a model. In this chapter we review those algorithms and discuss how HMMs have been constructed and refined for the purpose of protein structure prediction.
Proteins are linear chains that fold into characteristic shapes and features. To understand proteins and protein folding, we try to represent the protein molecule in such a way that its features are easy to see and manipulate. A simple representation facilitates algorithm design for structure prediction. The simplicity of the 3-state character string representation of secondary structure is part of the reason for secondary structure prediction receiving so much attention early in the era of computational biology. One-dimensional strings are easily understood, parsed, mined and manipulated. But secondary structure alone does not tell us enough about the overall shapes and features of a protein. We need a simple way to represent the overall tertiary structure of a protein.
Here we explore a two-dimensional Boolean matrix representation of protein structure, where each dimension is the residue number and each value is true if the residues are spatial neighbors and false otherwise -- called a contact map. A contact map is the simplest representation of a protein that can be faithfully projected back into three dimensions. As such it has received increased attention in recent years from bioinformaticists, who see this as a data structure that is readily amenable to data mining and machine learning.
Motivation: In recent years, advances have been made in the ability of
computational methods to discriminate between homologous and non-homologous
proteins in the "Twilight Zone" of sequence similarity, where the percent
sequence identity is a poor indicator of homology. To make these predictions
more valuable to the protein modeler, they must be accompanied by accurate
alignments. Pairwise sequence alignments are inferences of orthologous
relationships between sequence positions. Evolutionary distance is
traditionally modeled using global amino acid substitution matrices. But real
differences in the likelihood of substitutions may exist for different
structural contexts within proteins, since structural context contributes to
the selective pressure.
Results: HMMSUM (HMMSTR-based SUbstitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Align-ments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM50 matrix when validated against curated remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.
Availability: Matrices and programs are available at http://www.bioinfo.rpi.edu/bystrc/downloads.html.
Contact: firstname.lastname@example.org, email@example.com
Summary: We present a method for constructing thousands of compact protein conformations from fragments and then connecting these structures to form a network of physically plausible folding pathways. This is the first attempt to merge the previous successes in fragment assembly methods with probabilistic roadmap (PRM) methods. Previous PRM methods have used the knowledge of the true structure to sample conformational space. Our method uses only the amino acid sequence to bias the conformational sampling. Conformational sampling is done using HMMSTR, a hidden Markov model for local sequence-structure correlations. We then build a PRM graph and find paths that have the the lowest energy climb. We find that favored folding pathways exist, corresponding to deep valleys in the energy landscape. We describe the pathways for three small proteins with different secondary structure content in the context of a folding funnel model.
ECOME is an interactive, graph-based model for simulating an evolving, closed consumption web. It demonstrates the fundamental behavior of a global ecosystem over evolutionary time using wellestablished ecological/evolutionary principles. Nodes in the graph send biomass along weighted, directed edges. New nodes evolve by speciation and disappear when biomass (i.e. population) shrinks to zero. Consumption rates, predator/prey relationships, and speciation rates are user-defined, following theoretic distributions. The output shows the biomass and biodiversity over time for up to five trophic levels. Using this simple system, we demonstrate that closed ecosystems are inherently unstable in the absence of evolution or in the presence of a single, hyperchanging species, but are dynamically stable and robust to perturbations when the evolution rates for all species follow a normal distribution. Our new application provides provocative lessons for biology students during a time of mass extinction.
Motivation: Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence, and compactness.
Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures. For example, the 3-layer sandwich domain architecture may be divided into four structural subclasses based on internal packing arrangements. These subclasses represent an intermediate level of structure classification, more general than topology but more specific than architecture as defined in CATH. A strategy is presented for developing a set of predictive hidden Markov models based on multiple SCALI alignments.
Availability: An online topology independent SCALI structure comparison server is available at http://www.bioinfo.rpi.edu/bystrc/scali.html.
Summary: A structured folding pathway, which is a time
ordered sequence of folding events, plays an important role in
the protein folding process and hence, in the conformational
search. Pathway prediction, thus gives more insight into the
folding process and is a valuable guiding tool
to search the conformation space.
In this paper, we propose a novel unfolding
approach to predict the folding pathway. We apply graph-based
methods on a weighted secondary structure graph of a protein
to predict the sequence of unfolding events. When viewed in
reverse this yields the folding pathway. We demonstrate the
success of our approach on several proteins whose pathway
is partially known.
Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns.These state strings are transformed into fixed dimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP)1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods. Proteins 2004;57:518-30.
A review of recent work toward modeling the protein folding pathway using a bioinformatics approach is presented. Statistical models have been developed for sequence-structure correlations in proteins at five levels of structural complexity: (1) short motifs, (2) extended motifs, (3) non-local pairs of motifs, (4) three dimensional arrangements of multiple motifs, and (5) global structural homology. Here we review statistical models, including sequence profiles, hidden Markov models and interaction potentials, for the first four levels of structural detail. The I-sites Library (folding Initiation sites) models local structure motifs. HMMSTR (Hidden Markov Model for STRucture) is a hidden Markov model for extended motifs. HMMSTR-CM (Contact Maps) is a model for pairwise interactions between motifs. And SCALI-HMM (HMMs for Structural Core Alignments) is a set of hidden Markov models for spatial arrangements of motifs. Global sequence models have been extensively reviewed elsewhere and are not discussed here. The parallels between the statistical models and the theoretical models for folding pathways are discussed.
Access to the data used and algorithms presented in this paper are available at http://www.bioinfo.rpi.edu/bystrc/ or by request to firstname.lastname@example.org. HMMSTR predictions may be obtained from this web site: http://www.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Knowledge-based potential functions for protein structure prediction assume that the frequency of occurrence of a given structure or a contact in the protein database is a measure of its free energy. Here, we put this assumption to test by comparing the results obtained from sequence-structure cluster analysis with those obtained from long all-atom molecular dynamics simulations. Sixty-four eight-residue peptide sequences with varying degrees of similarity to the canonical sequence pattern for amphipathic helix were drawn from known protein structures, regardless of whether they were helical in the protein. Each was simulated using AMBER6.0 for at least 10 ns using explicit waters. The total simulation time was 1176 ns. The resulting trajectories were tested for reproducibility, and the helical content was measured. Natural peptides whose sequences matched the amphipathic helix motif with greater than 50% confidence were significantly more likely to form helix during the course of the simulation than peptides with lower confidence scores. The sequence pattern derived from the simulation data closely resembles the motif pattern derived from the database cluster analysis. The difficulties encountered in sampling conformational space and sequence space simultaneously are discussed. Key words: Proteins 2003;50:552-562.
The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. Currently, discriminative approach which combines support vector ma-chine and sequence similarity is recognized as the most ac-curate approach. SVM-Fisher and SVM-pairwise methods are two representatives of this approach, and SVM-pairwise is the most accurate method. However, these methods only encode sequence information into their feature vectors and ignore the structure information. In addition, one of their major drawbacks is their computation inefficiency. Based on this observation, we present an alternative method for SVM-based protein classification. Our method, SVM-I-sites, uses structure similarity instead of sequence similarity for remote homology detection. Our studies show that SVM-I-sites is much more efficient than both SVM-Fisher and SVM-pairwise while achieving a comparable performance with SVM-pairwise.
Result: We adopt SCOP 1.53 as our dataset. The result shows that SVM-I-sites runs much faster and is able to out-perform many state-of-the-art sequence-based methods such as PSI-BLAST, SAM and SVM-Fisher, and comparable to SVM-pairwise.
Availability: I-sites server is accessible through the web at http://www.bioinfo.rpi.edu.
Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request.
We present a novel method, HMMSTR-CM, for protein contact map predictions. Contact potentials were calculated using HMMSTR, a hidden Markov model for local sequence structure correlations. Targets were aligned against protein templates using a Bayesian method and contact maps were generated using these alignments. Contact potentials then were used to evaluate these templates. An ab initio method was developed based on the target contact potentials using a rule-based strategy to model the protein folding pathway. Fold recognition and ab initio methods were combined to produce accurate, protein-like contact maps. Pathways sometimes led to an unambiguous prediction of topology, even without using templates. The results on CASP5 targets are discussed. Also included is a brief update on the quality of fully automated ab initio predictions using the I-sites server.
Proteins fold through a series of intermediate states called a pathway. Protein folding pathways have been modeled using either simulations or a heirarchy of statistical models. Here we present a series of related statistical models that at-tempt to predict early, middle and late intermediates along the folding pathway. I-sites motifs are discrete models for folding initiation sites. HMMSTR is a model for local structure patterns composed of I-sites motifs. HMMSTR-CM is an ap-proach toward assembling motifs and groups of motifs in a contact map represen-tation, using heuristic rules to predict contact maps either with or without the use of templates. We also discuss the I-sites/ROSETTA server, which is a folding simulation algorithm that uses a fragment library as input. The results of blind structure prediction experiments are discussed. Pathway-based predictions some-times lead to an unambiguous prediction of the fold topology, even without using templates.
Ab initio prediction is the challenging attempt to predict protein structures based only on sequence information and without using templates. It is often divided into two distinct sub-problems: (1) the scoring function that can distinguish between native or native-like structures from non-native ones, and (2) the method of searching the conformational space. Currently there does not exist a reliable scoring function that can always drive a search to the native fold, and there is no general search method that can guarantee a significant sampling of near-natives. Pathway models combine the scoring function and the search. In this short review, we explore some of the ways pathway models are used in folding, in published works since 2001, and present a new pathway model HMMSTR-CM, that uses a fragment library and a set of nucleation/propagation-based rules. The new method was used for ab initio predictions as part of CASP5. This work was presented at the Winter School in Bioinformatics, Bologna, Italy, Feb 10-14, 2003.
A fast algorithm for computing the solvent accessible molecular surface area (SAS) using Boolean masks (Le Grand, S. M. & Merz, K. M. J. (1993). J. Comp. Chem. 14, 349-52.) has been modified to estimate the solvent excluded molecular surface area (SES), including contact, toroidal and reentrant surface components. Numerical estimates of arc lengths of intersecting atomic SAS are using to estimate the toroidal surface, and intersections between those arcs are used to estimate the reentrant surface area. The new method is compared to an exact analytical method. Boolean molecular surface areas are continuous and pairwise differentiable, and should be useful for molecular dynamics simulations, especially as the basis for an implicit solvent model.
Motivation: The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction. Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm. Results: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here. 61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6A. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction. Availability:The server is accessible through the web atwww.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Torsion space molecular dynamics may be more efficiently encoded if the global motions are separated from the internal motions. The equations of motion for single, non-cyclic chains are shown to be first order in the backbone angle parameters when the global frame of reference is ignored, and second order otherwise. Adding a simple heuristic substitute for the global motions enables the encoding of dynamics for mixed constrained/un-constrained model systems.
We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear hidden Markov models used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the protein database and, by representing overlapping motifs in a much more compact form, achieves a great reduction in parameters. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.3 %, backbone torsion angles better than any previously reported method and the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction.
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.
In this paper, we develop data mining techniques to predict 3D contact potentials among protein residues (or amino acids) based on the hierarchical nucleation-propagation model of protein folding. We apply a hybrid approach, using a hidden Markov model (HMM) to extract folding initiation sites, and then apply association mining to discover contact potentials. The new hybrid approach achieves accuracy results better than those reported previously (13 Refs.)
We describe the development of a scoring function based on the decomposition P(structure/sequence) proportional to P(sequence/structure) *P(structure), which outperforms previous scoring functions in correctly identifying native-like protein structures in large ensembles of compact decoys. The first term captures sequence- dependent features of protein structures, such as the burial of hydrophobic residues in the core, the second term, universal sequence- independent features, such as the assembly of beta-strands into beta- sheets. The efficacies of a wide variety of sequence-dependent and sequence-independent features of protein structures for recognizing native-like structures were systematically evaluated using ensembles of approximately 30,000 compact conformations with fixed secondary structure for each of 17 small protein domains. The best results were obtained using a core scoring function with P(sequence/structure) parameterized similarly to our previous work (Simons et al., J Mol Biol 1997;268:209-225] and P(structure) focused on secondary structure packing preferences; while several additional features had some discriminatory power on their own, they did not provide any additional discriminatory power when combined with the core scoring function. Our results, on both the training set and the independent decoy set of Park and Levitt (J Mol Biol 1996;258:367-392), suggest that this scoring function should contribute to the prediction of tertiary structure from knowledge of sequence and secondary structure.
We describe a new method for local protein structure prediction based on a library of short sequence pattern that correlate strongly with protein three-dimensional structural elements. The library was generated using an automated method for finding correlations between protein sequence and local structure, and contains most previously described local sequence-structure correlations as well as new relationships, including a diverging type-II beta-turn, a frayed helix, and a proline-terminated helix. The query sequence is scanned for segments 7 to 19 residues in length that strongly match one of the 82 patterns in the library. Matching segments are assigned the three-dimensional structure characteristic of the corresponding sequence pattern, and backbone torsion angles for the entire query sequence are then predicted by piecing together mutually compatible segment predictions. In predictions of local structure in a test set of 55 proteins, about 50% of all residues, and 76% of residues covered by high-confidence predictions, were found in eight-residue segments within 1.4 A of their true structures. The predictions are complementary to traditional secondary structure predictions because they are considerably more specific in turn regions, and may contribute to ab initio tertiary structure prediction and fold recognition.
Previous studies of the conformations of peptides spanning the length of the alpha-spectrin SH3 domain suggested that SH3 domains lack independently folding substructures. Using a local structure prediction method based on the I-sites library of sequence-structure motifs, we identified a seven residue peptide in the src SH3 domain predicted to adopt a native-like structure, a type II beta-turn bridging unpaired beta-strands, that was not contained intact in any of the SH3 domain peptides studied earlier. NMR characterization confirmed that the isolated peptide, FKKGERL, adopts a structure similar to that adopted in the native protein: the NOE and 3JNHalpha coupling constant patterns were indicative of a type II beta-turn, and NOEs between the Phe and the Leu side-chains suggest that they are juxtaposed as in the prediction and the native structure. These results support the idea that high-confidence I-sites predictions identify protein segments that are likely to form native-like structures early in folding. Copyright 1998 Academic Press.
Blind predictions of the local structure of nine CASP2 targets were made using the I-sites library of short sequence--structure motifs, revealing strengths and weaknesses in this new knowledge-based method. Many turns between secondary structural elements were accurately predicted. Estimates of the confidence of prediction correlated well with the accuracy over the whole set. Bias toward structures used to develop the library was minimal, probably because of the extensive use of cross-validation. However, helix positions were better predicted by the PHD program. The method is likely to be sensitive to the quality of the sequence alignment. A general measure for evaluating local structure predictions is suggested.
We have used cluster analysis to identify recurring sequence patterns that transcend protein family boundaries. A subset of these patterns occur predominantly in a single type of local structure in proteins. Here we characterize the three-dimensional structures and contexts in which these sequence patterns occur, with particular attention to the interactions responsible for their structural selectivity.
Considerable progress has been made in understanding the relationship between local amino acid sequence and local protein structure. Recent highlights include numerous studies of the structures adopted by short peptides, new approaches to correlating sequence patterns with structure patterns, and folding simulations using simple potentials.
The 2.4 A crystal structure (R = 0.180) of the serine protease inhibitor ecotin was determined in a complex with trypsin. Ecotin's dimer structure provides a second discrete and distal binding site for trypsin and, as shown by modelling experiments, other serine proteases. The second site is approximately 45 A from the reactive/active site of the complex and features 13 hydrogen bonds, including six that involve carbonyl oxygen atoms and four bridged by water molecules. Contacts ecotin makes with trypsin's active site are similar to, though more extensive than, those found between trypsin and basic pancreatic trypsin inhibitor. The side chain of ecotin Met84 is found in the substrate binding pocket of trypsin where it makes few contacts, but also does not disrupt the solvent structure or cause misalignment of the scissile bond. This first case of protein dimerization being used to augment binding energy and allow chelation of a target protein provides a new model for protein-protein interactions and for protease inhibition.
The authors describe the further development of phase refinement by iterative skeletonization (PRISM), a recently introduced phase-refinement strategy which makes use of the information that proteins consist of connected linear chains of atoms. An initial electron-density map is generated with inaccurate phases derived from a partial structure or from isomorphous replacement. A linear connected skeleton is then constructed from the map using a modified version of Greer's algorithm (1985) and a new map is created from the skeleton. This 'skeletonized' map is Fourier transformed to obtain new phases, which are combined with any starting-phase information and the experimental structure-factor amplitudes to produce a new map. The procedure is iterated until convergence is reached. In the paper significant improvements to the method are described as is a challenging molecular-replacement test case in which initial phases are calculated from a model containing only one third of the atoms of the intact protein (15 Refs.)
A phase-refinement strategy for protein crystallography which exploited the information that proteins consist of connected linear chains of atoms is applied to a molecular-replacement problem, the structure of the protease inhibitor ecotin bound to trypsin, and a single isomorphous replacement problem, the structure of the N-terminal domain of apolipoprotein E. The starting phases for the ecotin-trypsin complex were based on a partial model (trypsin) containing 61% of the atoms in the complex. Iterative skeletonization gave better results than either solvent flattening or twofold non-crystallographic symmetry averaging as measured by the reduction in the free R factor. Protection of the trypsin density during the course of the refinement greatly improved the performance of both skeletonizing and solvent flattening. In the case of apolipoprotein E, the combination of iterative skeletonization and solvent flattening decreased the phase error with respect to the final refined structure, significantly more than solvent flattening alone (20 Refs.)
The crystal structure of subtilisin BL, an alkaline protease from Bacillus lentus with activity at pH 11, has been determined to 1.4 A resolution. The structure was solved by molecular replacement starting with the 2.1 A structure of subtilisin BPN' followed by molecular dynamics refinement using X-PLOR. A final crystallographic R-factor of 19% overall was obtained. The enzyme possesses stability at high pH, which is a result of the high pI of the protein. Almost all of the acidic side-chains are involved in some type of electrostatic interaction (ion pairs, calcium binding, etc.). Furthermore, three of seven tyrosine residues have potential partners for forming salt bridges. All of the potential partners are arginine with a pK around 12. Lysine would not function well in a salt bridge with tyrosine as it deprotonates at around the same pH as tyrosine ionizes. Stability at high pH is acquired in part from the pI of the protein, but also from the formation of salt bridges (which would affect the pI). The overall structure of the enzyme is very similar to other subtilisins and shows that the subtilisin fold is more highly conserved than would be expected from the differences in amino acid sequence. The amino acid side-chains in the hydrophobic core are not conserved, though the inter- residue interactions are. Finally, one third of the serine side-chains in the protein have multiple conformations. This presents an opportunity to correlate computer simulations with observed occupancies in the crystal structure.
The crystal structure of unliganded dihydrofolate reductase (DHFR) from Escherichia coli has been solved and refined to an R factor of 19% at 2.3-A resolution in a crystal form that is nonisomorphous with each of the previously reported E. coli DHFR crystal structures [Bolin, J. T., Filman, D. J., Matthews, D. A., Hamlin, B. C., & Kraut, J. (1982) J. Biol. Chem. 257, 13650-13662; Bystroff, C., Oatley, S. J., & Kraut, J. (1990) Biochemistry 29, 3263-3277]. Significant conformational changes occur between the apoenzyme and each of the complexes: the NADP+ holoenzyme, the folate-NADP+ ternary complex, and the methotrexate (MTX) binary complex. The changes are small, with the largest about 3 A and most of them less than 1 A. For simplicity a two-domain description is adopted in which one domain contains the NADP+ 2'-phosphate binding site and the binding sites for the rest of the coenzyme and for the substrate lie between the two domains. Binding of either NADP+ or MTX induces a closing of the PABG-binding cleft and realignment of alpha- helices C and F which bind the pyrophosphate of the coenzyme. Formation of the ternary complex from the holoenzyme does not involve further relative domain shifts but does involve a shift of alpha-helix B and a floppy loop (the Met-20 loop) that precedes alpha B. These observations suggest a mechanism for cooperativity in binding between substrate and coenzyme wherein the greatest degree of cooperativity is expressed in the transition-state complex. We explore the idea that the MTX binary complex in some ways resembles the transition-state complex.
The crystal structure of dihydrofolate reductase (EC 22.214.171.124) from Escherichia coli has been solved as the binary complex with NADP+ (the holoenzyme) and as the ternary complex with NADP+ and folate. The Bragg law resolutions of the structures are 2.4 and 2.5 A, respectively. The new crystal forms are nonisomorphous with each other and with the methotrexate binary complex reported earlier [Bolin, J. T., Filman, D. J., Matthews, D. A., Hamlin, R. C., & Kraut, J. (1982) J. Biol. Chem. 257, 13650-13662]. In general, NADP+ and folate binding conform to predictions, but the nicotinamide moiety of NADP+ is disordered in the holoenzyme and ordered in the ternary complex. A mobile loop (residues 16-20) involved in binding the nicotinamide is also disordered in the holoenzyme. We report a detailed analysis of the binding interactions for both ligands, paying special attention to several apparently strained interactions that may favor the transition state for hydride transfer. Hypothetical models are presented for the binding of 7,8- dihydrofolate in the Michaelis complex and for the transition-state complex.