X-ray sequencing

I am looking for suggestions of ways to check a 1.7 Angstrom X-ray sequence for a protein where it is impractical to do experimental sequencing, protein or DNA. The structure refines to publishable R/R-free and the main ambiguities seem to be Thr/Val, Asp/Asn and Glu/Gln where alternative H-bonding networks are possible. Running alpha-fold seems an interesting option? Any suggestions much appreciated. Jon Cooper.

If you know at least something about your protein, organism, type of molecule, you could try mass spectrometry peptide mapping to known sequences, this may give you some answers for the ambiguities you might be seeing, if nothing else. Jan Dohnalek.

I think a BLAST search with a few (10) consecutive amino-acid residues is sufficient to tell you which protein it is. Ana Luisa Carvalho.

Thank you so much for your replies. I apologise for being unclear. The protein is purified from a plant that hasn't had its genome sequence determined. We know the enzyme family of the protein and therefore the structure was originally solved by MR. The 'X-ray sequence' we have is just determined from looking at the 1.7 Angstrom density, which is good, over several refinement and rebuilding rounds. The resulting sequence has been run through blast and it is up to 58% identical with other family members. To me this seemed low but that degree of identity is typical of other family members. The postgrad who did the work did obtain some peptide sequences and prior to that about 20% of the sequence was determined by the Edman method with the usual Asp/Asn and Glu/Gln ambiguity. However, there isn't any prospect of us doing further experimental work, sorry, but that's the way it is!! Jon Cooper.

Reminds me very much of one of our projects: FEBS J. 2014 281(18):4319-33. doi: 10.1111/febs.12900. A plant peroxidase and the tree is not sequenced. We tried RNA extraction from the latex, but the tree is sacred and I didn't want the student to bring living material of a sacred tree from India to Germany. We got some material prepared by the botanical garden in Giessen, didn't work. Well, we are not plant biochemists, either. As far as I know, the tree is still not sequenced. Problem during deposition was, that PDB wants a link for the sequence, but UniProt only created a sequence after the entry of the PDB went public. So all we could do is what we tell the students: compare with the sequence homologues (similar to your numbers), check hydrogen bonds, compare B-factors in Asp/Asn models. Gottfried Palm.

We did exactly the same back in 1999's. 1) Natesh R, et al., (1999) J. Mol. Biol. 288, 999-1012 and then updated the 1.8 A crystal structure derived sequence with ultra high resolution electron density map. We were also able to model additional terminal residues in ultra high resolution map: 2) Natesh R, et al., (2003). Acta Crystallogr., D59,105-117. Ramanathan Natesh.

Then I'd just be "honest" about it and submit to your favourite IUCr journal. Mark J van Raaij.

Maybe a crazy idea, but couldn't one use various model/geometry validation tools to figure out some of those ambiguities? As a test one could take a very good 1.7 A structure and do some random ASN->ASP, THR->VAL etc mutations followed by refinement (including hydrogens). Wouldn't some validation tool pick up unfavourable conformations, poor rotamers and/or hydrogen clashes and poor H-networks (compared to the initial, correct sequence)? Maybe there is some kind of "fingerprint" in validation results for such incorrect residue assignments that can distinguish correct from incorrect sequences ... Or put another way, if model validation cannot pick up such sequence errors: should we be worried about the reliability of our validation criteria? A large scale re-refinement of deposited structures with (1) the current/correct sequence and (2) those ASN/ASP, THR/VAL etc ambiguities artificially introduced, could provide a clever algorithm (AI?) with the data basis to figure out those "fingerprints". Maybe even for the ASN/GLN/HIS side-chain orientations when the sequence is actually correct. Clemens Vonrhein.

Maybe SEQUENCE SLIDER from Borges & Uson could be a good option here actually: http://chango.ibmb.csic.es/tutorial_slider_remote described in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7057211/. I think I came across it from reading this paper: https://www.sciencedirect.com/science/article/pii/S0300908420302078. Never used it though but could be a nice and unbiased way of assigning the sequence! Ulrich Eckhard.

Thank you, yes, threading style tools to assess the likelihood of having a given amino acid in a certain position in the fold would be a good approach. I have tried one but wasn't hugely informative, in my hands anyway. All suggestions very welcome but big database science is a bit outside my skill set. Of course, sequence conservation in the family helps a lot with the assignment but there are bigger ambiguities in less conserved surface regions e.g. a disordered Lys can refine well as a Ser, Ala or Gly even, but e.g. nearby conserved acidic groups might suggest the presence of the basic amino acid which could salt-bridge with them, but why then would it be so disordered? Jon Cooper.

There are placeholders for ASP/ASN and GLU/GLN ambiguities: ASX and GLX respectively. You can just use those. AFAICT there no such thing for VAL/THR ambiguities. You could look for the most likely candidates based on multiple sequence alignments. Refinement of both alternatives can give hints in B-factors and if you are lucky in difference density. But if hydrogen bonding gives no hints, then the residues are also not in a place where the identity really matters. You can give your best guess with a CAVEAT record or use the name UNK to indicate that you do not know what the residue is. You would loose the knowledge that it is either VAL or THR in that case. Robbie Joosten.

I don't know how much effort I would put into that, given how easy nucleic acid sequencing has become. David J. Schuller.

Maybe worth asking someone if Alphafold/rosettafold can be run in reverse (i.e. what sequence would give the maximum score for a given alpha carbon matrix). Might be totally impossible. Matthew Snee.

In 2003, I faced the same problem and despite all my efforts to convince referees about my approach, they were very insensitive. I had to perform a DNA sequencing with a degenerated sequence based on my protein electron density-derived primary structure. We used the following approaches:
1. calculated sequence similarity with dozens of homologous proteins
2. calculated correlation betweeen calculated and measured electron-densities
3. calculated R-factors and solvent accessibility per residue all of them to propose a level of confidence on each residue. Ronaldo Nagem.

I am the programmer of SEQUENCE SLIDER that Ulrich Eckhard commented, but the code for sequence assignment is available through https://github.com/LBME/slider and the related article is https://academic.oup.com/nar/article/50/9/e50/6519368 . I applied it to assign the sequence and its uncertainties of snake venom isoforms that were purified directly from natural sources, one of the articles, Ulrich sent you. Still, there is also https://www.sciencedirect.com/science/article/pii/S014181302102002X and https://www.sciencedirect.com/science/article/pii/S0304416521000714 , but in those, I integrated mass spectrometry (MS) and phylogenetic analysis to validate sequence hypotheses proposed from crystallographic data.

Feel free to test it if you would like or I am happy to do it for you. The code is rather simple, it has different modes, but roughly speaking it will generate all 20 side chains for each residue, calculate the RSCC using polder map to avoid bias and organize values in Tables. As you already expected, SLIDER will have discriminated residues and ambiguous in a systematic evaluation. This ambiguity can be further reduced using MS and phylogenetic analysis or the other way around, data from MS/phylogenetic analysis can restrict sequence possibilities evaluated in crystallographic data.

What has not yet been published is another data set that we do not have more samples to perform sequencing experiments, therefore I am in the same situation as you. For this objective, I am now characterizing the physicochemical environment to aid discrimination of approximately isosteric side chains and testing machine learning to help sequence prediction. The machine learning part is not yet distributed. But I am happy to run it in your case and share the results if you would like to. Rafael Borges

A couple of other relevant papers:
http://www.biotechnologia-journal.org/Identification-of-amino-acid-sequences-r-nvia-X-ray-crystallography-a-mini-review-of-case-studies,85,23820,0,1.html
https://bmcstructbiol.biomedcentral.com/articles/10.1186/1472-6807-7-29