# Blog

## About the classic papers club

The aim of this club is to read the papers that everyone keeps citing but which few people have read. We plan to read a paper every third week for the next 20 years.

We tend to meet Wednesday mornings, and tend to announce the papers we will read ahead of time. Everyone is welcome: if a paper sounds interesting to you, please come by.

This reading group used to be organised on facebook. The old page can be found here.

## About the Science Book Club

*Mathematical Chemistry and Chemoinformatics*, by A. Kerber et al. (summary part I and part II)*Phylogeny: Discrete and random processes in evolution*, by M. Steel (review published on SIAM News Blog: part I, part II, part III)*Bayesian Methods in Structural Bioinformatics*, edited byT. Hamelryck, K. Mardia and J. Ferkinghoff-Borg-
*Phylogenetics*, by C. Semple and M. A. Steel *Protein Physics – A course of lectures*, by A. V. Finkelstein and O. Ptitsyn (summary slides)

## About the Humanities Book Club

This book club is an endeavour to broaden our horizons and critically engage with good writing from across the humanities. We intend to go through a book per term, and tend to meet roughly once every 3rd week to discuss new sections of whatever book we are currently going through. Past books that we have read include:

*The General Theory of Employment, Interest and Money,*by John Maynard Keynes*The*, by Arthur J. Droge*Qurʼān – A New Annoteted Translation*

*On Politics*, by Alan Ryan*Capital in the Twenty-First Century*, by Thomas Piketty

The size of the reading group is capped at 6 persons. Current members include Jotun Hein, Mathias Cronjäger, James Anderson (former DPhil student), and Eddie Rolls (former summer school student).

## Today: OMICS and Deep Learning

we had 3 papers to read for today, but only discussed one “Predicting the effects of non-coding variants..” (2015) Zhou and Troyanskaya. It had impressive results but was frustrating to read since details of both data and model were not described.

Wednesday 9AM we will discuss “END-TO-END DIFFERENTIABLE LEARNING OF PROTEIN STRUCTURE” (2018) by Mohammed AlQuraishi who has solved the protein folding problem. And “Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model” (2017)

## Deep Learning Study Group

Since early May we have met Monday Wednesday and Friday 9AM reading books and papers relevant to DL. We started with reading/discussing 700 pages in Goodfellow (2016) et al : Deep Learning and Giron (2017) Hands-On using TensorFlow. after that we switch to papers from different application areas and have taken papers on Games (GO, Backgammon, Atari), Biosciences (Baldi, 2018) and are now in the middle of Genotype–>Phenotype mapping. We have 4 weeks to go and hope to cover protein structure, chemoinformatics, finance and a few deeper methodological papers from people in the Department. Any can join and come with suggestions about what to read.

## IT WILL BE AWFULL

I hope I am wrong – I worked hard for this not to be the case, but maybe not hard enough. I am giving a graduate lecture – 4th in 18 months on “Algorithms, Combinatorics in ChemoInformatics”.

Title: Combinatorics and Algorithms in ChemoInformatics Venue: May 17th 3.30 PM Department of Statistics, LG.04

Summary: Chemoinformatics is central to Drug Development and Design. In this lecture, we will go through key algorithms and combinatorics related to Chemoinformatics. Such algorithms are graph isomorphism, subgraph isomorphism, maximal common subgraphs and double pushout graph grammars. Combinatorics include generating functions for counting/enumerating special classes of molecules starting with alkanes, Polya-counting/enumerating molecules with symmetries, recursive enumeration of molecular graphs. We will also mention calculation of synthetic pathways, prediction of reactions and catalysis, exploration of chemical space and the potential for the use of Deep Learning. The talk attempts to survey these techniques in a way that should be useful for users that normally don’t venture into these techniques but maybe use chemoinformatics tools. 90 minutes – 30 slides:

## THE BRAIN and CONSCIOUSNESS

We are done with Roemer (1996): “Theories of Distributive Justice”. It was hard to digest to say the least and we might still have a wrap-up dinner at University College, if I find an expert who are willing to discuss with the 4 readers.

We are real happy to move on and somehow we found the above topic appealing. I have avoided reading on this as I feel the Brain extremely complex, textbooks on Neuroscience are often huge, I am sceptical about the contribution about philosophy to the kind of knowledge I strive for. I personally feel absolutely fine with a life without consciousness and have repeatedly recommended it to others.

But it is one of the BIG QUESTIONS and the ability to simulate the brain grows, so it would be real exciting to scratch the surface of this topic. I have also included 4 lectures in my course Topics in Computational Biology in 2019, so I better get started reading the background literature and for once this fits right in.

I emailed 5 Professors of Neuroscience I knew, for recommendations and so far this book got unambigous praise, but I hope for a bit more feedback and advice on supplementary readings. We are real careful with the books we read and it is not easy to get through the needle’s eye. It has to be hard to read so it needs thorough discussion for each chapter. In Roemer we at times used 30 minutes per page…..

But if I don’t find a better book by Monday morning, we will start Friday May 4th 6.30AM UK time per skype to discuss Stan Dehaene (2014): “Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts”. We might supplement this with technical literature if need be.

## Talk by Julián Echave Tues 11am

### Professor Julián Echave will be giving an informal talk on Tuesday 10 April at 11am in the Small Lecture Theatre at the Department of Statistics.

### Title: Protein evolutionary divergence is not random

#### Abstract

A simple comparison of homologous proteins shows clear patterns of differential conservation/variation at the levels of amino-acid sequence, 3D structure, and protein motions. For instance, the rate of sequence evolution varies among sites; protein structures diverge more at some sites than others, and some protein vibrations are more variable than others. The default explanation of evolutionary patterns is the rather fuzzy concept of “functional importance”: the underlying assumption is that any extra conservation/variability is due to natural selection. However, while selection does indeed shape sequence divergence, the patterns of divergence of structure and motion are mostly shaped by the physics of the response of proteins to random mutations.

## Enumerating Molecules – Summary of 2 papers

We read:

Faulon, Jean‐Loup, Donald P. Visco, and Diana Roe. “Enumerating Molecules.” Reviews in Computational Chemistry, Volume 21: 209-286.

and

Meringer, Markus. “Structure enumeration and sampling.” Handbook of chemoinformatics algorithms (2010): 233-267.

——

There are 3 main topics in enumeration molecules:

1) Counting (which produces a number)

2) Enumeration (which produces a list of molecular graphs)

3) Sampling (which produces a random molecular graph)

1) Counting

The most important tool in this area is “Polya’s counting method”. It can be used to count derivatives of benzene, by not double counting molecules which can be turned into each other through a rotation.

In general, you identify the similar components of your molecule which you want to label by something. You then work out the subgroup of the symmetric group that acts on these components. Afterwards, the cycle types give rise to some variables and added together to create a polynomial called Z.

Finally, substituting into Z gives a polynomial C, whose coefficients describe how many different graphs exist with the labels corresponding to the variables in the term.

If the number of labels is substitutes instead of the labels in polynomial form, then one obtains the sum of the coefficients directly.

The reason Polya counting works can be understood, by looking at Burnsides lemma, which states that the number of orbits (e.g number of different molecules) equals the average number of fixed points.

Another application of Polya’s counting method comes from applying it to counting series themselves (meta level warning) – this idea gives a counting series for the number of Alkynes (rooted carbon trees). Many variations of Alkynes with other molecules attached can be counted with the initial result. Ironically, Alkanes, which usually seems simpler than alkynes, is one such example, but it is way more complicated to count them.

Faulon suggests that further progress in this direction might lead to a solution of counting isomers in general, which is not currently possible in a closed form way. The closest is by Wang, Li, Wang, which found a counting series for C_iH_{2i+2}O_j, with some further restrictions, which is far from all molecular formulas, but still quite general.

It is, of course, possible to brute-force the problem by constructing all the molecular graphs, which brings us to the second topic.

2) Enumeration

When it comes to enumeration, the “equivalent of Polya counting” in this area is orderly enumeration. It basically consists of having an ordering on the graphs, which allows for the notion of canonical. The best software seems to be McKay’s nauty for enumerating graphs.

The idea is the following:

a) Nodes have an order

b) This induces a lexicographical order on the edges

c) This induces a partial order on the graphs, by letting A be smaller than B if B contains A or B equals A up to a point and then has a larger edge.

If one is interested in particular graphs, rather than all graphs on n nodes:

If the criteria are “consistent” (meaning graphs fail if their subgraphs fail), then it can be incorporated into the algorithm in a natural way (since both the generation and the criteria are “bottom-up”)

An interesting comment on why Goldberg’s theoretical result on enumeration, that runs in polynomial time per outputs cannot be directly implemented (An assumption is made, that the number of (n,q)-graphs is not necessarily greater than the number of (n,q-1)-graphs).

3) Sampling

Dixon and Wilf have a cool and simple algorithm. The main idea is this:

a) Pick a random permutation in S_n, where n is the number of atoms.

b) Then let the permutation act on the edges (a,b) by acting on the nodes a,b individually – compute the cycles of this edge-permutation.

c) For each cycle, let all the edges exist or none of them.

This produces a random graph.

Several expansions have been made, such as Goldberg and Jerrums algorithm, which shows that it is theoretically possible to sample molecules in polynomial time. This is peculiar since both the enumeration and enumeration computational complexities are open problems.

Simulated annealing sampling and a “genetic” algorithm are also mentioned.

According to Faulon, the sampling problem lack really strong tools which compare to Polya counting or ordered enumeration.

Our thoughts:

These 2 reviews are very similar and if you only read one, then we would recommend the first.

These predate the work of Reymond and the GDB-17 databases of all small molecules organic molecules with 17 non-hydrogen atoms or fewer.

One thing, that is not mentioned in the papers is chemical filters – it seems to this reader, that molecular graphs, such as 4 oxygen-atoms all bound together (O1OOO1) satisfies the mathematical rules, but not the chemical ones. Another example of unreasonable molecular graphs can be found in the creation of Reymond’s GDB-17 databases there were a series of filters applied, which came from chemical considerations of reasonability and not any theory mentioned in this review.

If this example points to a deeper problem, then all the enumeration and counting are merely supersets and upper bounds to the molecules, which would explain why mathematicians are often more interested in these problems than chemists.

Overall, these papers are worth reading.

## THE GREAT GOLDEN

He will talk on an correlation model for RNA evolution 3.30PM in the lower floor of Department of Statistics. The slides can be seen here: https://goo.gl/e6r4gx I spoke a week ago on the work I did in Israel: https://tinyurl.com/longisrael