Home » Articles posted by thewildwilli

# Author Archives: thewildwilli

## Enumerating Molecules – Summary of 2 papers

We read:

Faulon, Jean‐Loup, Donald P. Visco, and Diana Roe. “Enumerating Molecules.” Reviews in Computational Chemistry, Volume 21: 209-286.

and

Meringer, Markus. “Structure enumeration and sampling.” Handbook of chemoinformatics algorithms (2010): 233-267.

——

There are 3 main topics in enumeration molecules:

1) Counting (which produces a number)

2) Enumeration (which produces a list of molecular graphs)

3) Sampling (which produces a random molecular graph)

1) Counting

The most important tool in this area is “Polya’s counting method”. It can be used to count derivatives of benzene, by not double counting molecules which can be turned into each other through a rotation.

In general, you identify the similar components of your molecule which you want to label by something. You then work out the subgroup of the symmetric group that acts on these components. Afterwards, the cycle types give rise to some variables and added together to create a polynomial called Z.

Finally, substituting into Z gives a polynomial C, whose coefficients describe how many different graphs exist with the labels corresponding to the variables in the term.

If the number of labels is substitutes instead of the labels in polynomial form, then one obtains the sum of the coefficients directly.

The reason Polya counting works can be understood, by looking at Burnsides lemma, which states that the number of orbits (e.g number of different molecules) equals the average number of fixed points.

Another application of Polya’s counting method comes from applying it to counting series themselves (meta level warning) – this idea gives a counting series for the number of Alkynes (rooted carbon trees). Many variations of Alkynes with other molecules attached can be counted with the initial result. Ironically, Alkanes, which usually seems simpler than alkynes, is one such example, but it is way more complicated to count them.

Faulon suggests that further progress in this direction might lead to a solution of counting isomers in general, which is not currently possible in a closed form way. The closest is by Wang, Li, Wang, which found a counting series for C_iH_{2i+2}O_j, with some further restrictions, which is far from all molecular formulas, but still quite general.

It is, of course, possible to brute-force the problem by constructing all the molecular graphs, which brings us to the second topic.

2) Enumeration

When it comes to enumeration, the “equivalent of Polya counting” in this area is orderly enumeration. It basically consists of having an ordering on the graphs, which allows for the notion of canonical. The best software seems to be McKay’s nauty for enumerating graphs.

The idea is the following:

a) Nodes have an order

b) This induces a lexicographical order on the edges

c) This induces a partial order on the graphs, by letting A be smaller than B if B contains A or B equals A up to a point and then has a larger edge.

If one is interested in particular graphs, rather than all graphs on n nodes:

If the criteria are “consistent” (meaning graphs fail if their subgraphs fail), then it can be incorporated into the algorithm in a natural way (since both the generation and the criteria are “bottom-up”)

An interesting comment on why Goldberg’s theoretical result on enumeration, that runs in polynomial time per outputs cannot be directly implemented (An assumption is made, that the number of (n,q)-graphs is not necessarily greater than the number of (n,q-1)-graphs).

3) Sampling

Dixon and Wilf have a cool and simple algorithm. The main idea is this:

a) Pick a random permutation in S_n, where n is the number of atoms.

b) Then let the permutation act on the edges (a,b) by acting on the nodes a,b individually – compute the cycles of this edge-permutation.

c) For each cycle, let all the edges exist or none of them.

This produces a random graph.

Several expansions have been made, such as Goldberg and Jerrums algorithm, which shows that it is theoretically possible to sample molecules in polynomial time. This is peculiar since both the enumeration and enumeration computational complexities are open problems.

Simulated annealing sampling and a “genetic” algorithm are also mentioned.

According to Faulon, the sampling problem lack really strong tools which compare to Polya counting or ordered enumeration.

Our thoughts:

These 2 reviews are very similar and if you only read one, then we would recommend the first.

These predate the work of Reymond and the GDB-17 databases of all small molecules organic molecules with 17 non-hydrogen atoms or fewer.

One thing, that is not mentioned in the papers is chemical filters – it seems to this reader, that molecular graphs, such as 4 oxygen-atoms all bound together (O1OOO1) satisfies the mathematical rules, but not the chemical ones. Another example of unreasonable molecular graphs can be found in the creation of Reymond’s GDB-17 databases there were a series of filters applied, which came from chemical considerations of reasonability and not any theory mentioned in this review.

If this example points to a deeper problem, then all the enumeration and counting are merely supersets and upper bounds to the molecules, which would explain why mathematicians are often more interested in these problems than chemists.

Overall, these papers are worth reading.

## What did we learn at: Origins of Life Conference – ISSOL17

What did we learn at: Origins of Life Conference – ISSOL17

- written by Jotun Hein

Overall attending the conference was a very useful since I haven’t been to an Origins Conference for more than 5 years and since I have stopped teaching Origins, in general, I don’t read so much on the chemical nitty-gritty.

The was much interesting material at the conference and of course, I met some people from Oxford, that I had never seen before working on catalysis.

The first day [Monday] was mainly devoted to Exoplanets and Meteorites/Comets/Transport of Organic Matter.

The second day [Tuesday] was the physical condition on earth 4 Billion or so years ago.

The third 1/2 day [Wednesday] was dedicated to the first chemical steps towards life.

The last 2 days were on the early evolution of life and more theoretical models.

Origins of Life studies are clearly getting a lot more attention/funding now. Computational studies play a much larger role. There are much more serious attempts at synthesizing life de Novo. But I can’t say there is a single convincing scenario for planet Earth. Exoplanets clearly are very exciting, but there is no way to study the architecture of life so far away [barring SETI – that was unrepresented at ISSOL] so all one can hope for a couple of centuries is observation of convincing bio-signatures.

There seemed to have been a lot of organizational problems. I didn’t know where to go and sleep and ended up sitting all night in the airport (while paying for a room at UCSD). Another person I met had experienced something else. The conference dinner was not very different from the free dinner and there were no arrangements of where to go. Anybody going to conferences/workshops knows that many connections are made at the evening socializing.

I, William Kurdahl and possibly some from the Oxford Catalysis will give an informal orientation about the meeting Tuesday, August 29th 3 PM in the small lecture room in The Department of Statistics, Oxford.

William and I both chose 5 papers/presentations that we liked.

These are the slides in progress:

https://www.dropbox.com/s/p5tmy3a1g8i2kd0/ISSOL.pptx?dl=0

## Extreme Reading – status report

There is a famous danish sketch called “Jarl Kakadue” from the show “Casper og Mandrilaftalen”. In the sketch, Jarl explains how he completed an iron man, but instead of running a marathon, he got a good nights sleep instead.

“But isn’t that cheating?” to host asks, to which Jarl replies “No, because such a run takes a couple of hours, but a proper nights sleep is at least 8 hours.”

As the sketch goes on, more and more of the exercise gets replaced. The full thing can be seen here: (in danish)

The concept of Extreme Reading is also a modified iron man in the following sense:

instead of swimming, we read a book.

instead of cycling, we summarise the book

and

instead of running a marathon, we run half a marathon (over 3 days)

So each day, we read for a couple of hours, ran 7 kilometers, read some more and then we summarized the book for each other and discussed it.

The book i question was “The origin and nature of life on earth – the emergence of the fourth biosphere” – by Eric Smith and Harold J. Morowitz

Unfortunately, the book is rather wordy and not very mathematical. The individual sections are nicely structured, but the book lacks an main message and sense of direction.

This is puzzling, since Morowitz other books are usually shorter and more precise. However, Morowitz died before the book was published, was very weak the last decade, published little in that period and was in general very short in his formulations, while this book is very long (at times lenghty). It is unclear how much Morowitz contributed to the present book.

This book is 600 pages long and consists of 8 chapters. This is a very hard topic to write a coherent book about and the chapters are quite free-standing contributions to describing or explaining the theory of life.

Eric Smith gave a talk somewhat based on the book, which can be found here: https://www.youtube.com/watch?v=0cwvj0XBKlE

The 4 geospheres are:

Atmosphere (air)

Hydrosphere (water)

Lithosphere (earth)

Biosphere (life)

The point of the title is that life should be though of as a planetary property. However, the point seems more philosophical than scientific, which is the case with many of the subtle points in the book.

A longer summary will be added later.

Overall, the project was a success. We managed to run and read a lot. It is a very satisfying feeling to be both mentally and physically exhausted and we can definitely recommend similar undertakings.

## Limits of Complexity

[written by William K. Larsen]

We read two papers. Both were review articles from the journal Nature.

#### Summary of: Ultimate physical limits to computation(2000), by Seth Lloyd

The article is written for the general scientifically literate audience and is mostly easily digestible. A description of the concept of an AND-gate is given, which is rather elementary.The paper covers advanced topics as well.

Lloyd defines the **Ultimate physical laptop **as a “computer” with mass of 1 kg and volume of 1 liter.

2 main arguments are made concerning the ultimate physical laptop:

- Energy limits the speed of computation – which leads to a limit of approximately 10
^{50 }computations pr. second (10^{41 }GHz). - Entropy limits memory space – which leads to a limit of approximately 10
^{31 }bits (10^{21 }GB)

Overall, the paper makes you think of a computer as a physical object with energy, mass and volume, rather than the idealized model often used in computer science. It gives a taste of a number physical concepts and ideas along the way.

Our discussion:

The limits presented in the paper are very large upper bounds. One could theoretically fill an empty jar with colored sand and consider the very large number of possible configurations as being representative of the information stored. The problem is that the jar of sand cannot be utilized for work in the same way that a computer can.

This paper was written before the death of Moore’s law and thus the limits may have seemed more like a important perspective than a fun thought experiment.

#### Brief summary of: Limits on fundamental limits to computation(2014), by Igor L. Markov

The aim of this paper is to get a perspective on the different possible barriers for the development of computers.

The other paper considered the problem of Fundamental limits to computation by looking at Energy, Space and Information.

This corresponds to only 3 of 25 cells in the following table from the paper:

Markov considers different cells in this table and makes some concluding remarks.

Our discussion of the paper:

The paper is really hard on the quantum computer – claiming that quantum computers can only find use when simulating quantum-chemical phenomena and that this is also uncertain.

Markov generally does not go into too much depth with anything, but the references seem very useful. Especially 8 of them have been highlighted and given a small description (out of 99)

## Limits of computation

Thursday February 16th 10AM we will discuss:

http://www.nature.com/nature/journal/v512/n7513/full/nature13570.html

and

http://www.nature.com/nature/journal/v406/n6799/full/4061047a0.html

Jotun will join per Skype – at least Soren and William will be in Department of Statistics, Oxford, room LG.05.

## Graph Grammar Library

Summary of the paper “The Graph Grammar Library – a generic framework for chemical graph rewrite systems” by Flamm et al.

The paper explains how a C++ package works, which implements graph grammar rewriting rules for chemical reaction networks. There are some other chem(o)informatics stuff in the package as well.

The main story of the paper is that Yadav et al. requested a package like GGL, which is now available.

Most of it seems quite logical/intuitive based on the what it tries to do.

The software translates to/from SMILES at the beginning and end, but everything is done using graphs.

For more information and guides on how to use the software, see: http://www.tbi.univie.ac.at/software/GGL/

## Mathematical Chemistry and ChemoInformatics part II

Jotun Hein and William Kurdahl will review two books: Leach and Gillet (2010) Introduction to Chemoinformatics and chapters from Faulon Handbook of Chemoinformatics Algorithms (2007). Additionally briefly summarize key papers on Graph Grammars, Reaction prediction and related topics. The talk ends with an attempt to identify projects in chemoinformatics that would be worth it to work on.

The slides can be found here: http://tinyurl.com/j28y67s

Thursday December 8th 10AM to noon LG Department of Statistics