On Encoding Molecular Structures
Bin Shao
President of Zhongguancun Institute of Artificial Intelligence



Molecular Representations
Extracted Features
(Molecular fingerprints &
descriptors)

Very lossy!
Bond-based Representation
(SMILES & Graph)
C1=CC=C(C=C1)C2=CC=CC=C2

Somewhat lossy
Geometric Representation
(3D structure & Point Cloud)
A little lossy

Electronic Structures


A property graph with distances between neighboring nodes cannot recover the molecular geometry.

What makes a good representation for molecules
- Invariance: the encoding should be invariant under distance-preserving translations and rotations.
- Distinguishability: the encodings for two distinguishable structures should be distinguishable.
- Recoverability: the molecular structures should be recovered based on its encoding.
- Smoothness: the encoding should not change dramatically as the structure changes a bit.
Cartesian Coordinates

Pairwise Distances

Do we differ from our images in the mirror?
Are two mirror molecules (chiral molecules) different?

Well, it really depends.
If we only concern ourselves with intrinsic properties (energies, interatomic forces, ...) of a molecule without considering the connections with other molecules, chiral molecules are essentially the same molecule.
But,
Mirror molecules function differently in our 3D world
Distinctiveness of Handedness Collapses in 4D Space


Canonicalized Coordinate System

Mirror Molecules in the Canonicalized Coordinate System

Canonicalized Coordinates for the Mirror Molecules

Distance-preserving Transformation in 4D Space




Intrinsic Plane

Trilateration
Intrinsic Plane and its Normal Vector

Trilateration with Four Spheres
Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers


Are there other ways to encode a molecular structure?
Epicycles
Electronic Structures

