On Encoding Molecular Structures

 

Bin Shao
President of Zhongguancun Institute of Artificial Intelligence

Data Model Task
Data Model Task
Data Model Task

Molecular Representations

Extracted Features
(Molecular fingerprints & descriptors)

Very lossy!

Bond-based Representation
(SMILES & Graph)
C1=CC=C(C=C1)C2=CC=CC=C2
Mol

Somewhat lossy

Geometric Representation
(3D structure & Point Cloud)

A little lossy

Electronic Structures

What makes a good representation for molecules

  • Invariance: the encoding should be invariant under distance-preserving translations and rotations.
  • Distinguishability: the encodings for two distinguishable structures should be distinguishable.
  • Recoverability: the molecular structures should be recovered based on its encoding.
  • Smoothness: the encoding should not change dramatically as the structure changes a bit.

Cartesian Coordinates

Pairwise Distances

Well, it really depends.

If we only concern ourselves with intrinsic properties (energies, interatomic forces, ...) of a molecule without considering the connections with other molecules, chiral molecules are essentially the same molecule.

 

But,

Mirror molecules function differently in our 3D world

 

Distinctiveness of Handedness Collapses in 4D Space

Canonicalized Coordinate System

Canonicalized Coordinates for the Mirror Molecules

Distance-preserving Transformation in 4D Space

Intrinsic Plane

Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers

Encoding with Four Intrinsic Centers

Are there other ways to encode a molecular structure?

Electronic Structures

Ground State (left) and Excited State (right) of Acetone

The End