From 2ed2568fd8819635ad960abc06df25d121f00460 Mon Sep 17 00:00:00 2001 From: David Dotson Date: Mon, 27 Feb 2017 14:49:32 -0700 Subject: [PATCH 1/5] Added new topology system post. Included some of what @richardjgowers wrote on the wiki way back when. --- _posts/2017-03-06-new-topology-system.md | 102 +++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 _posts/2017-03-06-new-topology-system.md diff --git a/_posts/2017-03-06-new-topology-system.md b/_posts/2017-03-06-new-topology-system.md new file mode 100644 index 00000000..0459ec46 --- /dev/null +++ b/_posts/2017-03-06-new-topology-system.md @@ -0,0 +1,102 @@ +--- +layout: post +title: A shiny, new topology system +--- + +With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development that most users will probably not notice if we've done our job well. +In fall 2015, @richardjgowers and I set to work on redesigning the topology system from scratch. +This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (``Universe``, ``AtomGroup``, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems. +We hoped to resolve all of these issues with our new design. + +The starting point of this work was (the now infamous) [issue 363](https://github.com/MDAnalysis/mdanalysis/issues/363), which floated the idea of holding all atom, residue, and segment attributes in arrays instead of lists of ``Atom``, ``Residue``, and ``Segment`` objects. +This approach turned the way topology data such as atom names, resids, masses, etc. are stored in a ``Universe`` on its head, going from an array of structs (list of ``Atom`` objects with individual attributes) to a struct of arrays (an array for each attribute, one entry per ``Atom``). + +Now, over a year later, the finishing touches on this work are being prepared for release. +This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future. + +## Internal changes that shouldn't affect external behavior + +In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment. +The new `Topology` object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue. +Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to fancy-index the `Atoms->Residues` array, and then using the result of this to fancy-index the `Resnames` array. +For example, if the `Topology` has 5 atoms and 3 residues, with membership (`Atoms->Residues`) and `Resnames` arrays as below: + +``` + Atoms->Residues Resnames + index --------------- index -------- + 0 0 0 GLU + 1 2 1 LYS + 2 1 2 ALA + 3 1 + 4 2 +``` + +calling `AtomGroup.resnames` for an `AtomGroup` with atoms [2, 0, 1, 2] will yield (pseudocode): + +``` +"Atoms->Residues"[[2, 0, 1, 2]] --> [1, 0, 2, 1] +"Resnames"[[1, 0, 2, 1]] --> ['LYS', 'GLU', 'ALA', 'LYS'] +``` + +This scheme only works if each atom is a member of one and only one residue, and likewise if residues are members of one and only one segment. +Furthermore, `AtomGroup`s, `ResidueGroup`s, and `SegmentGroup`s are very thin, storing only the indices of their members as a `numpy` array. +This gives a number of advantages: + +1. **Performance**. We get up to an 8x speedup over the old scheme when accessing attributes. Setting attributes can give up to a 40x speedup. +2. **Memory**. We don't store, for example, a resname for each atom, but instead store attributes at the level they make sense for. +3. **Consistency**. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname. +4. **No staleness**. Because e.g. `ResidueGroup`s are only an array of indices, not a list of `Residue` objects generated upon creation of the group, changes of resiude-level properties by another `ResidueGroup` are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the `Topology` object. + +For further performance comparisons, check out this [notebook](http://nbviewer.jupyter.org/gist/dotsdl/0e0fbd409e3e102d0458). + +## External changes that may affect how you use MDAnalysis + +Previously, every object except ``Atom`` subclassed from ``AtomGroup``. +This meant that calling `.positions` of would give you the positions of the ``Atom``s contained within that group. + +Previous class structure: +``` +Atom + +AtomGroup -> Residue + -> ResidueGroup -> Segment + -> SegmentGroup +``` +New class structure: +``` +Group -> AtomGroup + -> ResidueGroup + -> SegmentGroup + +Atom +Residue +Segment +``` + +Now each object only contains information pertaining to that particular object. +A ``Residue`` object only yields information about the residue; to get to the atoms, use ``Residue.atoms``. + +### Why this was changed + +Previously everything inheriting from ``AtomGroup`` made it unclear at what level of topology a given method or attribute was working on. +For example, does ``ResidueGroup.charges`` give the charge of the residues or the atoms? +Also, it was unclear what size a given output would be (see [issue 411](https://github.com/MDAnalysis/mdanalysis/issues/411)). + +### How to work around this + +To access atom-level information from anything that isn't an ``AtomGroup``, use the `.atoms` level accessor. +For example, changing all `.positions` calls on anything that isn't an `AtomGroup` to `.atoms.positions`. + + +## Going forward: what does this mean for MDAnalysis as a project? + +A major benefit of the new topology system is that information about the topology of a ``Universe`` is now completely encapsulated in the ``Topology`` object. +This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger. +A single ``Topology`` object can now be cleanly shared by multiple ``Universe`` instances, each with their own trajectory reader(s), making common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more feasible for large systems. +The ``Topology`` object can also be serialized more easily, making parallelization on workers without shared memory (using libraries such as [``distributed``](http://distributed.readthedocs.io/en/latest/)) within the realm of possibility out-of-the-box. + +Making these things work is an ongoing effort, but the MDAnalysis [coredevs](https://github.com/orgs/MDAnalysis/teams/coredevs) are working to take advantage of all these possibilities. +We look forward to the benefits this brings not only to the project, but also to all our users going forward. +We hope you like what we've done here. + +-- @dotsdl From 36995a017f20714ced8278d4a1af4faaf60feb68 Mon Sep 17 00:00:00 2001 From: Richard Gowers Date: Wed, 1 Mar 2017 11:16:59 +0000 Subject: [PATCH 2/5] Update 2017-03-06-new-topology-system.md --- _posts/2017-03-06-new-topology-system.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2017-03-06-new-topology-system.md b/_posts/2017-03-06-new-topology-system.md index 0459ec46..81dfb2ff 100644 --- a/_posts/2017-03-06-new-topology-system.md +++ b/_posts/2017-03-06-new-topology-system.md @@ -4,7 +4,7 @@ title: A shiny, new topology system --- With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development that most users will probably not notice if we've done our job well. -In fall 2015, @richardjgowers and I set to work on redesigning the topology system from scratch. +In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch. This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (``Universe``, ``AtomGroup``, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems. We hoped to resolve all of these issues with our new design. @@ -99,4 +99,4 @@ Making these things work is an ongoing effort, but the MDAnalysis [coredevs](htt We look forward to the benefits this brings not only to the project, but also to all our users going forward. We hope you like what we've done here. --- @dotsdl +-- @dotsdl @richardjgowers From be74d779e2c3c24cc02e8f6194f2b0b6a30644e9 Mon Sep 17 00:00:00 2001 From: Max Linke Date: Sun, 2 Apr 2017 12:57:05 +0200 Subject: [PATCH 3/5] address comments --- _posts/2017-03-06-new-topology-system.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/_posts/2017-03-06-new-topology-system.md b/_posts/2017-03-06-new-topology-system.md index 81dfb2ff..0e251a87 100644 --- a/_posts/2017-03-06-new-topology-system.md +++ b/_posts/2017-03-06-new-topology-system.md @@ -1,9 +1,9 @@ --- layout: post -title: A shiny, new topology system +title: A shiny, new and faster topology system --- -With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development that most users will probably not notice if we've done our job well. +With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development. In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch. This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (``Universe``, ``AtomGroup``, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems. We hoped to resolve all of these issues with our new design. @@ -14,11 +14,12 @@ This approach turned the way topology data such as atom names, resids, masses, e Now, over a year later, the finishing touches on this work are being prepared for release. This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future. -## Internal changes that shouldn't affect external behavior +## Invisible changes to make working with MDAnalysis *faster* +Most of the changes are (or should be) invisible to the user. But they made some of the most fundamental operations in MDAnalysis quite a bit *faster*. Although this section is mostly of interest to developers, it is useful for all users to know the operations that MDAnalysis can now do much faster than before (and why). In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment. The new `Topology` object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue. -Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to fancy-index the `Atoms->Residues` array, and then using the result of this to fancy-index the `Resnames` array. +Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to [fancy-index](https://docs.scipy.org/doc/numpy/user/basics.indexing.html#index-arrays) the `Atoms->Residues` array, and then using the result of this to fancy-index the `Resnames` array. For example, if the `Topology` has 5 atoms and 3 residues, with membership (`Atoms->Residues`) and `Resnames` arrays as below: ``` @@ -46,6 +47,7 @@ This gives a number of advantages: 2. **Memory**. We don't store, for example, a resname for each atom, but instead store attributes at the level they make sense for. 3. **Consistency**. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname. 4. **No staleness**. Because e.g. `ResidueGroup`s are only an array of indices, not a list of `Residue` objects generated upon creation of the group, changes of resiude-level properties by another `ResidueGroup` are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the `Topology` object. +5. **Serialization**. Topologies become serializable and changes to topologies can be easily saved and communicated around. This is an important step towards implementing parallel algorithms in MDAnalysis. For further performance comparisons, check out this [notebook](http://nbviewer.jupyter.org/gist/dotsdl/0e0fbd409e3e102d0458). @@ -75,6 +77,9 @@ Segment Now each object only contains information pertaining to that particular object. A ``Residue`` object only yields information about the residue; to get to the atoms, use ``Residue.atoms``. +Similarly, to get the atoms from a Segment or a SegmentGroup use Segment.atoms or SegmentGroup.atoms. +As before, you can get all residues associated with a group with Group.residues (which returns a ResidueGroup) and all segments with Group.segments (a SegmentGroup). +Bottom line: you should now always be explicit about what you want. ### Why this was changed @@ -82,12 +87,11 @@ Previously everything inheriting from ``AtomGroup`` made it unclear at what leve For example, does ``ResidueGroup.charges`` give the charge of the residues or the atoms? Also, it was unclear what size a given output would be (see [issue 411](https://github.com/MDAnalysis/mdanalysis/issues/411)). -### How to work around this +### How to work with the new system To access atom-level information from anything that isn't an ``AtomGroup``, use the `.atoms` level accessor. For example, changing all `.positions` calls on anything that isn't an `AtomGroup` to `.atoms.positions`. - ## Going forward: what does this mean for MDAnalysis as a project? A major benefit of the new topology system is that information about the topology of a ``Universe`` is now completely encapsulated in the ``Topology`` object. From aab758782cff635976eaecd14ac38bc15a76956a Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Sun, 2 Apr 2017 22:48:51 -0700 Subject: [PATCH 4/5] topology posts: minor edits --- _posts/2017-03-06-new-topology-system.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/_posts/2017-03-06-new-topology-system.md b/_posts/2017-03-06-new-topology-system.md index 0e251a87..6cd2c7b0 100644 --- a/_posts/2017-03-06-new-topology-system.md +++ b/_posts/2017-03-06-new-topology-system.md @@ -77,9 +77,9 @@ Segment Now each object only contains information pertaining to that particular object. A ``Residue`` object only yields information about the residue; to get to the atoms, use ``Residue.atoms``. -Similarly, to get the atoms from a Segment or a SegmentGroup use Segment.atoms or SegmentGroup.atoms. -As before, you can get all residues associated with a group with Group.residues (which returns a ResidueGroup) and all segments with Group.segments (a SegmentGroup). -Bottom line: you should now always be explicit about what you want. +Similarly, to get the atoms from a ``Segment`` or a ``SegmentGroup`` use ``Segment.atoms`` or ``SegmentGroup.atoms``. +As before, you can get all residues associated with a group with ``Group.residues`` (which returns a ``ResidueGroup``) and all segments with ``Group.segments`` (a ``SegmentGroup``). +Bottom line: you should now *always be explicit about what you want*. ### Why this was changed @@ -96,11 +96,13 @@ For example, changing all `.positions` calls on anything that isn't an `AtomGrou A major benefit of the new topology system is that information about the topology of a ``Universe`` is now completely encapsulated in the ``Topology`` object. This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger. -A single ``Topology`` object can now be cleanly shared by multiple ``Universe`` instances, each with their own trajectory reader(s), making common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more feasible for large systems. -The ``Topology`` object can also be serialized more easily, making parallelization on workers without shared memory (using libraries such as [``distributed``](http://distributed.readthedocs.io/en/latest/)) within the realm of possibility out-of-the-box. +A single ``Topology`` object can now be cleanly shared by multiple ``Universe`` instances, each with their own trajectory reader(s). +This could make common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more efficient for large systems. +The ``Topology`` object can also be serialized more easily. +This should enable parallelization on workers without shared memory (using libraries such as [``distributed``](http://distributed.readthedocs.io/en/latest/)) out-of-the-box. Making these things work is an ongoing effort, but the MDAnalysis [coredevs](https://github.com/orgs/MDAnalysis/teams/coredevs) are working to take advantage of all these possibilities. We look forward to the benefits this brings not only to the project, but also to all our users going forward. We hope you like what we've done here. --- @dotsdl @richardjgowers +-- @dotsdl and @richardjgowers From e3881953a95b68ae0d6a0265996b1ba743da94a1 Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Sun, 2 Apr 2017 22:50:30 -0700 Subject: [PATCH 5/5] topology post: changed date to tomorrow --- ...6-new-topology-system.md => 2017-04-03-new-topology-system.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2017-03-06-new-topology-system.md => 2017-04-03-new-topology-system.md} (100%) diff --git a/_posts/2017-03-06-new-topology-system.md b/_posts/2017-04-03-new-topology-system.md similarity index 100% rename from _posts/2017-03-06-new-topology-system.md rename to _posts/2017-04-03-new-topology-system.md