Skip to content

Fixing Issue#550#580

Merged
richardjgowers merged 13 commits into
developfrom
gro-fix-truncation
Jan 4, 2016
Merged

Fixing Issue#550#580
richardjgowers merged 13 commits into
developfrom
gro-fix-truncation

Conversation

@tylerjereddy

Copy link
Copy Markdown
Member

Attempted fix for Issue #550. Although this now enables us to use the GROWriter to write usable .gro files with > 99,999 atoms, there are some caveats here:

  • This solution produces an atom number of 0 after 99,999 rather than 00000. The latter would be the left-truncation result. Why did I do this? Well, the new unit test .gro data file, with > 100,000 atoms and generated by genbox 4.6.3, also does this!
  • It turns out that not all GROMACS utilities behave in this manner. For example, it appears that trjconv keeps the extra zeros.
  • Overall, my suggestion is that this PR fixes the major issue (you can't work with MDA-written .gro files for systems with >99,999 particles), even if there is some room for debate about the correct approach here. Both VMD and MDA can read the newly-written .gro files back in. I have to concede that most of my large systems stored in .gro format are actually truncated to keep all the extra zeros / digits, probably because I've used trjconv on them at some point.

The reason for this behaviour is because:

int(str(00000)) #where 00000 is the truncated form of 100000

returns 0

Of course, I'm sure we could find a way to do the truncation with digit-retention, but I ultimately aimed for a fix that matched the new unit test file, which I had generated in the manner Chris Ing suggested.

sseyler and others added 11 commits December 9, 2015 16:26
Tests for PSA now use tempdir

Fixes Issue #557
…re than 99,999 atoms, in a manner consistent with genbox in GROMACS 4.6.3. Note that it appears that some GROMACS utilities (i.e., trjconv) actually write 00000 instead of 0 for the 100,000th atom. Both formats can be read in by MDAnalysis and VMD, and this is (overall) an improvement that allows us to write large .gro files in a workable manner. Fixes #550.
@richardjgowers

Copy link
Copy Markdown
Member

bigbox.gro is 5.7M. Is it possible to test the written values just by calling readlines and then checking that line[100000] is a certain string?

@richardjgowers richardjgowers self-assigned this Dec 11, 2015
@tylerjereddy

Copy link
Copy Markdown
Member Author

@richardjgowers I thought we discussed that. It might be possible to come up with something more elegant by subclassing GROWriter and modifying it such that we can write > 100,000 particle data with much smaller input. However, there are bigger data files already present for other tests, and is the extra effort to engineer this behaviour really worth saving the space?

Maybe it is! Note that subclassing for test purposes does add a layer of complexity if the GROWriter changes substantially in the future, since we'd be bending it to do something it doesn't normally do for good reason.

@richardjgowers

Copy link
Copy Markdown
Member

I don't mind creating the large gro file, it's the large reference file that's the problem. The only section of it we're using is this:

33333SOL    HW199998   7.632   5.963   2.548
33333SOL    HW299999   7.702   5.897   2.680
33334SOL     OW    0   8.642   7.198   1.893
33334SOL    HW1    1   8.648   7.105   1.930

So we could change expected_lines to be those 4 lines, then check a slice of mda_output against that.

@tylerjereddy

Copy link
Copy Markdown
Member Author

And how are you planning to create a large gro file with those lines if you don't have a large reference file with coordinates to read in? Where is the big universe, that will be used for writing, coming from? This is the same discussion from the issue itself. Either we include a large gro file or we subclass the writer and do some magic so that we can use smaller input or topology manipulation (from a blank universe or whatever) to fake reading in a large coordinate file.

@orbeckst

Copy link
Copy Markdown
Member

For a start, try bzip2ing the bigbox.gro file. The GRO reader uses openany() so it should be happy to read it. It will probably compress down to 1/10 of the original size. Not elegant but possibly workable.

…which is involved in testing resolution of Issue#550.
@tylerjereddy

Copy link
Copy Markdown
Member Author

Ok, compressing the file has reduced its size to 1.1M, and I've pushed the adjustments needed to account for the compressed file as well. The 'objections' from QuantifiedCode aren't even confined to things I've changed or added, and those that are involve policies that are not unique to my changes (i.e., they are policies / practices used in other parts of the MDA codebase, though that doesn't necessarily make them 'right'). Should PRs that pass all tests but face objections from QuantifiedCode really be labelled with a red 'x?'

Also, would it be of interest to add a flag that allows the user to specify if they want digit-retained truncation (i.e., 00000) vs. what we currently have (i.e., 0) for atom numbers when writing gro files? If yes, then I'd probably suggest we should use another bz2-compressed file with that adjusted numbering for unit testing. However, one drawback here is that you might only be able to specify the flag when calling GROWriter directly rather than inferring the writer form the filename as in u.atoms.write("filename.gro"). Also, extending the writing functionality in this way would be more of an enhancement than a resolution of the botched (unworkable) .gro files. The latter is the main target for resolution with this particular issue / PR.

@mnmelo

mnmelo commented Dec 13, 2015

Copy link
Copy Markdown
Member

@orbeckst expressed the same discontent at QC output in #563. I'm opening an issue so we can discuss what the output should be.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do both universe and large_universe need to be created for each test of that class while none of the tests use both? Could large_universe be created only in the relevant test?

@tylerjereddy

Copy link
Copy Markdown
Member Author

@jbarnoud If the tests in the class don't really share the same requirements for the setUp and tearDown methods, that is probably an indication that refactoring into separate test classes is needed, until tests do share the same setup/ teardown requirements. Although my contributions in this PR may highlight the increasing heterogeneity in this test class, and the need to split into more classes, I'd suggest that if we are going to get more stringent with this kind of thing (I'm certain that not all MDAnalysis test classes are so clean that all of their tests share the same setUp and tearDown method requirements), then that should be a separate issue for overhauling test infrastructure / policy.

As for the other comment, I actually thought that using the same temporary output directory structure (and defining it in the setUp) made it nice and clear for deletion / cleanup of this test data, so that the tests can focus on the actual testing code.

As an aside, my PR adds a third statement of the following form to the tearDown method:

try:
 os.unlink(self.outfile3)
except OSError:
 pass

Doing the above operation in three separate try/except loops for three different temporary files in the same tearDown method just doesn't feel right! But it was already done twice for the first two files, so when I added a third (large) file, I just continued the trend. However, even in this case, I think that is a separate design issue that should be raised elsewhere, rather than turning this into a complete refactoring of the GRO writing test infrastructure.

@kain88-de

Copy link
Copy Markdown
Member

@tylerjereddy I haven't read the tests right now. But the os.unlink(...) shouldn't be necessary anymore if you switched to use tempdir. The teardown method should only delete the temporary directory created by tempdir. In other tests I used a pattern like this.

def setup():
    self.tempdir = tempdir.tempdir()

def tearDown():
    del self.tempdir

def test_1():
    outfile = self.tempdir + '/name of file'

Personally even better for me works

class test_xyz:
    def __init__(self):
          self.tempdir = tempdir.tempdir()

   def test_1(self):
         outfile = self.tempdir + '/test-1.txt'

This way all the tests in this class work in one temp directory which gets cleaned once you are done.
Just make sure that every test in the class uses a unique filename and you don't need
to care about file deletion unlinking at all anymore.

@kain88-de

Copy link
Copy Markdown
Member

You can check the second example here

@jbarnoud

Copy link
Copy Markdown
Contributor

I agree these tests would benefit from some cleaning that is out of scope of this PR. Yet, maybe the test introduced here could be move in a new class so a large universe does not get created for each test of the TestGROWriter class. Tests are long enough to run already.

Also, a docstring with a reference to #550 would be nice as it would give the test some context. But this is just nitpicking.

@tylerjereddy

Copy link
Copy Markdown
Member Author

I've moved the test for #550 to its own class and made an effort to comply with the suggestions in this PR thread. The original TestGROWriter class should now be unchanged, and any issues with that should be raised elsewhere.

@jbarnoud

Copy link
Copy Markdown
Contributor

👍

@tylerjereddy

Copy link
Copy Markdown
Member Author

Any other suggestions / problems with this PR?

richardjgowers added a commit that referenced this pull request Jan 4, 2016
@richardjgowers richardjgowers merged commit 0371edc into develop Jan 4, 2016
@richardjgowers richardjgowers deleted the gro-fix-truncation branch January 4, 2016 10:39
@richardjgowers

Copy link
Copy Markdown
Member

Nope, looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants