- re-tested ... timing reported was global (full code, including malloc() ... which for large data-sets, eats ns's)
- uploaded correct numbers, with timing = strictly the part with the fit (lxplus766, relatively low load)