Commits · aad6a5fa767529d3353bd3beb89e126c7b0868ca · Chen Yisong / benchmark

17 Nov, 2017 1 commit

authored Nov 17, 2017

Define BENCHMARK_OS_NETBSD for NetBSD.

Add detection of cpuinfo_cycles_per_second and cpuinfo_num_cpus.
This code shared detection of these properties with FreeBSD.

aad6a5fa

15 Nov, 2017 1 commit
- Add a pkg-config file, for the benefit of projects not using CMake. (#480) · 0c3ec998
  Steinar H. Gunderson authored Nov 15, 2017
  
  0c3ec998
13 Nov, 2017 1 commit
- Add doc specifying the scope of the timing calculation · ed5764ea
  Dominic Hamon authored Nov 13, 2017
```
Fixes #479
```
  ed5764ea
07 Nov, 2017 4 commits

[Tools] A new, more versatile benchmark output compare tool (#474) · 5e66248b

authored Nov 08, 2017

* [Tools] A new, more versatile benchmark output compare tool

Sometimes, there is more than one implementation of some functionality.
And the obvious use-case is to benchmark them, which is better?

Currently, there is no easy way to compare the benchmarking results
in that case:
    The obvious solution is to have multiple binaries, each one
containing/running one implementation. And each binary must use
exactly the same benchmark family name, which is super bad,
because now the binary name should contain all the info about
benchmark family...

What if i tell you that is not the solution?
What if we could avoid producing one binary per benchmark family,
with the same family name used in each binary,
but instead could keep all the related families in one binary,
with their proper names, AND still be able to compare them?

There are three modes of operation:
1. Just compare two benchmarks, what `compare_bench.py` did:
```
$ ../tools/compare.py benchmarks ./a.out ./a.out
RUNNING: ./a.out --benchmark_out=/tmp/tmprBT5nW
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:16:44
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
BM_memcpy/8            36 ns         36 ns   19101577   211.669MB/s
BM_memcpy/64           76 ns         76 ns    9412571   800.199MB/s
BM_memcpy/512          84 ns         84 ns    8249070   5.64771GB/s
BM_memcpy/1024        116 ns        116 ns    6181763   8.19505GB/s
BM_memcpy/8192        643 ns        643 ns    1062855   11.8636GB/s
BM_copy/8             222 ns        222 ns    3137987   34.3772MB/s
BM_copy/64           1608 ns       1608 ns     432758   37.9501MB/s
BM_copy/512         12589 ns      12589 ns      54806   38.7867MB/s
BM_copy/1024        25169 ns      25169 ns      27713   38.8003MB/s
BM_copy/8192       201165 ns     201112 ns       3486   38.8466MB/s
RUNNING: ./a.out --benchmark_out=/tmp/tmpt1wwG_
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:16:53
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
BM_memcpy/8            36 ns         36 ns   19397903   211.255MB/s
BM_memcpy/64           73 ns         73 ns    9691174   839.635MB/s
BM_memcpy/512          85 ns         85 ns    8312329   5.60101GB/s
BM_memcpy/1024        118 ns        118 ns    6438774   8.11608GB/s
BM_memcpy/8192        656 ns        656 ns    1068644   11.6277GB/s
BM_copy/8             223 ns        223 ns    3146977   34.2338MB/s
BM_copy/64           1611 ns       1611 ns     435340   37.8751MB/s
BM_copy/512         12622 ns      12622 ns      54818   38.6844MB/s
BM_copy/1024        25257 ns      25239 ns      27779   38.6927MB/s
BM_copy/8192       205013 ns     205010 ns       3479    38.108MB/s
Comparing ./a.out to ./a.out
Benchmark                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------
BM_memcpy/8            +0.0020         +0.0020            36            36            36            36
BM_memcpy/64           -0.0468         -0.0470            76            73            76            73
BM_memcpy/512          +0.0081         +0.0083            84            85            84            85
BM_memcpy/1024         +0.0098         +0.0097           116           118           116           118
BM_memcpy/8192         +0.0200         +0.0203           643           656           643           656
BM_copy/8              +0.0046         +0.0042           222           223           222           223
BM_copy/64             +0.0020         +0.0020          1608          1611          1608          1611
BM_copy/512            +0.0027         +0.0026         12589         12622         12589         12622
BM_copy/1024           +0.0035         +0.0028         25169         25257         25169         25239
BM_copy/8192           +0.0191         +0.0194        201165        205013        201112        205010
```

2. Compare two different filters of one benchmark:
(for simplicity, the benchmark is executed twice)
```
$ ../tools/compare.py filters ./a.out BM_memcpy BM_copy
RUNNING: ./a.out --benchmark_filter=BM_memcpy --benchmark_out=/tmp/tmpBWKk0k
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:37:28
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
BM_memcpy/8            36 ns         36 ns   17891491   211.215MB/s
BM_memcpy/64           74 ns         74 ns    9400999   825.646MB/s
BM_memcpy/512          87 ns         87 ns    8027453   5.46126GB/s
BM_memcpy/1024        111 ns        111 ns    6116853    8.5648GB/s
BM_memcpy/8192        657 ns        656 ns    1064679   11.6247GB/s
RUNNING: ./a.out --benchmark_filter=BM_copy --benchmark_out=/tmp/tmpAvWcOM
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:37:33
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_copy/8           227 ns        227 ns    3038700   33.6264MB/s
BM_copy/64         1640 ns       1640 ns     426893   37.2154MB/s
BM_copy/512       12804 ns      12801 ns      55417   38.1444MB/s
BM_copy/1024      25409 ns      25407 ns      27516   38.4365MB/s
BM_copy/8192     202986 ns     202990 ns       3454   38.4871MB/s
Comparing BM_memcpy to BM_copy (from ./a.out)
Benchmark                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------
[BM_memcpy vs. BM_copy]/8            +5.2829         +5.2812            36           227            36           227
[BM_memcpy vs. BM_copy]/64          +21.1719        +21.1856            74          1640            74          1640
[BM_memcpy vs. BM_copy]/512        +145.6487       +145.6097            87         12804            87         12801
[BM_memcpy vs. BM_copy]/1024       +227.1860       +227.1776           111         25409           111         25407
[BM_memcpy vs. BM_copy]/8192       +308.1664       +308.2898           657        202986           656        202990
```

3. Compare filter one from benchmark one to filter two from benchmark two:
(for simplicity, the benchmark is executed twice)
```
$ ../tools/compare.py benchmarksfiltered ./a.out BM_memcpy ./a.out BM_copy
RUNNING: ./a.out --benchmark_filter=BM_memcpy --benchmark_out=/tmp/tmp_FvbYg
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:38:27
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
BM_memcpy/8            37 ns         37 ns   18953482   204.118MB/s
BM_memcpy/64           74 ns         74 ns    9206578   828.245MB/s
BM_memcpy/512          91 ns         91 ns    8086195   5.25476GB/s
BM_memcpy/1024        120 ns        120 ns    5804513   7.95662GB/s
BM_memcpy/8192        664 ns        664 ns    1028363   11.4948GB/s
RUNNING: ./a.out --benchmark_filter=BM_copy --benchmark_out=/tmp/tmpDfL5iE
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:38:32
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_copy/8           230 ns        230 ns    2985909   33.1161MB/s
BM_copy/64         1654 ns       1653 ns     419408   36.9137MB/s
BM_copy/512       13122 ns      13120 ns      53403   37.2156MB/s
BM_copy/1024      26679 ns      26666 ns      26575   36.6218MB/s
BM_copy/8192     215068 ns     215053 ns       3221   36.3283MB/s
Comparing BM_memcpy (from ./a.out) to BM_copy (from ./a.out)
Benchmark                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------
[BM_memcpy vs. BM_copy]/8            +5.1649         +5.1637            37           230            37           230
[BM_memcpy vs. BM_copy]/64          +21.4352        +21.4374            74          1654            74          1653
[BM_memcpy vs. BM_copy]/512        +143.6022       +143.5865            91         13122            91         13120
[BM_memcpy vs. BM_copy]/1024       +221.5903       +221.4790           120         26679           120         26666
[BM_memcpy vs. BM_copy]/8192       +322.9059       +323.0096           664        215068           664        215053
```

* [Docs] Document tools/compare.py

* [docs] Document how the change is calculated

5e66248b

Reorder inline to avoid warning on MSVC (#469) · 90aa8665
Dominic Hamon authored Nov 07, 2017
```
Fixes #467
```
90aa8665
Fix #476. Explicit coersion of size_t to boolean (#477) · f4009ef8
Dominic Hamon authored Nov 07, 2017

f4009ef8

Fix #382 - MinGW often reports negative CPU times. (#475) · 72a4581c

authored Nov 07, 2017

When stopping a timer, the current time is subtracted
from the start time. However, when the times are identical,
or sufficiently close together, the subtraction can result
in a negative number.

For some reason MinGW is the only platform where this problem
manifests. I suspect it's due to MinGW specific behavior in either
the CPU timing code, floating point model, or printf formatting.

Either way, the fix for MinGW should be correct across all platforms.

72a4581c

06 Nov, 2017 1 commit
- Remove deprecated headers (#473) · f65c6d9a
  Dominic Hamon authored Nov 06, 2017
  
  f65c6d9a
03 Nov, 2017 2 commits
- Add releasing doc (#472) · 1e525601
  Dominic Hamon authored Nov 03, 2017
  
  1e525601
- Update AUTHORS/CONTRIBUTORS (#471) · 336bb8db
  Roman Lebedev authored Nov 03, 2017
```
As requested, in a pr form :)
```
  336bb8db
02 Nov, 2017 1 commit

Mention how to disable CPU frequency scaling while running the benchmark. (#466) · 4463a60e

authored Nov 02, 2017

Describe how to use the cpupower command to disable CPU frequency scaling.
Document this, since there are other ways that don't see to have the same
effect. See #325

4463a60e

31 Oct, 2017 1 commit

Improve BM_SetInsert example (#465) · fa341e51

authored Oct 31, 2017

* Fix BM_SetInsert example

Move declaration of `std::set<int> data` outside the timing loop, so that the
destructor is not timed.

* Speed up BM_SetInsert test

Since the time taken to ConstructRandomSet() is so large compared to the time
to insert one element, but only the latter is used to determine number of
iterations, this benchmark now takes an extremely long time to run in
benchmark_test.

Speed it up two ways:
  - Increase the Ranges() parameters
  - Cache ConstructRandomSet() result (it's not random anyway), and do only
    O(N) copy every iteration

* Fix same issue in BM_MapLookup test

* Make BM_SetInsert test consistent with README

- Use the same Ranges everywhere, but increase the 2nd range
- Change order of Args() calls in README to more closely match the result of Ranges
- Don't cache ConstructRandomSet, since it doesn't make sense in README
- Get a smaller optimization inside it, by givint a hint to insert()

fa341e51

20 Oct, 2017 1 commit
- Add option to install benchmark (#463) · 491360b8
  Yangqing Jia authored Oct 20, 2017
```
* Add option to install benchmark

* Change to BENCHMARK_ENABLE_INSTALL per @dominichamon
```
  491360b8
17 Oct, 2017 3 commits

Refactor most usages of KeepRunning to use the perfered ranged-for. (#459) · 25acf220

authored Oct 17, 2017

Recently the library added a new ranged-for variant of the KeepRunning
loop that is much faster. For this reason it should be preferred in all
new code.

Because a library, its documentation, and its tests should all embody
the best practices of using the library, this patch changes all but a
few usages of KeepRunning() into for (auto _ : state).

The remaining usages in the tests and documentation persist only
to document and test behavior that is different between the two formulations.

Also note that because the range-for loop requires C++11, the KeepRunning
variant has not been deprecated at this time.

25acf220

Fix and document SkipWithError(...) using ranged-for loop. · 22fd1a55
Eric Fiselier authored Oct 17, 2017

22fd1a55

Improve KeepRunning loop performance to be similar to the range-based for. (#460) · a37fc0c4

authored Oct 17, 2017

This patch improves the performance of the KeepRunning loop in two ways:

(A) it removes the dependency on the max_iterations variable, preventing
it from being loaded every iteration.

(B) it loops to zero, instead of to an upper bound. This allows a single
decrement instruction to be used instead of a arithmetic op followed by a
comparison.

a37fc0c4

16 Oct, 2017 1 commit
- Correct typo in sample code for range-based for loop. (#458) · 2fc2ea0e
  Fred Tingaud authored Oct 16, 2017
  
  2fc2ea0e
13 Oct, 2017 1 commit
- Avoid implicit float to double conversion (#457) · cacd3218
  Raúl Marín authored Oct 13, 2017
```
Triggered by -Werror=double-promotion
```
  cacd3218
10 Oct, 2017 1 commit

Add C++11 Ranged For loop alternative to KeepRunning (#454) · 05267559

authored Oct 10, 2017

* Add C++11 Ranged For loop alternative to KeepRunning

As pointed out by @astrelni and @dominichamon, the KeepRunning
loop requires a bunch of memory loads and stores every iterations,
which affects the measurements.

The main reason for these additional loads and stores is that the
State object is passed in by reference, making its contents externally
visible memory, and the compiler doesn't know it hasn't been changed
by non-visible code.

It's also possible the large size of the State struct is hindering
optimizations.

This patch allows the `State` object to be iterated over using
a range-based for loop. Example:

void BM_Foo(benchmark::State& state) {
	for (auto _ : state) {
		[...]
	}
}

This formulation is much more efficient, because the variable counting
the loop index is stored in the iterator produced by `State::begin()`,
which itself is stored in function-local memory and therefore not accessible
by code outside of the function. Therefore the compiler knows the iterator
hasn't been changed every iteration.

This initial patch and idea was from Alex Strelnikov.

* Fix null pointer initialization in C++03

05267559

09 Oct, 2017 3 commits

Always use inline asm DoNotOptimize with clang. (#452) · f3cd636f

authored Oct 09, 2017

* Always use inline asm DoNotOptimize with clang.

clang-cl masquerades as MSVC but not GCC, so it was using the
MSVC-compatible definitions of DoNotOptimize and ClobberMemory.
Presumably, it's better in general to use the targeted assembly for
this functionality (the codegen is different), but the specific issue
is that clang-cl deprecates the usage of _ReadWriteBarrier, and this
gets rid of that warning.

* triggering another AppVeyor run

f3cd636f

Add macros for create benchmark with templated fixture (#451) · 819adb4c

authored Oct 09, 2017

* Add macros for create benchmark with templated fixture

* Add info about templated fixtures to README.md

* Add tests for templated fixtures

819adb4c

Minor move of code to cleanup up namespace spaghetti a bit · 2409cb2e
Dominic Hamon authored Oct 09, 2017

2409cb2e

27 Sep, 2017 5 commits
- Alphabets are hard. AUTHORS version. · a96ff121
  Dominic Hamon authored Sep 27, 2017
```
#448
```
  a96ff121
- Alphabets are hard. CONTRIBUTORS version. · 5d47e987
  Dominic Hamon authored Sep 27, 2017
```
#448
```
  5d47e987
- Remove myself from AUTHORS · 8792dff1
  Dominic Hamon authored Sep 27, 2017
```
Covered by Google Inc here and i'm in CONTRIBUTORS
```
  8792dff1
- Order CONTRIBUTORS · 359120be
  Dominic Hamon authored Sep 27, 2017
```
Fixes #448
```
  359120be
- Organize AUTHORS · 84a54ae9
  Dominic Hamon authored Sep 27, 2017
```
Part of #448
```
  84a54ae9
14 Sep, 2017 2 commits

Fix #444 - Use BENCHMARK_HAS_CXX11 over __cplusplus. (#446) · 6d8339dd

authored Sep 14, 2017

* Fix #444 - Use BENCHMARK_HAS_CXX11 over __cplusplus.

MSVC incorrectly defines __cplusplus to report C++03, despite the compiler
actually providing C++11 or greater. Therefore we have to detect C++11 differently
for MSVC. This patch uses `_MSVC_LANG` which has been defined since
Visual Studio 2015 Update 3; which should be sufficient for detecting C++11.

Secondly this patch changes over most usages of __cplusplus >= 201103L to
check BENCHMARK_HAS_CXX11 instead.

* remove redunant comment

6d8339dd

Improve README's basic usage example (#433) · 2a05f248
Disconnect3d authored Sep 14, 2017

2a05f248

13 Sep, 2017 1 commit
- Fix Markdown typos in readme. (#445) · 24b80427
  Andre Schroeder authored Sep 13, 2017
  
  24b80427
28 Aug, 2017 2 commits

[RFC] Tools: compare-bench.py: print change% with two decimal digits (#440) · 886585a3

authored Aug 29, 2017

* Tools: compare-bench.py: print change% with two decimal digits

Here is a comparison of before vs. after:
```diff
-Benchmark                      Time           CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------
-BM_SameTimes                  +0.00         +0.00            10            10            10            10
-BM_2xFaster                   -0.50         -0.50            50            25            50            25
-BM_2xSlower                   +1.00         +1.00            50           100            50           100
-BM_1PercentFaster             -0.01         -0.01           100            99           100            99
-BM_1PercentSlower             +0.01         +0.01           100           101           100           101
-BM_10PercentFaster            -0.10         -0.10           100            90           100            90
-BM_10PercentSlower            +0.10         +0.10           100           110           100           110
-BM_100xSlower                +99.00        +99.00           100         10000           100         10000
-BM_100xFaster                 -0.99         -0.99         10000           100         10000           100
-BM_10PercentCPUToTime         +0.10         -0.10           100           110           100            90
+Benchmark                        Time             CPU      Time Old      Time New       CPU Old       CPU New
+-------------------------------------------------------------------------------------------------------------
+BM_SameTimes                  +0.0000         +0.0000            10            10            10            10
+BM_2xFaster                   -0.5000         -0.5000            50            25            50            25
+BM_2xSlower                   +1.0000         +1.0000            50           100            50           100
+BM_1PercentFaster             -0.0100         -0.0100           100            99           100            99
+BM_1PercentSlower             +0.0100         +0.0100           100           101           100           101
+BM_10PercentFaster            -0.1000         -0.1000           100            90           100            90
+BM_10PercentSlower            +0.1000         +0.1000           100           110           100           110
+BM_100xSlower                +99.0000        +99.0000           100         10000           100         10000
+BM_100xFaster                 -0.9900         -0.9900         10000           100         10000           100
+BM_10PercentCPUToTime         +0.1000         -0.1000           100           110           100            90
+BM_ThirdFaster                -0.3333         -0.3333           100            67           100            67

```

So the first ("Time") column is exactly where it was, but with
two more decimal digits. The position of the '.' in the second
("CPU") column is shifted right by those two positions, and the
rest is unmodified, but simply shifted right by those 4 positions.

As for the reasoning, i guess it is more or less the same as
with #426. In some sad times, microbenchmarking is not applicable.
In those cases, the more precise the change report is, the better.

The current formatting prints not so much the percentages,
but the fraction i'd say. It is more useful for huge changes,
much more than 100%. That is not always the case, especially
if it is not a microbenchmark. Then, even though the change
may be good/bad, the change is small (<0.5% or so),
rounding happens, and it is no longer possible to tell.

I do acknowledge that this change does not fix that problem. Of
course, confidence intervals and such would be better, and they
would probably fix the problem. But i think this is good as-is
too, because now the you see 2 fractional percentage digits!1

The obvious downside is that the output is now even wider.

* Revisit tests, more closely documents the current behavior.

886585a3

Attempting to resolve a submoduling issues... (#439) · 6e066481
Roman Lebedev authored Aug 29, 2017

6e066481

23 Aug, 2017 1 commit

Drop Stat1, refactor statistics to be user-providable, add median. (#428) · a271c36a

authored Aug 24, 2017

* Drop Stat1, refactor statistics to be user-providable, add median.

My main goal was to add median statistic. Since Stat1
calculated the stats incrementally, and did not store
the values themselves, it is was not possible. Thus,
i have replaced Stat1 with simple std::vector<double>,
containing all the values.

Then, i have refactored current mean/stdev to be a
function that is provided with values vector, and
returns the statistic. While there, it seemed to make
sense to deduplicate the code by storing all the
statistics functions in a map, and then simply iterate
over it. And the interface to add new statistics is
intentionally exposed, so they may be added easily.

The notable change is that Iterations are no longer
displayed as 0 for stdev. Is could be changed, but
i'm not sure how to nicely fit that into the API.

Similarly, this dance about sometimes (for some fields,
for some statistics) dividing by run.iterations, and
then multiplying the calculated stastic back is also
dropped, and if you do the math, i fail to see why
it was needed there in the first place.

Since that was the only use of stat.h, it is removed.

* complexity.h: attempt to fix MSVC build

* Update README.md

* Store statistics to compute in a vector, ensures ordering.

* Add a bit more tests for repetitions.

* Partially address review notes.

* Fix gcc build: drop extra ';'

clang, why didn't you warn me?

* Address review comments.

* double() -> 0.0
* early return

a271c36a

21 Aug, 2017 1 commit

Allow the definition of 1k to be flexible. (#438) · d7041799

authored Aug 21, 2017

When generating a human-readable number for user counters, we don't
generally expect 1k to be 1024. This is the default due to the more
general purpose string utility.

Fixes #437

d7041799

18 Aug, 2017 1 commit

compare_bench.py: fixup benchmark_options. (#435) · c7192c8a

authored Aug 18, 2017

https://github.com/google/benchmark/commit/2373382284918fda13f726aefd6e2f700784797f
reworked parsing, and introduced a regression
in handling of the optional options that
should be passed to both of the benchmarks.

Now, unless the *first* optional argument starts with
'-', it would just complain about that argument:
	Unrecognized positional argument arguments: '['q']'
which is wrong. However if some dummy arg like '-q' was
passed first, it would then happily passthrough them all...

This commit fixes benchmark_options behavior, by
restoring original passthrough behavior for all
the optional positional arguments.

c7192c8a

15 Aug, 2017 1 commit
- CMake: Fallback from try_run to try_compile when cross-compiling. (#436) · 90293603
  Victor Costan authored Aug 15, 2017
  
  90293603
01 Aug, 2017 1 commit
- reporter_output_test: json: iterations is int, not float (#431) · 3347a20e
  Roman Lebedev authored Aug 01, 2017
```
May be relevant for flakiness of win builds

Noted by @KindDragon
```
  3347a20e
31 Jul, 2017 1 commit

Suppress -Wodr on C++03 tests when LTO is enabled. · abafced9

authored Jul 30, 2017

The benchmark library is compiled as C++11, but certain
tests are compiled as C++03. When -flto is enabled GCC 5.4
and above will diagnose an ODR violation in libstdc++'s <map>.

This ODR violation, although real, should likely be benign. For
this reason it seems sensible to simply suppress -Wodr when building
the C++03 test.

This patch fixes #420 and supersede's PR #424.

abafced9

25 Jul, 2017 1 commit

Tooling: generate_difference_report(): show old/new for both values (#427) · d474450b

authored Jul 25, 2017

While the percentages are displayed for both of the columns,
the old/new values are only displayed for the second column,
for the CPU time. And the column is not even spelled out.

In cases where b->UseRealTime(); is used, this is at the
very least highly confusing. So why don't we just
display both the old/new for both the columns?

Fixes #425

d474450b

24 Jul, 2017 1 commit

Json reporter: don't cast floating-point to int; adjust tooling (#426) · b9be142d

authored Jul 25, 2017

* Json reporter: passthrough fp, don't cast it to int; adjust tooling

Json output format is generally meant for further processing
using some automated tools. Thus, it makes sense not to
intentionally limit the precision of the values contained
in the report.

As it can be seen, FormatKV() for doubles, used %.2f format,
which was meant to preserve at least some of the precision.
However, before that function is ever called, the doubles
were already cast to the integer via RoundDouble()...

This is also the case for console reporter, where it makes
sense because the screen space is limited, and this reporter,
however the CSV reporter does output some( decimal digits.

Thus i can only conclude that the loss of the precision
was not really considered, so i have decided to adjust the
code of the json reporter to output the full fp precision.

There can be several reasons why that is the right thing
to do, the bigger the time_unit used, the greater the
precision loss, so i'd say any sort of further processing
(like e.g. tools/compare_bench.py does) is best done
on the values with most precision.

Also, that cast skewed the data away from zero, which
i think may or may not result in false- positives/negatives
in the output of tools/compare_bench.py

* Json reporter: FormatKV(double): address review note

* tools/gbench/report.py: skip benchmarks with different time units

While it may be useful to teach it to operate on the
measurements with different time units, which is now
possible since floats are stored, and not the integers,
but for now at least doing such a sanity-checking
is better than providing misinformation.

b9be142d