Commits · 1b44120cd16712f3b5decd95dc8ff2813574b273 · Chen Yisong / benchmark

13 Sep, 2018 2 commits

Un-deprecate [SG]et{Item,Byte}sProcessed, re-implement as custom counters. (#676) · 1b44120c

authored Sep 13, 2018

As discussed with @dominichamon and @dbabokin, sugar is nice.
Well, maybe not for the health, but it's sweet.
Alright, enough puns.

A special care needs to be applied not to break csv reporter. UGH.
We end up shedding some code over this.
We no longer specially pretty-print them, they are printed just like the rest of custom counters.

Fixes #627.

1b44120c

Track two more details about runs - the aggregate name, and run name. (#675) · 58588476

authored Sep 13, 2018

This is related to @BaaMeow's work in https://github.com/google/benchmark/pull/616 but is not based on it.

Two new fields are tracked, and dumped into JSON:
* If the run is an aggregate, the aggregate's name is stored.
  It can be RMS, BigO, mean, median, stddev, or any custom stat name.
* The aggregate-name-less run name is additionally stored.
  I.e. not some name of the benchmark function, but the actual
  name, but without the 'aggregate name' suffix.

This way one can group/filter all the runs,
and filter by the particular aggregate type.

I *might* need this for further tooling improvement.
Or maybe not.
But this is certainly worthwhile for custom tooling.

58588476

12 Sep, 2018 1 commit

*Display* aggregates only. (#665) · c614dfc0

authored Sep 12, 2018

There is a flag 
https://github.com/google/benchmark/blob/d9cab612e40017af10bddaa5b60c7067032a9e1c/src/benchmark.cc#L75-L78
and a call
https://github.com/google/benchmark/blob/d9cab612e40017af10bddaa5b60c7067032a9e1c/include/benchmark/benchmark.h#L837-L840
But that affects everything, every reporter, destination:
https://github.com/google/benchmark/blob/d9cab612e40017af10bddaa5b60c7067032a9e1c/src/benchmark.cc#L316


It would be quite useful to have an ability to be more picky.


More specifically, i would like to be able to only see the aggregates in the on-screen output,
but for the file output to still contain everything. The former is useful in case of a lot of repetition
(or even more so if every iteration is reported separately), while the former is **great** for tooling.

Fixes https://github.com/google/benchmark/issues/664

c614dfc0

10 Sep, 2018 1 commit

Backport LLVM's r341717 "Fix flags used to compile benchmark library with clang-cl" (#673) · f274c503

authored Sep 10, 2018

`MSVC` is true for clang-cl, but `"${CMAKE_CXX_COMPILER_ID}" STREQUAL
"MSVC"` is false, so we would enable -Wall, which means -Weverything
with clang-cl, and we get tons of undesired warnings.

Use the simpler condition to fix things.

Patch by: Reid Kleckner @rnk

f274c503

05 Sep, 2018 2 commits

GetCacheSizesMacOSX(): use consistent types. (#667) · f0901417

authored Sep 05, 2018

I have absolutely no way to test this, but this looks obviously-good.

This was reported by Tim Northover @TNorthover in
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180903/584223.html

> I think this breaks some 32-bit configurations (well, mine at least).
> I was using Clang (from Xcode 10 beta) on macOS and got a bunch of
> errors referencing sysinfo.cc:292 and onwards:

> /Users/tim/llvm/llvm-project/llvm/utils/benchmark/src/sysinfo.cc:292:47:
> error: non-constant-expression cannot be narrowed from type
> 'std::__1::array<unsigned long long, 4>::value_type' (aka 'unsigned
> long long') to 'size_t' (aka 'unsigned long') in initializer list
> [-Wc++11-narrowing]
>   } Cases[] = {{"hw.l1dcachesize", "Data", 1, CacheCounts[1]},
>                                               ^~~~~~~~~~~~~~
>
> The same happens when self-hosting ToT. Unfortunately I couldn't
> reproduce the issue on Debian (Clang 6.0.1) even with libc++; I'm not
> sure what the difference is.

f0901417

Travis-ci: attempt to add 32-bit osx xcode build (#669) · a7ed76ad
Roman Lebedev authored Sep 05, 2018
```
Maybe will show https://github.com/google/benchmark/pull/667
or maybe this is completely wrong.
```
a7ed76ad

04 Sep, 2018 1 commit
- Pass name by const-reference instead of by value in class Statistics' constructor (#668) · 305ba313
  Changming Sun authored Sep 04, 2018
  
  305ba313
03 Sep, 2018 1 commit
- add missing closing bracket in --help message (#666) · fbfc495d
  pseyfert authored Sep 03, 2018
  
  fbfc495d
30 Aug, 2018 1 commit

Mark Set{Items,Bytes}Processed()/{items,bytes}_processed() as deprecated. (#654) · 51599675

authored Aug 30, 2018

They are basically proto-version of custom user counters.
It does not seem that they do anything that custom user counters
don't do. And having two similar entities is not good for generalization.

Migration plan:
* ```
  SetItemsProcessed(<val>)
    =>
  state.counters.insert({
    {"<Name>", benchmark::Counter(<val>, benchmark::Counter::kIsRate)},
    ...
  });
  ```
* ```
  SetBytesProcessed(<val>)
    =>
  state.counters.insert({
    {"<Name>", benchmark::Counter(<val>, benchmark::Counter::kIsRate,
                                  benchmark::Counter::OneK::kIs1024)},
    ...
  });
  ```
* ```
  <Name>_processed()
    =>
  state.counters["<Name>"]
  ```

One thing the custom user counters miss is better support
for units of measurement.

Refs. https://github.com/google/benchmark/issues/627

51599675

29 Aug, 2018 3 commits

Counter(): add 'one thousand' param. (#657) · caa2fcb1

authored Aug 29, 2018

* Counter(): add 'one thousand' param.

Needed for https://github.com/google/benchmark/pull/654

Custom user counters are quite custom. It is not guaranteed
that the user *always* expects for these to have 1k == 1000.
If the counter represents bytes/memory/etc, 1k should be 1024.

Some bikeshedding points:
1. Is this sufficient, or do we really want to go full on
   into custom types with names?
   I think just the '1000' is sufficient for now.
2. Should there be a helper benchmark::Counter::Counter{1000,1024}()
   static 'constructor' functions, since these two, by far,
   will be the most used?
3. In the future, we should be somehow encoding this info into JSON.

* Counter(): use std::pair<> to represent 'one thousand'

* Counter(): just use a new enum with two values 1000 vs 1024.

Simpler is better. If someone comes up with a real reason
to need something more advanced, it can be added later on.

* Counter: just store the 1000 or 1024 in the One_K values directly

* Counter: s/One_K/OneK/

caa2fcb1

[NFC] s/console_reporter/display_reporter/ (#663) · d9cab612

authored Aug 29, 2018

There are two destinations:
* display (console, terminal) and
* file.

And each of the destinations can be poplulated with one of the reporters:
* console - human-friendly table-like display
* json
* csv (deprecated)

So using the name console_reporter is confusing.
Is it talking about the console reporter in the sense of
table-like reporter, or in the sense of display destination?

d9cab612

Ignore 32 bit build option when using MSVC (#638) · a0018c39
Michael "Croydon" Keck authored Aug 29, 2018

a0018c39

28 Aug, 2018 3 commits

Track 'type' of the run - is it an actual measurement, or an aggregate. (#658) · 8688c5c4

authored Aug 28, 2018

This is *only* exposed in the JSON. Not in CSV, which is deprecated.

This *only* supposed to track these two states.
An additional field could later track which aggregate this is,
specifically (statistic name, rms, bigo, ...)

The motivation is that we already have ReportAggregatesOnly,
but it affects the entire reports, both the display,
and the reporters (json files), which isn't ideal.

It would be very useful to have a 'display aggregates only' option,
both in the library's console reporter, and the python tooling,
This will be especially needed for the 'store separate iterations'.

8688c5c4

[NFC] Prefix "report(_)?mode" with Aggregation. (#656) · 9a179cb9

authored Aug 28, 2018

This only specifically represents handling of reporting of aggregates.
Not of anything else. Making it more specific makes the name less generic.

This is an issue because i want to add "iteration report mode",
so the naming would be conflicting.

9a179cb9

Make tests pass on 1-core VMs (#653) · ede90ba6

authored Aug 28, 2018

found while working on reproducible builds for openSUSE

To reproduce there
osc checkout openSUSE:Factory/benchmark && cd $_
osc build -j1 --vm-type=kvm

ede90ba6

16 Aug, 2018 1 commit
- properly escape json names (#652) · af441fc1
  BaaMeow authored Aug 16, 2018
  
  af441fc1
13 Aug, 2018 1 commit
- [Tools] Drop compare_bench.py, compare.py is to be used, add U-test docs. (#645) · 94c4d6d5
  Roman Lebedev authored Aug 13, 2018
```
As discussed in IRC, time to deduplicate.
```
  94c4d6d5
08 Aug, 2018 1 commit

Remove redundant default which causes failures (#649) · f85304e4

authored Aug 08, 2018

* Remove redundant default which causes failures

* Fix old GCC warnings caused by poor analysis

* Use __builtin_unreachable

* Use BENCHMARK_UNREACHABLE()

* Pull __has_builtin to benchmark.h too

* Also move compiler identification macro to main header

* Move custom compiler identification macro back

f85304e4

26 Jul, 2018 1 commit
- README improvements (#648) · d939634b
  Dominic Hamon authored Jul 26, 2018
```
* Clarifications and cleaning of the core documentation.
```
  d939634b
24 Jul, 2018 1 commit

Memory management and reporting hooks (#625) · f965eab5

authored Jul 24, 2018

* Introduce memory manager interface

* Add memory stats to JSON reporter and a test

* Add comments and switch json output test to int

f965eab5

23 Jul, 2018 1 commit
- Add note to tools.md regarding scipy. · 63e183b3
  Dominic Hamon authored Jul 23, 2018
  
  63e183b3
09 Jul, 2018 2 commits

Update AUTHORS and CONTRIBUTORS (#632) · 1f35fa4a
Federico Ficarelli authored Jul 09, 2018
```
Adding myself to AUTHORS and CONTRIBUTORS according to guidelines.
```
1f35fa4a

Fix build with Intel compiler (#631) · 0c21bc36

authored Jul 09, 2018

* Set -Wno-deprecated-declarations for Intel

Intel compiler silently ignores -Wno-deprecated-declarations
so warning no. 1786 must be explicitly suppressed.

* Make std::int64_t → double casts explicit

While std::int64_t → double is a perfectly conformant
implicit conversion, Intel compiler warns about it.
Make them explicit via static_cast<double>.

* Make std::int64_t → int casts explicit

Intel compiler warns about emplacing an std::int64_t
into an int container. Just make the conversion explicit
via static_cast<int>.

* Cleanup Intel -Wno-deprecated-declarations workaround logic

0c21bc36

03 Jul, 2018 1 commit
- Disable Intel invalid offsetof warning (#629) · 5946795e
  Federico Ficarelli authored Jul 03, 2018
  
  5946795e
28 Jun, 2018 1 commit
- fixed Google Test (Primer) Documentation link (#628) · 847c0069
  Yoshinari Takaoka authored Jun 28, 2018
  
  847c0069
27 Jun, 2018 2 commits

Add Iteration-related Counter::Flags. Fixes #618 (#621) · b123abdc

authored Jun 27, 2018

Inspired by these [two](https://github.com/darktable-org/rawspeed/commit/a1ebe07bea5738f8607b48a7596c172be249590e) [bugs](https://github.com/darktable-org/rawspeed/commit/0891555be56b24f9f4af716604cedfa0da1efc6b) in my code due to the lack of those i have found fixed in my code:
* `kIsIterationInvariant` - `* state.iterations()`
  The value is constant for every iteration, and needs to be **multiplied** by the iteration count.
* `kAvgIterations` - `/ state.iterations()`
  The is global over all the iterations, and needs to be **divided** by the iteration count.

They play nice with `kIsRate`:
* `kIsIterationInvariantRate`
* `kAvgIterationsRate`.

I'm not sure how  meaningful they are when combined with `kAvgThreads`.
I guess the `kIsThreadInvariant` can be added, too, for symmetry with `kAvgThreads`.

b123abdc

Use EXPECT_DOUBLE_EQ when comparing doubles in tests. (#624) · d8584bda
Dominic Hamon authored Jun 27, 2018
```
* Use EXPECT_DOUBLE_EQ when comparing doubles in tests.

Fixes #623

* disable 'float-equal' warning
```
d8584bda

18 Jun, 2018 1 commit

[Tooling] Enable U Test by default, add tooltip about repetition count. (#617) · 7d03f2df

authored Jun 18, 2018

As previously discussed, let's flip the switch ^^.

This exposes the problem that it will now be run
for everyone, even if one did not read the help
about the recommended repetition count.

This is not good. So i think we can do the smart thing:
```
$ ./compare.py benchmarks gbench/Inputs/test3_run{0,1}.json
Comparing gbench/Inputs/test3_run0.json to gbench/Inputs/test3_run1.json
Benchmark                   Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------
BM_One                   -0.1000         +0.1000            10             9           100           110
BM_Two                   +0.1111         -0.0111             9            10            90            89
BM_Two                   +0.2500         +0.1125             8            10            80            89
BM_Two_pvalue             0.2207          0.6831      U Test, Repetitions: 2. WARNING: Results unreliable! 9+ repetitions recommended.
BM_Two_stat              +0.0000         +0.0000             8             8            80            80
```
(old screenshot)
![image](https://user-images.githubusercontent.com/88600/41502182-ea25d872-71bc-11e8-9842-8aa049509b14.png)

Or, in the good case (noise omitted):
```
s$ ./compare.py benchmarks /tmp/run{0,1}.json
Comparing /tmp/run0.json to /tmp/run1.json
Benchmark                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------
<99 more rows like this>
./_T012014.RW2/threads:8/real_time                +0.0160         +0.0596            46            47            10            10
./_T012014.RW2/threads:8/real_time_pvalue          0.0000          0.0000      U Test, Repetitions: 100
./_T012014.RW2/threads:8/real_time_mean           +0.0094         +0.0609            46            47            10            10
./_T012014.RW2/threads:8/real_time_median         +0.0104         +0.0613            46            46            10            10
./_T012014.RW2/threads:8/real_time_stddev         -0.1160         -0.1807             1             1             0             0
```
(old screenshot)
![image](https://user-images.githubusercontent.com/88600/41502185-fb8193f4-71bc-11e8-85fa-cbba83e39db4.png)

7d03f2df

07 Jun, 2018 1 commit
- Disable deprecation warnings when -Werror is enabled. (#609) · 151ead62
  Dominic Hamon authored Jun 07, 2018
```
Fixes #608
```
  151ead62
06 Jun, 2018 1 commit
- Avoid using CMake 3.6 feature list(FILTER ...) (#612) · 505be96a
  Marat Dukhan authored Jun 06, 2018
```
list(FILTER ...) is a CMake 3.6 feature, but benchmark targets CMake 2.8.12
```
  505be96a
05 Jun, 2018 2 commits

cmake: use numeric version in package config (#611) · 1301f53e
Sergiu Deitsch authored Jun 05, 2018

1301f53e

Fix compilation on Android with GNU STL (#596) · 7fb3c564

authored Jun 05, 2018

* Fix compilation on Android with GNU STL

GNU STL in Android NDK lacks string conversion functions from C++11, including std::stoul, std::stoi, and std::stod.
This patch reimplements these functions in benchmark:: namespace using C-style equivalents from C++03.

* Avoid use of log2 which doesn't exist in Android GNU STL

GNU STL in Android NDK lacks log2 function from C99/C++11.
This patch replaces their use in the code with double log(double) function.

7fb3c564

01 Jun, 2018 1 commit

(clang-)format all the things (#610) · 4c2af078

authored Jun 01, 2018

* format all documents according to contributor guidelines and specifications
use clang-format on/off to stop formatting when it makes excessively poor decisions

* format all tests as well, and mark blocks which change too much

4c2af078

30 May, 2018 1 commit
- Some platforms and environments don't pass a valid argc/argv. (#607) · 4fbfa2f3
  Dominic Hamon authored May 30, 2018
```
Specifically some iOS targets.
```
  4fbfa2f3
29 May, 2018 6 commits

clang-format run on the benchmark header (#606) · d07372e6
Dominic Hamon authored May 29, 2018

d07372e6

Deprecate CSVReporter - A first step to overhauling reporting. (#488) · 7b8d0249

authored May 29, 2018

As @dominichamon and I have discussed, the current reporter interface
is poor at best. And something should be done to fix it.

I strongly suspect such a fix will require an entire reimagining
of the API, and therefore breaking backwards compatibility fully.

For that reason we should start deprecating and removing parts
that we don't intend to replace. One of these parts, I argue,
is the CSVReporter. I propose that the new reporter interface
should choose a single output format (JSON) and traffic entirely
in that. If somebody really wanted to replace the functionality
of the CSVReporter they would do so as an external tool which
transforms the JSON.

For these reasons I propose deprecating the CSVReporter.

7b8d0249

cleaner and slightly larger statistics tests (#604) · 16703ff8
Dominic Hamon authored May 29, 2018

16703ff8
Add some 'travis_wait' commands to avoid gcc@7 installation timeouts. (#605) · c8adf453
Dominic Hamon authored May 29, 2018

c8adf453

Benchmarking is hard. Making sense of the benchmarking results is even harder. (#593) · a6a1b0d7

authored May 29, 2018

The first problem you have to solve yourself. The second one can be aided.
The benchmark library can compute some statistics over the repetitions,
which helps with grasping the results somewhat.

But that is only for the one set of results. It does not really help to compare
the two benchmark results, which is the interesting bit. Thankfully, there are
these bundled `tools/compare.py` and `tools/compare_bench.py` scripts.

They can provide a diff between two benchmarking results. Yay!
Except not really, it's just a diff, while it is very informative and better than
nothing, it does not really help answer The Question - am i just looking at the noise?
It's like not having these per-benchmark statistics...

Roughly, we can formulate the question as:
> Are these two benchmarks the same?
> Did my change actually change anything, or is the difference below the noise level?

Well, this really sounds like a [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis), does it not?
So maybe we can use statistics here, and solve all our problems?
lol, no, it won't solve all the problems. But maybe it will act as a tool,
to better understand the output, just like the usual statistics on the repetitions...

I'm making an assumption here that most of the people care about the change
of average value, not the standard deviation. Thus i believe we can use T-Test,
be it either [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test), or [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test).
**EDIT**: however, after @dominichamon review, it was decided that it is better
to use more robust [Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann–Whitney_U_test)
I'm using [scipy.stats.mannwhitneyu](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu).

There are two new user-facing knobs:
```
$ ./compare.py --help
usage: compare.py [-h] [-u] [--alpha UTEST_ALPHA]
{benchmarks,filters,benchmarksfiltered} ...

versatile benchmark output compare tool
<...>
optional arguments:
-h, --help show this help message and exit

-u, --utest Do a two-tailed Mann-Whitney U test with the null
hypothesis that it is equally likely that a randomly
selected value from one sample will be less than or
greater than a randomly selected value from a second
sample. WARNING: requires **LARGE** (9 or more)
number of repetitions to be meaningful!
--alpha UTEST_ALPHA significance level alpha. if the calculated p-value is
below this value, then the result is said to be
statistically significant and the null hypothesis is
rejected. (default: 0.0500)
```

Example output:
![screenshot_20180512_175517](https://user-images.githubusercontent.com/88600/39958581-ae897924-560d-11e8-81b9-806db6c3e691.png)
As you can guess, the alpha does affect anything but the coloring of the computed p-values.
If it is green, then the change in the average values is statistically-significant.

I'm detecting the repetitions by matching name. This way, no changes to the json are _needed_.
Caveats:
* This won't work if the json is not in the same order as outputted by the benchmark,
or if the parsing does not retain the ordering.
* This won't work if after the grouped repetitions there isn't at least one row with
different name (e.g. statistic). Since there isn't a knob to disable printing of statistics
(only the other way around), i'm not too worried about this.
* **The results will be wrong if the repetition count is different between the two benchmarks being compared.**
* Even though i have added (hopefully full) test coverage, the code of these python tools is staring
to look a bit jumbled.
* So far i have added this only to the `tools/compare.py`.
Should i add it to `tools/compare_bench.py` too?
Or should we deduplicate them (by removing the latter one)?

a6a1b0d7

Update README.md · ec0f69c2
Dominic Hamon authored May 29, 2018

ec0f69c2