Add Options to Generate Build Metrics Report #1369

divyegala · 2023-03-23T22:02:31Z

This PR will also automatically generate an HTML report in conda-cpp-build CI runs under the task Upload Additional Artifacts

cjnolet · 2023-03-23T22:57:11Z

cc @ahendriksen for review

…cs-report

build.sh

ahendriksen

Looks good! Some small comments.

ahendriksen · 2023-03-24T13:05:57Z

build.sh

@@ -18,7 +18,7 @@ ARGS=$*
 # script, and that this script resides in the repo dir!
 REPODIR=$(cd $(dirname $0); pwd)

-VALIDARGS="clean libraft pylibraft raft-dask docs tests bench clean --uninstall  -v -g -n --compile-lib --allgpuarch --no-nvtx --show_depr_warn -h"
+VALIDARGS="clean libraft pylibraft raft-dask docs tests bench clean --uninstall  -v -g -n --compile-lib --allgpuarch --no-nvtx --show_depr_warn --build_metrics --incl_cache_stats -h"


Nit: Apart from --show_depr_warn, most other options use -. I would prefer this for consistency.

Can I also change --show_depr_warn then?

I am okay with changing --show_depr_warn to --show-depr-warn for consistency in the build.sh flags of RAFT. @cjnolet : are you okay too?

build.sh

…cs-report

ajschmidt8

Looks good.

I left some comments soliciting feedback from you and others.

Let's wait to make any of these changes until we get some consensus so that we can avoid wasteful iterations.

ajschmidt8 · 2023-03-24T20:32:16Z

build.sh

+    # get the current count before the compile starts
+    CACHE_COMMAND=""
+    if [[ -z "${CACHE_TOOL}" ]]; then
+        # default is sccache for CI
+        CACHE_COMMAND="sccache"
+    else
+        CACHE_COMMAND="${CACHE_TOOL}"
+    fi
+    if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_COMMAND})" ]]; then
+        # zero the cache statistics
+        "${CACHE_COMMAND}" --zero-stats
+    fi


Suggested change

# get the current count before the compile starts

CACHE_COMMAND=""

if [[ -z "${CACHE_TOOL}" ]]; then

# default is sccache for CI

CACHE_COMMAND="sccache"

else

CACHE_COMMAND="${CACHE_TOOL}"

fi

if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_COMMAND})" ]]; then

# zero the cache statistics

"${CACHE_COMMAND}" --zero-stats

fi

if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_TOOL:-sccache})" ]]; then

"${CACHE_COMMAND}" --zero-stats

fi

Two quick comments:

Bash has the concept of default values that we can use to simplify this logic (src) to just "$(command -v ${CACHE_TOOL:-sccache})"

We can omit comments like # zero the cache statistics when using the long form of CLI flags like --zero-stats since they're a bit redundant

Okay! I didn't know about default values in bash. But I think the correct expression here is:

CACHE_COMMAND="" if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_TOOL:-sccache})" ]]; then CACHE_COMMAND=$CACHE_TOOL "${CACHE_COMMAND}" --zero-stats fi

Good point. LGTM.

This change doesn't work. Can you verify what the failure is about? My bash is really weak:

./build.sh: line 394: : command not found

ajschmidt8 · 2023-03-24T21:06:34Z

build.sh

+  if [[ "$BUILD_REPORT_METRICS" == "ON" && -f "${LIBRAFT_BUILD_DIR}/.ninja_log" ]]; then
+      if ! rapids-build-metrics-reporter.py 2> /dev/null && [ ! -f rapids-build-metrics-reporter.py ]; then
+          echo "Downloading rapids-build-metrics-reporter.py"
+          curl -sO https://github.com/raw/rapidsai/build-metrics-reporter/v1/rapids-build-metrics-reporter.py
+      fi
+
+      echo "Formatting build metrics"
+      MSG=""
+      # get some sccache/ccache stats after the compile
+      if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" ]]; then
+          if [[ ${CACHE_COMMAND} == "sccache" && -x "$(command -v sccache)" ]]; then
+              COMPILE_REQUESTS=$(sccache -s | grep "Compile requests \+ [0-9]\+$" | awk '{ print $NF }')
+              CACHE_HITS=$(sccache -s | grep "Cache hits \+ [0-9]\+$" | awk '{ print $NF }')
+              HIT_RATE=$(echo - | awk "{printf \"%.2f\n\", $CACHE_HITS / $COMPILE_REQUESTS * 100}")
+              MSG="${MSG}<br/>cache hit rate ${HIT_RATE} %"
+          elif [[ ${CACHE_COMMAND} == "ccache" && -x "$(command -v ccache)" ]]; then
+              CACHE_STATS_LINE=$(ccache -s | grep "Hits: \+ [0-9]\+ / [0-9]\+" | tail -n1)
+              if [[ ! -z "$CACHE_STATS_LINE" ]]; then
+                  CACHE_HITS=$(echo "$CACHE_STATS_LINE" - | awk '{ print $2 }')
+                  COMPILE_REQUESTS=$(echo "$CACHE_STATS_LINE" - | awk '{ print $4 }')
+                  HIT_RATE=$(echo - | awk "{printf \"%.2f\n\", $CACHE_HITS / $COMPILE_REQUESTS * 100}")
+                  MSG="${MSG}<br/>cache hit rate ${HIT_RATE} %"
+              fi
+          fi
+      fi
+      MSG="${MSG}<br/>parallel setting: $PARALLEL_LEVEL"
+      MSG="${MSG}<br/>parallel build time: $compile_total seconds"
+      if [[ -f "${LIBRAFT_BUILD_DIR}/libraft.so" ]]; then
+          LIBRAFT_FS=$(ls -lh ${LIBRAFT_BUILD_DIR}/libraft.so | awk '{print $5}')
+          MSG="${MSG}<br/>libraft.so size: $LIBRAFT_FS"
+      fi


I'd like to include @vyasr and @davidwendt in a discussion here about how we can eliminate some of this boilerplate code.

Presumably some of this functionality will be needed by other repositories, so we should strive to make rapids-build-metrics-reporter.py handle as much of this as possible so that we have a single source of truth.

I have some thoughts and proposals here:

I think rapids-build-metrics-reporter.py should handle sccache -s/ccache -s parsing. The lines of code in this PR are already duplicated from cudf's build.sh, so this is a prime candidate for deduping. Perhaps we can expose a --cache-tool/-c flag that will let consumers indicate whether ccache or sccache should be used.

Whenever possible, I'm in favor of avoiding the manual generation of HTML in bash variables (e.g. <br/>). In place of the --msg flag (which appears to take in an HTML string), I think we should create the following flags:

--stat/-s - this flag should be able to be specified multiple times and its value will be a key/value pair that can be added to a final summary table (e.g. --stat="Parallel Setting=$PARALLEL_LEVEL", --stat="Parallel Build Time=$compile_total seconds")

--file-size/-f - this flag should be able to be specified multiple times and its value will be a path to a file whose name and size will be added to a final summary table (e.g. --file-size="${LIBRAFT_BUILD_DIR}/libraft.so")

The sum of these changes should allow us to dedupe quite a bit here and also eliminate the MSG_OUTFILE lines in this PR.

Once we get a consensus on all of this, I will open an issue to track the rollout of this new script to other repositories as well.

I'd rather have the .py strictly handle the .ninja_log file. It currently supports other formats besides html including csv and xml (unittest format?). Both of these are handy at times and and there is no place for the stats in these outputs.
Rather than including the ccache stats in the html output, maybe we generate a separate file with sccache states into the artifacts directory. The reason is what incorporated in the html output originally was to just minimize the burden on the Jenkins job at the time. Just piping the sccache -s > $RAPIDS_ARTIFACT_DIR/cache-stats.txt seems a reasonable approach.

So we entirely eliminate formating sccache/ccache statistics and just add it as a raw file for the user to parse?

Perhaps we can expose a --cache-tool/-c flag

This already exists in RAFT!

If you are calling the output of sccache -s raw output then yes.

I think centralizing more of the logic is the right approach as long as it is done in a way that the different pieces of the document (cache stats, ninja log, etc) are composable and extensible by packages that want to add new files etc. I would advocate for starting by building out the different pieces and then figuring out how to combine them afterwards. As long as there is a consistent HTML output the combining shouldn't be too difficult.

For building up the HTML in Python a reasonable low-level approach is building up the HTML tree with xml.etree. A nicer approach might be to define a Jinja template that can be filled in, leaving some arbitrary extra sections that could be appended to so that the output format is extensible.

How about rapids-reporter.sh which can build the html output like build.sh currently does and calls the .py to build the ninja html? We can look at different ways to combine the results into a single file.

Are you suggesting that we just move the existing logic to another shell script? That doesn't seem to solve the problem of reusability / keeping things DRY. All of the repositories will just have another rapids-reporter.sh script that contains all of the duplicated lines of code still.

I think centralizing more of the logic is the right approach as long as it is done in a way that the different pieces of the document (cache stats, ninja log, etc) are composable and extensible by packages that want to add new files etc. I would advocate for starting by building out the different pieces and then figuring out how to combine them afterwards. As long as there is a consistent HTML output the combining shouldn't be too difficult.

+1.

For building up the HTML in Python a reasonable low-level approach is building up the HTML tree with xml.etree. A nicer approach might be to define a Jinja template that can be filled in, leaving some arbitrary extra sections that could be appended to so that the output format is extensible.

In order to keep this script curl-able (to avoid the overhead of a pip package), we should probably try to stick to Python standard libraries.

@davidwendt, can you point out where / how the other output formats that you mentioned will be used? As far as I can tell, build.sh only outputs a single format type. Are you trying to preserve the multiple output formats for users who may run this tool manually outside of build.sh?

Are you suggesting that we just move the existing logic to another shell script? That doesn't seem to solve the problem of reusability / keeping things DRY. All of the repositories will just have another rapids-reporter.sh script that contains all of the duplicated lines of code still.

The existing shell script logic does not necessarily have move to into a new shell script. I just don't want to move this part of the existing shell script into the rapids-build-metrics-reporter.py file. We could move the shell script logic into a new separate .py file if that helps. I'm suggesting there will be 2 files in this repository -- one to build/format the sccache/ccache results and one to build/format the ninja-log results. And maybe a 3rd file to combine them if necessary.

can you point out where / how the other output formats that you mentioned will be used? As far as I can tell, build.sh only outputs a single format type. Are you trying to preserve the multiple output formats for users who may run this tool manually outside of build.sh?

The other output formats are not being used by CI currently. I'm hoping the xml can be used in the future to fail a build if there is a problem with specific source file compile times. The CSV format has been helpful for me locally so I can load the results into Excel and sort them in different ways as well as compare them with other runs.

I'd like to merge what we have in RAFT at this moment, and when there is consensus on how to share these scripts and how much logic to pluck out we can go ahead and update it. Is everyone okay with that? There are some major changes happening in RAFT right now where it would be very valuable to have these metrics for hindsight.

I'd like to merge what we have in RAFT at this moment, and when there is consensus on how to share these scripts and how much logic to pluck out we can go ahead and update it. Is everyone okay with that? There are some major changes happening in RAFT right now where it would be very valuable to have these metrics for hindsight.

That's fine.

Feel free to fix the merge conflicts and make any other changes necessary to this PR and I will approve.

I will open an issue in https://github.com/rapidsai/build-metrics-reporter to continue this discussion.

ajschmidt8

looks like there are some errors that might need to be sorted out, but pre-approving in anticipation of those fixes.

divyegala · 2023-03-30T18:38:47Z

/merge

adding build metrics report

5f4c4f3

divyegala added feature request New feature or request non-breaking Non-breaking change ci labels Mar 23, 2023

cjnolet assigned divyegala Mar 23, 2023

divyegala added 4 commits March 23, 2023 16:13

fix invalid arg

7ce6f65

correctly order build metrics report

7b729ad

add RAPIDS_ARTIFACTS_DIR to libraft meta

3b566e3

Merge remote-tracking branch 'upstream/branch-23.04' into build-metri…

1da41c9

…cs-report

divyegala marked this pull request as ready for review March 24, 2023 02:18

divyegala requested review from a team as code owners March 24, 2023 02:18

divyegala requested a review from ahendriksen March 24, 2023 02:19

cjnolet reviewed Mar 24, 2023

View reviewed changes

build.sh Outdated Show resolved Hide resolved

cjnolet reviewed Mar 24, 2023

View reviewed changes

build.sh Outdated Show resolved Hide resolved

build.sh Outdated Show resolved Hide resolved

ahendriksen reviewed Mar 24, 2023

View reviewed changes

divyegala added 2 commits March 24, 2023 10:15

address review, add ccache option

df9a539

Merge remote-tracking branch 'upstream/branch-23.04' into build-metri…

229ceb8

…cs-report

ajschmidt8 reviewed Mar 24, 2023

View reviewed changes

ajschmidt8 mentioned this pull request Mar 29, 2023

User interface discussion rapidsai/build-metrics-reporter#4

Open

divyegala added 2 commits March 29, 2023 13:51

address review

9c612da

merge upstream

89a6f57

ajschmidt8 approved these changes Mar 29, 2023

View reviewed changes

cjnolet approved these changes Mar 30, 2023

View reviewed changes

fix error

d18fb7d

rapids-bot bot merged commit e456207 into rapidsai:branch-23.04 Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Options to Generate Build Metrics Report #1369

Add Options to Generate Build Metrics Report #1369

divyegala commented Mar 23, 2023 •

edited

Loading

cjnolet commented Mar 23, 2023

ahendriksen left a comment

ahendriksen Mar 24, 2023

divyegala Mar 24, 2023

ahendriksen Mar 24, 2023

ajschmidt8 left a comment

ajschmidt8 Mar 24, 2023

divyegala Mar 28, 2023

ajschmidt8 Mar 29, 2023

divyegala Mar 29, 2023

ajschmidt8 Mar 24, 2023

davidwendt Mar 24, 2023

divyegala Mar 24, 2023

divyegala Mar 24, 2023

davidwendt Mar 24, 2023

vyasr Mar 27, 2023

ajschmidt8 Mar 27, 2023

davidwendt Mar 27, 2023

divyegala Mar 29, 2023

ajschmidt8 Mar 29, 2023

ajschmidt8 left a comment

divyegala commented Mar 30, 2023

Add Options to Generate Build Metrics Report #1369

Add Options to Generate Build Metrics Report #1369

Conversation

divyegala commented Mar 23, 2023 • edited Loading

cjnolet commented Mar 23, 2023

ahendriksen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajschmidt8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajschmidt8 left a comment

Choose a reason for hiding this comment

divyegala commented Mar 30, 2023

divyegala commented Mar 23, 2023 •

edited

Loading