Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Options to Generate Build Metrics Report #1369

Merged
merged 10 commits into from
Mar 30, 2023

Conversation

divyegala
Copy link
Member

@divyegala divyegala commented Mar 23, 2023

This PR will also automatically generate an HTML report in conda-cpp-build CI runs under the task Upload Additional Artifacts

@divyegala divyegala added feature request New feature or request non-breaking Non-breaking change ci labels Mar 23, 2023
@cjnolet
Copy link
Member

cjnolet commented Mar 23, 2023

cc @ahendriksen for review

@divyegala divyegala marked this pull request as ready for review March 24, 2023 02:18
@divyegala divyegala requested review from a team as code owners March 24, 2023 02:18
build.sh Outdated Show resolved Hide resolved
build.sh Outdated Show resolved Hide resolved
build.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some small comments.

build.sh Outdated
@@ -18,7 +18,7 @@ ARGS=$*
# script, and that this script resides in the repo dir!
REPODIR=$(cd $(dirname $0); pwd)

VALIDARGS="clean libraft pylibraft raft-dask docs tests bench clean --uninstall -v -g -n --compile-lib --allgpuarch --no-nvtx --show_depr_warn -h"
VALIDARGS="clean libraft pylibraft raft-dask docs tests bench clean --uninstall -v -g -n --compile-lib --allgpuarch --no-nvtx --show_depr_warn --build_metrics --incl_cache_stats -h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Apart from --show_depr_warn, most other options use -. I would prefer this for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I also change --show_depr_warn then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with changing --show_depr_warn to --show-depr-warn for consistency in the build.sh flags of RAFT. @cjnolet : are you okay too?

build.sh Show resolved Hide resolved
build.sh Outdated Show resolved Hide resolved
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I left some comments soliciting feedback from you and others.

Let's wait to make any of these changes until we get some consensus so that we can avoid wasteful iterations.

build.sh Outdated
Comment on lines 361 to 372
# get the current count before the compile starts
CACHE_COMMAND=""
if [[ -z "${CACHE_TOOL}" ]]; then
# default is sccache for CI
CACHE_COMMAND="sccache"
else
CACHE_COMMAND="${CACHE_TOOL}"
fi
if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_COMMAND})" ]]; then
# zero the cache statistics
"${CACHE_COMMAND}" --zero-stats
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# get the current count before the compile starts
CACHE_COMMAND=""
if [[ -z "${CACHE_TOOL}" ]]; then
# default is sccache for CI
CACHE_COMMAND="sccache"
else
CACHE_COMMAND="${CACHE_TOOL}"
fi
if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_COMMAND})" ]]; then
# zero the cache statistics
"${CACHE_COMMAND}" --zero-stats
fi
if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_TOOL:-sccache})" ]]; then
"${CACHE_COMMAND}" --zero-stats
fi

Two quick comments:

  • Bash has the concept of default values that we can use to simplify this logic (src) to just "$(command -v ${CACHE_TOOL:-sccache})"
  • We can omit comments like # zero the cache statistics when using the long form of CLI flags like --zero-stats since they're a bit redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I didn't know about default values in bash. But I think the correct expression here is:

    CACHE_COMMAND=""
    if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ${CACHE_TOOL:-sccache})" ]]; then
        CACHE_COMMAND=$CACHE_TOOL
        "${CACHE_COMMAND}" --zero-stats
    fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't work. Can you verify what the failure is about? My bash is really weak:

./build.sh: line 394: : command not found

build.sh Outdated
Comment on lines 402 to 432
if [[ "$BUILD_REPORT_METRICS" == "ON" && -f "${LIBRAFT_BUILD_DIR}/.ninja_log" ]]; then
if ! rapids-build-metrics-reporter.py 2> /dev/null && [ ! -f rapids-build-metrics-reporter.py ]; then
echo "Downloading rapids-build-metrics-reporter.py"
curl -sO https://github.com/raw/rapidsai/build-metrics-reporter/v1/rapids-build-metrics-reporter.py
fi

echo "Formatting build metrics"
MSG=""
# get some sccache/ccache stats after the compile
if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" ]]; then
if [[ ${CACHE_COMMAND} == "sccache" && -x "$(command -v sccache)" ]]; then
COMPILE_REQUESTS=$(sccache -s | grep "Compile requests \+ [0-9]\+$" | awk '{ print $NF }')
CACHE_HITS=$(sccache -s | grep "Cache hits \+ [0-9]\+$" | awk '{ print $NF }')
HIT_RATE=$(echo - | awk "{printf \"%.2f\n\", $CACHE_HITS / $COMPILE_REQUESTS * 100}")
MSG="${MSG}<br/>cache hit rate ${HIT_RATE} %"
elif [[ ${CACHE_COMMAND} == "ccache" && -x "$(command -v ccache)" ]]; then
CACHE_STATS_LINE=$(ccache -s | grep "Hits: \+ [0-9]\+ / [0-9]\+" | tail -n1)
if [[ ! -z "$CACHE_STATS_LINE" ]]; then
CACHE_HITS=$(echo "$CACHE_STATS_LINE" - | awk '{ print $2 }')
COMPILE_REQUESTS=$(echo "$CACHE_STATS_LINE" - | awk '{ print $4 }')
HIT_RATE=$(echo - | awk "{printf \"%.2f\n\", $CACHE_HITS / $COMPILE_REQUESTS * 100}")
MSG="${MSG}<br/>cache hit rate ${HIT_RATE} %"
fi
fi
fi
MSG="${MSG}<br/>parallel setting: $PARALLEL_LEVEL"
MSG="${MSG}<br/>parallel build time: $compile_total seconds"
if [[ -f "${LIBRAFT_BUILD_DIR}/libraft.so" ]]; then
LIBRAFT_FS=$(ls -lh ${LIBRAFT_BUILD_DIR}/libraft.so | awk '{print $5}')
MSG="${MSG}<br/>libraft.so size: $LIBRAFT_FS"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to include @vyasr and @davidwendt in a discussion here about how we can eliminate some of this boilerplate code.

Presumably some of this functionality will be needed by other repositories, so we should strive to make rapids-build-metrics-reporter.py handle as much of this as possible so that we have a single source of truth.

I have some thoughts and proposals here:

  • I think rapids-build-metrics-reporter.py should handle sccache -s/ccache -s parsing. The lines of code in this PR are already duplicated from cudf's build.sh, so this is a prime candidate for deduping. Perhaps we can expose a --cache-tool/-c flag that will let consumers indicate whether ccache or sccache should be used.
  • Whenever possible, I'm in favor of avoiding the manual generation of HTML in bash variables (e.g. <br/>). In place of the --msg flag (which appears to take in an HTML string), I think we should create the following flags:
    • --stat/-s - this flag should be able to be specified multiple times and its value will be a key/value pair that can be added to a final summary table (e.g. --stat="Parallel Setting=$PARALLEL_LEVEL", --stat="Parallel Build Time=$compile_total seconds")
    • --file-size/-f - this flag should be able to be specified multiple times and its value will be a path to a file whose name and size will be added to a final summary table (e.g. --file-size="${LIBRAFT_BUILD_DIR}/libraft.so")

The sum of these changes should allow us to dedupe quite a bit here and also eliminate the MSG_OUTFILE lines in this PR.

Once we get a consensus on all of this, I will open an issue to track the rollout of this new script to other repositories as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have the .py strictly handle the .ninja_log file. It currently supports other formats besides html including csv and xml (unittest format?). Both of these are handy at times and and there is no place for the stats in these outputs.
Rather than including the ccache stats in the html output, maybe we generate a separate file with sccache states into the artifacts directory. The reason is what incorporated in the html output originally was to just minimize the burden on the Jenkins job at the time. Just piping the sccache -s > $RAPIDS_ARTIFACT_DIR/cache-stats.txt seems a reasonable approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we entirely eliminate formating sccache/ccache statistics and just add it as a raw file for the user to parse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can expose a --cache-tool/-c flag

This already exists in RAFT!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are calling the output of sccache -s raw output then yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think centralizing more of the logic is the right approach as long as it is done in a way that the different pieces of the document (cache stats, ninja log, etc) are composable and extensible by packages that want to add new files etc. I would advocate for starting by building out the different pieces and then figuring out how to combine them afterwards. As long as there is a consistent HTML output the combining shouldn't be too difficult.

For building up the HTML in Python a reasonable low-level approach is building up the HTML tree with xml.etree. A nicer approach might be to define a Jinja template that can be filled in, leaving some arbitrary extra sections that could be appended to so that the output format is extensible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rapids-reporter.sh which can build the html output like build.sh currently does and calls the .py to build the ninja html? We can look at different ways to combine the results into a single file.

Are you suggesting that we just move the existing logic to another shell script? That doesn't seem to solve the problem of reusability / keeping things DRY. All of the repositories will just have another rapids-reporter.sh script that contains all of the duplicated lines of code still.

I think centralizing more of the logic is the right approach as long as it is done in a way that the different pieces of the document (cache stats, ninja log, etc) are composable and extensible by packages that want to add new files etc. I would advocate for starting by building out the different pieces and then figuring out how to combine them afterwards. As long as there is a consistent HTML output the combining shouldn't be too difficult.

+1.

For building up the HTML in Python a reasonable low-level approach is building up the HTML tree with xml.etree. A nicer approach might be to define a Jinja template that can be filled in, leaving some arbitrary extra sections that could be appended to so that the output format is extensible.

In order to keep this script curl-able (to avoid the overhead of a pip package), we should probably try to stick to Python standard libraries.


@davidwendt, can you point out where / how the other output formats that you mentioned will be used? As far as I can tell, build.sh only outputs a single format type. Are you trying to preserve the multiple output formats for users who may run this tool manually outside of build.sh?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that we just move the existing logic to another shell script? That doesn't seem to solve the problem of reusability / keeping things DRY. All of the repositories will just have another rapids-reporter.sh script that contains all of the duplicated lines of code still.

The existing shell script logic does not necessarily have move to into a new shell script. I just don't want to move this part of the existing shell script into the rapids-build-metrics-reporter.py file. We could move the shell script logic into a new separate .py file if that helps. I'm suggesting there will be 2 files in this repository -- one to build/format the sccache/ccache results and one to build/format the ninja-log results. And maybe a 3rd file to combine them if necessary.

can you point out where / how the other output formats that you mentioned will be used? As far as I can tell, build.sh only outputs a single format type. Are you trying to preserve the multiple output formats for users who may run this tool manually outside of build.sh?

The other output formats are not being used by CI currently. I'm hoping the xml can be used in the future to fail a build if there is a problem with specific source file compile times. The CSV format has been helpful for me locally so I can load the results into Excel and sort them in different ways as well as compare them with other runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to merge what we have in RAFT at this moment, and when there is consensus on how to share these scripts and how much logic to pluck out we can go ahead and update it. Is everyone okay with that? There are some major changes happening in RAFT right now where it would be very valuable to have these metrics for hindsight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to merge what we have in RAFT at this moment, and when there is consensus on how to share these scripts and how much logic to pluck out we can go ahead and update it. Is everyone okay with that? There are some major changes happening in RAFT right now where it would be very valuable to have these metrics for hindsight.

That's fine.

Feel free to fix the merge conflicts and make any other changes necessary to this PR and I will approve.

I will open an issue in https://github.com/rapidsai/build-metrics-reporter to continue this discussion.

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there are some errors that might need to be sorted out, but pre-approving in anticipation of those fixes.

@divyegala
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit e456207 into rapidsai:branch-23.04 Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

6 participants