Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make train test traceable when error occurs #243

Merged
merged 4 commits into from
Oct 28, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .ci-cd/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ function check_style() {

function test() {
cd alf
python3 -m unittest discover -p "*_test.py" -v
python3 -m unittest discover -p "*_test.py" -v
cd ..
}

Expand Down
20 changes: 16 additions & 4 deletions alf/bin/train_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@
import subprocess
from pathlib import Path
import numpy as np
import sys

import logging as sys_logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

from absl import logging
import tensorflow as tf

Expand All @@ -34,17 +36,27 @@ def run_and_stream(cmd, cwd):
cwd (str): working directory for the process
"""
logging.info("Running %s", " ".join(cmd))

# create a logger for sub process outputs
# 1. logging all outputs of sub process to sys.stderr to make it traceable when an error
# occurs (ci suppresses stdout output to prevent producing a big log file than 4 MB)
# 2. set a simple formatter without prefix for the logger, because a log_prefix
# already exists for sub process log
logger = logging.ABSLLogger('')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of changing the code here, can we simply redirect stdout and stderr to stderr at build.sh?

python3 -m unittest discover -p "*_test.py" -v 1>&2

If not, we need some comment in the code to explain the reason of doing all these stuff so that future readers can understand it.

Copy link
Contributor Author

@witwolf witwolf Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can be done withpython3 -m unittest discover -p "*_test.py" -v 1>&2 ,
but there is a potential problem Log length exceeded 4 MB if we log for all stdout (now the log file is about 3.2MB)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then please add some comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

handler = sys_logging.StreamHandler(sys.stderr)
handler.setFormatter(sys_logging.Formatter('%(message)s'))
logger.addHandler(handler)

process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=cwd)

while process.poll() is None:
with io.TextIOWrapper(process.stdout, encoding="utf-8") as text_io:
for line in text_io:
logging.info(line.strip())
logger.info(line.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the original stdout goes to stderr, and the original stderr goes to stdout (because stderr=subprocess.STDOUT at line 50). So I am still confused. And why not simply use stdout=subprocess.STDERR at line 50?


if process.returncode != 0:
logging.error("cmd: {0} exited with code {1}".format(
" ".join(cmd), process.returncode))
assert process.returncode == 0, ("cmd: {0} exit abnormally".format(
" ".join(cmd)))


def get_metrics_from_eval_tfevents(eval_dir):
Expand Down