TEST-#5172: Add fuzzydata logs to artifacts #5173

suhailrehman · 2022-11-01T16:06:16Z

Signed-off-by: Suhail Rehman suhailrehman@gmail.com

What do these changes do?

Adds logs to each fuzzydata artifact generated to the pipeline. The artifact zip should contain a run.log file which details the exact sequence of operations and randomization parameters used in that fuzzydata run.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves TEST: CI: Fuzzydata should provide a log dump within the artifacts #5172
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Suhail Rehman <suhailrehman@gmail.com>

suhailrehman · 2022-11-02T18:47:02Z

Based on input from @mvashishtha, adding a few more quality-of-life improvements to this PR.

.github/workflows/fuzzydata-test.yml

Signed-off-by: Suhail Rehman <suhailrehman@gmail.com>

…to issue-5172

anmyachev

@suhailrehman thanks! LGTM! @mvashishtha it's up to you

mvashishtha · 2022-11-10T12:57:55Z

Sample logs from an info failure LGTM (this is from devin-petersohn's commit 42ddd1a780c03094f2c9624c9edb88553574e824 on his fork of modin):

Command:

python -m pytest modin/experimental/fuzzydata/test/test_fuzzydata.py -Wignore::UserWarning --log-file=/tmp/fuzzydata-test-wf-ray/run.log --log-file-level=INFO

Failure:

failure logs

================================================================= test session starts ==================================================================
platform darwin -- Python 3.10.6, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/maheshvashishtha/software_sources/modin, configfile: setup.cfg
plugins: benchmark-3.4.1, xdist-2.5.0, forked-1.4.0, dash-2.6.1, Faker-13.15.1, repeat-0.9.1, cov-3.0.0
collected 1 item

modin/experimental/fuzzydata/test/test_fuzzydata.py F [100%]CoverageWarning: Couldn't parse '/Users/maheshvashishtha/software_sources/modin/modin/core/execution/container.py': No source for code: '/Users/maheshvashishtha/software_sources/modin/modin/core/execution/container.py'. (couldnt-parse)

======================================================================= FAILURES =======================================================================
____________________________________________________________ test_fuzzydata_sample_workflow ____________________________________________________________

def test_fuzzydata_sample_workflow():
    # Workflow Generation Options
    wf_name = str(uuid.uuid4())[:8]  # Unique name for the generated workflow
    num_versions = 10  # Number of unique CSV files to generate
    cols = 33  # Columns in Base Artifact
    rows = 1000  # Rows in Base Artifact
    bfactor = 1.0  # Branching Factor - 0.1 is linear, 10.0 is star-like
    exclude_ops = ["groupby"]  # In-Memory groupby operations cause issue #4287
    matfreq = 2  # How many operations to chain before materialization

    engine = Engine.get().lower()

    # Create Output Directory for Workflow Data
    base_out_directory = (
        f"/tmp/fuzzydata-test-wf-{engine}/"  # Must match corresponding github-action
    )
    if os.path.exists(base_out_directory):
        shutil.rmtree(base_out_directory)
    output_directory = f"{base_out_directory}/{wf_name}/"
    os.makedirs(output_directory, exist_ok=True)

    # Start Workflow Generation

  workflow = generate_workflow(

        workflow_class=ModinWorkflow,
        name=wf_name,
        num_versions=num_versions,
        base_shape=(cols, rows),
        out_directory=output_directory,
        bfactor=bfactor,
        exclude_ops=exclude_ops,
        matfreq=matfreq,
        wf_options={"modin_engine": engine},
    )

modin/experimental/fuzzydata/test/test_fuzzydata.py:46:

../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/generator.py:379: in generate_workflow
raise e
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/generator.py:367: in generate_workflow
wf.execute_current_operation(next_label)
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/workflow.py:168: in execute_current_operation
new_artifact = self.current_operation.execute(new_label)
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/operation.py:174: in execute
logger.debug(f"After Op: {result.to_df()}")
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/base.py:3380: in str
return repr(self)
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/dataframe.py:224: in repr
result = repr(self._build_repr_df(num_rows, num_cols))
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/base.py:190: in _build_repr_df
return self.iloc[indexer]._query_compiler.to_pandas()
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/core/storage_formats/pandas/query_compiler.py:286: in to_pandas
return self._modin_frame.to_pandas()
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/core/dataframe/pandas/dataframe/dataframe.py:124: in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
modin/core/dataframe/pandas/dataframe/dataframe.py:3043: in to_pandas
ErrorMessage.catch_bugs_and_request_email(

cls = <class 'modin.error_message.ErrorMessage'>, failure_condition = True, extra_log = 'Internal and external indices on axis 0 do not match.'

@classmethod
def catch_bugs_and_request_email(
    cls, failure_condition: bool, extra_log: str = ""
) -> None:
    if failure_condition:
        get_logger().info(f"Modin Error: Internal Error: {extra_log}")

      raise Exception(

            "Internal Error. "
            + "Please visit https://github.com/modin-project/modin/issues "
            + "to file an issue with the traceback and the command that "
            + "caused this error. If you can't file a GitHub issue, "
            + f"please email bug_reports@modin.org.\n{extra_log}"
        )

E Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org.
E Internal and external indices on axis 0 do not match.

modin/error_message.py:80: Exception
----------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
materializing with code self.sources[0].table[['NBKE2__sha1', '3QE6r__military_state', '3ppzZ__random_number', 'abpBL__ean']].sample(frac=0.8)
materializing with code self.sources[0].table[['VSmyB__random_int', 'abpBL__ean']].merge(self.sources[1].table, on="abpBL__ean")
materializing with code self.sources[0].table.sample(frac=0.74).sample(frac=0.71)
----------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
2022-11-10 06:54:53,978 INFO worker.py:1519 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
------------------------------------------------------------------ Captured log call -------------------------------------------------------------------
INFO fuzzydata.core.workflow:workflow.py:53 Creating new Workflow 8c10cc63
INFO fuzzydata.core.generator:generator.py:72 Generating base df with 1000 rows and 33 columns
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_0), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 18 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_0), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 14 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_2), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: project
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_2'], 'new_label': None, 'op_list': [{'op': 'project', 'args': {'output_cols': ['NBKE2__sha1', '3QE6r__military_state', '3ppzZ__random_number', 'abpBL__ean']}}, {'op': 'sample', 'args': {'frac': 0.8}}]}
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_2), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: project
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 11 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_2', 'artifact_4'], 'new_label': None, 'op_list': [{'op': 'project', 'args': {'output_cols': ['VSmyB__random_int', 'abpBL__ean']}}, {'op': 'merge', 'args': {'key_col': 'abpBL__ean'}}]}
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_5), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 12 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_6), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_6'], 'new_label': None, 'op_list': [{'op': 'sample', 'args': {'frac': 0.74}}, {'op': 'sample', 'args': {'frac': 0.71}}]}
INFO modin.logger.default:error_message.py:79 Modin Error: Internal Error: Internal and external indices on axis 0 do not match.
ERROR fuzzydata.core.generator:generator.py:376 Error during generation, stopping...
ERROR fuzzydata.core.generator:generator.py:377 Writing out all files to /tmp/fuzzydata-test-wf-ray//8c10cc63/

---------- coverage: platform darwin, python 3.10.6-final-0 ----------
Coverage XML written to file coverage.xml

=============================================================== short test summary info ================================================================
FAILED modin/experimental/fuzzydata/test/test_fuzzydata.py::test_fuzzydata_sample_workflow - Exception: Internal Error. Please visit https://github.c...

TEST-modin-project#5172: Add fuzzydata logs to artifacts

fc9e0fd

Signed-off-by: Suhail Rehman <suhailrehman@gmail.com>

suhailrehman requested a review from a team as a code owner November 1, 2022 16:06

suhailrehman marked this pull request as draft November 2, 2022 18:46

mvashishtha marked this pull request as ready for review November 4, 2022 20:16

anmyachev reviewed Nov 5, 2022

View reviewed changes

.github/workflows/fuzzydata-test.yml Outdated Show resolved Hide resolved

suhailrehman and others added 3 commits November 7, 2022 11:34

Merge branch 'modin-project:master' into issue-5172

53cec93

TEST-modin-project#5172: Change to INFO logging for fuzzydata

05911e4

Signed-off-by: Suhail Rehman <suhailrehman@gmail.com>

Merge branch 'issue-5172' of https://github.com/suhailrehman/modin in…

9341323

…to issue-5172

anmyachev approved these changes Nov 10, 2022

View reviewed changes

mvashishtha approved these changes Nov 10, 2022

View reviewed changes

mvashishtha merged commit d2ae95f into modin-project:master Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEST-#5172: Add fuzzydata logs to artifacts #5173

TEST-#5172: Add fuzzydata logs to artifacts #5173

suhailrehman commented Nov 1, 2022

suhailrehman commented Nov 2, 2022

anmyachev left a comment

mvashishtha commented Nov 10, 2022

TEST-#5172: Add fuzzydata logs to artifacts #5173

TEST-#5172: Add fuzzydata logs to artifacts #5173

Conversation

suhailrehman commented Nov 1, 2022

What do these changes do?

suhailrehman commented Nov 2, 2022

anmyachev left a comment

Choose a reason for hiding this comment

mvashishtha commented Nov 10, 2022