Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST-#5172: Add fuzzydata logs to artifacts #5173

Merged
merged 4 commits into from
Nov 10, 2022

Conversation

suhailrehman
Copy link
Contributor

Signed-off-by: Suhail Rehman suhailrehman@gmail.com

What do these changes do?

Adds logs to each fuzzydata artifact generated to the pipeline. The artifact zip should contain a run.log file which details the exact sequence of operations and randomization parameters used in that fuzzydata run.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves TEST: CI: Fuzzydata should provide a log dump within the artifacts #5172
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Suhail Rehman <suhailrehman@gmail.com>
@suhailrehman suhailrehman requested a review from a team as a code owner November 1, 2022 16:06
@suhailrehman suhailrehman marked this pull request as draft November 2, 2022 18:46
@suhailrehman
Copy link
Contributor Author

Based on input from @mvashishtha, adding a few more quality-of-life improvements to this PR.

@mvashishtha mvashishtha marked this pull request as ready for review November 4, 2022 20:16
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suhailrehman thanks! LGTM! @mvashishtha it's up to you

@mvashishtha
Copy link
Collaborator

Sample logs from an info failure LGTM (this is from devin-petersohn's commit 42ddd1a780c03094f2c9624c9edb88553574e824 on his fork of modin):

Command:

python -m pytest modin/experimental/fuzzydata/test/test_fuzzydata.py -Wignore::UserWarning --log-file=/tmp/fuzzydata-test-wf-ray/run.log --log-file-level=INFO

Failure:

failure logs

================================================================= test session starts ==================================================================
platform darwin -- Python 3.10.6, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/maheshvashishtha/software_sources/modin, configfile: setup.cfg
plugins: benchmark-3.4.1, xdist-2.5.0, forked-1.4.0, dash-2.6.1, Faker-13.15.1, repeat-0.9.1, cov-3.0.0
collected 1 item

modin/experimental/fuzzydata/test/test_fuzzydata.py F [100%]CoverageWarning: Couldn't parse '/Users/maheshvashishtha/software_sources/modin/modin/core/execution/container.py': No source for code: '/Users/maheshvashishtha/software_sources/modin/modin/core/execution/container.py'. (couldnt-parse)

======================================================================= FAILURES =======================================================================
____________________________________________________________ test_fuzzydata_sample_workflow ____________________________________________________________

def test_fuzzydata_sample_workflow():
    # Workflow Generation Options
    wf_name = str(uuid.uuid4())[:8]  # Unique name for the generated workflow
    num_versions = 10  # Number of unique CSV files to generate
    cols = 33  # Columns in Base Artifact
    rows = 1000  # Rows in Base Artifact
    bfactor = 1.0  # Branching Factor - 0.1 is linear, 10.0 is star-like
    exclude_ops = ["groupby"]  # In-Memory groupby operations cause issue #4287
    matfreq = 2  # How many operations to chain before materialization

    engine = Engine.get().lower()

    # Create Output Directory for Workflow Data
    base_out_directory = (
        f"/tmp/fuzzydata-test-wf-{engine}/"  # Must match corresponding github-action
    )
    if os.path.exists(base_out_directory):
        shutil.rmtree(base_out_directory)
    output_directory = f"{base_out_directory}/{wf_name}/"
    os.makedirs(output_directory, exist_ok=True)

    # Start Workflow Generation
  workflow = generate_workflow(
        workflow_class=ModinWorkflow,
        name=wf_name,
        num_versions=num_versions,
        base_shape=(cols, rows),
        out_directory=output_directory,
        bfactor=bfactor,
        exclude_ops=exclude_ops,
        matfreq=matfreq,
        wf_options={"modin_engine": engine},
    )

modin/experimental/fuzzydata/test/test_fuzzydata.py:46:


../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/generator.py:379: in generate_workflow
raise e
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/generator.py:367: in generate_workflow
wf.execute_current_operation(next_label)
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/workflow.py:168: in execute_current_operation
new_artifact = self.current_operation.execute(new_label)
../../opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/fuzzydata/core/operation.py:174: in execute
logger.debug(f"After Op: {result.to_df()}")
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/base.py:3380: in str
return repr(self)
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/dataframe.py:224: in repr
result = repr(self._build_repr_df(num_rows, num_cols))
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/pandas/base.py:190: in _build_repr_df
return self.iloc[indexer]._query_compiler.to_pandas()
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/core/storage_formats/pandas/query_compiler.py:286: in to_pandas
return self._modin_frame.to_pandas()
modin/logging/logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin/core/dataframe/pandas/dataframe/dataframe.py:124: in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
modin/core/dataframe/pandas/dataframe/dataframe.py:3043: in to_pandas
ErrorMessage.catch_bugs_and_request_email(


cls = <class 'modin.error_message.ErrorMessage'>, failure_condition = True, extra_log = 'Internal and external indices on axis 0 do not match.'

@classmethod
def catch_bugs_and_request_email(
    cls, failure_condition: bool, extra_log: str = ""
) -> None:
    if failure_condition:
        get_logger().info(f"Modin Error: Internal Error: {extra_log}")
      raise Exception(
            "Internal Error. "
            + "Please visit https://github.com/modin-project/modin/issues "
            + "to file an issue with the traceback and the command that "
            + "caused this error. If you can't file a GitHub issue, "
            + f"please email bug_reports@modin.org.\n{extra_log}"
        )

E Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org.
E Internal and external indices on axis 0 do not match.

modin/error_message.py:80: Exception
----------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
materializing with code self.sources[0].table[['NBKE2__sha1', '3QE6r__military_state', '3ppzZ__random_number', 'abpBL__ean']].sample(frac=0.8)
materializing with code self.sources[0].table[['VSmyB__random_int', 'abpBL__ean']].merge(self.sources[1].table, on="abpBL__ean")
materializing with code self.sources[0].table.sample(frac=0.74).sample(frac=0.71)
----------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
2022-11-10 06:54:53,978 INFO worker.py:1519 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
------------------------------------------------------------------ Captured log call -------------------------------------------------------------------
INFO fuzzydata.core.workflow:workflow.py:53 Creating new Workflow 8c10cc63
INFO fuzzydata.core.generator:generator.py:72 Generating base df with 1000 rows and 33 columns
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_0), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 18 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_0), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 14 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_2), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: project
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_2'], 'new_label': None, 'op_list': [{'op': 'project', 'args': {'output_cols': ['NBKE2__sha1', '3QE6r__military_state', '3ppzZ__random_number', 'abpBL__ean']}}, {'op': 'sample', 'args': {'frac': 0.8}}]}
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_2), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: project
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 11 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_2', 'artifact_4'], 'new_label': None, 'op_list': [{'op': 'project', 'args': {'output_cols': ['VSmyB__random_int', 'abpBL__ean']}}, {'op': 'merge', 'args': {'key_col': 'abpBL__ean'}}]}
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_5), initializing operation chain
INFO fuzzydata.core.generator:generator.py:70 Generating right-merge df df with 1000 rows and 12 columns
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: merge
INFO fuzzydata.core.generator:generator.py:301 Selected Artifact: Artifact(label=artifact_6), initializing operation chain
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:339 Chaining Operation: sample
INFO fuzzydata.core.generator:generator.py:365 Executing current operation list: {'sources': ['artifact_6'], 'new_label': None, 'op_list': [{'op': 'sample', 'args': {'frac': 0.74}}, {'op': 'sample', 'args': {'frac': 0.71}}]}
INFO modin.logger.default:error_message.py:79 Modin Error: Internal Error: Internal and external indices on axis 0 do not match.
ERROR fuzzydata.core.generator:generator.py:376 Error during generation, stopping...
ERROR fuzzydata.core.generator:generator.py:377 Writing out all files to /tmp/fuzzydata-test-wf-ray//8c10cc63/

---------- coverage: platform darwin, python 3.10.6-final-0 ----------
Coverage XML written to file coverage.xml

=============================================================== short test summary info ================================================================
FAILED modin/experimental/fuzzydata/test/test_fuzzydata.py::test_fuzzydata_sample_workflow - Exception: Internal Error. Please visit https://github.c...

@mvashishtha mvashishtha merged commit d2ae95f into modin-project:master Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TEST: CI: Fuzzydata should provide a log dump within the artifacts
3 participants