Skip to content

Releases: modin-project/modin

Modin 0.23.1

22 Aug 10:02
0.23.1
b5545c6
Compare
Choose a tag to compare

Modin 0.23.1

This release contains fixes that improve Modin's performance for both the NumPy and pandas APIs, as well as removes the Modin In the Cloud experimental feature. This release also includes upgrades to Modin's testing suite that significantly speed up CI.

Key Features and Updates Since 0.23.0

  • Stability and Bugfixes
    • FIX-#0000: don't test experimental xgboost with Ray nightly build (#6424)
    • FIX-#0000: fix xgboost tests with ray>2.6.0 (#6425)
    • FIX-#1930: Fix one of the cases of heterogeneous data for read_csv (#5507)
    • FIX-#4580: Fix access by row label in query and eval (#6488)
    • FIX-#5627: Stop checking temp_df.dtype == 'category' (#6360)
    • FIX-#5972: compute correct dtype for Series.str.find/index/rfind/rindex (#6426)
    • FIX-#6219: don't default to pandas for 'copy' on empty DataFrame/Series objects (#6371)
    • FIX-#6299: array method always returns array of vanilla numpy (#6300)
    • FIX-#6334: improve error message if hdk isn't installed in the environment (#6358)
    • FIX-#6347: remove 'modin in the cloud' experimental feature (#6408)
    • FIX-#6364: Make reshuffling work with 'BenchmarkMode.put(True)' (#6365)
    • FIX-#6367: Enable support for 'groupby.size()' in reshuffling groupby (#6370)
    • FIX-#6368: Apply deferred indices before map-reduce groupby (#6369)
    • FIX-#6372: precompute dtypes for 'sum' operation (#6421)
    • FIX-#6375: don't initialize engines at import time (#6374)
    • FIX-#6386: don't make unnecesary 'astype' calls for modin.array.sum op (#6395)
    • FIX-#6396: set '__factory' to 'None' in case of any problems during initialization (#6397)
    • FIX-#6402: Allow datetime and timedelta types in diff (#6403)
    • FIX-#6405: Apply disable_logging to __getattr__ (#6406)
    • FIX-#6410: add a link to @modin_project twitter (#6411)
    • FIX-#6414: fix 'read_feather' with pyarrow<11.0 (#6415)
    • FIX-#6427: make code compatible with flake8==6.1.0 (#6428)
    • FIX-#6429: exclude pymssql==2.2.8 from environments (#6430)
    • FIX-#6436: Support ~ in paths in IO functions correctly (#6448)
    • FIX-#6443: Cast boolean columns before sum|mean|median groupby aggregations (#6444)
    • FIX-#6456: create fake xgboost module for building docs (#6457)
    • FIX-#6459: support fastparquet>=2023.1.0 (#6458)
    • FIX-#6483: Default to pandas for array_ufunc (#6486)
  • Performance enhancements
    • PERF-#6437: preserve dtypes for 'reindex' (#6438)
  • Update testing suite
    • TEST-#2008: Reduce runtime of CI checks a lot (#6356)
    • TEST-#6349: Update minimum versions for test dependencies in general environments (#6350)
    • TEST-#6469: pin numexpr<2.8.5 (#6474)
  • New Features
    • FEAT-#6407: update minimum dependency versions (#6342)
  • Uncategorized improvements
    • Release version 0.23.1 (#6495)

Contributors

@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@vnlitvinov

Modin 0.23.0

06 Jul 15:35
6a5416c
Compare
Choose a tag to compare

Modin 0.23.0

This release upgrades the pandas version to 2.0. It also includes '.corr' speed-up, new
features, and bug fixes.

Key Features and Updates Since 0.22.0

  • Stability and Bugfixes
    • FIX-#1851: Squash multiple LogicalProject nodes (#6306)
    • FIX-#3371: Remove pandas patch level pin (#6211)
    • FIX-#4048: support sqlalchemy objects in con parameter for to_sql (#5940)
    • FIX-#4485: fix 'clip' with list-like bounds and axis=None (#6344)
    • FIX-#4954: defaults to pandas in read_json in case of rows having different columns (#5946)
    • FIX-#5077: fix 'Series.rename_axis' signature (#6324)
    • FIX-#5461: fix groupby if dataframe has empty partitions (#6307)
    • FIX-#6035: Fall back to Pandas, when merging unsupported column types (#6036)
    • FIX-#6085: HDK: Implemented support for datetime64 dtypes serialization (#6086)
    • FIX-#6208: HDK: Added support for median aggregation (#6209)
    • FIX-#6215: Process '.corr(numeric_only=False)' parameter at the qc level (#6242)
    • FIX-#6218: Fix read_excel and unpin openpyxl (#6247)
    • FIX-#6229: fix Series.equals/DataFrame.equals with NA entries (#6270)
    • FIX-#6232: support DataFrame.cov(numeric_only=False) without fallback to pandas (#6262)
    • FIX-#6237: Log errors only from deepest modin layer (#6238)
    • FIX-#6245: support datetime64 with different resolutions types for HDK (#6255)
    • FIX-#6246: fix 'groupby(..., as_index=False).agg(...)' case (#6263)
    • FIX-#6258: Fix series to_dict (#6260)
    • FIX-#6259: Fix astype("category") causing read-only buffer error (#6267)
    • FIX-#6273: fix DataFrame.min/max/mean/median/skew/kurt with axis=None (#6275)
    • FIX-#6297: fix experimental numpy.argmax/argmin with Nans in data (#6298)
    • FIX-#6309: do not materialize axes for 'rank' operation (#6310)
    • FIX-#6313: update MIN_RAY_VERSION var: 1.4.0 -> 1.13.0 (#6314)
    • FIX-#6317: fix syntax error in 'push-to-master.yml' (#6318)
    • FIX-#6336: pin 'pydantic<2' to fix CI (#6337)
    • FIX-#6338: fix TypeError: WorksheetReader.init() got an unexpected keyword argument 'rich_text' (#6339)
    • FIX-#6341: call _filter_empties only if shapes are different on particular axis (#6333)
    • FIX-#6352: Fix the HdkOnNativeDataframePartition._width_cache property computation (#6353)
    • FIX-#6354: Skip bad and pre-release versions (#6355)
  • Performance enhancements
    • PERF-#4560: Implement '.corr()' method using MapReduce pattern (#6193)
    • PERF-#6319: remove '__make_init_labels_args' explicit calls that materialize axes (#6312)
  • Refactor Codebase
    • REFACTOR-#0000: Remove OmnisciWorker as unused (#6278)
    • REFACTOR-#0000: rename 'exc' -> 'err' (#6252)
    • REFACTOR-#6279: HDK DataFrame should not have more than one partition (#6280)
    • REFACTOR-#6329: deprecate cloud feature (#6330)
  • Update testing suite
    • TEST-#6282: Reduce copy-pasteness in ci.yml (#6283)
    • TEST-#6308: add to_numpy ASV bench (#6305)
    • TEST-#6315: increase 'install_timeout' for ASV benchmarks: 600 -> 6000 sec (#6316)
  • New Features
    • FEAT-#5684: Use TreeReduce implementation for 'pivot_table' in certain cases (#6089)
    • FEAT-#5759: Implement lazy Arrow execution for the HDK engine (#6251)
    • FEAT-#5936: support pandas 2.0.2 (#5995)
    • FEAT-#6048: add wait method for Dask/Ray/Unidist wrappers (#6049)
    • FEAT-#6191: Implement groupby.rolling API (#6292)
    • FEAT-#6253: add 'dtype_backend' parameter support for read_parquet/read_feather (#6264)
    • FEAT-#6256: HDK: Add support for DataFrameGroupBy.head/tail() (#6257)
    • FEAT-#6284: Do not convert HDK query execution result to arrow. (#6286)
    • FEAT-#6296: Add additional pyhdk launch parameters (#6303)
    • FEAT-#6322: Give a warning only if the major or minor part of pandas version are different (#6323)
    • FEAT-#6325: Add GPU execution option for HDK backend (#6326)
    • FEAT-#6327: Bump pyhdk version to 0.7 (#6328)
    • FEAT-#6351: Add a simple heuristic for fragment size when running on a GPU (#6346)

Contributors

@AndreyPavlenko
@YarShev
@alexbaden
@anmyachev
@dchigarev
@kurapov-peter
@mvashishtha
@vnlitvinov

Modin 0.22.3

04 Jul 18:16
0.22.3
d69bcad
Compare
Choose a tag to compare

Patch release with main point of pinning pydantic<2 to resolve Ray issues, plus a few bugfixes.

Key Features and Updates Since 0.22.2

  • Stability and Bugfixes
    • FIX-#5461: fix groupby if dataframe has empty partitions (#6307)
    • FIX-#6035: Fall back to Pandas, when merging unsupported column types (#6036)
    • FIX-#6297: fix experimental numpy.argmax/argmin with Nans in data (#6298)
    • FIX-#6309: do not materialize axes for 'rank' operation (#6310)
    • FIX-#6313: update MIN_RAY_VERSION var: 1.4.0 -> 1.13.0 (#6314)
    • FIX-#6336: pin 'pydantic<2' to fix CI (#6337)

Contributors

@AndreyPavlenko
@anmyachev

Modin 0.23.0rc0

17 Jun 01:23
0.23.0rc0
e1d4241
Compare
Choose a tag to compare
Modin 0.23.0rc0 Pre-release
Pre-release

This release includes support for pandas 2.0, '.corr' speed-up, new features and bug fixes.

Note: this is a release candidate. If everything goes well, we'll release Modin 0.23.0 in two weeks.

Key Features and Updates Since 0.22.0

  • Stability and Bugfixes
    • FIX-#3371: Remove pandas patch level pin (#6211)
    • FIX-#4954: Defaults to pandas in read_json in case of rows having different columns (#5946)
    • FIX-#6215: Process '.corr(numeric_only=False)' parameter at the qc level (#6242)
    • FIX-#6218: Fix read_excel and unpin openpyxl (#6247)
    • FIX-#6232: Support DataFrame.cov(numeric_only=False) without fallback to pandas (#6262)
    • FIX-#6237: Log errors only from deepest modin layer (#6238)
    • FIX-#6245: Support datetime64 with different resolutions types for HDK (#6255)
    • FIX-#6246: Fix 'groupby(..., as_index=False).agg(...)' case (#6263)
    • FIX-#6258: Fix series to_dict (#6260)
    • FIX-#6259: Fix astype("category") causing read-only buffer error (#6267)
    • FIX-#6273: Fix DataFrame.min/max/mean/median/skew/kurt with axis=None (#6275)
  • Performance enhancements
    • PERF-#4560: Implement '.corr()' method using MapReduce pattern (#6193)
  • New Features
    • FEAT-#5759: Implement lazy Arrow execution for the HDK engine (#6251)
    • FEAT-#5936: Support pandas 2.0.2 (#5995)
    • FEAT-#6048: Add wait method for Dask/Ray/Unidist wrappers (#6049)
    • FEAT-#6253: Add 'dtype_backend' parameter support for read_parquet/read_feather (#6264)
    • FEAT-#6256: HDK: Add support for DataFrameGroupBy.head/tail() (#6257)

Contributors

@AndreyPavlenko
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@vnlitvinov

Modin 0.22.2

14 Jun 20:44
0.22.2
fdb79c6
Compare
Choose a tag to compare

This release includes several bug fixes.

Key Features and Updates Since 0.22.1

  • Stability and Bugfixes
    • FIX-#6258: Fix series to_dict (#6260)
    • FIX-#6259: Fix astype("category") causing read-only buffer error (#6267)

Contributors

@mvashishtha

Modin 0.22.1

07 Jun 17:39
eeb410c
Compare
Choose a tag to compare

This release includes a bug fix.

Key Features and Updates Since 0.22.0

  • Stability and Bugfixes
    • FIX-#6237: Log errors only from deepest modin layer (#6238)

Contributors

@mvashishtha

Modin 0.22.0

01 Jun 17:33
9869832
Compare
Choose a tag to compare

This release includes support for pyhdk=0.6, a few performance enhancements,
new features and bug fixes.

Key Features and Updates Since 0.21.0

  • Stability and Bugfixes
    • FIX-#6104: Stop selecting same column twice for repr (#6210)
    • FIX-#6199: make sure read_html return a list of DataFrames (#6200)
    • FIX-#6201: align groupby objects signatures with pandas (#6202)
    • FIX-#6212: Fix '.read_feather()' failure if the file contains index metadata (#6213)
    • FIX-#6216: make sure 'infer_objects' returns DataFrame (#6217)
    • FIX-#5722: Use full axis function when casting to "category" (#6222)
    • FIX-#5889: HDK: Combine multiple lazy concat operations into a single one and replace recursion with iteration (#5932)
  • Performance enhancements
    • PERF-#6126: Remove redundant '.fillna(0)' at the end of '.size()' and '.count()' (#6127)
    • PERF-#6224: Use 'Map' operator to retrieve categorical codes (#6230)
  • Refactor Codebase
    • REFACTOR-#5916: Align Python engine's API with other engines (#6214)
  • New Features
    • FIX-#6189: Bump pyhdk version to 0.6 (#6190)
    • FEAT-#6225: Allow set_index to take an object to be handled by the backend (a backend index, etc) (#6228)
  • Dependencies
    • FIX-#6072: unpin pyarrow and xfail test_read_parquet_pandas_index test (#6223)

Contributors

@mvashishtha
@AndreyPavlenko
@anmyachev
@dchigarev
@jkew
@YarShev

Modin 0.21.0

24 May 21:27
e8e57d9
Compare
Choose a tag to compare

Modin 0.21.0

This release includes many bug fixes, performance enhancements, and new features.

Key Features and Updates Since 0.20.0

  • Stability and Bugfixes
    • FIX-#4828: allow dict_apply_builder use keyword argument internal_indices (#5945)
    • FIX-#5091: Handle pd.Grouper objects correctly (#6174)
    • FIX-#5203: don't raise AttributeError: 'list' object has no attribute '_query_compiler' in join op (#5939)
    • FIX-#5985: BUG: ArrowPeriodType and ArrowIntervalType are not supported by HDK (#5987)
    • FIX-#5988: BUG: Concatenation of frames with strings is not supported by HDK (#5989)
    • FIX-#5993: Fix documentation building in CI (#5994)
    • FIX-#5997: Run build-docs CI job regardless of the files being changed (#5998)
    • FIX-#6000: HDK: read_csv(): Do not parse dates, if the parse_dates argument is not specified (#6001)
    • FIX-#6022: support lazy import of modin.pandas module (#6023)
    • FIX-#6037: Simplified filter node expression for ranges (#6038)
    • FIX-#6053: align 'Series.str' signatures with pandas (#6054)
    • FIX-#6069: Improve the way resample is handled at the API layer (#6179)
    • FIX-#6070: Simplify implementation of shift (#6168)
    • FIX-#6074: cap pyarrow<12 to fix CI (#6075)
    • FIX-#6094: pin 'urllib3<2' for pip command in 'test-ray-master' job (#6178)
    • FIX-#6095: Implement the to_csv() method in the HDK backend (#6099)
    • FIX-#6097: Pass storage_options to the to_csv function of PandasOnRayIO class with fsspec (#6098)
    • FIX-#6106: Fix API layer implementation of reindex_like (#6131)
    • FIX-#6107: Allow pass through of tz_convert and tz_localize to QC if possible (#6137)
    • FIX-#6109: Don't use join() when indicator is true (#6130)
    • FIX-#6110: Generalize logic to test if an index is a MultiIndex (#6135)
    • FIX-#6112: Ensure that truncate verifies that before <= after (#6134)
    • FIX-#6113: Add QC Layer implementation for idxmin/max (#6170)
    • FIX-#6114: Fix series groupby list of numpy methods (#6129)
    • FIX-#6115: Check for _to_datetime attribute in pd.to_datetime (#6133)
    • FIX-#6117: Add error checking at API level for diff (#6167)
    • FIX-#6120: HDK read_csv(): Fixed parsing dates with nanosecond precision (#6121)
    • FIX-#6146: Fix pivot when values=None (#6166)
    • FIX-#6152: make numeric_only default to True (#6162)
    • FIX-#6154: Ensure GroupBy.getitem preserves key order (#6164)
    • FIX-#6155: Fully implement droplevel for axis=0 (#6180)
    • FIX-#6175: Fix groupby agg columns for empty column partition (#6176)
    • FIX-#6181: Do not ignore copy argument in tz_convert and tz_localize (#6182)
    • FIX-#6183: Ensure array resets index and columns for all storage formats (#6185)
    • FIX-#6184: Make Series.to_list return proper list (#6188)
    • FIX-#6186: Don't use pandas extension types (#6187)
    • FIX-#6194: Fix crashes on groupby.{pct_change,diff} (#6195)
    • FIX-#6196: Align 'Series.cat' signatures with pandas (#6061)
    • FIX-#6204: Use reset_index instead of insert in to_sql (#6205)
    • FIX-#6172: Pass storage_options to the to_csv function of PandasOnUnidist class with fsspec (#6173)
  • Performance enhancements
    • PERF-#5835: Introduce lazy categorical proxy for pandas backend (#6055)
    • PERF-#5840: Precompute dtypes cache for binary operations more often (#5949)
    • PERF-#5841: Precompute dtypes for boolean setitem (#5952)
    • PERF-#5999: Do not set Ray's runtime_env for a single-node case (#6028)
    • PERF-#6122: Extract Feather's metadata without reading a whole file (#6123)
  • Refactor Codebase
    • REFACTOR-#5844: remove inplace kwarg from query compiler clip arguments (#5954)
    • REFACTOR-#5951: remove code duplication for to_pickle_distributed (#5950)
    • REFACTOR-#5992: remove 'apply_license_header.py' as unused (#5990)
    • REFACTOR-#6012: move experimental dispatchers under modin/experimental/... folder (#6011)
    • REFACTOR-#6024: remove code duplication for to_* functions (#5953)
    • REFACTOR-#6044: remove code duplication for 'get_objects_from_partitions' (#6045)
    • REFACTOR-#6046: remove code duplication for 'progress_bar_wrapper' (#6047)
    • REFACTOR-#6062: Add query compiler interfaces for expanding methods (#6064)
    • REFACTOR-#6063: Add query compiler interfaces for some strings methods. (#6088)
    • REFACTOR-#6065: Use between_time in at_time (#6158)
    • REFACTOR-#6066: Support rolling.{rank,quantile,sem} (#6084)
    • REFACTOR-#6067: Simplify describe() query compiler interface (#6082)
    • REFACTOR-#6068: Simplify info() call (#6087)
    • REFACTOR-#6071: Push first and last down to query compiler. (#64) (#6125)
    • REFACTOR-#6091: Push more of memory_usage down to query compiler. (#6092)
    • REFACTOR-#6105: Explicitly pass default value of np.nan to Series.reindex (#6138)
    • REFACTOR-#6108: Move implementation of pd.cut to QC layer (#6136)
    • REFACTOR-#6116: Move groupby_ohlc implementation to QC layer (#6132)
    • REFACTOR-#6119: #6118: Add query compiler methods for groupby diff, pct_change (#6128)
    • REFACTOR-#6151: Get slicer without consructing pandas dataframe. (#6161)
    • REFACTOR-#6159: Stop defaulting at API layer for a few more methods (#6160)
  • Update testing suite
    • TEST-#5956: Verify dtypes equality in tests (#5955)
    • TEST-#5980: use cancel-in-progress only for PRs (#5917)
    • TEST-#5991: add simple tests for read_orc, read_spss, json_normalize, read_xml, read_gbq (#5983)
    • TEST-#6004: add more '# pragma: no cover' for io functions (#6002)
    • TEST-#6006: test modin/test/test_partition_api.py on unidist and dask (#6003)
    • TEST-#6009: use tmp_path fixture instead of ensure_clean_dir as pandas 2.0.0 does (#6008)
    • TEST-#6010: add some more test directories into 'setup.cfg' (#6007)
    • TEST-#6020: exclude '_version.py' from coverage (#6019)
    • TEST-#6027: Test installing Unidist via pip in a clean environment, as we do for Dask and Ray (#6025)
    • TEST-#6030: test the function parameters of Series.str accessor for pandas equivalence (#6033)
    • TEST-#6031: test the function parameters of 'Series.dt' accessor for pandas equivalence (#6197)
    • TEST-#6076: Use 2 cores for experimental groupby on dask (#6077)
    • TEST-#6198: add 'pragma: no cover' for unidist and ray utils that used in remote context (#6059)
    • TEST-#6260: Increase test_io timeout (#6207)
  • Documentation improvements
    • DOCS-#5449: Add page for Modin interoperability with select third party libraries (#5517)
    • DOCS-#6021: Add a section regarding reshuffling groupby to Modin's documentation (#6051)
    • DOCS-#6078: correct default values for MODIN_CPUS and MODIN_NPARTITIONS (#6177)
    • DOCS-#6079: Make 'experimental/index.html' accessible through the readthedocs website (#6080)
  • New Features
    • FEAT-#5816: Implement '.split' method for axis partitions (#5856)
    • FEAT-#5867: Introduce groupby implementation via range-partitioning (#5928)
    • FEAT-#6014: Stop defaulting to pandas in groupby frontend for fill-like methods (#5996)
    • FEAT-#6039: Implement Series.str through CachedAccessor (#6043)
    • FEAT-#6040: implement 'Series.dt' through 'CachedAccessor' (#6056)
    • FEAT-#6041: implement 'Series.cat' through 'CachedAccessor' (#6057)
    • FEAT-#6144: Stop defaulting at API layer for a bunch of methods (#6145)
    • FEAT-#6147: HDK: Arrow-based columns concatenation of frames with trivial index. (#6148)
    • FEAT-#6153: Add API layer implementations for some stat methods. (#6156)

Contributors

@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@helmeleegy
@jkew
@labanyamukhopadhyay
@mdatre
@mvashishtha
@noloerino
@pyrito
@vnlitvinov
@naren-ponder

Modin 0.20.1

24 Apr 13:54
0.20.1
1fb9eb8
Compare
Choose a tag to compare

Modin 0.20.1

This release includes some fixes.

Key Features and Updates Since 0.20.0

  • Stability and Bugfixes
    • FIX-#4828: Allow dict_apply_builder use keyword argument internal_indices (#5945)
    • FIX-#5203: Don't raise AttributeError: 'list' object has no attribute '_query_compiler' in join op (#5939)
    • FIX-#5985: BUG: ArrowPeriodType and ArrowIntervalType are not supported by HDK (#5987)
    • FIX-#5988: BUG: Concatenation of frames with strings is not supported by HDK (#5989)
    • FIX-#5993: Fix documentation building in CI (#5994)
    • FIX-#5997: Run build-docs CI job regardless of the files being changed (#5998)
    • FIX-#6000: HDK: read_csv(): Do not parse dates, if the parse_dates argument is not specified (#6001)
    • FIX-#6022: Support lazy import of modin.pandas module (#6023)

Contributors

@AndreyPavlenko
@anmyachev
@dchigarev

Modin 0.20.0

12 Apr 12:52
0.20.0
daec667
Compare
Choose a tag to compare

Modin 0.20.0

This release adds parallel implementations for some functions on Dask that were previously implemented for other engines.
It also includes support for pyhdk 0.5, many bug fixes and some performance enhancements.

Key Features and Updates Since 0.19.0

  • Stability and Bugfixes
    • FIX-#2850: use modin.pandas.Series instead of pandas.Series for where func (#5883)
    • FIX-#3925: Fixed AssertionError on columns and index drop (#5156)
    • FIX-#4227: Calling FactoryDispatcher.get_factory also initializes the engine (#4228)
    • FIX-#4635: allow pass modin functions to apply (#5915)
    • FIX-#4924: fix read_excel when header is None (#5919)
    • FIX-#5309: series iloc/loc raises IndexingError if a key is too long (#5784)
    • FIX-#5373: Fix Series.shift() for named Series (#5823)
    • FIX-#5432: don't return None when astype used with copy=False parameter (#5918)
    • FIX-#5454: add missed methods for SeriesGroupBy, DataFrameGroupBy objects (#5866)
    • FIX-#5509: default to pandas for read_parquet if any additional kwargs are passed to the engine (#5911)
    • FIX-#5566: Enable test_indexing test on the HDK engine and add to ci (#5567)
    • FIX-#5576: Enable test_join_sort test on the HDK engine and add to CI (#5578)
    • FIX-#5580: HDK-BUG: 'AVG|SUM' is only valid on integer and floating point (#5583)
    • FIX-#5618: don't ignore 'errors' parameter for astype (#5895)
    • FIX-#5653: implement convert_dtypes as a full-axis operation instead of using map approach (#5885)
    • FIX-#5737: BUG: String columns are converted to Categorical, if exported from HDK (#5738)
    • FIX-#5767: cast pathlib.Path to str for read_parquet (#5860)
    • FIX-#5770: Enable test_series test on the HDK engine and add to ci (#5771)
    • FIX-#5774: Correctly calculate shape of single row (#5775)
    • FIX-#5776: fix IndexError when concatenating dict of series along columns (#5804)
    • FIX-#5781: Fix sort in descending order for columns with highly dense values (#5783)
    • FIX-#5787: Enable test_reduce test on the HDK engine and add to ci (#5788)
    • FIX-#5794: Enable test_default test on the HDK engine and add to ci (#5795)
    • FIX-#5806: Enable test_io test on the HDK engine and add to ci (#5807)
    • FIX-#5810: Enable test_binary test on the HDK engine (#5811)
    • FIX-#5819: Fix np.argmax/argmin on 1D arrays (#5820)
    • FIX-#5829: fix ndarray assignment via loc (#5847)
    • FIX-#5846: add Series.str.removeprefix/removesuffix/fullmatch methods (#5845)
    • FIX-#5849: add Series.dt.day_of_week/day_of_year/isocalendar/asfreq methods (#5848)
    • FIX-#5859: Fix '.sort_values()' when there's only one row partition (#5869)
    • FIX-#5862: fix Inline strong start-string without end-string for read_custom_text (#5861)
    • FIX-#5870: Enable test_general test on the HDK engine and add to ci (#5871)
    • FIX-#5888: Fix to_parquet in s3. (#5912)
    • FIX-#5891: BUG: HDK: Query execution fails because the query contains not supported self-join pattern (#5892)
    • FIX-#5927: Enable test_map_metadata test on the HDK engine and add to ci (#5929)
    • FIX-#5934: Enable test_window test on the HDK engine and add to ci (#5935)
    • FIX-#5941: TEST: The test test_io.py fails on HDK (#5942)
    • FIX-#5976: correct use of dtypes cache for concat op (#5975)
    • FIX-#5977: use wrapper.materialize instead of wait_partitions; use AWS env vars in pytest_sessionstart function (#5981)
  • Performance enhancements
    • PERF-#5590: Precompute columns and dtypes metadata for '.merge()' (#5594)
    • PERF-#5670: create self._identity in partitions only for "debug" logging level (#5679)
    • PERF-#5674: reduce data transferring in _launch_tasks function (#5678)
    • PERF-#5675: make index calculation for read_csv function lazy; introduce ModinIndex (#5677)
    • PERF-#5740: allow read_csv, read_fwf, read_table, read_custom_text functions be executed fully asynchronous; introduce ModinDtypes (#5713)
    • PERF-#5777: Filter out empty bins at range-based reshuffling (#5779)
    • PERF-#5778: Avoid extra materialization at range-based reshuffling (#5780)
    • PERF-#5808: Delay metadata computations for '.sort_values' result (#5828)
    • PERF-#5837: Defer index materialization for MapReduce implemented groupby (#5948)
  • Refactor Codebase
    • REFACTOR-#2863: remove 'other_name' from broadcast_apply (#5882)
    • REFACTOR-#5414: Move partition.get into base class (#5408)
    • REFACTOR-#5417: fix FutureWarning: the mangle_dupe_cols keyword is deprecated (#5407)
    • REFACTOR-#5683: remove Engine.subscribe(_update_engine) in DataFrame/Series constructors (#5855)
    • REFACTOR-#5786: align logging of Dask partitions with other executions (#5785)
    • REFACTOR-#5799: Clean up numpy array operations (#5800)
    • REFACTOR-#5830: rename experimental dispatchers and parsers (#5864)
    • REFACTOR-#5874: move lazy_metadata_decorator into utils.py (#5872)
    • REFACTOR-#5875: use default implementations for dt methods from the base query compiler (#5873)
    • REFACTOR-#5902: use __make_read for non experimental IO classes (#5898)
    • REFACTOR-#5908: remove unused parameters from 'run_exec_plan' (#5907)
    • REFACTOR-#5910: remove '_dtypes_for_cols' internal function as unused (#5909)
    • REFACTOR-#5922: let upload-coverage action fail if there is no .coverage file (#5921)
    • REFACTOR-#5923: add pragma: no cover for functions that used in apply_full_axis (#5920)
  • Update testing suite
    • TEST-#2544: delay codecov notifications until all reports have been sent (#5782)
    • TEST-#4261: test rolling with axis=1, win_type=, and center=True (#5881)
    • TEST-#5477: fix typo: read_stata kwargs -> read_sas kwargs (#5854)
    • TEST-#5790: add ASV configs for Dask and Unidist (#5789)
    • TEST-#5802: update some actions in CI (#5801)
    • TEST-#5826: remove _propagate_index_objs internal function usage from tests (#5813)
    • TEST-#5832: Suppress pytest coverage messages in terminal (#5833)
    • TEST-#5851: test api of cat/sparse accessors (#5850)
    • TEST-#5878: exclude modin/experimental/batch/test/ folder from computing coverage (#5877)
    • TEST-#5897: Add more robust tests for numpy API (#5900)
    • TEST-#5913: Cancel CI for commits to same branch. (#5914)
    • TEST-#5933: Add assert_array_equals utility to numpy tests (#5947)
    • TEST-#5943: Rebalance tests between different CI jobs (#5890)
    • TEST-#5977: Add AWS mock keys to moto in push-to-master.yml (#5978)
  • Documentation improvements
    • DOCS-#0000: fix pip install command for macos (#5749)
    • DOCS-#5659: Supplement quickstart notebook with a note regarding OOM issue (#5821)
    • DOCS-#5852: Add mention of read_custom_text experimental api in docs (#5853)
    • DOCS-#5957: Compress Import.gif as it's too large (#5958)
  • New Features
    • FEAT-#4624: add to_parquet parallel implementation for Dask (#5876)
    • FEAT-#5497: add several experimental functions for Dask (#5496)
    • FEAT-#5880: add to_sql parallel implementation for Dask (#5879)
    • FEAT-#5901: add read_fwf parallel implementation for Dask (#5899)
    • FEAT-#5930: Bump pyhdk version to 0.5 (#5931)

Contributors

@MSHADroo
@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@noloerino
@pyrito
@vnlitvinov