Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(python): bitpacking index is out of bounds #1490

Closed
Evan-Kim2028 opened this issue Aug 1, 2024 · 0 comments · Fixed by lancedb/lance#2692
Closed

bug(python): bitpacking index is out of bounds #1490

Evan-Kim2028 opened this issue Aug 1, 2024 · 0 comments · Fixed by lancedb/lance#2692
Assignees
Labels
bug Something isn't working

Comments

@Evan-Kim2028
Copy link

LanceDB version

v0.7.14

What happened?

I made a standard dataframe without any exotic data types and tried to make a lance v2 table but got an index out of bounds error. Here is my python code:

import asyncio
import os
import polars as pl
import lancedb

# 8/1/24 - THE GOAL IS TO SUCCESSFULLY MAKE A LANCE v2 TABLE AND COMPARE THE COMPRESSION TO PARQUET


# Set environment variables to enable the new features
os.environ['LANCE_USE_FSST'] = '1'
os.environ['LANCE_USE_BITPACKING'] = '1'

# Check that lancedb folder exists. If not, create it
if not os.path.exists('data'):
    os.makedirs('data')


async def main():
    df = pl.read_parquet('data/transactions.parquet')

    with await lancedb.connect_async("data") as conn:
        await conn.create_table(name="transactions_v1", data=df, use_legacy_format=True)
        await conn.create_table(name="transactions_v2", data=df, use_legacy_format=False)

# Run the async main function
asyncio.run(main())

error:

thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-encoding-0.15.0/src/encodings/physical/bitpack.rs:161:13:
index out of bounds: the len is 34689 but the index is 34689
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-encoding-0.15.0/src/encodings/logical/primitive.rs:421:32:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(2158), ...)
Traceback (most recent call last):
  File "/home/evan/Documents/hypersync_lancev2/write_lance.py", line 23, in <module>
    asyncio.run(main())
  File "/home/evan/.rye/py/cpython@3.12.2/install/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/evan/.rye/py/cpython@3.12.2/install/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/evan/.rye/py/cpython@3.12.2/install/lib/python3.12/asyncio/base_events.py", line 685, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/evan/Documents/hypersync_lancev2/write_lance.py", line 20, in main
    await conn.create_table(name="transactions_v2", data=df, use_legacy_format=False)
  File "/home/evan/Documents/hypersync_lancev2/.venv/lib/python3.12/site-packages/lancedb/db.py", line 778, in create_table
    new_table = await self._inner.create_table(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_asyncio.RustPanic: rust future panicked: unknown error```



### Are there known steps to reproduce?

The dataset I used can be replicated with `pip install hypersync` and running this code to download the data

```import asyncio
import hypersync
import polars as pl
import time

from hypersync import ColumnMapping, DataType, TransactionField, BlockField, TransactionSelection


async def historical_blocks_txs_sync():
    """
    Use hypersync to query blocks and transactions and write to a LanceDB table. Assumes existence of a previous LanceDB table to
    query for the latest block number to resume querying.
    """
    # hypersync client, load with specific url
    client = hypersync.HypersyncClient(hypersync.ClientConfig())

    # set the block range
    from_block: int = 20000000
    to_block: int = 20025000

    # # add +/-1 to the block range because the query is not inclusive to the block number
    query = hypersync.Query(
        from_block=from_block-1,
        to_block=to_block+1,
        include_all_blocks=True,
        transactions=[TransactionSelection()],
        field_selection=hypersync.FieldSelection(
            block=[e.value for e in BlockField],
            transaction=[e.value for e in TransactionField],
        )
    )
    # Setting this number lower reduces client sync console error messages.
    query.max_num_transactions = 1_000  # for troubleshooting

    # configuration settings to predetermine type output here
    config = hypersync.StreamConfig(
        hex_output=hypersync.HexOutput.PREFIXED,
        column_mapping=ColumnMapping(
            transaction={
                TransactionField.GAS_USED: DataType.FLOAT64,
                TransactionField.MAX_FEE_PER_BLOB_GAS: DataType.FLOAT64,
                TransactionField.MAX_PRIORITY_FEE_PER_GAS: DataType.FLOAT64,
                TransactionField.GAS_PRICE: DataType.FLOAT64,
                TransactionField.CUMULATIVE_GAS_USED: DataType.FLOAT64,
                TransactionField.EFFECTIVE_GAS_PRICE: DataType.FLOAT64,
                TransactionField.NONCE: DataType.INT64,
                TransactionField.GAS: DataType.FLOAT64,
                TransactionField.MAX_FEE_PER_GAS: DataType.FLOAT64,
                TransactionField.MAX_FEE_PER_BLOB_GAS: DataType.FLOAT64,
                TransactionField.VALUE: DataType.FLOAT64,
            },
            block={
                BlockField.GAS_LIMIT: DataType.FLOAT64,
                BlockField.GAS_USED: DataType.FLOAT64,
                BlockField.SIZE: DataType.FLOAT64,
                BlockField.BLOB_GAS_USED: DataType.FLOAT64,
                BlockField.EXCESS_BLOB_GAS: DataType.FLOAT64,
                BlockField.BASE_FEE_PER_GAS: DataType.FLOAT64,
                BlockField.TIMESTAMP: DataType.INT64,
            }
        )
    )

    return await client.collect_parquet('data', query, config)

# time the query
start_time = time.time()
data = asyncio.run(historical_blocks_txs_sync())
end_time = time.time()
print(f"Time taken: {end_time - start_time}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants