-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-2639] Cannot read csv in chunks with pandas #598
Comments
Its been awhile since I have used chunksize. If I remember correct it returns a generator. chunks = catalog.load("train_dataset")
for chunk in chunks:
# chunk is a DataFrame do what you need with it
process(chunk) |
@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better. When I iterate it it throws error that saying file is closed already. |
I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to I posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.
That is awesome!!! and potentially motivating to keep making more content. |
@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too. But I haven't dig dive into transformer before yet, it would be great if someone has more knowledge jump in. |
I'm facing the same issue, anyone has updates on this problem? |
Have we got a solution for this? I have been having a rough time trying to integrate big data with Kedro. |
Looking for a solution too, still bugging me. |
@noklam Did you find a workaround? |
I solved the problem by creating a custom class for it, which basically loads the file using fsspec (like the CSVDataset) and saves it on a temp file, so i pass the file reference through the load function and inside my pipeline functions i just have to delete it after use (if i forget this no problem, bcause it's created using tempfile). I forgot to mention that the file reference is basically a iterator for the file chunks |
My solution is simply give up using dataset. I simply load it in a node via the typical pandas.read_csv. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
It is still a relevant bug, I don't think it should be closed. |
Thanks for reflagging this @noklam. I've added this as a bug ticket to our backlog, but of course we still very much welcome a PR fix on this if you have one. |
I believe the problem here is that the context manager that is used in Since pandas added In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk...):
Note also that since pandas 1.2
|
For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code. Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the from typing import Any, Dict
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd
from kedro.io.core import (
get_filepath_str,
get_protocol_and_path,
)
def _load(self) -> pd.DataFrame:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
return pd.read_csv(load_path, **self._load_args)
CSVDataSet._load = _load |
This is a great point, thanks @noklam. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I can confirm that this will be fixed in 0.18 - see 4f5f9c1. The fix should work for both pandas.CSVDataSet and others that currently use a context manager. |
Description
Cannot read csv in chunks with kedro data catalog.
df = pd.read_csv(csv, chunksize=1000)
df.get_chunk()
Context
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
df = catalog.load("train_dataset")
df.get_chunk()
Expected Result
I should be able to loop over the reader.
Actual Result
ValueError: I/O operation on closed file.
-- Separate them if you have more than one.
The text was updated successfully, but these errors were encountered: