Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export does not work #515

Closed
theTibi opened this issue Dec 20, 2019 · 6 comments
Closed

Export does not work #515

theTibi opened this issue Dec 20, 2019 · 6 comments

Comments

@theTibi
Copy link

theTibi commented Dec 20, 2019

Hi,

I was trying to export a filtered dataset in a csv document but the export function does not save the file. It runs without any error message but basically it does not do anything. export_hdf5 works well.

Here is an example with the titanic dataset:

In [5]: import vaex.ml
   ...: df = vaex.ml.datasets.load_titanic()

In [7]: df
Out[7]:
#      pclass    survived    name                                             sex     age     sibsp    parch    ticket    fare      cabin    embarked    boat    body    home_dest
0      1         True        Allen, Miss. Elisabeth Walton                    female  29.0    0        0        24160     211.3375  B5       S           2       nan     St Louis, MO
1      1         True        Allison, Master. Hudson Trevor                   male    0.9167  1        2        113781    151.55    C22 C26  S           11      nan     Montreal, PQ / Chesterville, ON
2      1         False       Allison, Miss. Helen Loraine                     female  2.0     1        2        113781    151.55    C22 C26  S           None    nan     Montreal, PQ / Chesterville, ON
3      1         False       Allison, Mr. Hudson Joshua Creighton             male    30.0    1        2        113781    151.55    C22 C26  S           None    135.0   Montreal, PQ / Chesterville, ON
4      1         False       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0    1        2        113781    151.55    C22 C26  S           None    nan     Montreal, PQ / Chesterville, ON
...    ...       ...         ...                                              ...     ...     ...      ...      ...       ...       ...      ...         ...     ...     ...
1,304  3         False       Zabour, Miss. Hileni                             female  14.5    1        0        2665      14.4542   None     C           None    328.0   None
1,305  3         False       Zabour, Miss. Thamine                            female  nan     1        0        2665      14.4542   None     C           None    nan     None
1,306  3         False       Zakarian, Mr. Mapriededer                        male    26.5    0        0        2656      7.225     None     C           None    304.0   None
1,307  3         False       Zakarian, Mr. Ortin                              male    27.0    0        0        2670      7.225     None     C           None    nan     None
1,308  3         False       Zimmerman, Mr. Leo                               male    29.0    0        0        315082    7.875     None     S           None    nan     None

In [8]: df.export('test.txt')

In [9]: df_filtered=df[df['sex']=='female']

In [10]: df_filtered
Out[10]:
#    pclass    survived    name                                             sex     age    sibsp    parch    ticket    fare      cabin    embarked    boat    body    home_dest
0    1         True        Allen, Miss. Elisabeth Walton                    female  29.0   0        0        24160     211.3375  B5       S           2       nan     St Louis, MO
1    1         False       Allison, Miss. Helen Loraine                     female  2.0    1        2        113781    151.55    C22 C26  S           None    nan     Montreal, PQ / Chesterville, ON
2    1         False       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0   1        2        113781    151.55    C22 C26  S           None    nan     Montreal, PQ / Chesterville, ON
3    1         True        Andrews, Miss. Kornelia Theodosia                female  63.0   1        0        13502     77.9583   D7       S           10      nan     Hudson, NY
4    1         True        Appleton, Mrs. Edward Dale (Charlotte Lamson)    female  53.0   2        0        11769     51.4792   C101     S           D       nan     Bayside, Queens, NY
...  ...       ...         ...                                              ...     ...    ...      ...      ...       ...       ...      ...         ...     ...     ...
461  3         True        Whabee, Mrs. George Joseph (Shawneene Abi-Saab)  female  38.0   0        0        2688      7.2292    None     C           None    nan     None
462  3         True        Wilkes, Mrs. James (Ellen Needs)                 female  47.0   1        0        363272    7.0       None     S           None    nan     None
463  3         True        Yasbeck, Mrs. Antoni (Selini Alexander)          female  15.0   1        0        2659      14.4542   None     C           None    nan     None
464  3         False       Zabour, Miss. Hileni                             female  14.5   1        0        2665      14.4542   None     C           None    328.0   None
465  3         False       Zabour, Miss. Thamine                            female  nan    1        0        2665      14.4542   None     C                   nan     None

In [11]: df_filtered.export('test_filtered.txt')

Neither test.txt or test_filtered.txt was created and saved. Is there any way to enable some debug information to see what is going on and why the data does not get exported ?

@maartenbreddels
Copy link
Member

Hi,

'.txt'is now a known extension. We should an error on this, so I will leave it open till fixed. Try df.export('test.hdf5') or df.export('test.arrow')

cheers,

Maarten

@theTibi
Copy link
Author

theTibi commented Dec 20, 2019

Ahh I see. Is there a way to save/export a filtered dataset back to a CSV or any text file?

@maartenbreddels
Copy link
Member

We could do this, but we'll rely on pandas or arrow to do this. But I feel users should feel comfortable doing

pandas_df = df.to_pandas_df()
pandas_df.to_csv('test.csv')

We may end up wrapping everything in pandas, which does not scale, but this might be a case where it's convenient/common enough that it makes sense. What do you think?

@JovanVeljanoski what do you think should we support an export to csv?

@JovanVeljanoski
Copy link
Member

Pandas .to_csv can append to file, so we can do it in chunks and save big dataframes to disk in CSV format.

Output to CSV/ascii should be avoided however, especially for bigger datasets, since in that case one is losing of the main strengths of Vaex. If the use-case is interoperability with other tools (java, etc..) arrow promises to bridge this gap. CSVs are still useful for reports (e.g. corporate environments), in which case small aggregations will be saved, for which the example by @maartenbreddels is quite suitable I would think.

For completeness.. i guess we could support output to CSV (wrapper around pandas), maybe in the same efforts planned for improvement of i/o @maartenbreddels ?

I could also improve the docstring of the df.export method a bit.

@maartenbreddels
Copy link
Member

I think with #516 we could do a very efficient csv exporter. Provided we also do a parallel/chunked csv reader (based on pandas or arrow).

@JovanVeljanoski
Copy link
Member

Since #708 the documentation page features an example about reading and exporting data from vaex.

It also described the improvements we have made in data I/O recently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants