Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Displaying data labels in Y axis on the left (instead of 1 and number of rows) #36

Open
gurol opened this issue Aug 30, 2017 · 6 comments

Comments

@gurol
Copy link

gurol commented Aug 30, 2017

Could we write the labels of data in Y axis just like time-series data? (like in given example: msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ') but for text)

DataLabels DS2 DS0 DS1 DS3 DS5
LABEL_1 0.001132 NaN 0.011811 0.002 0.000712
LABEL_2 0.013395 0.012160 0.007874 0.007 0.005013
@ResidentMario
Copy link
Owner

Can you describe this a bit more? I'm not sure what you mean.

@gurol
Copy link
Author

gurol commented Aug 30, 2017

missingnoexample
The image above is an example showing all the nos (data labels).

And even the best: there could be the following options to display

  • missing_only (missing in at least one column of the source data frame),
  • existing_only (existing in all the columns of the source data frame),
  • all_values (union of the all labels)

where the labels are defined in the index of the source data frame. Parametric text colors can be used to distinguish missing and existing labels.
It is not necassary to position the labels at the exact point in the graph. Just dump the labels on the left side from the top by adjusting the font size.
Thank you for your interest.

@ResidentMario
Copy link
Owner

Looking at this again, I don't think this is possible. The problem is that to get a useful sample of your dataset you need to include at least 100 or so records, which would mean 100 or so labels, which would be tiny. You wouldn't be able to read them at all!

It should be possible (with text label collision detection, which IIRC exists somewhere) to label some subset of the offending data in the display. However, that requires user input explaining, somehow, what the "anomaly threshold" is. And that starts too look to complicated to me for such a basic chart!

I'll nevertheless leave this feature request open, for now.

@jason-r-becker
Copy link

I think this could still be a useful feature for time-series data. Functioning similar to autofmt_xdate() from matplotlib https://matplotlib.org/_modules/matplotlib/figure.html#Figure.autofmt_xdate. Having a few dates would be useful to visualize the times associated with missing data.

@remisphere
Copy link

This would also be useful when working with Pandas' MultiIndex, with the option to choose a particular level that encompasses several samples.

For example, I am working with a dataset consisting of stereo video frames recorded from a car, and have sorted them in a dataframe with the following row multi-index:
environement / recording session / stereo side / timestamp
While displaying every timestamp would be as you said impossible (additionally because timestamps are not related from one recording session to another), printing only the much sparser environment or recording labels would allow to better localise where data is missing (provided that the dataframe is sorted).

From what I have seen when trying using a multi-index on the column axis, Missingno just reads it as a tuple.

@arturomoncadatorres
Copy link

@gurol I solved this with a few extra lines of code after calling msno.matrix. In my df, I had a column called year and I wanted to see if there were some years that had missing values. Therefore, my code looked like this:

df = df.sort_values(by=['year'])

fontsize = 20
    
fig, ax = plt.subplots(1, 1, figsize=[20, 14])
msno.matrix(df=df, ax=ax, color=(0.2, 0.2, 0.2), sparkline=False, fontsize=fontsize)

years = list(df['year'].unique())
ylim_start, ylim_end = ax.get_ylim()
step_size = df.shape[0] / len(years)
_ = ax.yaxis.set_ticks(np.arange(ylim_end, ylim_start, step_size))
_ = ax.yaxis.set_ticklabels(years, fontsize=fontsize)

@ResidentMario would this be a feature that you would be interested in adding to missingno? If so, we could further discuss the implementation and I could take the lead in making a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants