DataGrid

kangas.datatypes.datagrid

DataGrid Objects

class DataGrid()

DataGrid instances have the following atrributes:

columns - a list of column names, or a dict of column names mapped to column types
data - a list of lists where each is a row of data
name - a name of the tabular data

init

def __init__(data=None,
             columns=None,
             name="Untitled",
             datetime_format="%Y/%m/%d",
             heuristics=False,
             converters=None)

Create a DataGrid instance.

Arguments:

data - (optional, list of lists) The rows of data
columns - (optional, list of strings) the column titles
name - (optional, str) a name of the tabular data
datetime_format - (optional, str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01".
heuristics - if True, guess that some numbers might be dates
converters - (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)

NOTES:

The varaible dg is used below as an example DataGrid instance.

If column names are not provided, then names will be generated in the sequence "A", "B", ... "Z", "AA", "BB", ...

The DataGrid instance can be imagined as a two-dimensional list of lists. The first dimension is the row, and the second dimension is the column. For example, dg[5][2] would return the 6th row (zero-based) and the 3rd (zero-based) column's value.

Likewise, you can use dg.append(ROW), dg.extend(ROWS), and dg.pop(INDEX) methods.

Rows can be either lists of values, or JSON-like dictionaries of the form {"COLUMN NAME": VALUE, ...}.

These are common methods to use on a DataGrid:
- dg.info() - data about rows, columns, and datatypes
- dg.head() - show the first few rows of a DataGrid
- dg.tail() - show the last few rows of a DataGrid
- dg.show() - open up an IFrame (if in a Jupyter Notebook) or a webbrowser page showing the DataGrid UI

Examples:

>>> from kangas import DataGrid, Image
>>> import glob
>>> dg = DataGrid(name="Images", columns=["Image", "Score"])
>>> for filename in glob.glob("*.jpg"):
...     score = model.predict()
...     dg.append([Image(filename), score])
>>> dg.show()

show

def show(filter=None,
         host=None,
         port=4000,
         debug=None,
         height="750px",
         width="100%",
         protocol="http",
         hide_selector=None,
         use_ngrok=False,
         cli_kwargs=None,
         **kwargs)

Open DataGrid in an IFrame in the jupyter environment or browser.

Arguments:

host - (optional, str) the host name or IP number for the servers to listen to
filter - (optional, str) a filter to set on the DataGrid
port - (optional, int) the port number for the servers to listen to
debug - (optional, str) will display additional information from the server (may not be visible in a notebook)
height - (optional, str) the height of iframe in px or percentage
width - (optional, str) the width of iframe in px or percentage
use_ngrok - (optional, bool) force using ngrok as a proxy
cli_kwargs - (dict) a dictionary with keys the names of the kangas server flags, and values the setting value (such as: {"backend-port": 8000})
kwargs - additional URL parameters to pass to server

Example:

>>> import kangas as kg
>>> dg = kg.DataGrid()
>>> # append data to DataGrid
>>> dg.show()
>>> dg.show("{'Column Name'} == 'category three'")
>>> dg.show("{'Column Name'} == 'category three'",
...     group="Another Column Name")

set_columns

def set_columns(columns)

Set the columns. columns is either a list of column names, or a dict where the key is the column name, and the value is a DataGrid type. Vaild DataGrid types are: "INTEGER", "FLOAT", "BOOLEAN", "DATETIME", "TEXT", "JSON", "VECTOR", or "IMAGE-ASSET".

Example:

>>> dg = DataGrid()
>>> dg.set_columns(["Column 1", "Column 2"])

iter

def __iter__()

Iterate over data.

to_csv

def to_csv(filename,
           sep=",",
           header=True,
           quotechar='"',
           encoding="utf8",
           converters=None)

Save a DataGrid as a Comma Separated Values (CSV) file.

Arguments:

filename - (str) the file to save the CSV data to
sep - (str) separator to use in CSV; default is ","
header - (bool) if True, write out the header; default is True
quotechar - (str) the character to use to surround text; default is '"'
encoding - (str) the encoding to use in the saved file; default is "utf8"
converters - (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)

Example:

>>> dg.to_csv()

to_dataframe

def to_dataframe()

Convert a DataGrid into a pandas dataframe.

Example:

>>> df = dg.to_dataframe()

to_dicts

def to_dicts(column_names=None, format_map=None)

Iterate over data, returning dicts.

Arguments:

column_names - (optional, list of str) only return the given column names
format_map - (optional, dict) dictionary of column type to function that takes a value, and returns a new value.

>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg.to_dicts()
[
 {"column 1": value1_1, "column 2": value1_2, ...},
 {"column 1": value2_1, "column 2": value2_2, ...},
]
>>> dg.to_dicts("column 2")
[
 {"column two": value1_2, ...},
 {"column two": value2_2, ...},
]

getitem

def __getitem__(item)

Get either a row or a column from the DataGrid.

Arguments:

item - (str or int) - if int, return the zero-based row; if str then item is the column name to return

>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg[0]
[1, "one"]
>>> dg["column 1"]
[1, 2]

nrows

@property
def nrows()

The number of rows in the DataGrid.

Example:

>>> dg.nrows
42

ncols

@property
def ncols()

The number of columns in the DataGrid.

Example:

>>> dg.ncols
10

shape

@property
def shape()

The (rows, columns) in the DataGrid.

Example:

>>> dg.shape
(10, 42)

download

@classmethod
def download(cls, url, ext=None)

Download a file from a URL.

Example:

>>> DataGrid.download("https://example.com/file.zip")

read_sklearn

@classmethod
def read_sklearn(cls, dataset_name)

Load a sklearn dataset by name.

Arguments:

dataset_name - (str) one of: 'boston', 'breast_cancer', 'diabetes', 'digits', 'iris', 'wine'

Example:

>>> dg = DataGrid.read_sklearn("iris")

read_parquet

@classmethod
def read_parquet(cls, filename, **kwargs)

Takes a parquet filename or URL and returns a DataGrid.

Note: requires pyarrow to be installed.

Example:

>>> dg = DataGrid.read_parquet("userdata1.parquet")

read_dataframe

@classmethod
def read_dataframe(cls, dataframe, **kwargs)

Takes a columnar pandas dataframe and returns a DataGrid.

Example:

>>> dg = DataGrid.read_dataframe(df)

read_json

@classmethod
def read_json(cls, filename, **kwargs)

Read JSON data, or JSON or JSON Line files [1]. JSON should be a list of objects, or a series of objects, one per line.

Arguments:

filename - the name of the file or URL to read the JSON from, or the data
datetime_format - (str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01".
heuristics - (bool) whether to guess that some float values are datetime representations
name - (str) the name to use for the DataGrid
converters - (dict) dictionary of functions where the key is the columns name, and the value is a function that takes a value and converts it to the proper type and form.
Note - the file or URL may end with ".zip", ".tgz", ".gz", or ".tar" extension. If so, it will be downloaded and unarchived. The JSON file is assumed to be in the archive with the same name as the file/URL. If it is not, then please use the kangas.download() function to download, and then read from the downloaded file.

[1] - https://jsonlines.org/

Example:

>>> from kangas import DataGrid
>>> dg = DataGrid.read_json("json_line_file.json")
>>> dg = DataGrid.read_json("https://instances.social/instances.json")
>>> dg = DataGrid.read_json("https://company.com/data.json.zip")
>>> dg = DataGrid.read_json("https://company.com/data.json.gz")
>>> dg.save()

read_datagrid

@classmethod
def read_datagrid(cls, filename, **kwargs)

Read (load) a datagrid file.

Arguments:

kwargs - any keyword to pass to the DataGrid constructor

Example:

>>> dg = DataGrid.read_datagrid("mnist.datagrid")

read_csv

@classmethod
def read_csv(cls,
             filename,
             header=0,
             sep=",",
             quotechar='"',
             datetime_format=None,
             heuristics=False,
             converters=None)

Takes a CSV filename and returns a DataGrid.

Arguments:

filename - the CSV file to import
header - (optional, int) row number (zero-based) of column headings
sep - (optional, str) used in the CSV parsing
quotechar - (optional, str) used in the CSV parsing
datetime_format - (optional, str) the datetime format
heuristics - (optional, bool) whether to guess that some float values are datetime representations
converters - (optional, dict) A dictionary of functions for converting values in certain columns. Keys are column labels.

Example:

>>> dg = DataGrid.read_csv("results.csv")

info

def info()

Display information about the DataGrid.

Example:

>>> dg.info()
DataGrid (on disk)
    Name   : coco-500-with-bbox
    Rows   : 500
    Columns: 7
#   Column                Non-Null Count DataGrid Type
--- -------------------- --------------- --------------------
1   ID                               500 INTEGER
2   Image                            500 IMAGE-ASSET
3   Score                            500 FLOAT
4   Confidence                       500 FLOAT
5   Filename                         500 TEXT
6   Category 5                       500 TEXT
7   Category 10                      500 TEXT

head

def head(n=5)

Display the last n rows of the DataGrid.

Arguments:

n - (optional, int) number of rows to show

Example:

>>> dg.head()
         row-id              ID           Score      Confidence        Filename
              1          391895 0.4974163872616 0.5726406230662 COCO_val2014_00
              2          522418 0.3612518386682 0.8539611863547 COCO_val2014_00
              3          184613 0.1060265192042 0.1809083103203 COCO_val2014_00
              4          318219 0.8879546879811 0.2918134509273 COCO_val2014_00
              5          554625 0.5889039105388 0.8253719528139 COCO_val2014_00
[500 rows x 4 columns]

tail

def tail(n=5)

Display the last n rows of the DataGrid.

Arguments:

n - (optional, int) number of rows to show

Example:

>>> dg.tail()
         row-id              ID           Score      Confidence        Filename
            496          391895 0.4974163872616 0.5726406230662 COCO_val2014_00
            497          522418 0.3612518386682 0.8539611863547 COCO_val2014_00
            498          184613 0.1060265192042 0.1809083103203 COCO_val2014_00
            499          318219 0.8879546879811 0.2918134509273 COCO_val2014_00
            500          554625 0.5889039105388 0.8253719528139 COCO_val2014_00

[500 rows x 4 columns]

get_columns

def get_columns()

Get the public-facing, non-hidden columns. Returns a list of strings.

Example:

>>> dg.get_columns()
['ID', 'Image', 'Score', 'Confidence', 'Filename']

append_iou_columns

def append_iou_columns(image_column_name, layer1, layer2)

Add Intersection Over Union columns between two layers on an image column.

append_iou_column

def append_iou_column(image_column_name,
                      layer1,
                      layer2,
                      label,
                      new_column=None)

Add Intersection Over Union columns between two layers on an image column.

remove_unused_assets

def remove_unused_assets()

Remove any assets that don't have a reference to them from the datagrid table.

remove_select

def remove_select(where,
                  computed_columns=None,
                  limit=None,
                  offset=0,
                  debug=False)

Remove items by filter

Arguments:

where - (optional, str) a Python expression where column names are written as {"Column Name"}.
limit - (optional, int) select at most this value
offset - (optional, int) start selection at this offset
computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

remove_rows

def remove_rows(*row_ids)

Remove specific rows, and any associated assets.

remove_columns

def remove_columns(*column_names)

Delete columns from the saved DataGrid.

Arguments:

column_names - list of column names to delete

Example:

>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.save()
>>> dg.remove_columns("a")

append_column

def append_column(column_name, rows, verify=True)

Append a column to the DataGrid.

Arguments:

column_name - column name to append
rows - list of values
verify - (optional, bool) if True, verify the data
NOTE - rows is a list of values, one for each row.

Example:

>>> dg.append_column("New Column Name", ["row1", "row2", "row3", "row4"])

append_columns

def append_columns(columns, rows=None, verify=True)

Append columns to the DataGrid.

Arguments:

columns - list of column names to append if rows is given or dictionary of column names as keys, and column rows as values.
rows - (optional, list) list of list of values per row
verify - (optional, bool) if True, verify the data

Example:

>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.append([11, 12])
>>> dg.append([21, 22])
>>> dg.append_columns(
...     ["New Column 1", "New Column 2"],
...     [
...      ["row1 col1",
...       "row2 col1"],
...      ["row1 col2",
...       "row2 col2"],
...     ])
>>> dg.append_columns(
...     {"New Column 3": ["row1 col3",
...                       "row2 col3"],
...      "New Column 4": ["row1 col4",
...                       "row2 col4"],
...     })
>>> dg.info()
row-id   a   b  New Column 1  New Column 2  New Column 3  New Column 4
     1  11  12     row1 col1     row1 col2     row1 col3     row1 col4
     2  21  22     row2 col1     row2 col2     row2 col3     row2 col4
[2 rows x 6 columns]

pop

def pop(index)

Pop a row by index from an in-memory DataGrid.

Arguments:

index - (int) position (zero-based) of row to remove

Example:

>>> row = dg.pop(0)

append

def append(row)

Append this row onto the datagrid data.

Example:

>>> dg.append(["column 1 value", "column 2 value", ...])

get_asset_ids

def get_asset_ids()

Get all of the asset IDs from the DataGrid.

Returns a list of asset IDs.

extend

def extend(rows, verify=True)

Extend the datagrid with the given rows.

Example:

>>> dg.extend([
...     ["row 1, column 1 value", "row 1, column 2 value", ...],
...     ["row 2, column 1 value", "row 2, column 2 value", ...],
...     ...,
... ])

get_schema

def get_schema()

Get the DataGrid schema.

Example:

>>> dg.get_schema()
{'row-id': {'field_name': 'column_0', 'type': 'ROW_ID'},
 'ID': {'field_name': 'column_1', 'type': 'INTEGER'},
 'Image': {'field_name': 'column_2', 'type': 'IMAGE-ASSET'},
 'Score': {'field_name': 'column_3', 'type': 'FLOAT'},
 'Confidence': {'field_name': 'column_4', 'type': 'FLOAT'},
 'Filename': {'field_name': 'column_5', 'type': 'TEXT'},
 'Category 5': {'field_name': 'column_6', 'type': 'TEXT'},
 'Category 10': {'field_name': 'column_7', 'type': 'TEXT'},
 'Image--metadata': {'field_name': 'column_8', 'type': 'JSON'}}

select_count

def select_count(where="1")

Get the count of items given a where expression.

Arguments:

where - a Python expression where column names are written as {"Column Name"}.

Example:

>>> dg.select_count("{'column 1'} > 0.5")
894

select_dataframe

def select_dataframe(where="1",
                     sort_by=None,
                     sort_desc=False,
                     computed_columns=None,
                     limit=None,
                     offset=0,
                     select_columns=None)

Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.

Arguments:

where - (optional, str) a Python expression where column names are written as {"Column Name"}.
select_columns - (list of str, optional) list of column names to select
sort_by - (optional, str) name of column to sort on
sort_desc - (optional, bool) sort descending?
limit - (optional, int) select at most this value
offset - (optional, int) start selection at this offset
computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

Example:

>>> df = dg.select_dataframe("{'column name 1'} == {'column name 2'} and {'score'} < -1")

select

def select(where="1",
           sort_by=None,
           sort_desc=False,
           to_dicts=False,
           count=False,
           computed_columns=None,
           limit=None,
           offset=0,
           debug=False,
           select_columns=None)

Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.

Arguments:

where - (optional, str) a Python expression where column names are written as {"Column Name"}.
select_columns - (optional, list of str) a list of column names to select
sort_by - (optional, str) name of column to sort on
sort_desc - (optional, bool) sort descending?
limit - (optional, int) select at most this value
offset - (optional, int) start selection at this offset
to_dicts - (optional, cool) if True, return the rows in dicts where the keys are the column names.
count - (optional, bool) if True, return the count of matching rows
computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

Example:

>>> dg.select("{'column name 1'} == {'column name 2'} and {'score'} < -1")
[
   ["row 1, column 1 value", "row 1, column 2 value", ...],
   ["row 2, column 1 value", "row 2, column 2 value", ...],
   ...
]

save

def save(filename=None, create_thumbnails=None)

Create the SQLite database on disk.

Arguments:

filename - (optional, str) the name of the filename to save to
create_thumbnails - (optional, bool) if True, then create thumbnail images for assets

Example:

>>> dg.save()

set_about

def set_about(markdown)

Set the about page for this DataGrid.

Arguments:

markdown - (str) the text of the markdown About text

set_about_from_script

def set_about_from_script(filename)

Set the about page for this DataGrid.

Arguments:

filename - (str) the file that created the DataGrid

get_about

def get_about()

Get the about page for this DataGrid.

display_about

def display_about()

Display the about page for this DataGrid as markdown.

Note: this requires being in an IPython-like environment.

upgrade

def upgrade()

Upgrade to latest version of datagrid.

Kangas DataGrid is completely open source; sponsored by Comet ML

Home
- User Guides
  - Installation - installing kangas
  - Reading data - importing data
  - Constructing DataGrids - building from scratch
  - Exploring data - exploration and analysis
  - Examples - scripts and notebooks
- Kangas Command-Line Interface
- Kangas Python API
  - kangas - top-level functions
  - DataGrid - DataGrid object and methods
  - Image - Image object and methods
  - Embedding - Embedding object and methods
  - Tensor - Tensor object and methods
- Integrations - with Hugging Face and Comet
- User Interface
  - Filter expressions - filter syntax
  - Cell Types
    - Boolean
    - Datetime
    - Embedding
    - Float
    - Image
    - Integer
    - JSON
    - Tensor
    - Text
    - Vector
- FAQ - Frequently Asked Questions
- Under the Hood
  - Security - issues related to security
  - Development - setting up a development environment
  - Roadmap - plans and known issues

DataGrid

kangas.datatypes.datagrid

DataGrid Objects

__init__

show

set_columns

__iter__

to_csv

to_dataframe

to_dicts

__getitem__

nrows

ncols

shape

download

read_sklearn

read_parquet

read_dataframe

read_json

read_datagrid

read_csv

info

head

tail

get_columns

append_iou_columns

append_iou_column

remove_unused_assets

remove_select

remove_rows

remove_columns

append_column

append_columns

pop

append

get_asset_ids

extend

get_schema

select_count

select_dataframe

select

save

set_about

set_about_from_script

get_about

display_about

upgrade

Table of Contents

Clone this wiki locally

init

iter

getitem