Skip to content

DataGrid

Doug Blank edited this page May 20, 2023 · 10 revisions

kangas.datatypes.datagrid

DataGrid Objects

class DataGrid()

DataGrid instances have the following atrributes:

  • columns - a list of column names, or a dict of column names mapped to column types
  • data - a list of lists where each is a row of data
  • name - a name of the tabular data

__init__

def __init__(data=None,
             columns=None,
             name="Untitled",
             datetime_format="%Y/%m/%d",
             heuristics=False,
             converters=None)

Create a DataGrid instance.

Arguments:

  • data - (optional, list of lists) The rows of data

  • columns - (optional, list of strings) the column titles

  • name - (optional, str) a name of the tabular data

  • datetime_format - (optional, str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01".

  • heuristics - if True, guess that some numbers might be dates

  • converters - (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)

    NOTES:

    The varaible dg is used below as an example DataGrid instance.

    If column names are not provided, then names will be generated in the sequence "A", "B", ... "Z", "AA", "BB", ...

    The DataGrid instance can be imagined as a two-dimensional list of lists. The first dimension is the row, and the second dimension is the column. For example, dg[5][2] would return the 6th row (zero-based) and the 3rd (zero-based) column's value.

    Likewise, you can use dg.append(ROW), dg.extend(ROWS), and dg.pop(INDEX) methods.

    Rows can be either lists of values, or JSON-like dictionaries of the form {"COLUMN NAME": VALUE, ...}.

    These are common methods to use on a DataGrid:

    • dg.info() - data about rows, columns, and datatypes
    • dg.head() - show the first few rows of a DataGrid
    • dg.tail() - show the last few rows of a DataGrid
    • dg.show() - open up an IFrame (if in a Jupyter Notebook) or a webbrowser page showing the DataGrid UI

Examples:

>>> from kangas import DataGrid, Image
>>> import glob
>>> dg = DataGrid(name="Images", columns=["Image", "Score"])
>>> for filename in glob.glob("*.jpg"):
...     score = model.predict()
...     dg.append([Image(filename), score])
>>> dg.show()

show

def show(filter=None,
         host=None,
         port=4000,
         debug=None,
         height="750px",
         width="100%",
         protocol="http",
         hide_selector=None,
         use_ngrok=False,
         cli_kwargs=None,
         **kwargs)

Open DataGrid in an IFrame in the jupyter environment or browser.

Arguments:

  • host - (optional, str) the host name or IP number for the servers to listen to
  • filter - (optional, str) a filter to set on the DataGrid
  • port - (optional, int) the port number for the servers to listen to
  • debug - (optional, str) will display additional information from the server (may not be visible in a notebook)
  • height - (optional, str) the height of iframe in px or percentage
  • width - (optional, str) the width of iframe in px or percentage
  • use_ngrok - (optional, bool) force using ngrok as a proxy
  • cli_kwargs - (dict) a dictionary with keys the names of the kangas server flags, and values the setting value (such as: {"backend-port": 8000})
  • kwargs - additional URL parameters to pass to server

Example:

>>> import kangas as kg
>>> dg = kg.DataGrid()
>>> # append data to DataGrid
>>> dg.show()
>>> dg.show("{'Column Name'} == 'category three'")
>>> dg.show("{'Column Name'} == 'category three'",
...     group="Another Column Name")

set_columns

def set_columns(columns)

Set the columns. columns is either a list of column names, or a dict where the key is the column name, and the value is a DataGrid type. Vaild DataGrid types are: "INTEGER", "FLOAT", "BOOLEAN", "DATETIME", "TEXT", "JSON", "VECTOR", or "IMAGE-ASSET".

Example:

>>> dg = DataGrid()
>>> dg.set_columns(["Column 1", "Column 2"])

__iter__

def __iter__()

Iterate over data.

to_csv

def to_csv(filename,
           sep=",",
           header=True,
           quotechar='"',
           encoding="utf8",
           converters=None)

Save a DataGrid as a Comma Separated Values (CSV) file.

Arguments:

  • filename - (str) the file to save the CSV data to
  • sep - (str) separator to use in CSV; default is ","
  • header - (bool) if True, write out the header; default is True
  • quotechar - (str) the character to use to surround text; default is '"'
  • encoding - (str) the encoding to use in the saved file; default is "utf8"
  • converters - (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)

Example:

>>> dg.to_csv()

to_dataframe

def to_dataframe()

Convert a DataGrid into a pandas dataframe.

Example:

>>> df = dg.to_dataframe()

to_dicts

def to_dicts(column_names=None, format_map=None)

Iterate over data, returning dicts.

Arguments:

  • column_names - (optional, list of str) only return the given column names
  • format_map - (optional, dict) dictionary of column type to function that takes a value, and returns a new value.
>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg.to_dicts()
[
 {"column 1": value1_1, "column 2": value1_2, ...},
 {"column 1": value2_1, "column 2": value2_2, ...},
]
>>> dg.to_dicts("column 2")
[
 {"column two": value1_2, ...},
 {"column two": value2_2, ...},
]

__getitem__

def __getitem__(item)

Get either a row or a column from the DataGrid.

Arguments:

  • item - (str or int) - if int, return the zero-based row; if str then item is the column name to return
>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg[0]
[1, "one"]
>>> dg["column 1"]
[1, 2]

nrows

@property
def nrows()

The number of rows in the DataGrid.

Example:

>>> dg.nrows
42

ncols

@property
def ncols()

The number of columns in the DataGrid.

Example:

>>> dg.ncols
10

shape

@property
def shape()

The (rows, columns) in the DataGrid.

Example:

>>> dg.shape
(10, 42)

download

@classmethod
def download(cls, url, ext=None)

Download a file from a URL.

Example:

>>> DataGrid.download("https://example.com/file.zip")

read_sklearn

@classmethod
def read_sklearn(cls, dataset_name)

Load a sklearn dataset by name.

Arguments:

  • dataset_name - (str) one of: 'boston', 'breast_cancer', 'diabetes', 'digits', 'iris', 'wine'

Example:

>>> dg = DataGrid.read_sklearn("iris")

read_parquet

@classmethod
def read_parquet(cls, filename, **kwargs)

Takes a parquet filename or URL and returns a DataGrid.

Note: requires pyarrow to be installed.

Example:

>>> dg = DataGrid.read_parquet("userdata1.parquet")

read_dataframe

@classmethod
def read_dataframe(cls, dataframe, **kwargs)

Takes a columnar pandas dataframe and returns a DataGrid.

Example:

>>> dg = DataGrid.read_dataframe(df)

read_json

@classmethod
def read_json(cls, filename, **kwargs)

Read JSON data, or JSON or JSON Line files [1]. JSON should be a list of objects, or a series of objects, one per line.

Arguments:

  • filename - the name of the file or URL to read the JSON from, or the data

  • datetime_format - (str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01".

  • heuristics - (bool) whether to guess that some float values are datetime representations

  • name - (str) the name to use for the DataGrid

  • converters - (dict) dictionary of functions where the key is the columns name, and the value is a function that takes a value and converts it to the proper type and form.

  • Note - the file or URL may end with ".zip", ".tgz", ".gz", or ".tar" extension. If so, it will be downloaded and unarchived. The JSON file is assumed to be in the archive with the same name as the file/URL. If it is not, then please use the kangas.download() function to download, and then read from the downloaded file.

    [1] - https://jsonlines.org/

Example:

>>> from kangas import DataGrid
>>> dg = DataGrid.read_json("json_line_file.json")
>>> dg = DataGrid.read_json("https://instances.social/instances.json")
>>> dg = DataGrid.read_json("https://company.com/data.json.zip")
>>> dg = DataGrid.read_json("https://company.com/data.json.gz")
>>> dg.save()

read_datagrid

@classmethod
def read_datagrid(cls, filename, **kwargs)

Read (load) a datagrid file.

Arguments:

  • kwargs - any keyword to pass to the DataGrid constructor

Example:

>>> dg = DataGrid.read_datagrid("mnist.datagrid")

read_csv

@classmethod
def read_csv(cls,
             filename,
             header=0,
             sep=",",
             quotechar='"',
             datetime_format=None,
             heuristics=False,
             converters=None)

Takes a CSV filename and returns a DataGrid.

Arguments:

  • filename - the CSV file to import
  • header - (optional, int) row number (zero-based) of column headings
  • sep - (optional, str) used in the CSV parsing
  • quotechar - (optional, str) used in the CSV parsing
  • datetime_format - (optional, str) the datetime format
  • heuristics - (optional, bool) whether to guess that some float values are datetime representations
  • converters - (optional, dict) A dictionary of functions for converting values in certain columns. Keys are column labels.

Example:

>>> dg = DataGrid.read_csv("results.csv")

info

def info()

Display information about the DataGrid.

Example:

>>> dg.info()
DataGrid (on disk)
    Name   : coco-500-with-bbox
    Rows   : 500
    Columns: 7
#   Column                Non-Null Count DataGrid Type
--- -------------------- --------------- --------------------
1   ID                               500 INTEGER
2   Image                            500 IMAGE-ASSET
3   Score                            500 FLOAT
4   Confidence                       500 FLOAT
5   Filename                         500 TEXT
6   Category 5                       500 TEXT
7   Category 10                      500 TEXT

head

def head(n=5)

Display the last n rows of the DataGrid.

Arguments:

  • n - (optional, int) number of rows to show

Example:

>>> dg.head()
         row-id              ID           Score      Confidence        Filename
              1          391895 0.4974163872616 0.5726406230662 COCO_val2014_00
              2          522418 0.3612518386682 0.8539611863547 COCO_val2014_00
              3          184613 0.1060265192042 0.1809083103203 COCO_val2014_00
              4          318219 0.8879546879811 0.2918134509273 COCO_val2014_00
              5          554625 0.5889039105388 0.8253719528139 COCO_val2014_00
[500 rows x 4 columns]

tail

def tail(n=5)

Display the last n rows of the DataGrid.

Arguments:

  • n - (optional, int) number of rows to show

Example:

>>> dg.tail()
         row-id              ID           Score      Confidence        Filename
            496          391895 0.4974163872616 0.5726406230662 COCO_val2014_00
            497          522418 0.3612518386682 0.8539611863547 COCO_val2014_00
            498          184613 0.1060265192042 0.1809083103203 COCO_val2014_00
            499          318219 0.8879546879811 0.2918134509273 COCO_val2014_00
            500          554625 0.5889039105388 0.8253719528139 COCO_val2014_00

[500 rows x 4 columns]

get_columns

def get_columns()

Get the public-facing, non-hidden columns. Returns a list of strings.

Example:

>>> dg.get_columns()
['ID', 'Image', 'Score', 'Confidence', 'Filename']

append_iou_columns

def append_iou_columns(image_column_name, layer1, layer2)

Add Intersection Over Union columns between two layers on an image column.

append_iou_column

def append_iou_column(image_column_name,
                      layer1,
                      layer2,
                      label,
                      new_column=None)

Add Intersection Over Union columns between two layers on an image column.

remove_unused_assets

def remove_unused_assets()

Remove any assets that don't have a reference to them from the datagrid table.

remove_select

def remove_select(where,
                  computed_columns=None,
                  limit=None,
                  offset=0,
                  debug=False)

Remove items by filter

Arguments:

  • where - (optional, str) a Python expression where column names are written as {"Column Name"}.
  • limit - (optional, int) select at most this value
  • offset - (optional, int) start selection at this offset
  • computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

remove_rows

def remove_rows(*row_ids)

Remove specific rows, and any associated assets.

remove_columns

def remove_columns(*column_names)

Delete columns from the saved DataGrid.

Arguments:

  • column_names - list of column names to delete

Example:

>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.save()
>>> dg.remove_columns("a")

append_column

def append_column(column_name, rows, verify=True)

Append a column to the DataGrid.

Arguments:

  • column_name - column name to append

  • rows - list of values

  • verify - (optional, bool) if True, verify the data

  • NOTE - rows is a list of values, one for each row.

Example:

>>> dg.append_column("New Column Name", ["row1", "row2", "row3", "row4"])

append_columns

def append_columns(columns, rows=None, verify=True)

Append columns to the DataGrid.

Arguments:

  • columns - list of column names to append if rows is given or dictionary of column names as keys, and column rows as values.
  • rows - (optional, list) list of list of values per row
  • verify - (optional, bool) if True, verify the data

Example:

>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.append([11, 12])
>>> dg.append([21, 22])
>>> dg.append_columns(
...     ["New Column 1", "New Column 2"],
...     [
...      ["row1 col1",
...       "row2 col1"],
...      ["row1 col2",
...       "row2 col2"],
...     ])
>>> dg.append_columns(
...     {"New Column 3": ["row1 col3",
...                       "row2 col3"],
...      "New Column 4": ["row1 col4",
...                       "row2 col4"],
...     })
>>> dg.info()
row-id   a   b  New Column 1  New Column 2  New Column 3  New Column 4
     1  11  12     row1 col1     row1 col2     row1 col3     row1 col4
     2  21  22     row2 col1     row2 col2     row2 col3     row2 col4
[2 rows x 6 columns]

pop

def pop(index)

Pop a row by index from an in-memory DataGrid.

Arguments:

  • index - (int) position (zero-based) of row to remove

Example:

>>> row = dg.pop(0)

append

def append(row)

Append this row onto the datagrid data.

Example:

>>> dg.append(["column 1 value", "column 2 value", ...])

get_asset_ids

def get_asset_ids()

Get all of the asset IDs from the DataGrid.

Returns a list of asset IDs.

extend

def extend(rows, verify=True)

Extend the datagrid with the given rows.

Example:

>>> dg.extend([
...     ["row 1, column 1 value", "row 1, column 2 value", ...],
...     ["row 2, column 1 value", "row 2, column 2 value", ...],
...     ...,
... ])

get_schema

def get_schema()

Get the DataGrid schema.

Example:

>>> dg.get_schema()
{'row-id': {'field_name': 'column_0', 'type': 'ROW_ID'},
 'ID': {'field_name': 'column_1', 'type': 'INTEGER'},
 'Image': {'field_name': 'column_2', 'type': 'IMAGE-ASSET'},
 'Score': {'field_name': 'column_3', 'type': 'FLOAT'},
 'Confidence': {'field_name': 'column_4', 'type': 'FLOAT'},
 'Filename': {'field_name': 'column_5', 'type': 'TEXT'},
 'Category 5': {'field_name': 'column_6', 'type': 'TEXT'},
 'Category 10': {'field_name': 'column_7', 'type': 'TEXT'},
 'Image--metadata': {'field_name': 'column_8', 'type': 'JSON'}}

select_count

def select_count(where="1")

Get the count of items given a where expression.

Arguments:

  • where - a Python expression where column names are written as {"Column Name"}.

Example:

>>> dg.select_count("{'column 1'} > 0.5")
894

select_dataframe

def select_dataframe(where="1",
                     sort_by=None,
                     sort_desc=False,
                     computed_columns=None,
                     limit=None,
                     offset=0,
                     select_columns=None)

Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.

Arguments:

  • where - (optional, str) a Python expression where column names are written as {"Column Name"}.
  • select_columns - (list of str, optional) list of column names to select
  • sort_by - (optional, str) name of column to sort on
  • sort_desc - (optional, bool) sort descending?
  • limit - (optional, int) select at most this value
  • offset - (optional, int) start selection at this offset
  • computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

Example:

>>> df = dg.select_dataframe("{'column name 1'} == {'column name 2'} and {'score'} < -1")

select

def select(where="1",
           sort_by=None,
           sort_desc=False,
           to_dicts=False,
           count=False,
           computed_columns=None,
           limit=None,
           offset=0,
           debug=False,
           select_columns=None)

Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.

Arguments:

  • where - (optional, str) a Python expression where column names are written as {"Column Name"}.
  • select_columns - (optional, list of str) a list of column names to select
  • sort_by - (optional, str) name of column to sort on
  • sort_desc - (optional, bool) sort descending?
  • limit - (optional, int) select at most this value
  • offset - (optional, int) start selection at this offset
  • to_dicts - (optional, cool) if True, return the rows in dicts where the keys are the column names.
  • count - (optional, bool) if True, return the count of matching rows
  • computed_columns - (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.

Example:

>>> dg.select("{'column name 1'} == {'column name 2'} and {'score'} < -1")
[
   ["row 1, column 1 value", "row 1, column 2 value", ...],
   ["row 2, column 1 value", "row 2, column 2 value", ...],
   ...
]

save

def save(filename=None, create_thumbnails=None)

Create the SQLite database on disk.

Arguments:

  • filename - (optional, str) the name of the filename to save to
  • create_thumbnails - (optional, bool) if True, then create thumbnail images for assets

Example:

>>> dg.save()

set_about

def set_about(markdown)

Set the about page for this DataGrid.

Arguments:

  • markdown - (str) the text of the markdown About text

set_about_from_script

def set_about_from_script(filename)

Set the about page for this DataGrid.

Arguments:

  • filename - (str) the file that created the DataGrid

get_about

def get_about()

Get the about page for this DataGrid.

display_about

def display_about()

Display the about page for this DataGrid as markdown.

Note: this requires being in an IPython-like environment.

upgrade

def upgrade()

Upgrade to latest version of datagrid.

Table of Contents

Clone this wiki locally