Releases: rstudio/pointblank
v0.12.1
v0.12.0
New features
-
Complete
{tidyselect}
support for thecolumns
argument of all validation functions, as well as inhas_columns()
andinfo_columns
. Thecolumns
argument can now take familiar column-selection expressions as one would use insidedplyr::select()
. This also begins a process of deprecation:columns = vars(...)
will continue to work, butc()
now supersedesvars()
.- If passing an external vector of column names, it should be wrapped in
all_of()
.
-
The
label
argument of validation functions now exposes the following string variables via{glue}
syntax:"{.step}"
: The validation step name"{.col}"
: The current column name"{.seg_col}"
: The current segment's column name"{.seg_val}"
: The current segment's value/group
These dynamic values may be useful for validations that get expanded into multiple steps.
-
interrogate()
gains two new options for printing progress in the console output:progress
: Whether interrogation progress should be printed to the console (TRUE
for interactive sessions, same as before)show_step_label
: Whether each validation step's label value should be printed alongside the progress.
Minor improvements and bug fixes
-
Fixes issue with rendering reports in Quarto HTML documents.
-
When no columns are returned from a
{tidyselect}
expression incolumns
, the agent's report now displays the originally supplied expression instead of simply blank (e.g., increate_agent(small_table) |> col_vals_null(matches("z"))
). -
Fixes issue with the hashing implementation to improve performance and alignment of validation steps in the multiagent.
v0.11.4
- Fixes issue with gt
0.9.0
compatibility.
v0.11.3
- Fixes issue with tables not rendering due to interaction with the gt package.
v0.11.2
- Internal changes were made to ensure compatibility with an in-development version of R.
v0.11.1
- Updated all help files to pass HTML validation.
v0.11.0
New features
-
The
row_count_match()
function can now match the count of rows in the target table to a literal value (in addition to comparing row counts to a secondary table). -
The analogous
col_count_match()
function was added to compare column counts in the target table to a secondary table, or, to match on a literal value. -
Substitution syntax has been added to the
tbl_store()
function with{{ <name> }}
. This is a great way to make table-prep more concise, readable, and less prone to errors. -
The
get_informant_report()
has been enhanced with morewidth
options. Aside from the"standard"
and"small"
sizes we can now supply any pixel- or percent-based width to precisely size the reporting. -
Added support for validating data in BigQuery tables.
Documentation
- All functions in the package now have better usage examples.
v0.10.0
New features
-
The new function
row_count_match()
(plusexpect_row_count_match()
andtest_row_count_match()
) checks for exact matching of rows across two tables (the target table and a comparison table of your choosing). Works equally well for local tables and for database and Spark tables. -
The new
tbl_match()
function (along withexpect_tbl_match()
andtest_tbl_match()
) checks for an exact matching of the target table with a comparison table. It will check for a strict match on table schemas, on equivalent row counts, and then exact matches on cell values across the two tables.
Minor improvements and bug fixes
-
The
set_tbl()
function was given thetbl_name
andlabel
arguments to provide an opportunity to set metadata on the new target table. -
Support for
mssql
tables has been restored and works exceedingly well for the majority of validation functions (the few that are incompatible provide messaging about not being supported).
Documentation
-
All functions in the package now have usage examples.
-
An RStudio Cloud project has been prepared with .Rmd files that contain explainers and runnable examples for each function in the package. Look at the project README for a link to the project.
Breaking changes
-
The
read_fn
argument increate_agent()
andcreate_informant()
has been deprecated in favor of an enhancedtbl
argument. Now, we can supply a variety of inputs totbl
for associating a target table to an agent or an informant. Withtbl
, it's now possible to provide a table (e.g.,data.frame
,tbl_df
,tbl_dbi
,tbl_spark
, etc.), an expression (a table-prep formula or a function) to read in the table only at interrogation time, or a table source expression to get table preparations from a table store (as an in-memory object or as defined in a YAML file). -
The
set_read_fn()
,remove_read_fn()
, andremove_tbl()
functions were removed since theread_fn
argument has been deprecated (and there's virtually no need to remove a table from an object withremove_tbl()
now).
v0.9.0
New features
-
The new
rows_complete()
validation function (along with theexpect_rows_complete()
andtest_rows_complete()
expectation and test variants) check whether rows contain anyNA
/NULL
values (optionally constrained to a selection of specifiedcolumns
). -
The new function
serially()
(along withexpect_serially()
andtest_serially()
) allows for a series of tests to run in sequence before either culminating in a final validation step or simply exiting the series. This construction allows for pre-testing that may make sense before a validation step. For example, there may be situations where it's vital to check a column type before performing a validation on the same column. -
The
specially()
/expect_specially()
/test_specially()
functions enable custom validations/tests/expectations with a user-defined function. We still havepreconditions
and other common args available for convenience. The great thing about this is that because we require the UDF to return a logical vector of passing/failing test units (or a table where the rightmost column is logical), we can incorporate the results quite easily in the standard pointblank reporting. -
The
info_columns_from_tbl()
function is a super-convenient wrapper for theinfo_columns()
function. Say you're making a data dictionary with an informant and you already have the table metadata somewhere as a table: you can use that here and not have to callinfo_columns()
many, many times. -
Added the
game_revenue_info
dataset which contains metadata for the extantgame_revenue
dataset. Both datasets pair nicely together in examples that create a data dictionary withcreate_informant()
andinfo_columns_from_tbl()
. -
Added the table transformer function
tt_tbl_colnames()
to get a table's column names for validation.
Minor improvements and bug fixes
-
Input data tables with
label
attribute values in their columns will be displayed in the 'Variables' section of thescan_data()
report. This is useful when scanning imported SAS tables (which often have labeled variables). -
The
all_passed()
function has been improved such that failed validation steps (that return an evaluation error, perhaps because of a missing column) result inFALSE
; thei
argument has been added toall_passed()
to optionally get a subset of validation steps before evaluation. -
For those
expect_*()
functions that can handle multiple columns, pointblank now correctly stops at the first failure and provides the correct reporting for that. Passing multiple columns really should mean processing multiple steps in serial, and previously this was handled incorrectly.
v0.8.0
New features
-
The new
draft_validation()
function will create a starter validation .R or .Rmd file with just a table as an input. Uses a new 'column roles' feature to develop a starter set of validation steps based on what kind of data the columns contain (e.g., latitude/longitude values, URLs, email addresses, etc.). -
The validation function
col_vals_within_spec()
(and the variantsexpect_col_vals_within_spec()
andtest_col_vals_within_spec()
) will test column values against a specification like phone numbers ("phone"
), VIN numbers ("VIN"
), URLs ("url"
), email addresses ("email"
), and much more ("isbn"
,"postal_code[<country_code>]"
,"credit_card"
,"iban[<country_code>]"
,"swift"
,"ipv4"
,"ipv6"
, and"mac"
). -
A large cross section of row-based validation functions can now operate on segments of the target table, so you can run a particular validation with slices (or segments) of the target table. The segmentation is made possible by use of the new
segments
argument, which takes an expression that serves to segment the target table by column values. It can be given in one of two ways: (1) as a single or multiple column names containing keys to segment on, or (2) as a two-sided formula where the LHS holds a column name and the RHS contains the column values to segment on (allowing for a subset of keys for segmentation). -
The default printing of the multiagent object is now a stacked display of agent reports. The wide report (useful for comparisons of validations targeting the same table over time) is available in the improved
get_multiagent_report()
function (withdisplay_mode = "wide"
). -
Exporting the reporting is now much easier with the new
export_report()
function. It will export objects such as the agent (for validations), the informant (for table metadata), and the multiagent (a series of validations), and, also those objects containing customized reports (fromscan_data()
,get_agent_report()
,get_informant_report()
, andget_multiagent_report()
). You'll always get a self-contained HTML file of the report from any use ofexport_report()
. -
A new family of functions has been added to pointblank: Table Transformers! These functions can radically transform a data table and either provide a wholly different table (like a summary table or table properties table) or do some useful filtering in a single step. This can be useful for preparing the target table for validation or when creating temporary tables (through
preconditions
) for a few validation steps (e.g., validating table properties or string lengths). As a nice bonus these transformer functions will work equally well with data frames, database tables, and Spark tables. The included functions are:tt_summary_stats()
,tt_string_info()
,tt_tbl_dims()
,tt_time_shift()
, andtt_time_slice()
. -
Two new datasets have been added:
specifications
andgame_revenue
. The former dataset can be used to test out thecol_vals_within_spec()
validation function. The latter dataset (with 2,000 rows) can be used to experiment with thett_time_shift()
andtt_time_slice()
table transformer functions.
Minor improvements and bug fixes
-
Added the Polish (
"pl"
), Danish ("da"
), Turkish ("tr"
), Swedish ("sv"
), and Dutch ("nl"
) translations. -
The
scan_data()
function is now a bit more performant, testable, and better at communicating progress in generating the report. -
The
preconditions
argument, used to modify the target table in a validation step, is now improved by (1) checking that a table object is returned after evaluation, and (2) correcting the YAML writing of anypreconditions
expression that's provided as a function. -
The
x_write_disk()
andx_read_disk()
have been extended to allow the writing and reading ofptblank_tbl_scan
objects (returned byscan_data()
). -
Print methods received some love in this release. Now,
scan_data()
table scan reports look much better in R Markdown. Reporting objects fromget_agent_report()
,get_informant_report()
, andget_multiagent_report()
now have print methods and work beautifully in R Markdown as a result. -
The
incorporate()
function, when called on an informant object, now emits styled messages to the console. And when usingyaml_exec()
to process an arbitrary amount of YAML-based agents and informants, you'll be given information about that progress in the console.
Documentation
- Many help files were overhauled so that (1) things are clearer, (2) more details are provided (if things are complex), and (3) many ready-to-run examples are present. The functions with improved help in this release are:
all_passed()
,get_data_extracts()
,get_multiagent_report()
,get_sundered_data()
,has_columns()
,write_testthat_file()
,x_write_disk()
, andyaml_exec()
.