energia*

For some of our research, we've discovered many JSONified nested arrays (ElasticSearch results, cough cough). In order to more readily interpret the data, we needed a way to compute relative complexities of the nested array structure - signatures, complexity, and so forth. The logic behind this being:

There are probably many known datasets with a predictable signature that we would like to deprioritize ex. example datasets, generic logs, or so forth.
Several of the ElasticSearch results could have been discovered by simply looking for complex nested arrays, ex. deep, inconsistent, broad structure.

Signatures

Stil ruminating. Fuzzy hashing maybe. But that could be hard on our workflow unless we add a new datastore and are comfortable with O(N^2) complexity on certain checks.

Complexity

Partially inspired by Kolmogorov complexity, we implement a five-metric scoring system for complexity which should allow us to distill how complex certain document structures in ElasticSearch results are:

Approximate array dimensions:

Count how many items are in the widest array at all depths, returning a list of depth->sum(items) (shape)
... and count how many total arrays there are in this nested array (breadth)
maybe others?

Document structural complexity (DSM):

Start a counter at 0
For each value that maps to another array, add 1
For all other values, add 0.1

Item duplication-averse DSM (IDA DSM):

Remove any key->value pairs that are duplicated (removing the duplicates but not the original)
Recompute DSM

Type duplication-averse DSM (TDA DSM):

For any key that is not mapping to another array, make the value a string representation of what type it is ('int', 'str', etc.)
Remove any key->value pairs that are duplicated elsewhere (removing the duplicates but not the original), ignoring key->array mappings
Recompute DSM

~~Skeleton~~ Pile of bones DSM (POB DSM):

Flatten nested array into one exceptionally long array
For any keys that should have mapped to an array, recreate them as null
Remove any key->value pairs that are duplicated elsewhere (removing both the original and all duplicates)
Recompute DSM

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

energia*

Signatures

Complexity

About

Releases

Packages

License

mns-llc/energia

Folders and files

Latest commit

History

Repository files navigation

energia*

Signatures

Complexity

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages