Skip to content

Latest commit

 

History

History
73 lines (58 loc) · 3.48 KB

File metadata and controls

73 lines (58 loc) · 3.48 KB

Code Quality

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Summary

This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:

  • line specific metrics include mean & max line length
  • character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
  • identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
  • tags the samples as autogenerated if the sample contains keywords like auto-generated, autogenerated or automatically generated
  • programming language specific identification, where:
    • if the input sample is python programming language and sample has no reference to constructs like def, class, it is highlighted as has_no_keywords

This module adds the following fields into the output file:

  • line_mean
  • line_max
  • total_num_lines
  • avg_longest_lines
  • alphanum_frac
  • char_token_ratio
  • autogenerated
  • config_or_test
  • has_no_keywords
  • has_few_assignments
  • is_xml
  • is_html

It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.

Running

Launcher Command Line Options

The following command line arguments are available in addition to the options provided by the ray launcher and the python launcher.

  • "--contents_column_name" - input a column name which contains data to process. The default column name: contents
  • "--language_column_name" - input a column name which contains programming language details. The default column name: language
  • "--tokenizer" - input a tokenizer to convert the data into tokens. The default tokenizer is codeparrot/codeparrot
  • "--hf_token" - input the Hugging Face auth token to download the tokenizer. This option is only required for the tokenizer's whose access is restricted in Hugging Face.

Running the samples

To run the samples, use the following make targets

  • run-cli-sample - runs src/code_quality_transform_python.py using command line args
  • run-local-sample - runs src/code_quality_local_python.py

These targets will activate the virtual environment and set up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-cli-sample
...

Then

ls output

To see results of the transform.

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.