Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Allow users to define an include pattern for the Github Data Connector #1798

Open
oleander opened this issue Jul 1, 2024 · 0 comments
Labels
enhancement New feature or request feature request

Comments

@oleander
Copy link

oleander commented Jul 1, 2024

What would you like to see?

The current filtering feature under upload a document -> Data Connectors -> File Ignores, used for file uploads, has a few limitations and, for me, often results in unwanted documents being passed on to the LLM. The UI supports an ignore pattern, which technically can be inverted using ! to define what files to include. I mainly want to determine what files to include in the vector database rather than exclude, so I frequently use the inverse pattern.

The filtering feature is implemented by the ignore npm package and, from what I can tell, doesn't allow for nested inverts.

const ignore = require('ignore');
const fs = require('fs');
const path = require('path');

const files = [
  'README.md',
  'index.js',
  'docs/guide.md',
  'src/components/Button.md',
  'src/components/Button.js',
  'src/utils/helpers.md',
  'notes.txt',
];

const matchFiles = (patterns) => {
  const ig = ignore().add(patterns);
  return files.filter(file => !ig.ignores(file));
};

const patternsList = [
  ['*', '!.md'],
  ['*', '!*.md'],
  ['*', '!**/*.md'],
  ['*', '!.md', '!**/*.md'],
];

patternsList.forEach((patterns, index) => {
  console.log(`Test ${index + 1} with patterns: ${patterns}`);
  const matchedFiles = matchFiles(patterns);
  console.log('Matched files:', matchedFiles);
  console.log('---');
});

This results in only the root files being included.

> node main.js                                                                                                  
Test 1 with patterns: *,!.md
Matched files: []
---
Test 2 with patterns: *,!*.md
Matched files: [ 'README.md' ]
---
Test 3 with patterns: *,!**/*.md
Matched files: [ 'README.md' ]
---
Test 4 with patterns: *,!.md,!**/*.md
Matched files: [ 'README.md' ]

To simplify the UI, I suggest the following changes:

  • Allow the users to define files to be included using a glob pattern, i.e. **/**/*.md
  • Before sending the loaded data to the LLM, list all files matched by the pattern. This would allow us to test the pattern before sending it to the LLM.

What do you guys think? Would this be useful? I could look into a solution after we agree on an approach or if we decide to leave it be.

@oleander oleander added enhancement New feature or request feature request labels Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

1 participant