Skip to content

πŸ”€πŸ“Š SEEKTOPIC: Summarization by Extracting Entities, Keyword Tokens, and Outline Phrases Important to Context πŸ€–πŸ”Ž STREAM: Search with Top Result Extraction & Answer Model πŸšœπŸ“œ Tractor the Text Extractor

Notifications You must be signed in to change notification settings

vtempest/ai-research-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Join Discord GitHub Discussions

NPM Downloads PRs Welcome

NPM badge for ai-research-agent

Being is Becoming: Whatever the future of research can be, that is what it must become.

πŸ€–πŸ”Ž STREAM: Search with Top Result Extraction & Answer Model

searchSTREAM Docs

  1. Search Web for query via metasearch of major engines or your custom data
  2. Extract text of top results with Tractor the Text Extractor.
  3. SEEKTOPIC: Extract Keyphrase Topics and Top Sentences that centralize those topics
  4. Rerank documents's chunks based on relevance to query, using embeddings by convert text to concept vector of numbers within LLM "concept space", and cosine similarity of query to topic, returning the sentences central to key relevant parts of the article.
  5. Research Agent prompt with key sentences from relevant sources to answer via Groq Llama, OpenAI, or Anthropic API Key and suggest follow-ups

πŸšœπŸ“œ Tractor the Text Extractor

extract Docs

  • Extract URL or HTML to main content, based on Readability with improved version
  • using 100+ custom adapters for major websites.
  • Strips to basic HTML for reading mode or saving research notes
  • Youtube - get full transcript for video if detected a youtube video
  • PDF - Extracts formatted text from PDF with parsing of linebreaks, page headers, footnotes, and adding infer headings based on standard deviation from average text height
  • Cite - Extract author, date, source, and title from HTML using meta tags and common class names. Validates human name from author string to check against common list of 90k first names, last names, and organizations to infer if last name should be reversed starting by author last name (accounting for affixes/titles), since organizations are not reversed.

πŸ”€πŸ“Š SEEKTOPIC: Summarization by Extracting Entities, Keyword Tokens, and Outline Phrases Important to Context

kg.webm

extractSEEKTOPIC Docs

SEEKTOPIC Sample Output

SEEKTOPIC can be used to find unique, domain-specific keyphrases using noun Ngrams. The user can click on keyphrases or LLM can suggest questions based on them. The user can see highlighted just the most important sentences that centralize and tie in the core topics. It is possible to vectorize and compare the dot product similarity of query to keyphrases which are then mapped to parts of the document like section labels. This is more in line with how humans think of article organization into section headings and lead sentences which tie in concepts from others.

  1. Split into sentences with exceptions for 222 common abbrev., numbers, URLs, etc.
  2. Use this Wiki Phrases tokenizer to extract wiki topics, phrases, and nouns. It checks for spelling typos and uses Porter Stemmer to check root words if original word is not found.
  3. Extract Noun Edgegrams. Stop words are allowed in the middle like "state of the art"
  4. Fold smaller Ngrams that are subsets of larger ones by comparing weight into keyphrases
  5. Calculate named entities and phrase domain specificity to reward unique keyphrases, using WikiIDF. Domains-specific examples in medical data would be "endocrinology" or in religion it is "thou shall" which can help build category label classifiers. We can find repeated phrases that are unique to that document's field, as opposed to common phrases in all docs.
  6. Pass to the next layer only a cut of top keyphrases sorted by frequency ^ word count
  7. Create a double-ring weighted graph mapping keyphrases as the central ring and each sentence that uses that concept on the outer ring and give each link weights to determine probability of going to that link
  8. Weights sentences using TextRank noun keyphrase frequency to find which sentences centralize and tie together keyphrase concepts refered to most by other sentences. Based on the TextRank & PageRank algorithms, it randomly surfs links to nodes to find probability of being at that node, thus ranking influence. There's also random jumps to prevent stuck in a loop around same sentences.
  9. Cut off top Number or percent (for larger docs) of top sentences and keyphrases by overall weight and graph centrality
  10. Returns Top Sentences (and keyphrases for each sentence) and Top Keyphrases (and which sentences for each keyphrase).
  11. If the user clicks a keyphrase, or if there was a search query leading to doc, we can compare similarity of query to which keyphrase is most similar -- then we give that keyphrase a lot more weight and rerank everything from step #8 TextRank.

πŸŒπŸ“– WORLD: Wikipedia Outline Relational Lexicon & Dictionary

compileTopicModel Docs

Search and outline a research base using Wikipedia's 100k popular pages as the core topic phrases graph for LLM Research Agents. Most of the documents online (and by extension thinking in the collective conciousness) can revolve around core topic phrases linked as a graph. If all the available docs are nodes, the links in the graph can be extracted Wiki page entities and mappings of dictionary phrases to their wiki page. These can serve as topic labels, keywords, and suggestions for LLM followup questions. Documents can be linked in a graph with: 1. wiki page entity recognition 2. frequent keyphrases 3. html links 4. research paper references 5. keyphrases to query in global web search 6. site-specific recommendations. These can lay the foundation for LLM Research Agents to fully grok, summarize, and outline a research base.

  • 240K total words & phrases, first 117K first-word or single words to check every token against. 100K Wikipedia Page Titles and links - Wikipedia most popular pages titles. Also includes domain specificity score and what letters should be capital.
  • 84K words and 67K phrases in dictionary lexicon OpenEnglishWordNet, a better updated version of Wordnet - multiple definitions per term, 120k definitions, 45 concept categories
  • JSON Prefix Trie - arranged by sorting words and phrases for lookup by first word to tokenize by word, then find if it starts a phrase based on entries, for Phrase Extraction from a text. There is "unanimous consensus" that Prefix Trie O(1) lookups (instead of thaving to loop through the index for each lookup) makes it the best data type for this task.

πŸ“ˆπŸ“ WRITEFAT: Weight Relevance by Inference of Topics, Entities, and Frequency Averages for Terms

weighRelevanceTermFrequency Docs

Calculate term specificity for a single doc with BM25 formula by using Wikipedia term frequencies as the baseline Inverse Frequency across Documents. WikiBM25 solves the need to pass in all docs to compute against all documents in a database. The problem with BM25 and TF-IDF is that a large set of documents is needed to find the words that are repeated often across all. These overused words are often the same list of words, so using Wikipedia's term frequencies ensures a common sense baseline against a neutral corpus.

Use this list to Replace or Combine with All Documents IDF - Many websites may have less than a hundred pages to search through and that is not enough to find which terms are domain-specific. They can score a single doc at a time to find the weight each word in query gets. Wikipedia IDf can be a baseline IDF to average with the All Docs IDF for uniqueness across the average public and the specific domain.

Example: Given a query "Superbowl wins by year" we do not want to simply return docs filled with common words like year, but rather recognize Superbowl is more domains-specific. This requires precomputing IDF values across all docs, and for websites that may not have that many docs to start with may consider averaging their precomputed score with wikiIDF values to ensure most unique words get a score.

LLM RAG Chunk to Query Similarity - When we chunk a document into parts to find which to pass into a LLM prompt, they need to be weighted to relevance to the query. Semantic Embedding with a LLM not only takes resources to compute & store the vectors, it also performs worse than BM25 on its own. Hybrid BM25 & Embeddings RAG is best, but there may not be time to compute BM25 idf scores across all doc chunks. We need a fast way to distinguish more unique words to give them more weight rather than common short words that get repeated a lot in an edge case paragraph. WikiBM25 is the best in use cases like realtime web search where chunking the text cannot be done beforehand.

$$\text{score}(D,Q) = \sum_{i=1}^{N} \text{Wiki-IDF}(q_i) \times \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

WikiIDF Term Specificity Statistics Metadata

  • Phrase Starter Words and Single Words: 104556

  • Total Terms: 204169

  • Dict single words: 84493

  • Dict phrases: 67444

All words in English Wikipedia are sorted by number of pages they are in for 325K words with frequencies of at least 32 wikipages, between 3 to 23 characters of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï, but filtering out numbers and foreign language.

πŸ§©πŸ” Autocomplete & Query To Topic Phrase Tokenization

suggestNextWordCompletions Docs

Search-on-keystroke and load this JSON index for word and phrase completion, sorted by how common the terms are with IDF, for search autocomplete dropdown. Tokening by word can often have a meaning widely different than if it is part of a phrase, so it is better to extract phrases by first-word next-words pairings. Search results will be more accurate if we infer likely phrases and search for those words occuring together and not just split into words and find frequency. Examples are "white house" or "state of the art" which should be searched as a phrase but would return different context if split into words. As Led Zeppelin famously put it: β™« "'Cause you know sometimes words have two meanings."

Further Research

PRs Welcome

About

πŸ”€πŸ“Š SEEKTOPIC: Summarization by Extracting Entities, Keyword Tokens, and Outline Phrases Important to Context πŸ€–πŸ”Ž STREAM: Search with Top Result Extraction & Answer Model πŸšœπŸ“œ Tractor the Text Extractor

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published