Fix from code review + ADR

koppor · Jul 29, 2024 · efed0cc · efed0cc
1 parent f669a95
commit efed0cc
Show file tree

Hide file tree

Showing 11 changed files with 122 additions and 12 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,7 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
 
 ### Added
 
-- We added an AI chat for linked files. [#11430](https://github.com/JabRef/jabref/pull/11430)
+-  We added an AI-based chat for entries with linked PDF files. [#11430](https://github.com/JabRef/jabref/pull/11430)
 - We added support for selecting and using CSL Styles in JabRef's OpenOffice/LibreOffice integration for inserting bibliographic and in-text citations into a document. [#2146](https://github.com/JabRef/jabref/issues/2146), [#8893](https://github.com/JabRef/jabref/issues/8893)
 - We added Tools > New library based on references in PDF file... to create a new library based on the references section in a PDF file. [#11522](https://github.com/JabRef/jabref/pull/11522)
 - When converting the references section of a paper (PDF file), more than the last page is treated. [#11522](https://github.com/JabRef/jabref/pull/11522)

diff --git a/docs/decisions/0033-store-chats-in-mvstore.md b/docs/decisions/0033-store-chats-in-mvstore.md
@@ -42,6 +42,7 @@ Chosen option: "MVStore", because it is simple and memory-efficient.
 
 * Good, because automatic loading and saving to disk
 * Good, because memory-efficient
+* Bad, because does not support mutable values in maps.
 * Bad, because the order of messages need to be "hand-crafted" (e.g., by mapping from an Integer to the concrete message)
 * Bad, because it stores data as key-values, but not as a custom data type (like tables in RDBMS)
 

diff --git a/docs/decisions/0037-rag-architecture-implementation.md b/docs/decisions/0037-rag-architecture-implementation.md
@@ -0,0 +1,107 @@
+---
+nav_order: 0037
+parent: Decision Records
+---
+
+# RAG architecture implementation
+
+## Context and Problem Statement
+
+The current trend in questions and answering (Q&A) using large language models (LLMs) or other
+AI related technology is retrieval-augmented-generation (RAG).
+
+RAG is related to [Open Generative QA](https://huggingface.co/tasks/question-answering) 
+that means LLM (which generates text) is supplied with context (chunks of information extracted
+from various sources) and then it generates answer.
+
+RAG architecture consists of [these steps](https://www.linkedin.com/pulse/rag-architecture-deep-dive-frank-denneman-4lple) (simplified):
+
+How source data is processed:
+1. **Indexing**: application is supplied with information sources (PDFs, text files, web pages, etc.)
+2. **Conversion**: files are converted to string (because LLM works on text data).
+3. **Splitting**: the string from previous step is split into parts (because LLM has fixed context window, meaning
+it cannot handle big documents).
+4. **Embedding generation**: a vector consisting of float values is generated out of chunks. This vector represents meaning
+of text and the main propety of such vectors is that chunks with similar meaning has vectors that are close to.
+Generation of such a vector is achieved by using a separate model called *embedding model*.
+5. **Store**: chunks with relevant metadata (for example, from which document they were generated) and embedding vector are stored in a vector database.
+
+How answer is generated:
+1. **Ask**: user asks AI a question.
+2. **Question embedding**: an embedding model generates embedding vector of a query.
+3. **Data finding**: vector database performs search of most relevant pieces of information (a finite count of pieces).
+That's performed by vector similarity: meaning how close are chunk vector with question vector.
+4. **Prompt generation**: using a prompt template the user question is *augmented* with found information. Found information
+is not generally supplied to user, as it may seem strange that a user asked a question that was already supplied with
+found information. These pieces of text can be either totally ignored or showed separately in UI tab "Sources".
+5. **LLM generation**: LLM generates output.
+
+This ADR concerns about implementation of this architecture.
+
+## Decision Drivers
+
+* Prefer good and maintained libraries over self-made solutions for better quality.
+* The usage of framework should be easy. It would seem strange when user wants to download a BIB editor, but they are
+required to install some separate software (or even Python runtime).
+* RAG shouldn't provide any additional money costs. Users should pay only for LLM generation.
+
+## Considered Options
+
+* Use a hand-crafted RAG
+* Use a third-party Java library
+* Use a standalone application
+* Use an online service
+
+## Decision Outcome
+
+Chosen option: mix of "Use a hand-crafted RAG" and "Use a third-party Java library".
+
+Third-party libraries provide excellent resources for connecting to an LLM or extracting text from PDF files. For RAG,
+we mostly used all the machinery provided by `langchain4j`, but there were moments that should be hand-crafted:
+- **LLM connection**: due to https://github.com/langchain4j/langchain4j/issues/1454 (https://github.com/InAnYan/jabref/issues/77)
+  this was delegated to another library `jvm-openai`.
+- **Embedding generation**: due to https://github.com/langchain4j/langchain4j/issues/1492 (https://github.com/InAnYan/jabref/issues/79),
+  this was delegated to another library `djl`.
+- **Indexing**: `langchain4j` is just a bunch of useful tools, but we still have to orchestrate when indexing should
+happen and what files should be processed.
+- **Vector database**: there seems to be no embedded vector database (except SQLite with `sqlite-vss` extension). We
+implemented vector database using `MVStore` because that was easy.
+
+## Pros and Cons of the Options
+
+### Use a hand-crafted RAG
+
+* Good, because we have the full control over generation
+* Good, because extendable
+* Bad, because LLM connection, embedding models, vector storage, and file conversion should be implemented manually
+* Bad, because it's hard to make a complex RAG architecture
+
+### Use a third-party Java library
+
+* Good, because provides well-tested and maintained tools
+* Good, because libraries have many LLM integrations, as well as embedding models, vector storage, and file conversion tools
+* Good, because they provide complex RAG pipelines and extensions
+* Neutral, because they provide many tools and functions, but they should be orchestrated in a real application
+* Bad, because some of them are raw and undocumented
+* Bad, because they are all similar to `langchain`
+* Bad, because they may have bugs
+
+### Use a standalone application
+
+* Good, because they provide complex RAG pipelines and extensions
+* Good, because no additional code is required (except connecting to API)
+* Neutral, because they provide not that many LLM integrations, embedding models, and vector storages
+* Bad, because a standalone app running is required. Users may be required to set it up properly
+* Bad, because the internal working of app is hidden. Additional agreement to Privacy or Terms of Service is needed
+* Bad, because hard to extend
+
+### Use an online service
+
+* Good, because all data is processed and stored not on the user's machine: faster and no memory is used.
+* Good, because they provide complex RAG pipelines and extensions
+* Good, because no additional code is required (except connecting to API)
+* Neutral, because they provide not that many LLM integrations, embedding models, and vector storages
+* Bad, because requires connection to Internet
+* Bad, because data is processed by a third party company
+* Bad, because most of them require additional payment (in fact, it would be impossible to develop a free service like 
+that)
diff --git a/src/main/java/module-info.java b/src/main/java/module-info.java
@@ -154,4 +154,5 @@
     // Provides number input fields for parameters in AI expert settings
     requires com.dlsc.unitfx;
     requires de.saxsys.mvvmfx.validation;
+    requires dd.plist;
 }
diff --git a/src/main/java/org/jabref/gui/Dark.css b/src/main/java/org/jabref/gui/Dark.css
@@ -163,4 +163,3 @@
 .file-row-text {
     -fx-text-fill: -fx-light-text-color;
 }
-
diff --git a/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java b/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java
@@ -46,7 +46,7 @@ private void initialize() {
             sourceLabel.setText(Localization.lang("AI"));
             contentTextArea.setText(aiMessage.text());
         } else {
-            LOGGER.warn("ChatMessageComponent supports only user or AI messages, but other type was passed: " + chatMessage.type().name());
+            LOGGER.warn("ChatMessageComponent supports only user or AI messages, but other type was passed: {}", chatMessage.type().name());
         }
     }
 

diff --git a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml
@@ -6,7 +6,7 @@
 
 <fx:root prefHeight="200.0" prefWidth="500.0" type="BorderPane" xmlns="http://javafx.com/javafx/17.0.2-ea" xmlns:fx="http://javafx.com/fxml/1" fx:controller="org.jabref.gui.ai.components.errorstate.ErrorStateComponent">
    <center>
-      <VBox alignment="CENTER" spacing="10.0">
+      <VBox fx:id="contentsVBox" alignment="CENTER" spacing="10.0">
          <children>
             <Text fx:id="titleText" strokeType="OUTSIDE" strokeWidth="0.0" text="Title">
                <font>

diff --git a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java
@@ -12,6 +12,7 @@
 public class ErrorStateComponent extends BorderPane {
     @FXML private Text titleText;
     @FXML private Text contentText;
+    @FXML private VBox contentsVBox;
 
     public ErrorStateComponent(String title, String content) {
         ViewLoader.view(this)
@@ -25,7 +26,7 @@ public ErrorStateComponent(String title, String content) {
     public static ErrorStateComponent withSpinner(String title, String content) {
         ErrorStateComponent errorStateComponent = new ErrorStateComponent(title, content);
 
-        ((VBox) errorStateComponent.getCenter()).getChildren().add(new ProgressIndicator());
+        errorStateComponent.contentsVBox.getChildren().add(new ProgressIndicator());
 
         return errorStateComponent;
     }
@@ -36,7 +37,7 @@ public static ErrorStateComponent withTextArea(String title, String content, Str
         TextArea textArea = new TextArea(additional);
         textArea.setEditable(false);
 
-        ((VBox) errorStateComponent.getCenter()).getChildren().add(textArea);
+        errorStateComponent.contentsVBox.getChildren().add(textArea);
 
         return errorStateComponent;
     }

diff --git a/src/main/java/org/jabref/gui/preferences/ai/AiTab.java b/src/main/java/org/jabref/gui/preferences/ai/AiTab.java
@@ -40,8 +40,6 @@ public class AiTab extends AbstractPreferenceTabView<AiTabViewModel> implements
     @FXML private IntegerInputField ragMaxResultsCountTextField;
     @FXML private DoubleInputField ragMinScoreTextField;
 
-    private final ControlsFxVisualizer visualizer = new ControlsFxVisualizer();
-
     @FXML private Button chatModelHelp;
     @FXML private Button embeddingModelHelp;
     @FXML private Button apiBaseUrlHelp;
@@ -54,6 +52,8 @@ public class AiTab extends AbstractPreferenceTabView<AiTabViewModel> implements
 
     @FXML private Button resetExpertSettingsButton;
 
+    private final ControlsFxVisualizer visualizer = new ControlsFxVisualizer();
+
     public AiTab() {
         ViewLoader.view(this)
                   .root(this)

diff --git a/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java b/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java
@@ -23,11 +23,13 @@
 import dev.langchain4j.model.output.Response;
 
 /**
- * Wrapper around langchain4j embedding model.
+ * Wrapper around langchain4j {@link dev.langchain4j.model.embedding.EmbeddingModel}.
  * <p>
  * This class listens to preferences changes.
  */
 public class EmbeddingModel implements dev.langchain4j.model.embedding.EmbeddingModel, AutoCloseable {
+    private static final String DJL_AI_DJL_HUGGINGFACE_PYTORCH_SENTENCE_TRANSFORMERS = "djl://ai.djl.huggingface.pytorch/sentence-transformers/";
+
     private final AiPreferences aiPreferences;
 
     private final ExecutorService executorService = Executors.newCachedThreadPool(
@@ -48,7 +50,7 @@ private void rebuild() {
             return;
         }
 
-        String modelUrl = "djl://ai.djl.huggingface.pytorch/sentence-transformers/" + aiPreferences.getEmbeddingModel().getLabel();
+        String modelUrl = DJL_AI_DJL_HUGGINGFACE_PYTORCH_SENTENCE_TRANSFORMERS + aiPreferences.getEmbeddingModel().getLabel();
 
         Criteria<String, float[]> criteria =
                 Criteria.builder()

diff --git a/src/main/resources/tinylog.properties b/src/main/resources/tinylog.properties
@@ -12,8 +12,7 @@ exception = strip: jdk.internal
 
 level@org.jabref.http.server.Server = debug
 
-# FIXME: Remove before merging the branch
-
+# AI debugging
 #level@org.jabref.gui.entryeditor.aichattab.AiChat = trace
 #level@org.jabref.gui.JabRefGUI = trace
 #level@org.jabref.logic.ai.AiService = trace