multilingual coreference and singletons support #1403

Jemoka · 2024-07-16T19:10:10Z

adds support for multilingual and singletons support through xlm-roberta-large and t5-large.

adds Rn -> R1 projection anaphoricity scorer for start-of-chain in order to have singletons
integrates newer PEFT architecture for FTing xlm-roberta with adaptors
created adapter for CorefUD data
training throughput fixes for long documents

Jemoka · 2024-07-16T20:59:14Z

stanza/models/mwt/trainer.py

@@ -87,10 +87,6 @@ def predict(self, batch, unsort=True):
                    pred_tokens.append("".join(pred_seq))
        else:
            pred_tokens = ["".join(seq) for seq in pred_seqs] # join chars to be tokens
-            # if any tokens are predicted to expand to blank,


I honestly forgot why this is removed; perhaps it is a merging artifact?

I suspect so. I just made a couple changes on the MWT processing last week in order to fix some weird tokenization of previously unknown lemmas in Spanish

yes, would try to merge in that change or otherwise undo this. shouldn't be here

AngledLuffa · 2024-07-22T21:43:36Z

stanza/models/coref/anaphoricity_scorer.py


    @staticmethod
-    def _get_pair_matrix(all_mentions: torch.Tensor,
-                         mentions_batch: torch.Tensor,
+    def _get_pair_matrix(mentions_batch: torch.Tensor,


so the change here is to push the dereferencing up into the caller? sounds fair, could maybe split that out for readability of the PR but it's not necessary

also the indexing is required multiple times instead of one, so I felt it would be easier than passing everything around to two stacks.
happy to split it, do you mean I should undo the change and apply another diff?

thanks in advance!

no change needed, was just thinking in terms of making the big change more readable with smaller pieces cut off. it's not an issue though.

you can if you like split off individual edits with git rebase -i dev and then edit the change. it's really not necessary in this case, though, unless you wanted the practice

AngledLuffa · 2024-07-22T21:46:06Z

stanza/models/coref/bert.py

@@ -40,14 +44,19 @@ def get_subwords_batches(doc: Doc,
            while end and doc["sent_id"][doc["word_id"][end - 1]] == sent_id:
                end -= 1

+        # if we ended up at prev end, well, looks like we will


this is what happens if a single sentence is longer than the maximum length of the transformer?

yes, clarified comment

AngledLuffa · 2024-07-22T21:46:47Z

stanza/models/coref/cluster_checker.py

@@ -15,7 +21,24 @@ def __init__(self):
        self._r = 0.0
        self._p_weight = 0.0
        self._r_weight = 0.0
+        self._num_preds = 0.0
+
+        # muc


these names are somewhat opaque but i assume they're just the standard scoring names? seems reasonable

updated with better variable names; they are underscore prefixed, so hopefully folks won't try to access it from the outside

AngledLuffa · 2024-07-22T21:49:50Z

stanza/models/coref/conll.py

+                try:
+                    return int(split[-1].replace(")", "").strip())  
+                except ValueError:
+                    breakpoint()


could remove this?

AngledLuffa · 2024-07-22T21:50:29Z

stanza/models/coref/conll.py

+            cluster_info_lst.append(f"e{cluster_marker})")
+
+
+        # we need our clusters to be ordered such that the one that closest first is listed last


that closest first

-->

that is closest to the first

?

AngledLuffa · 2024-07-22T21:53:16Z

stanza/models/coref/conll.py

+                    breakpoint()
+            else: 
+                # we want everything that's a closer to be first
+                return 1000000000


how about float('inf')

of course; no clue why I did this....

AngledLuffa · 2024-07-22T21:54:06Z

stanza/models/coref/coref_config.toml

+#dev_data = "data/coref/corefud_concat_v1_0_langid-bal.dev.json"
+#test_data = "data/coref/corefud_concat_v1_0_langid-bal.test.json"
+
+train_data = "data/coref/corefud_concat_v1_0_langid.train.json"


is there a script or an explanation of how to build this?

wondering if some of the others could be cleaned up

done. left the ontonotes + gum + balanced langid. Also, looks like I missed the conversion script being committed: would love your review.

convert_udcoref.py: converts depparse annotated udcoref files into our format

balance_languages.py: takes a dataset built in the previous option and balance the document counts for each language within the JSON.

AngledLuffa · 2024-07-22T21:54:52Z

stanza/models/coref/coref_config.toml

 lora_dropout = 0.1
+lora_alpha = 128


could be undone just to make the change cleaner

done. no clue how this ended up changed. apologies

AngledLuffa · 2024-07-22T21:56:22Z

stanza/models/coref/pairwise_encoder.py

@@ -58,7 +60,18 @@ def forward(self,  # type: ignore  # pylint: disable=arguments-differ  #35566 in
        distance = torch.where(distance < 5, distance - 1, log_distance + 2)
        distance = self.distance_emb(distance)

-        genre = torch.tensor(self.genre2int[doc["document_id"][:2]],
+        if not self.__full_pw:


this is for VRAM OOM issues?

not quite: this is for documents that have a genre and speaker embedding, which doesn't exist for UDCoref

ah, got it. is that detected automatically from the input files? i might have missed that if it is. if not, it would be simpler for the user to do that rather than make it an option

AngledLuffa · 2024-07-22T21:56:44Z

stanza/models/coref/rough_scorer.py

@@ -29,6 +29,7 @@ def forward(self,  # type: ignore  # pylint: disable=arguments-differ  #35566 in
        Returns rough anaphoricity scores for candidates, which consist of
        the bilinear output of the current model summed with mention scores.
        """
+


maybe undo just to keep things cleaner?

AngledLuffa · 2024-07-22T21:58:03Z

stanza/models/wl_coref.py

+                           help="Adjust the dummy mix")
+    argparser.add_argument("--bert_finetune_begin_epoch", type=float,
+                           help="Adjust the bert finetune begin epoch")
+    argparser.add_argument("--warm_start", action="store_true",


worth adding an argument for --full_pairwise here?

done. I expect that for non ontonotes documents this will be rarely used, however, because most datasets don't have speaker embeddings.

heh, adding that flag might have been the opposite of what i just suggested above with the __full_pw option

AngledLuffa · 2024-07-23T02:41:03Z

stanza/models/coref/model.py

@@ -30,13 +30,67 @@
 from stanza.models.coref.utils import GraphNode
 from stanza.models.coref.word_encoder import WordEncoder

+from torch.utils.data import Dataset
+from functools import lru_cache, wraps
+import weakref


is this still used? maybe it could go away

yes, apologies. used to be used for dataset memoization, but turns out we were system OOMing on long docs instead

AngledLuffa · 2024-07-23T02:41:38Z

stanza/models/coref/model.py

@@ -30,13 +30,67 @@
 from stanza.models.coref.utils import GraphNode
 from stanza.models.coref.word_encoder import WordEncoder

+from torch.utils.data import Dataset
+from functools import lru_cache, wraps


same with wraps, a quick ctrl-f doesn't find it anywhere else

removed, thanks

AngledLuffa · 2024-07-23T02:42:25Z

stanza/models/coref/model.py

 from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict

 from stanza.utils.get_tqdm import get_tqdm   # type: ignore
 tqdm = get_tqdm()

 logger = logging.getLogger('stanza')

+class CorefDataset(Dataset):


can / should this be refactored into a different module?

AngledLuffa · 2024-07-23T02:44:35Z

stanza/models/coref/model.py

+        self.config = config
+        self.tokenizer = tokenizer
+
+        self.__filter_func = TOKENIZER_FILTERS.get(self.config.bert_model,


maybe leave a comment here to specify that the default is to not filter anything? it takes a couple seconds to understand, so maybe that time can be saved for the reader instead

AngledLuffa · 2024-07-23T02:45:50Z

stanza/models/coref/model.py

@@ -91,14 +144,15 @@ def __init__(self,
                                            modules_to_save=self.config.lora_fully_tune,
                                            bias="none")

-            self.bert = get_peft_model(self.bert, self.__peft_config)
            self.bert.train()


is this switch necessary? it was working before

Reverted. There's a chance that I did this because certain types of loading loads in eval, which makes PEFT do weird things. But, I can't seem to reproduce it. Apologies

no worries, mostly was wondering about needing to change other models

AngledLuffa · 2024-07-23T02:46:32Z

stanza/models/coref/model.py

            self.bert.train()
+            self.bert = get_peft_model(self.bert, self.__peft_config)
            self.trainable["bert"] = self.bert

        if build_optimizers:
            self._build_optimizers()
        self._set_training(False)
        self._coref_criterion = CorefLoss(self.config.bce_loss_weight)


maybe a comment on the distinction between the coref & rough criterions would be helpful

AngledLuffa · 2024-07-23T02:46:56Z

stanza/models/coref/model.py

@@ -117,13 +171,15 @@ def training(self, new_value: bool):
    @torch.no_grad()
    def evaluate(self,
                 data_split: str = "dev",
-                 word_level_conll: bool = False
+                 word_level_conll: bool = False, 
+                 eval_lang=None


consistency on the typing might be nice

AngledLuffa · 2024-07-23T02:47:48Z

stanza/models/coref/model.py

@@ -185,8 +244,9 @@ def evaluate(self,
                    f" p: {s_lea[1]:.5f},"
                    f" r: {s_lea[2]:<.5f}"
                )
+            logger.info(f"BAKE!: {w_checker.bakeoff:.5f}")


i do think a more informative log line would be helpful

done. apologies

AngledLuffa · 2024-07-23T02:49:41Z

stanza/models/coref/model.py

@@ -421,12 +488,17 @@ def train(self, log=False):
            for doc_indx, doc_id in enumerate(pbar):
                doc = docs[doc_id]

+                # skip very long documents during training time


could this be an option?

I believe we discussed this being strictly good—simply because it quadruples the memory limit during training and seems to confer no actual performance benefits. happy to make this a flag if needed, however.

that's fair. although i was thinking that there could be a cutoff where those lines become batches by themselves. however, it's also not necessary to do that, i think, especially if it's not giving any benefit

maybe a comment on how many training lines will be skipped, in that case?

AngledLuffa · 2024-07-23T02:49:57Z

stanza/models/coref/model.py

                running_c_loss += c_loss.item()
                running_s_loss += s_loss.item()

-                # log every 50 docs
-                if log and doc_indx % 50 == 0:
+                # log every 100 docs


this could also be an option

happy to do so; do you think there are other areas where this flag would be used? i.e.: this only affects wandb logs; would love to hear how best I could implement this.

Thanks in advance!

eh, guess not. having some default behavior for the logs is fine until someone complains about wanting more granularity

AngledLuffa · 2024-07-23T02:51:43Z

stanza/models/coref/model.py

@@ -490,30 +564,44 @@ def train(self, log=False):
    # ========================================================= Private methods

    def _bertify(self, doc: Doc) -> torch.Tensor:


some similar pieces of logic are also in models/common/bert_embedding.py
not saying this needs to happen this time around, but it would be useful to unify that so there's just one source of truth

that code also handles some other model types, such as the VI extension to bert (phobert)

it looks like the original design put bert.py there as bert utilities (which has no logic dependent upon bert choice/initialization), whereas this does. happy to refact if you think that's best

AngledLuffa · 2024-07-23T02:53:07Z

stanza/models/coref/model.py

        y = (y == cluster_ids.unsqueeze(1))            # True if coreferent
        # For all rows with no gold antecedents setting dummy to True
        y[y.sum(dim=1) == 0, 0] = True
+
+        if singletons:
+            # add another dummy for firts coref


whoops; thanks

AngledLuffa · 2024-07-23T02:55:10Z

Overall it looks good, thanks! Just a bunch of random nitpicks and of course the MWT code being reverted.

I would think that with the config changes the way they are, the original model is no longer viable, right? Or does loading the pipeline with the old model and the new code still work?

If the old model is dead (which is fine) we should either fix the existing .pt file or rebuild it.

AngledLuffa · 2024-07-23T21:14:44Z

stanza/utils/datasets/coref/convert_udcoref.py

+IS_UDCOREF_FORMAT = True
+UDCOREF_ADDN = 0 if not IS_UDCOREF_FORMAT else 1
+
+# TODO: move this to a utility module and try it on other languages


this is just a copy of the one in convert_ontonotes.py, right? can we refactor that now?

ah, yes, absolutely. will do ASAP with the other items today (just got a temp laptop, setting up dev env)

stanza/utils/datasets/coref/balance_languages.py

Jemoka · 2024-07-26T22:10:11Z

closing in favor of #1406

multilingual coreference and singletons support

8a1f4b6

Jemoka requested a review from AngledLuffa July 16, 2024 19:10

Jemoka marked this pull request as ready for review July 16, 2024 20:09

Jemoka commented Jul 16, 2024

View reviewed changes

don't log the model every time

12f2a37

AngledLuffa reviewed Jul 22, 2024

View reviewed changes

AngledLuffa reviewed Jul 23, 2024

View reviewed changes

coref documentation and implementation cleanup

907eb6e

Jemoka added 2 commits July 23, 2024 09:32

Reverts MWT model patch, and refactored coref model file

be4f2ac

creates refactored dataset file

9d941f0

Jemoka force-pushed the multilingual-coref-2 branch from 658f843 to 9d941f0 Compare July 23, 2024 16:50

AngledLuffa reviewed Jul 23, 2024

View reviewed changes

stanza/utils/datasets/coref/balance_languages.py Outdated Show resolved Hide resolved

AngledLuffa added 2 commits July 23, 2024 14:36

Fix typo

9472aef

Refactor find_cconj_heads

a857368

Jemoka closed this Jul 26, 2024

AngledLuffa deleted the multilingual-coref-2 branch July 31, 2024 21:53

		cluster_info_lst.append(f"e{cluster_marker})")


		# we need our clusters to be ordered such that the one that closest first is listed last

		@@ -490,30 +564,44 @@ def train(self, log=False):
		# ========================================================= Private methods

		def _bertify(self, doc: Doc) -> torch.Tensor:

multilingual coreference and singletons support #1403

multilingual coreference and singletons support #1403

Conversation

Jemoka commented Jul 16, 2024 • edited Loading

Jemoka Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jemoka Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngledLuffa commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jemoka commented Jul 26, 2024 • edited Loading

Jemoka commented Jul 16, 2024 •

edited

Loading

Jemoka Jul 16, 2024 •

edited

Loading

Jemoka Jul 23, 2024 •

edited

Loading

Jemoka commented Jul 26, 2024 •

edited

Loading