POSTCLITIC gets generated in the list of word even though not present in source transcript #26

timotheecour · 2024-07-09T18:47:16Z

Describe the bug
POSTCLITIC gets output as word. I wonder what else gets similarly generated; it makes it harder to use this data for transcription purposes

Relevant CHILDES or TalkBank data
https://sla.talkbank.org/TBB/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha

To reproduce

def bug_D20240709T163017_POSTCLITIC():
    import pylangacq
    filename="/data/timothee/talkbank/media/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha"
    reader = pylangacq.read_chat(filename)
    segments = []
    count = 0
    for a in reader.utterances():
        for b in a.tokens:
            count += b.word == "POSTCLITIC"
    print(count)
bug_D20240709T163017_POSTCLITIC()

this shows a >0 number

Expected behavior
no POSTCLITIC should be output

Note
zooming in on where this occurs:

%gra:   1|2|SUBJ 2|0|ROOT 3|4|NEG 4|2|COMP 5|2|PUNCT
*CHI:   wow . ^U17651_20428^U
%mor:   co|wow .
%gra:   1|0|INCROOT 2|1|PUNCT
*MOT:   (..) okay gotta put these back . ^U20428_26799^U
%mor:   co|okay mod|got~inf|to v|put&ZERO pro:dem|these v|back .
%gra:   1|4|COM 2|4|AUX 3|4|INF 4|0|ROOT 5|6|SUBJ 6|4|COMP 7|4|PUNCT
*CHI:   &+ba back . ^U26799_29336^U
%mor:   adv|back .

=>

"okay gotta POSTCLITIC put these back ."

Note 2
in #23 (comment) @jacksonllee mentions:

Token(word='POSTCLITIC', pos='pro:obj', mor='me', gra=Gra(dep=2, head=1, rel='OBJ')),

which makes me wonder, is this even intentional? how can caller distinguish what are actual words?
should this code (to get transcript) be replaced by something else?

transcript = " ".join([b.word for b in a.tokens])

The text was updated successfully, but these errors were encountered:

jacksonllee · 2024-07-11T01:56:43Z

If you're accessing data from the Token objects, then yes seeing "POSTCLITIC" (or sometimes "PRECLITIC") at the word attribute of Token is intentional by design and not a bug. The package has a strong emphasis on aligning the cleaned-up utterance with the available %mor tier, so when we get, say, 5 words from the utterance but 6 elements from %mor (which is the situation with pre-clitics like French l'article or post-clitics like English gotta), then there'd be one Token object without a word form, in which case I've decided to put in something like "POSTCLITIC" in its word attribute. I can update the documentation to mention that "PRECLITIC" and "POSTCLITIC" are the only possibilities in Token's word attribute that are not from the data.

Also, it sounds like you're interested in getting the transcription data that's cleaned up and without the CHAT annotations? The way the package does the utterance cleaning is to use the currently private _clean_utterance function. I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance (which wouldn't have "PRECLITIC" or "POSTCLITIC"), so that users wouldn't have to access Token objects' word attribute to join back an utterance on their own.

timotheecour · 2024-07-29T18:50:46Z

I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance

that would be great

timotheecour · 2024-07-29T19:33:26Z

actually, calling _clean_utterance doesn't make any difference, looks like it's already called by
reader.utterances()
eg if I run:

for a in reader.utterances():
  transcript = " ".join([b.word for b in a.tokens])
  assert _clean_utterance(transcript)==transcript

and the POSTCLITIC is still there (eg we don't POSTCLITIC need any soap .), and the other CHA artifacts are still there, eg
∇oh I don't know , I think I'll go ▔home tomorrow▔∇
Ideally, I should be able to have an API to get raw transcript, eg:
oh I don't know , I think I'll go home tomorrow

and (optionally) an API to get raw transcript with just the rich annotations like breaths, laughs etc(I realize that might be hard to define, but something like "all the audible sounds"):

I mean: , but like <I was like one and a half centimeters> [% laughing fast]
=>
I mean, but like I was like one and a half centimeters [laughing fast]

How do I get these back? reader.utterances() strips out these [% laughing fast] and similar

timotheecour added the bug label Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POSTCLITIC gets generated in the list of word even though not present in source transcript #26

POSTCLITIC gets generated in the list of word even though not present in source transcript #26

timotheecour commented Jul 9, 2024 •

edited

Loading

jacksonllee commented Jul 11, 2024

timotheecour commented Jul 29, 2024

timotheecour commented Jul 29, 2024 •

edited

Loading

POSTCLITIC gets generated in the list of word even though not present in source transcript #26

POSTCLITIC gets generated in the list of word even though not present in source transcript #26

Comments

timotheecour commented Jul 9, 2024 • edited Loading

jacksonllee commented Jul 11, 2024

timotheecour commented Jul 29, 2024

timotheecour commented Jul 29, 2024 • edited Loading

timotheecour commented Jul 9, 2024 •

edited

Loading

timotheecour commented Jul 29, 2024 •

edited

Loading