Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSTCLITIC gets generated in the list of word even though not present in source transcript #26

Open
timotheecour opened this issue Jul 9, 2024 · 3 comments
Labels

Comments

@timotheecour
Copy link

timotheecour commented Jul 9, 2024

Describe the bug
POSTCLITIC gets output as word. I wonder what else gets similarly generated; it makes it harder to use this data for transcription purposes

Relevant CHILDES or TalkBank data
https://sla.talkbank.org/TBB/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha

To reproduce

def bug_D20240709T163017_POSTCLITIC():
    import pylangacq
    filename="/data/timothee/talkbank/media/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha"
    reader = pylangacq.read_chat(filename)
    segments = []
    count = 0
    for a in reader.utterances():
        for b in a.tokens:
            count += b.word == "POSTCLITIC"
    print(count)
bug_D20240709T163017_POSTCLITIC()

this shows a >0 number

Expected behavior
no POSTCLITIC should be output

Note
zooming in on where this occurs:

%gra:   1|2|SUBJ 2|0|ROOT 3|4|NEG 4|2|COMP 5|2|PUNCT
*CHI:   wow . ^U17651_20428^U
%mor:   co|wow .
%gra:   1|0|INCROOT 2|1|PUNCT
*MOT:   (..) okay gotta put these back . ^U20428_26799^U
%mor:   co|okay mod|got~inf|to v|put&ZERO pro:dem|these v|back .
%gra:   1|4|COM 2|4|AUX 3|4|INF 4|0|ROOT 5|6|SUBJ 6|4|COMP 7|4|PUNCT
*CHI:   &+ba back . ^U26799_29336^U
%mor:   adv|back .

%mor: co|okay mod|got~inf|to v|put&ZERO pro:dem|these v|back .

=>

"okay gotta POSTCLITIC put these back ."

Note 2
in #23 (comment) @jacksonllee mentions:

Token(word='POSTCLITIC', pos='pro:obj', mor='me', gra=Gra(dep=2, head=1, rel='OBJ')),

which makes me wonder, is this even intentional? how can caller distinguish what are actual words?
should this code (to get transcript) be replaced by something else?

transcript = " ".join([b.word for b in a.tokens])
@jacksonllee
Copy link
Owner

If you're accessing data from the Token objects, then yes seeing "POSTCLITIC" (or sometimes "PRECLITIC") at the word attribute of Token is intentional by design and not a bug. The package has a strong emphasis on aligning the cleaned-up utterance with the available %mor tier, so when we get, say, 5 words from the utterance but 6 elements from %mor (which is the situation with pre-clitics like French l'article or post-clitics like English gotta), then there'd be one Token object without a word form, in which case I've decided to put in something like "POSTCLITIC" in its word attribute. I can update the documentation to mention that "PRECLITIC" and "POSTCLITIC" are the only possibilities in Token's word attribute that are not from the data.

Also, it sounds like you're interested in getting the transcription data that's cleaned up and without the CHAT annotations? The way the package does the utterance cleaning is to use the currently private _clean_utterance function. I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance (which wouldn't have "PRECLITIC" or "POSTCLITIC"), so that users wouldn't have to access Token objects' word attribute to join back an utterance on their own.

@timotheecour
Copy link
Author

I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance

that would be great

@timotheecour
Copy link
Author

timotheecour commented Jul 29, 2024

actually, calling _clean_utterance doesn't make any difference, looks like it's already called by
reader.utterances()
eg if I run:

for a in reader.utterances():
  transcript = " ".join([b.word for b in a.tokens])
  assert _clean_utterance(transcript)==transcript

and the POSTCLITIC is still there (eg we don't POSTCLITIC need any soap .), and the other CHA artifacts are still there, eg
∇oh I don't know , I think I'll go ▔home tomorrow▔∇
Ideally, I should be able to have an API to get raw transcript, eg:
oh I don't know , I think I'll go home tomorrow

and (optionally) an API to get raw transcript with just the rich annotations like breaths, laughs etc(I realize that might be hard to define, but something like "all the audible sounds"):

I mean: , but like <I was like one and a half centimeters> [% laughing fast]
=>
I mean, but like I was like one and a half centimeters [laughing fast]

How do I get these back? reader.utterances() strips out these [% laughing fast] and similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants