-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding _REGEX_REPLACE #21
Comments
Hello! For datasets that come directly from the CHILDES / Talkbank ecosystem or those that strictly follow their CHAT data format, transcribed data in the form of |
Thanks for the response! I am working with .cha files coming from the CLAN software. The .cha files are transcripts of classroom video recordings. I'm not that knowledgeable about how CLAN works and whether it does or does not enforce the CHILDES/Talkbank CHAT format so that questions marks in sentences have preceding white space. |
Thank Jackson, for your response. We are using existing transcripts created in CLAN, but they are not part of CHILDES / Talkbank datasets. |
Feature you are interested in and your specific question(s):
In the
_clean_utterance()
function there is a _REGEX_REPLACE tuple which consists of a sub-tupleI noticed that this regex takes strings like
"hello?"
and replaces it with"hell ?"
.What you are trying to accomplish with this feature or functionality:
I would expect the value to return as
"hello ? "
.Additional context:
This is causing incorrect calculations for the type to token ratio.
Do you have any feedback on why this regex is used and if this is expected behavior?
The text was updated successfully, but these errors were encountered: