Understanding _REGEX_REPLACE #21

sladenheim · 2024-02-12T18:35:14Z

Feature you are interested in and your specific question(s):
In the _clean_utterance() function there is a _REGEX_REPLACE tuple which consists of a sub-tuple

(re.compile(r"[^\[\./!]\?"), " ? ")

I noticed that this regex takes strings like "hello?" and replaces it with "hell ?".

What you are trying to accomplish with this feature or functionality:
I would expect the value to return as "hello ? ".

Additional context:
This is causing incorrect calculations for the type to token ratio.

Do you have any feedback on why this regex is used and if this is expected behavior?

The text was updated successfully, but these errors were encountered:

jacksonllee · 2024-02-13T03:32:04Z

Hello! For datasets that come directly from the CHILDES / Talkbank ecosystem or those that strictly follow their CHAT data format, transcribed data in the form of "hello?" would not be expected, and instead would be "hello ?" with a space before the question mark. (This being said, I've recently become aware that some official CHILDES / Talkbank datasets might have these spaces missing after their semi-regular dataset updates, but this seems to occur only when there are non-ASCII characters around the spot where the space would be expected.) Where does your dataset come from?

sladenheim · 2024-02-14T01:19:48Z

Thanks for the response! I am working with .cha files coming from the CLAN software. The .cha files are transcripts of classroom video recordings. I'm not that knowledgeable about how CLAN works and whether it does or does not enforce the CHILDES/Talkbank CHAT format so that questions marks in sentences have preceding white space.
Is this the only punctuation that follows this convention (e.g., do periods work the same way?). This helps clarify that what is happening is the expected behavior. Do you think it is worthwhile to build in functionality to handle cases where this convention is violated? Although if it violates the convention then it might be better to just raise a warning.

JohaCM · 2024-02-14T18:43:49Z

Thank Jackson, for your response. We are using existing transcripts created in CLAN, but they are not part of CHILDES / Talkbank datasets.

sladenheim added the question label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding _REGEX_REPLACE #21

Understanding _REGEX_REPLACE #21

sladenheim commented Feb 12, 2024

jacksonllee commented Feb 13, 2024

sladenheim commented Feb 14, 2024

JohaCM commented Feb 14, 2024

Understanding _REGEX_REPLACE #21

Understanding _REGEX_REPLACE #21

Comments

sladenheim commented Feb 12, 2024

jacksonllee commented Feb 13, 2024

sladenheim commented Feb 14, 2024

JohaCM commented Feb 14, 2024