YouTube is *not* getting rid of it's hit/miss auto-generated captions. Here's an...

segfaultbuserr · on Sept 28, 2020

> the auto-generator thought people were saying slurs, but they weren't

I'm sympathetic towards YouTube on this particular issue. I clearly remember that there was an early time when a lot of slurs were filtered from YouTube's auto-generator output to prevent such incidents, and then there were some furious complaints about how YouTube censors words in the caption. For example, see [0], and note how the comments are overwhelmingly negative, "Is YouTube becoming a kindergarten now?", "Haven't you got the latest dictionary of Newspeak?", "The age of snowflakes", yet no one realized the serious technical problem of incorrectly identifying words as slur.

And now many word filters on auto-generator output have been lifted as the system became more reliable, then we get some furious complaints about how YouTube incorrectly identifies words as slurs.

I think there has been a similar accusation that claims Google removes negative keywords for certain politicians from its search recommendations. But a quick test could confirm that it was true even for convicted criminals. Same reason - incorrectly recommending negative keywords about living persons would be a serious incident.

[0] Youtube are Censoring the word BULLSHIT https://www.youtube.com/watch?v=bHvkEPL7mVA

MaxBarraclough · on Sept 28, 2020

What's the point in the system censoring these words in the subtitles, if the words are uncensored in the source audio?

filoleg · on Sept 28, 2020

The point is to avoid false positives for autogenerated captions, since false positives are what's damaging about allowing slurs in autogenerated captions, not slurs in captions in general.

For example, imagine there is a video where someone says a slur and the system transcribes that slur without censoring. If you detected it correctly (true positive), then everything is fine. However, if there was no slur and you detected it as a slur (false positive), it is a big problem. Imagine watching some video talking about perfume, the narrator says something like "the smell lingers for a while", and the transcription code thinks it got the n-word instead of "lingers". It would be disastrous.

Now imagine there is a censoring system in place for slurs. What it does, it simply gets rid of false positives at the root by completely removing them from the list of possible words to autogenerate, thus guaranteeing that situations like the one I brought up above could never happen. So now, if it is a true positive that nets a slur word, it just gets rid of that word. PR disaster avoided.

"Censorship", in this case, is just a very unfortunate side-effect of solving that problem I described above. As long as they don't censor those words in manually created captions, I am totally ok with it, because then the burden of a false positive is on the person who wrote the captions. I.e., if google doesn't ban "true positive slur" scenarios in manually generated captions, imo it is a reasonable and understandable compromise.

scrollaway · on Sept 28, 2020

Censoring (as in bleeping) doesn't prevent the damage. It's like a phone autocorrecting "duck" to "f*ck". Putting a star there doesn't undo the "damage" of getting it wrong.

Autocorrects solve this by omitting the word from their dictionary altogether, so if a false positive happens it's from duck to luck, not duck to something censored.

derefr · on Sept 28, 2020

If the censorship of text is complete (i.e. just the text "[censored]" instead of any word at all), then in a false-positive scenario, the "[censored]" will appear entirely out-of-context (since it wasn't the right word to begin with), so there'll be no way, just by reading the resulting text, to intuitively figure out what word the system thought it had recognized.

scrollaway · on Sept 29, 2020

"[censored]" standalone is no better, because of the implication. If you want to go that route, pick something else.

Voice: "Because lucky you, that's why"

Caption 2: "Because f*ck you, that's why"

Caption 3: "Because [censored] you, that's why"

Caption 4: "Because [unintelligible] you, that's why"

You tell me, which one of these failure modes is better?

filoleg · on Sept 29, 2020

The best failure mode is none of those. Using your example, the best one is the one that finds the closest approximate word that isn't a slur.

That would solve both problems at once, because there is a really strong chance that the first closest guess that isn't a slur is the actual correct word that was said there. And in case it picks an incorrect word, it would have picked it either way and it would have been just as wrong.

So if the voice said "Because lucky you, that's why", in case youtube autotranscript thinks it was "fuck" and not "lucky", it would either default to "lucky" (which is good) or another closest non-slur word like "mucky" or "rocky".

And, in my opinion, this is way better than putting [censored] or any other workaround out of those you mentioned, because all of them have zero chance to end up as the correct word (in case the slur was detected incorrectly), while with the approach I described, it is very possible.

mantap · on Sept 29, 2020

I don't think deaf users would agree with you. Because they are going to be trying to understand if they actually said lucky rather than fuck. Better to choose the best match, and fuck the prudes.

scrollaway · on Sept 29, 2020

I feel like we're arguing for the same thing.

derefr · on Sept 29, 2020

The false-positive case is more like, e.g. "[censored] the chicken crossed the road." There's no slur/curse that actually makes sense in that position, according to the rules of the English language, so there's no implication.

scrollaway · on Sept 29, 2020

This thread isn't about what may usually happen, but rather about the potential damage of an outlier, such as auto-captioning a slur in a the speech of a foreign official with a heavy accent.

segfaultbuserr · on Sept 29, 2020

When YouTube was censoring slurs in automatic captions, it was just a simple search and removal.

Caption 5: "Because you, that's why"

Not the best failure mode, but it does completely prevent the damage, <del>and your argument about how censorship doesn't prevent damage doesn't apply</del>.

scrollaway · on Sept 29, 2020

My argument was that bleeping doesn't prevent the damage, because there is an implication that a slur was said. Full removal or full replacement does.

segfaultbuserr · on Sept 29, 2020

I see, apologize for ignoring the context. I missed your grandparent comment.

9HZZRfNlpR · on Sept 29, 2020

[flagged]

dang · on Sept 29, 2020

Please don't take HN threads into nationalistic flamewar, regardless of which country you're for or against. It's definitely not what we want here.

https://news.ycombinator.com/newsguidelines.html

segfaultbuserr · on Sept 28, 2020

Inadvertently censoring slurs from all subtitles with the intention of removing slurs from automatically-generated subtitles due to concerns on their correctness could be a persuadable explanation.

BTW, I don't know whether the word filter was used on all subtitles or only automatic ones, so I'm limiting my scope of discussion strictly on automatically-generated subtitles.

d1zzy · on Sept 29, 2020

I think there's a big legal distinction between publishing unaltered user content that contains offensive content and creating such content yourself (the subtitles are created by Google software, thus, by Google). It is similar to banning a lot more words from "auto complete" than actually banning them from the search items that can be used if you were to actually type them, it's because the "auto complete" content can be argued that it's Google generated content and as such Google might be legally liable for it.

https://www.theguardian.com/technology/2016/dec/05/google-al...

segfaultbuserr · on Sept 29, 2020

Yes, it was what I meant by "Google removes negative keywords for certain politicians from its search recommendations".