People talk so much about fake news. There's an abundance of photoshopped images, and now there are even doctored videos and engineered voice messages.
Has this been historically true?
Is there anything in that noise? Can we protect ourselves in any way?
Cheers,
-- Sophie
"Fake news" is a neologism, a phrase of a fairly recent origin. It is used as a catch-all for anything we suspect that could be fabricated. Traditionally, we call the attempts to shape public opinion propaganda. It has been around since the dawn of humanity, quietly marching alongside history, shaping our views without our knowledge, molding itself to fit cultures, languages, and times.
It is therefore paramount that we have a solid definition of propaganda. What are we looking for?
One of my favorite examples of propaganda are history textbooks in a small Eastern-European country. Children in the 90's learned history of their tiny country as the western world knows it. Children of 00's read textbooks that were rewritten to have Moldova sound like a key player in European history. Children of the current decade know that Moldova has always and will always be a part of the Eropean Union, regardless of what The Hand Of Moscow wants. In this case, reshaping of the national mentality was a state-sanctioned and directed effort carried out over three decades. Do the parents notice the changes in these new shiny textbooks? Perhaps. The children, however, still must learn the history in its current iteration to graduate from high school.
Another good example of propaganda is a 2015 article released by a Russian newspaper regarding the state of affairs in the Ukraine. It had three paragraphs. One about the subway system crippling debt that could be easily verified with a quick google search. The second about cigarettes disappearing from Ukrainian capital - hard to verify, possibly subjective, and thus skimmed over. And the third about rising unemployment, an expected side effect of war that does not require verification at all. Three simple paragraphs, one big idea. A western or, perhaps, a younger reader will not make anything of it. But to anyone who grew up in the USSR the article sets off a subconsious alarm. In the USSR, as a remnant of World War II, all tobacco factories made cigarettes in the exact dimensions of AK-47 bullets. The idea was that the entire country could be turned into a war complex in under three hours. This alarm bypasses the conscious mind because it appears subjective, but passes the subconscious gateways because it is surrounded by verifiable or common sense ideas. That simple article sends a clear and scary message - the war is here.
Propaganda is meant to target a large stratum of people, it uses images and powerful language to speak directly to their subconsious. It is about strategic placement of information and does not require anything fabricated. The scary part about it is that anyone can slip propaganda into any news source and, unless facts are wrong, it could well bypass all of the defences that we learned to put up.
This definition is hard to communicate to a machine, partially due to lack of fact accuracy requirement, and partially due to relative weakness of computational techniques when it comes to subliminal message detection.
But machines can help anyway.
Some solid data came to my attention through a paper by Rashkin, Hannah, et al. "Truth of varying shades: Analyzing language in fake news and political fact-checking."Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. It's advantage is a corpus of almost 50,000 annotated articles totalling over 60 million words. The training set comes from all kinds of sources, inlcuding Onion, American News, The Activist, and the Gigaword news excluding American PressWorks, WPB magazine. The test set is clever as well. It contains 6,000 articles from Borowitz Report, DC Gazette, Activist News, Activist Post, Natural News, Clickhole, Xinhua News, AFP, NYTimes, and Associated Press.
This data set is robust. First look at word count and general tone in clean news and propaganda reveals not only a trivial challenge of imbalanced train and test sets, but a complete overlap in both word count and tone for the bulk of both clean news and propaganda slices of our data set. This is reflected in Rashkin's paper's outcome of 0.91 f1 dev score and 0.65 f1 test score.
Don't fret. There's a way to improve on that test score. But first, let's discuss some tools and fidings.
A dim = 60 word vector is, frankly, too much for an average machine to handle gracefully in a run of the mill machine learning algorithm. Or ungracefully, for that matter. So, AWS and LIWC lended a bit of their black magic, and spit out 95 dimensions ranging features like pronouns, numbers, temporal focus, clout, punctuation, male and female language.
Exploratory data analysis showed that:
revealed an elevated use of:
Apostrophies
Numbers
Focus on the past
Dashes
revealed an elevated use of:
Pronouns
Emotions
Focus on the present
Exclamation marks
Colons
If everything else is fairly self-explanatory and expected, the use of dashes vs colons remains a fascinating question and requires an input from a linguist. Perhaps, dashes are used to extend a thought, and colons are used to hammer it in.
So, armed with this, some 400 models of all shapes and forms were run on the available data set. After the first results revealed a distinct success of K-nearest Neighbors model, some algebraic geometry sorcery was performed.
The result to beat, the one that Rashkin's paper had, was a test F1 score of 0.65. After all of the effort, it was beaten by almost 25% for a test F1 score of 0.898.
This means that currently data can help identify propaganda in close to 9 out of 10 articles.
Wouldn't it be nice if in the near future this algorithm was incorporated into a mobile app?
In the meanwhile, keep your free will intact!
Love,
-- Data