We Go Deep: Data-Mining in Pornography
We Go Deep seems to be the title of a porn movie, I catched it while my crawler chased through the whole set at cavr.com—a pornography movie database where ratings, actors and actresses and often a description are present. The description on cavr is especially interesting for data-mining purposes, because it does not consist of complete natural language sentences, but instead just features the core keywords. Thus, it is easier to analyze the contents of pornography.
The intention of this article is to give a short overview over some statistical fact in a business which is little analyzed and considered, even though more than half of our population (almost all males + a part of females, according to a little survey I have done) does know it. I’m talking about porn industry.
Sometimes there is criticism, but most of the time pornography industry leads a pretty easy-going life whereas other industry (e.g. currently web industry with regards to privacy) has to justify.
Moreover, I also want to show how you can create statistics from some data you have.
Most popular first names
Let’s start with some very basic analysis, which can be easily created. How about the first names of porn actresses? Which ones are the mostly used ones?
Technical approach
Doing this is pretty easy if you have a list of actresses’ names. Just get all the names, split of the first word until the first space and sum up all occurrences. Of course you can say that some actresses could have double names, but as they are stage names, I guess still the first name would be the more important and common one. I even doubt that many people will choose double firstnames as stage names.
We use the Natural Language Toolkit (NLTK) which includes features for a lot of tasks in natural language processing. This is the first project I am doing with it. NLTK comes with a utility for frequency distribution, we can throw a list inside and get back an object we can directly plot.
In Python the code looks like this:
# Database access uses MongoDB, find() returns all actresses
firstnames = [star['name'].split()[0] for star in stars.find()]
fdist = nltk.FreqDist(firstnames)
fdist.plot(50)
Results
The mostly used firstname is Nikki (40 actresses call themselves like this) followed by Vanessa and Victoria (both more than 35 occurrences). Then come Kelly, Ashley, Jessica, Angel, Samantha, Tiffany, Michelle and so on.
Most commonly used words in porn titles
Let’s do something more complicated. Now we want to see which words are most commonly used in the titles of American pornography productions. There are some interesting results, but first let’s again have a lot on how we can get such an analysis.
Technical approach
We already have a database of American porn movies. We read all the titles and again split them by spaces. But this time we have to consider stopwords. These are these little words like me, at, with etc. you have in each text, so they are not statistically relevant. As a lot of titles include numbers (because they are series), we also exclude those. Also special characters like & are not relevant.
To be able to group similar words together we also use stemming. This reduces a word to its stem (which is. a bit similar to the grammatical stem of a word, but not necessarily the same).
In Python the code for keyword generation can look like this:
for movie in movies.find():
keywords = [porter.stem(keyword.lower()) # use stemming and lowercase
for keyword in movie['name'].split()
if not keyword.lower() in stopwords_english # exclude stopwords
and keyword.isalpha()] # check for real words
all_keywords = all_keywords + keywords
Then we throw all these keywords into a frequency distribution. We also reduce the number of displayed items in our plot to the 100 most frequent, because there are just too many different words to display them all in one plot.
fdist = nltk.FreqDist(all_keywords)
fdist.plot(100)
Results
And already we can see the result: As pornography industry does not use many clothes, skin color plays a huge role. Black (about 800) and white (about 300) belong to the mostly used words in porn titles. Next frequent race keyword is asian (about 180).
Also information from the documentation “9 to 5 - Days in Porn” are confirmed. There is a hell of a lot of anal sex in American pornography industry. Ass, anal and butt all belong to the most important words. Ff you summed them up, they would exceed the term girl which is mentioned in almost 1000 titles. Later we will see how much percent of pornography really includes anal sex in the scenes (as title only shows specialization on that topic).
Then pornography industry likes to use derogative words like slut, whore or bitch paired with dirti (stem of dirty). On the other hand, purity and youth play a large role: teen, young, first, angel, virgin.
Interesting is that there are so many episodes of Barely Legal, it even got into the top used keywords (check the stem bare and legal in the graph).
Finding collocations in description texts
Let’s get on to the description text of our porn movies. You might want to know if there are any terms that are usually used together with regards to pornography. In non-pornographic text one example could be the term “United States”, which is two distinct words, but one term.
Technical approach
So how do we do this? With the NLTK package it’s actually quite easy, but all details come from an answer on stackoverflow. We already used the frequency distribution utility, now we will use it to count both single words and bigrams. A bigram is just a tuple of two words that follow each other. In the text “I am a man” we have three bigrams: (I, am), (am, a), (a, man).
What we will do not is check which words follow each other most frequently. We have to count the occurrences of single words to normalize the whole calculation. If there are 20 occurrences of “I”, 20 occurrences of “me” and only 5 occurrences of “hello”, then of course I an me will follow each other more frequently, but that does not mean they follow each other unusually frequently. Thus we need to know how often each words occurs.
These two distributions are then thrown into a class BigramCollocationFinder which does all the calculating for us.
To avoid finding collocations through dots, we have to ensure that words after and before a dot cannot be seen as a bigram. I most normal circumstances this is not important, because sentence endings and beginnings differ too much, they will not be seen as collocations, but as we only use mini sentences of mostly 2-3 words and many of them begin with she we have to watch out. So we split dots as distinct words with the WordPunctTokenizer and then we filter special characters out. We also want to filter stopwords as these collocations are not really useful.
In source code the whole action looks like this:
stopwords_english = stopwords.words('english')
tokens = nltk.WordPunctTokenizer().tokenize(fulltext)
bigram_measures = nltk.collocations.BigramAssocMeasures()
word_fd = nltk.FreqDist(tokens)
bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))
finder = nltk.BigramCollocationFinder(word_fd, bigram_fd)
finder.apply_word_filter(lambda w: w in stopwords_english or not w.isalpha())
# Print the 50 most frequent bigrams, this might be collocations
print sorted(finder.nbest(bigram_measures.raw_freq, 50), reverse=True)
You might have recognized that there still is an unknown variable fulltext. It holds all values from our scene analysis. How you build this depends on how the data is saved, but it could look like this:
all_scenes = [scene['description'] for scene in scenes.find()]
fulltext = " ".join(all_scenes)
Results
Let’s see what collocations are found in all scene descriptions:
[(u'various', u'positions'), (u'toe', u'sucking'), (u'titty', u'sucking'), (u'titty', u'screwing'),
(u'titty', u'play'), (u'solo', u'fingering'), (u'side', u'ways'), (u'side', u'saddle'), (u'sexy', u'outfits'),
(u'sexy', u'outfit'), (u'sexy', u'black'), (u'self', u'titty'), (u'screwing', u'side'), (u'screwing', u'reverse'),
(u'screwing', u'rev'), (u'screwing', u'doggy'), (u'screwing', u'doggie'), (u'screwing', u'cowgirl'),
(u'safe', u'screwing'), (u'reverse', u'cowgirl'), (u'rev', u'cowgirl'), (u'open', u'mouth'),
(u'mouth', u'facials'), (u'mouth', u'facial'), (u'many', u'positions'), (u'les', u'eating'), (u'hands', u'bj'),
(u'guys', u'stroke'), (u'face', u'sitting'), (u'dual', u'bj'), (u'dp', u'rev'), (u'dp', u'doggie'),
(u'doggy', u'position'), (u'doggie', u'style'), (u'doggie', u'position'), (u'dildo', u'solo'), (u'dildo', u'play'),
(u'dildo', u'bj'), (u'deep', u'throat'), (u'cream', u'pie'), (u'cowgirl', u'riding'), (u'clit', u'solo'),
(u'black', u'guy'), (u'bj', u'clean'), (u'ball', u'sucking'), (u'anal', u'solo'), (u'anal', u'side'), (u'anal', u'rev'),
(u'anal', u'doggie'), (u'anal', u'dildo')]
Of course collocation finding is always a bit difficult and if you can, you should control it manually. As already mentioned in the technical approach section this is not sorted by number of occurrences, but by collocation strength (how often do these words occur together and how little not together).
Skimming through the words, you might see that there are a lot of real collocations, but sometimes there seems to be missing something. These are then probably trigrams (three words following each other). Look for example at “self titty”, what should that mean? Very probably it belongs together with “titty play” to the trigram “self titty play”. With a similar method to the above code (only changing bigram to trigram), we can find out the strongest trigrams:
[(u'sexy', u'white', u'outfit'), (u'sexy', u'teaser', u'opening'), (u'sexy', u'red', u'outfit'),
(u'sexy', u'pink', u'outfit'), (u'sexy', u'lowcut', u'dress'), (u'sexy', u'blue', u'outfit'), (u'sexy', u'black', u'outfit'),
(u'sexy', u'black', u'dress'), (u'self', u'titty', u'play'), (u'screwing', u'side', u'ways'), (u'screwing', u'side', u'saddle'),
(u'screwing', u'reverse', u'cowgirl'), (u'screwing', u'rev', u'cowgirl'), (u'screwing', u'doggy', u'position'), (u'screwing', u'doggie', u'style'), (u'screwing', u'doggie', u'position'),
(u'screwing', u'cowgirl', u'riding'), (u'safe', u'screwing', u'reverse'), (u'safe', u'screwing', u'rev'), (u'safe', u'screwing', u'doggy'), (u'safe', u'screwing', u'doggie'),
(u'safe', u'screwing', u'cowgirl'), (u'rubbing', u'boxes', u'together'), (u'reverse', u'cowgirl', u'anal'),
(u'open', u'mouth', u'facials'), (u'open', u'mouth', u'facial'), (u'les', u'titty', u'sucking'), (u'lee', u'stone', u'ii'),
(u'large', u'back', u'tattoo'), (u'guy', u'gets', u'bj'), (u'glass', u'dildo', u'solo'), (u'fingers', u'anal', u'solo'),
(u'dual', u'titty', u'sucking'), (u'dual', u'open', u'mouth'), (u'dp', u'reverse', u'cowgirl'), (u'dp', u'rev', u'cowgirl'),
(u'dp', u'doggy', u'position'), (u'dp', u'doggie', u'position'), (u'circle', u'jerk', u'bjs'), (u'bj', u'rev', u'cowgirl'), (u'anal', u'solo', u'fingering'), (u'anal', u'side', u'ways'),
(u'anal', u'side', u'saddle'), (u'anal', u'reverse', u'cowgirl'),
(u'anal', u'rev', u'cowgirl'), (u'anal', u'doggy', u'position'), (u'anal', u'doggie', u'position'), (u'anal', u'dildo', u'solo'),
(u'anal', u'dildo', u'play'), (u'anal', u'cream', u'pie')]
There are some interesting findings in these trigram collocations compared with the bigram collocations. At first the trigam lee stone ii sucks pretty much, but we can exclude it manually. Another method would be to check for possible names automatically. But more interestingly, we can see the most which is a bigram sexual position can also be done extended to anal and to safe. Also we could find out sexual positions from these trigrams, because in the trigram collocations they begin with screwing (if the sexual position itself has two words).
What are the most common actions in pornography?
Let’s go on and find out which actions are most frequently part of porn movies. This is not hard either, but we can improve it with the collocations.
At first we just want to see which words are mostly used. This will give us a quick overview over the data we have.
Technical approach
For this we just use a frequency distribution again and fill it with all the words we have:
porter = nltk.PorterStemmer()
stopwords_english = stopwords.words('english')
fdist = nltk.FreqDist()
for scene in scenes.find():
keywords = [porter.stem(keyword.lower())
for keyword in scene.split()
if not keyword.lower() in stopwords_english
and keyword.isalpha()]
fdist.update(keywords)
fdist.plot(100)
Results
And if we only count each occurrence in a movie once:
Next thing we want to do is find out more specifically which positions are often shown.
Technical approach
This is some manual work, but in natural language processing you will often have to work out things manually to improve them. As we already saw before, we can use the collocations to get an overview over positions, but we have to remove clothings like sexy red dress before. We also remove some collocations that do not provide more information than the single word (e.g. doggie position does mean the same as doggie, but dp doggie is something different). Then we combine the collocation positions with some keywords from the most frequent words and then check which is how common (if we find a collocation we remove it, so that it will not be counted as single word again).
Results
Maybe you also think – like me – that this is not good enough yet. Many terms are there twice, because we did not use stemming on the composite terms or because the website uses abbreviations (rev and reverse). So let’s create display groups. All we have to do is create a Python dictonary with the old positions as keys and the names they shall be mapped to as names.
Thre improved result might look like this:
Further Questions
Of course you can always extend such questions to get even more interesting results.
When I began with this article, I intended to analyze questions like “Comparing popular and not-so-popular porn actresses, is there any difference in what actions they perform?” later.
However, it has been a long time since I worked on this article and I do not have my raw data anymore. Maybe you want to continue my work?
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.