Archive for the ‘Labs’ Category


Musicmetric’s Sentiment Analysis v1.0 Beta

JAN. 31
2010

Today we are going to introduce to you another piece of technology we have developed at Musicmetric. As you may know, parts of our product are driven by semantic analysis; we don’t just tell you how many people are talking about your artists, but also their opinions, the sentiment and common topics surrounding them. How do we do this? Sentiment analysis is a challenging problem that still has not been solved completely. Many so-called sentiment analysis systems use a very naive method to detect sentiment in a context, i.e. using key words or very basic sentence decomposition. However, human language is not that simple, so these approaches fail to capture irony, sarcasm, slang and other idiomatic expressions.

Our methods are much more advanced than simple word detection. We have implemented a set of machine learning models that can be trained with different corpora (contexts) so they work well for general language but are also much more accurate for the pre defined contexts – for example, professionally written articles, fan comments and tweets are all different contexts and therefore have different sentiment analysis models trained for each one. Using this approach allows our model to get more and more intelligent as we keep downloading data to retrain it frequently. The accuracy of our method is shown in the confusion matrices below:

Musicmetric polarity confusion matrix

Musicmetric polarity confusion matrix


So what does this matrix mean? This confusion matrix tells how many percent the system is confusing two classes (i.e. mislabeling one as another). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class (yellow hightlighted ones are correct predictions). For example, we can see that 16% of neutral reviews are predicted as negative but only 2% of negative reviews are predicted as neutral or positive. Notably we have 96% of negative reviews are predicted correctly.


The second confusion matrix is a breakdown for score from 1-5 (3 is neutral):

Confusion matrix for score from 1 to 5

Similarly we can see that 90% of reviews with scores of 2 are predicted correctly while 5% of them are predicted as 3 and none of them are predicted as 5. The below table shows the numbers of reviews we evaluated:

Number of evaluated samples

Number of evaluated samples


The third confusion matrix is a break down for score from 1-10 (Note: this test data is different from above ones and does not include any reviews which have scores less than 5):

Musicmetric sentiment confusion matrix for score from 1-10

Musicmetric sentiment confusion matrix for score from 1-10


We think that is enough of our talking. It is now your turn to see how it works by playing around with the interface to our general purpose music sentiment analysis engine below:

Note: Our sentiment analysis usually works better for longer reviews or paragraphs rather than single short sentences, and definitely works better for music related topics. Try pasting in an album review.



Analysing trends over time with musicmetric

DEC. 13
2009

In this blog post we’re going to look at an example of some of the data mining and large scale analysis which we do at musicmetric, detecting patterns and similarities in time series data.

One use of this analysis is that given an artist, we can find another artist with the closest trend in some variable over time – for example MySpace plays per hour. Alternatively we could generate a list of artists who are increasing in popularity in a certain way, or show which artists have had a brief surge in activity – maybe caused an album release or gig.

Because we store all the data indefinitely and in such a way that we can access it very rapidly, we can run regular batch analysis on the contents of our data warehouse to unlock interesting information.

In this example, we will compare the play count time series data for the top 20,000 artists by total plays on MySpace. It is important to consider that some trends may follow each other with a time lag, so we compare the 20K time series at multiple time lags from 0 to 30 days in the past, in 1 day increments. This means the approximate number of time series comparisons our analysis servers must do for this particular problem is 6 Billion, each one comparing hourly resolution data over a period of 4 months.

Let’s take a look at which artist has a similar trend to Kings of Leon:

Kings of Leon and The Fray - MySpace Plays Per Hour

Kings of Leon and The Fray - MySpace Plays Per Hour

We can see the plays per hour for The Fray seem to be following a similar long term trend to that of Kings of Leon, but offset by the difference in their popularity on MySpace – although they are converging as time goes on. The peaks and troughs also line up, so clearly the fine resolution hourly variation in the data has something to do with the overall use of MySpace at any period in time, not just the popularity of the artist. This is something that can be seen over most MySpace data.

Now let’s look at two artists who have even more similar plays per hour to each other:

Dido and The Clash - MySpace Plays Per Hour

Dido and The Clash - MySpace Plays Per Hour

The Clash and Dido show very high similarity for plays per hour on MySpace over the time frame shown in the chart above. A lot of this will have to do with the overall use of MySpace at any period of time, and the fact that the two artists have not had a lot of activity during that period to make their play counts diverge from each other.

Finally, we’ll search for artists that show similar short term peaks to one other. In this case Muse was flagged as a high match for 50 Cent in September 2009, as is clear in the chart below:

Muse and 50 Cent - MySpace Plays Per Hour

Muse and 50 Cent - MySpace Plays Per Hour

If we look at their discographies – we discover that both Muse and 50 Cent made a release on the same day in September.

We’ll investigate the different reasons why two artists might have similar trends to each other in another blog post, so check back soon!

Twitter Filtering

DEC. 4
2009

In this blog we’re going to show you an important feature that helps distinguish the quality of data supplied by musicmetric: The ability to disambiguate whether mentions of an artist with a common word as their name are in fact referring to the artist. Likewise, distinguishing between two artists that have the same name.

These methods are applicable to any text based data, but for this example we’ll take a look at Twitter.

Musicmetric collects all mentions of an artist on Twitter. Taking an example of the rock band Oasis, we collects tweets in the following 3 categories:

  • name mentions: “Oasis”
  • replies: “@Oasis”
  • retweets: “RT @Oasis”

If the artist does not have a twitter ID, we still track their name mentions – and we are currently tracking over 500,000 artists.

It is obvious that all replies and retweets are definitely relevant to the band but some name mentions are probably not. When people post a tweet which includes the word “Oasis”, they might mean Oasis rock band, an isolated area of vegetation and water in a desert or just a name of a random bar or restaurant. Therefore it would be naive to collect tweets without filtering them because this trend data would not reflect the real popularity of the band Oasis on Twitter.

These name mentions are important since a lot of the time people will not cite the @username of the artist when referring to them on twitter (as can be seen in the examples below) and of course, not all bands even have a twitter ID.

At musicmetric, we have developed proprietary algorithms to deal with irrelevant tweets effectively. We analyse all tweets and successfully filter out irrelevant messages by assigning a probability that the tweet is relevant to that particular artist.

The table below shows a good example of our algorithm’s efficiency:

Filtering tweets about the band "Oasis"

Even though there are still few irrelevant tweets (highlighted red) and some vague tweets which we can not tell whether they are relevant or not (highlighted blue), the accuracy has been improved a lot in comparison to the raw data. Currently for bands or artists who have very common names like Oasis, our model can filter up to 70%-80% of irrelevant tweets. For bands or artists who have distinct names like Lady Gaga or Robbie Williams, the model can filter up to 95%-100% of irrelevant tweets.

The chart below shows the number of tweets mentioning Oasis per hour before and after being filtered. You can see a big difference and that is why the filter is very important.

Filtered and unfiltered tweets mentioning "Oasis"

We are still collecting more data and adding more valuable information to our model. Therefore it is expected to work more and more accurately – it learns as it goes, and it can read 96 Million tweets per day, so it learns very quickly.

Why not check some live stats for your bands by registering for a musicmetric Essentials trial?

Trung