Today we are going to introduce to you another piece of technology we have developed at Musicmetric. As you may know, parts of our product are driven by semantic analysis; we don’t just tell you how many people are talking about your artists, but also their opinions, the sentiment and common topics surrounding them. How do we do this? Sentiment analysis is a challenging problem that still has not been solved completely. Many so-called sentiment analysis systems use a very naive method to detect sentiment in a context, i.e. using key words or very basic sentence decomposition. However, human language is not that simple, so these approaches fail to capture irony, sarcasm, slang and other idiomatic expressions.
Read More
In this blog post we’re going to look at an example of some of the data mining and large scale analysis which we do at musicmetric, detecting patterns and similarities in time series data.
One use of this analysis is that given an artist, we can find another artist with the closest trend in some variable over time – for example MySpace plays per hour. Alternatively we could generate a list of artists who are increasing in popularity in a certain way, or show which artists have had a brief surge in activity – maybe caused an album release or gig.
Because we store all the data indefinitely and in such a way that we can access it very rapidly, we can run regular batch analysis on the contents of our data warehouse to unlock interesting information.
In this example, we will compare the play count time series data for the top 20,000 artists by total plays on MySpace. It is important to consider that some trends may follow each other with a time lag, so we compare the 20K time series at multiple time lags from 0 to 30 days in the past, in 1 day increments. This means the approximate number of time series comparisons our analysis servers must do for this particular problem is 6 Billion, each one comparing hourly resolution data over a period of 4 months.
Let’s take a look at which artist has a similar trend to Kings of Leon:

Kings of Leon and The Fray - MySpace Plays Per Hour
We can see the plays per hour for The Fray seem to be following a similar long term trend to that of Kings of Leon, but offset by the difference in their popularity on MySpace – although they are converging as time goes on. The peaks and troughs also line up, so clearly the fine resolution hourly variation in the data has something to do with the overall use of MySpace at any period in time, not just the popularity of the artist. This is something that can be seen over most MySpace data.
Now let’s look at two artists who have even more similar plays per hour to each other:

Dido and The Clash - MySpace Plays Per Hour
The Clash and Dido show very high similarity for plays per hour on MySpace over the time frame shown in the chart above. A lot of this will have to do with the overall use of MySpace at any period of time, and the fact that the two artists have not had a lot of activity during that period to make their play counts diverge from each other.
Finally, we’ll search for artists that show similar short term peaks to one other. In this case Muse was flagged as a high match for 50 Cent in September 2009, as is clear in the chart below:

Muse and 50 Cent - MySpace Plays Per Hour
If we look at their discographies – we discover that both Muse and 50 Cent made a release on the same day in September.
We’ll investigate the different reasons why two artists might have similar trends to each other in another blog post, so check back soon!
Read More
In this blog we’re going to show you an important feature that helps distinguish the quality of data supplied by musicmetric: The ability to disambiguate whether mentions of an artist with a common word as their name are in fact referring to the artist. Likewise, distinguishing between two artists that have the same name.
These methods are applicable to any text based data, but for this example we’ll take a look at Twitter.
Musicmetric collects all mentions of an artist on Twitter. Taking an example of the rock band Oasis, we collects tweets in the following 3 categories:
- name mentions: “Oasis”
- replies: “@Oasis”
- retweets: “RT @Oasis”
If the artist does not have a twitter ID, we still track their name mentions – and we are currently tracking over 500,000 artists.
It is obvious that all replies and retweets are definitely relevant to the band but some name mentions are probably not. When people post a tweet which includes the word “Oasis”, they might mean Oasis rock band, an isolated area of vegetation and water in a desert or just a name of a random bar or restaurant. Therefore it would be naive to collect tweets without filtering them because this trend data would not reflect the real popularity of the band Oasis on Twitter.
These name mentions are important since a lot of the time people will not cite the @username of the artist when referring to them on twitter (as can be seen in the examples below) and of course, not all bands even have a twitter ID.
At musicmetric, we have developed proprietary algorithms to deal with irrelevant tweets effectively. We analyse all tweets and successfully filter out irrelevant messages by assigning a probability that the tweet is relevant to that particular artist.
The table below shows a good example of our algorithm’s efficiency:

Even though there are still few irrelevant tweets (highlighted red) and some vague tweets which we can not tell whether they are relevant or not (highlighted blue), the accuracy has been improved a lot in comparison to the raw data. Currently for bands or artists who have very common names like Oasis, our model can filter up to 70%-80% of irrelevant tweets. For bands or artists who have distinct names like Lady Gaga or Robbie Williams, the model can filter up to 95%-100% of irrelevant tweets.
The chart below shows the number of tweets mentioning Oasis per hour before and after being filtered. You can see a big difference and that is why the filter is very important.

We are still collecting more data and adding more valuable information to our model. Therefore it is expected to work more and more accurately – it learns as it goes, and it can read 96 Million tweets per day, so it learns very quickly.
Why not check some live stats for your bands by registering for a musicmetric Essentials trial?
Trung
Read More