Marc Moens, senior partner at Pentech will join the Board of Semetric. The new investment enables us to expand our data collection infrastructure and applications, as well expand our very talented team. The additional resources will allow us to focus on extending our applications and services to the music industry as well as exploring other content sectors. The funding will fast track our development roadmap and see us well funded for a two year runway.
Today we are going to introduce to you another piece of technology we have developed at Musicmetric. As you may know, parts of our product are driven by semantic analysis; we don’t just tell you how many people are talking about your artists, but also their opinions, the sentiment and common topics surrounding them. How do we do this? Sentiment analysis is a challenging problem that still has not been solved completely. Many so-called sentiment analysis systems use a very naive method to detect sentiment in a context, i.e. using key words or very basic sentence decomposition. However, human language is not that simple, so these approaches fail to capture irony, sarcasm, slang and other idiomatic expressions.
In this blog we’re going to show you an important feature that helps distinguish the quality of data supplied by musicmetric: The ability to disambiguate whether mentions of an artist with a common word as their name are in fact referring to the artist. Likewise, distinguishing between two artists that have the same name.
These methods are applicable to any text based data, but for this example we’ll take a look at Twitter.
Musicmetric collects all mentions of an artist on Twitter. Taking an example of the rock band Oasis, we collects tweets in the following 3 categories:
- name mentions: “Oasis”
- replies: “@Oasis”
- retweets: “RT @Oasis”
If the artist does not have a twitter ID, we still track their name mentions – and we are currently tracking over 500,000 artists.
It is obvious that all replies and retweets are definitely relevant to the band but some name mentions are probably not. When people post a tweet which includes the word “Oasis”, they might mean Oasis rock band, an isolated area of vegetation and water in a desert or just a name of a random bar or restaurant. Therefore it would be naive to collect tweets without filtering them because this trend data would not reflect the real popularity of the band Oasis on Twitter.
These name mentions are important since a lot of the time people will not cite the @username of the artist when referring to them on twitter (as can be seen in the examples below) and of course, not all bands even have a twitter ID.
At musicmetric, we have developed proprietary algorithms to deal with irrelevant tweets effectively. We analyse all tweets and successfully filter out irrelevant messages by assigning a probability that the tweet is relevant to that particular artist.
The table below shows a good example of our algorithm’s efficiency:
Even though there are still few irrelevant tweets (highlighted red) and some vague tweets which we can not tell whether they are relevant or not (highlighted blue), the accuracy has been improved a lot in comparison to the raw data. Currently for bands or artists who have very common names like Oasis, our model can filter up to 70%-80% of irrelevant tweets. For bands or artists who have distinct names like Lady Gaga or Robbie Williams, the model can filter up to 95%-100% of irrelevant tweets.
The chart below shows the number of tweets mentioning Oasis per hour before and after being filtered. You can see a big difference and that is why the filter is very important.
We are still collecting more data and adding more valuable information to our model. Therefore it is expected to work more and more accurately – it learns as it goes, and it can read 96 Million tweets per day, so it learns very quickly.
Why not check some live stats for your bands by registering for a musicmetric Essentials trial?
Not that relevant to music, but this graph is pretty cool. We ran a really basic text extraction on 11 Million tweets logged by our servers during the past week, and plotted the proportion of messages each day that contain ’ ‘
It’s been corrected for varying popularity of twitter on different days.
Saturday is a happy day, and it’s tomorrow – so cheer up!
I should mention, our sentiment analysis algorithms at musicmetric are rather more advanced than this
In this post we’re going to give a quick fire tour of some charts you can see in our app, demonstrating some of the main functions and how they can be used.
Let’s start off with the big picture. Online Buzz gives an indicator of how many people are talking about an artist on the web. We use clever machines that learn how to cut through the noise and only detect the artist in question.
The chart below shows how the Online Buzz for the band Muse changed since 2006. It shows the number of comments per day about Muse, compared to the overall number of comments about bands.
If we zoom in to the last 6 months as is shown below, we can see the online buzz for Muse has been pretty constant, with a slight increase overall:
If you need a more granular view than Online Buzz, you can check what’s happening on some music social networks in the Social Networks section.
So, below are the MySpace Views and Plays per hour for Muse; the big spike in September shows when they released their single “Uprising”. The peak immediately after that one was the album release:
These charts show a 24 hour moving average for Plays and Views per hour.
That means we take the average number of plays or views for the last 24 hours and plot that on the graph.
This gives a better visualisation of the trend as the raw data can be confusing. Below (in red) we can see what the raw data looks like without the moving average overlaid:
Remember, musicmetric isn’t just limited to superstar bands like Muse. Let’s take a look at some stats for Master Shortie – an up and coming London rapper.
Here is a view of where people follow Master Shortie online:
Looking at some data about those fans, we can see Master Shortie is pretty popular with the ladies:
And their age profile fits a distribution around the 18 year old mark:
Now let’s drill down a bit to see where their MySpace fans live.
The chart below shows that fans of Master Shortie on MySpace are located mainly in the USA and UK:
The overall user demographic of MySpace is pretty biased towards these two countries, so let’s check out the top cities for fans of Master Shortie on Twitter:
Nine of the top 10 cities for locations of fans of Master Shortie on Twitter are in the UK, with only New York showing up for the USA.
Now let’s look at where Master Shortie’s Twitter fans live on a map of the world:
Each one of those circles represents one or more downloads, when you hover over a circle in the musicmetric application with your mouse you can see an instant pop-up of where and how many downloads the circle represents. It even tells you the exact time a download was made.
The darker and more solid the colour, the more downloads are being overlaid onto the same area, giving a really good indication of popularity by region.
Here is the same map for the location of Master Shortie’s fans, this time on MySpace:
Now let’s look at the most influential people relevant to Master Shortie on Twitter.
This will tell you the most relevant people on Twitter to target with marketing material, because they actually care about the artist in question, and are very influential in those circles.
We don’t just calculate this based on the number of followers each person gets, but the number of followers their followers get, and so on.
If that doesn’t make sense, imagine it works a bit like the Google PageRank algorithm, because it does. Someone with a million spam bots following them will have a lower rank than another person who’s only being followed by a few very influential people (like a music magazine or a record label).
Let’s move on to Bittorent data now, and take a look at some charts for Robbie Williams.
The chart below shows the number of peers per hour connected to the torrents for the single Bodies and the new album Reality Killed the Video Star. Just so you know, our Bittorent data is anonymous and aggregated to the city level. Tracking individuals isn’t our game.
And here is the map of locations of people downloading the torrents at 7:00pm yesterday (30th November 2009):
Now prepare yourself for the all time cumulative map for Bittorent downloads of Robbie Williams – Reality Killed the Video Star:
Clearly Robbie is very popular worldwide, so let’s get a closer look below at the largest solid coloured area in the UK and Europe:
To clearly see the top cities, a table is more suitable. Below are the top cities for Robbie Williams – Bodies on Bittorent:
So there you have it!
These were just some of the top functions currently launched in our beta version of musicmetric.
Get ready for our full launch over the next few weeks as we’ll be unveiling a rocking host of extra functions, including twitter activity, results from wider ranging web crawls, sentiment analysis for tracks and artists, more social networks, authority ranking for all sources of data, and individual song tracking.
Plus, we’ll be revealing our advanced analytics functions which allow the whole collection of data to be probed in more detail, picking out patterns, similarities, trends and more.
Our development cycle has been insane and it’s really ramping up now! We’ve hired more full time developers, upgraded our data centre, bought dozens more servers, hundreds of TB of storage… We’re just about ready to explode with data, and we love it.
Keep checking back because the updates will keep coming, and if you just can’t wait then register now to begin tracking everything in real time with a free demo of musicmetric essentials.
We saw The Temper Trap play a gig in London the other day, check out their online social network buzz the past few years:
How far will they go?
Why don’t you check out a free trial of the musicmetric Essentials desktop application and see this chart for your own artists!
Check out this snapshot of Bittorent activity for ‘Susan Boyle – I Dreamed a Dream’ during her album release week.
The top country is the UK, and top city is London
This is a screen shot of the analytics available in the musicmetric application.
We welcome Trung Huynh, the latest addition to the musicmetric development team. He’s an Oxford graduate with some cool skills in computing, AI and machine learning. Trung will be working on the analytics side of musicmetric – helping to squeeze more valuable and accurate analysis from our huge and ever increasing warehouse of data.
Check out his blog: http://www.trunghlt.com/techblog/
TechCrunch EU & TechCrunch USA have written a great article supporting our beta musicmetric Essentials launch today, you can read it here: http://bit.ly/234mF0 .
We’re giving away 250 voucher codes for TechCrunch readers to register for a 1 month free trial of Essentials, with some sneak previews of Professional added in, so get registering !
An update from the development team…
Our aim at musicmetric is quite simple: We will collect and analyse all the data on the web (and some that isn’t) related to trends in music and present it to our users in an easily accessible and actionable format. Over the next few months we will have downloaded and analysed a large proportion of all relevant published articles, and will continue to do so as they are written to keep right up to date with opinions, trends and buzz.
Our aims are simple, but the challenges we’ve faced over the last year and a half approaching our launch have been far from trivial, and hopefully this post will give some insight into the technical side of what we’re doing.
Gathering the data, although the easy part, needs an extensive hardware infrastructure to download, extract and archive text from millions of pages a month. Accurately analysing, scaling and detecting patterns in the data locked up in these terabytes of text is the real challenge and most interesting part of working on musicmetric. It would be naive to simply present raw data as trends in the global music landscape (although we do supply raw data), the trend tracking methods we have developed would be useless if not scaled by accurate influence ranking for the sources of these trends, and simply calculating these scores is a huge task in itself.
Likewise, following activity on just one or two social media websites and presenting this as trends would give a massively biased view of where an artist is actually popular. For example, the social media website Orkut is hugely popular in Brazil, so all data originating from this website would be biased towards that country. Likewise with Twitter, trends would lean towards the UK / USA and not necessarily reflect a global view. We are rolling out tracking for multiple social networks over the next month.
Another challenge faced are the methods we have developed for text mining and sentiment analysis (and not just the fact that we need to analyse over a million documents per day). An example would be the band Pavement. How does a machine know if a piece of text is referring to the band, or a pavement alongside a road. What about two artists with the same name? There are three artists that go by the name Nirvana, seven are called Justice. Which one does our customer care about? Perhaps all of them? Disambiguation is key for these applications to work correctly. The methods we use for sentiment analysis also have to cope with changing vocabulary, or even different languages so adaptive methods are key, for this reason we employ a machine learning approach to this problem, which again has taken a long time in development.
Because we know our customers are using this data to make important decisions in how they run their business or manage their artists, we are making absolutely sure that the data is reliable, trustworthy and complete. Traceability of data sources is paramount to reliability. Our infrastructure allows full audit of any piece of data at any time, from how it was scaled or normalised, right back to which one of our servers originally collected the raw version. This is important for a variety of reasons, particularly the ability to show exactly why trends are occurring, and improves trust in our analytics. It is one thing displaying a line chart or an index showing success for an artist, it is quite another presenting a full breakdown of each source of data and how it was included in the analysis, giving clear perspective on how that line chart or index was calculated.
musicmetric is a well funded team of 6 fulltime staff (and growing) with extensive backgrounds and deep knowledge in the field, we are using cutting edge technology and work closely with our partners to solve difficult problems and have spent the last year and a half working these out. We are extremely excited to be coming towards the end of our development / alpha stage and into our official beta, then preparing for our full launch in November.