In order to recommend personalized content to users, we must first distinguish them from one another. While there are a number of established approaches for uniquely identifying website users, it remains a difficult problem.
In this article, I will introduce a handful of identification approaches, and the challenges associated with them, then give an overview of a hybrid approach that attempts to solve many of these issues simultaneously. One challenge to note, that will not be addressed further, is when two or more people use the same device or user account. This article makes the assumption that it is always the same person looking at the screen while exploring various ways of identifying website users.
User Authentication
Probably the simplest and most effective approach is to have the user sign in to an account — whether it’s their Facebook, Google, LinkedIn, a site-specific account, it doesn’t matter. Assuming that they’re the only one using that account, you can now uniquely identify them wherever they go and on any device, provided they remain logged in; but therein lies the problem. On many sites, like the ones where we provide our recommendations, only a very small fraction of users choose to sign in. This leaves publishers having to rely on other methods.
Cookies
A cookie is a small piece of data that is sent by a website and stored on the user’s device. It can contain a unique string of text that allows the website to uniquely identify that browser (cookies are browser-specific) and therefore, device. Users have the ability to delete their cookies or refuse to store them altogether. If they decide to not store your cookie, you lose your ability to identify their activity on your site. The same goes for when they switch between their home PC, work PC, iPhone, iPad, etc.
IP Address
All devices on a network have some type of IP address, but a website only has access to public IPs. A typical home or office setup involves having many devices that are assigned only a private IP (hidden to the internet), and all traffic from those devices are forwarded through a router. A website sees traffic from all of those devices as coming from a single public IP address, which is that of the router. This presents a problem in that you can’t distinguish devices/users that all browse from the same location. Of course, if a user moves location (i.e. the same iPhone, but using the WiFi at home and work, and their 3G network in between), their traffic also comes from different IP addresses as they move, complicating the problem even further.
IP Address + User Agent String
A user agent string is a bit of text that describes the software that is being used to browse a website. Included in a user agent string are details like device type, operating system/version, browser/version and a few others. While not unique enough to be used alone, when paired with an IP address it can sometimes uniquely identify devices — say, if there is only one iPhone running Safari version 600.1.4 at a given location. Obviously, once you have overlap of user agent strings behind the same public IP, this approach no longer provides value.
Hybrid Approach
The hybrid approach given here moves through stages looking for a match. We check to see if the user is signed into the site and if they are and we have a record of them, we use the existing user ID. Otherwise, we look to see if we have a record of their cookie and use the user ID if it exists. Finally, if the previous stages fall through, we look for the combination of their IP address and user agent string. If we can’t find any matching records, then we create a new user ID.
Evaluation
Cookies handle shared IP addresses and change of IP addresses well, as long as the cookies persist, but cannot account for changes of device or browser.
Using an IP address to identify users across browsers and devices works in the narrow situation where the user is the only one that visits your site from a given public IP address.
The big problem is having multiple users behind a single public IP address, which is common in many homes and offices. In this case using the IP address + user agent string can offer some help. User authentication works great if you can get your users to log into your site, but we find that this is a rare occurrence on the sites that we work with.
Using a hybrid approach, you are able to gain the benefits of all of these methods. The hybrid approach also allows you to “backfill”, using multiple approaches to create a connection between two otherwise separate identities. For example, a change of device and IP address is one of the more difficult scenarios. Consider a user that browses your site on their Home PC (Home IP Address), Work PC (Work IP Address) and Phone (Both Home and Work IP Addresses). You can use the cookie on the Phone to identify the two IP addresses as being used by the same device, then use those IP addresses to make an inferred association of the Desktop and Work PCs to the user of the Phone.
This is just one example of many situations where backfilling using the hybrid approach can be useful.
Conclusion
By approaching unique user identification from a hybrid position, we can obtain the benefits of multiple techniques. It allows us to make use of better approaches, such as user authentication and cookies, when they are available, but gives us fallbacks like using IP Addresses and User Agent Strings when needed.
Find out more about the team and the work Youneeq does at youneeq.ca