Friday, January 25, 2008

How to detect Mobile device hitting your Web Server in PHP

by Andrea Trasatti

http://wurfl.sourceforge.net/index.php

PHP is a great platform for WAP development. Thousands of developers worldwide love PHP for its performance and for the semplicity of its model.
It should come as no surprise that some PHP developer quickly built the tools to tap the WURFL power from PHP.

One easy way to play with the WURFL, is to use what I call the "WURFL PHP Library". The package includes the library and some support files such as a readme (I STRONGLY SUGGEST EVERYONE TO READ IT), check_wurfl.php that will let you quickly read all the capabilities of a selected device and update_cache.php which will be better described later.

* wurfl_parser.php: parse the xml and put it in an Array and the other to work with the data collected. Before starting, make sure you have compiled PHP properly ( http://www.php.net/manual/en/ref.xml.php ), wurfl_parser.php uses the basic XML functions implemented compiling with expat.
* wurfl_class.php: this script lets you access the data in the wurfl array (see previous bullet) in an object-oriented fashion.
* wurfl_config.php: with the growing number of possible configurations in the PHP library, I am now introducing a single file for configuration. This should hopefully make it clearer where you need to configure the scripts and maybe also make it easier to integrate it with your central configuration files (if your application has any)


wurfl_parser.php

wurfl_parser.php: This is a VERY simple XML parser that reads WURFL and puts all the needed data into an array. Considering the time the parser was taking to parse the XML, I thought that using some kind of caching mechanism was probably a good idea. This will be discussed more later.
The generated Array is an associative array that looks like this:

$wurfl["devices"]["ericsson_generic"]["fall_back"]="generic";

The first key is "devices". Other possible values such as authors and contributors are probably not so interesting to you.

The second key is the UNIQUE ID assigned to each user agent. All other keys are related to the attributes or groups and capabilities of the device.

Of course, this may seem not so flexible in practice, since devices tell us their user-agent strings, not their WURFL IDs.
This is solved thanks to an additional array called $wurfl_agents, which is simply an associative array of user agents and the relative unique id. This array makes your life much simpler (and your search much faster). $wurfl_agents is cached too.
Let's look at a concrete example. Someone visits you site with, say, a Siemens S45 (user agent: "SIE-S45/24 UP.Browser/5.0"). In order to find the device ID, you would normally be required to cycle most of the $wurfl array and retrieve the ID, before you can look up the actual capabilities.
Thanks to $wurfl_agent, this is as simple as:

if ( in_array($id, $wurfl_agents) ) {
echo "The ID $id is known
\n";
}

Example to search an ID knowing the user agent:

$wurfl_user_agents = array_keys($wurfl_agents);
while ( $x = each($wurfl_user_agents) ) {
if ( $x[1] == $user_agent ) {
return $wurfl_agents($x[1]);
}
}


wurfl_class.php

wurfl_class.php was created to make our life simpler. Once we had the data ready in an array, we needed an easy way to access it and some methods to manipulate it. By looking closely at what had been done with the Java API, I reproduced many of those useful methods in PHP.

The class loads the parser automatically. To load the parser a defined must be set appropriately in the config file.

The class is initialized calling the constructor, wurfl_class and passing two variables that may be empty. The former is the full XML parsed (like the parser does) and the latter is the array of user agents and id's as generated by the parser. Pass two empty variables (or nothing) if you want them to be filled (when needed) or the real values if you already have them. The class will check the values and the cache files (if enabled) and decide what to do. Also check your configuration, because the behaviour will change.
Use the public method GetDeviceCapabilitiesFromAgent() passing the user agent to make the class search for the best fit and fill the object's properties. Once again, what the class does depends on your configuration, if you enabled cache files and so on.
Once you have instantiated the object and passed a user agent you may use all the class' methods.
Here is a list of properties:

$wurfl_class->_wurfl is the WURFL array (all of it)

$wurfl_class->_wurfl_agents is the associative array made of the user agents
and unique id's

$wurfl_class->user_agent the visitor's user agent

$wurfl_class->wurfl_agent the WURFL's best fitting user agent

$wurfl_class->id the corresponding id

$wurfl_class->GUI true if the device supports Openwave's GUI
extensions

$wurfl_class->browser_is_wap true if the device is WAP capable. It is here
only for legacy support, you should use the
is_wireless_device capability from WURFL.
browser_is_wap now has the same value as the
capability, when found. If you want to take full
advantage of this capability you should download
the web patch from the WURFL site or CVS!
$wurfl_class->capabilities the array of device's capabilities

Note: PHP (up to version 4.3) does not have any distinction between private and public methods. In the wurfl_class implementation, I named all private methods with a leading underscore to distinguish them from public methods without the underscore. If you are interested in knowing the details, just open the class, there are beautiful JAVADOC-like comments for each variable and method.
These libraries don't require register_globals anymore (from version 2 and up), but will not work with versions before 4.1.

wurfl_class($wurfl, $wurfl_agents) is the constructor. Built to work
best with the wurfl_parser.

GetDeviceCapabilitiesFromAgent($ua) given a user agent it will search WURFL
for the best fit

getDeviceCapability($capability) given a capability it will tell you the
value. Remember that capabilities might be string,
integer or boolean.

wurfl_config.php

The scope of this file is straight forward, modify it at your wish to configure the library to act as you like it best.
Please check the paragraph about caching for more info about cache files.
Here is a quick explanation of all the fields:

WURFL_CONFIG boolean, this is set to true by default, it's used as a simple
check to make sure the configuration was included. Add this to your
configuration files if you won't use wurfl_config.php, otherwise just
leave it as it is

DATADIR string, where all data is stored (wurfl.xml, cache file, logs, etc)

WURFL_FILE string, full path and filename of wurfl.xml

WURFL_PARSER_FILE string, full path and filename of wurfl_parser.php

WURFL_CLASS_FILE string, full path and filename of wurfl_class.php

WURFL_USE_CACHE boolean, true if you want to use a cache file (strongly
suggested). If only this parameter is set to true will be used
cache.php.

WURFL_USE_MULTICACHE boolean, true if you want to use Multicache files
instead of a single BIG cache file (cache.php)

MULTICACHE_DIR string, used only if you enabled Multicache, defines where
the cache files will be stored. WARNING: while cache.php will grow
in size but remain a single file, here the files will grow in
number. Expect more than 5000 tiny files.

MULTICACHE_SUFFIX string, suffix for the files generated using Multicache.
Useful if you use a caching system and don't want to load your
shared memory with a ton of tiny files.

CACHE_FILE string, with full path and filename of the cache file to use
(refreshed when a new WURFL is found, if WURFL_CACHE_AUTOUPDATE is
set to true)

WURFL_CACHE_AUTOUPDATE boolean, tells the class to automatically update the
cached files with a new XML is found. This is NOT suggested when
using MULICACHE because of the high number of files to be updated.
Race conditions are highly possible to happen. The use of
update_cache.php is strongly suggested for production
environments

WURFL_PATCH_FILE string, optional patch file for WURFL

WURFL_AGENT2ID_FILE string, used by wurfl_class.php. Used only when
WURFL_USE_CACHE is set to true

MAX_UA_CACHE integer, max number of user agents to store in
WURFL_AGENT2ID_FILE. Too high limits might give the opposite effect.

WURFL_LOG_FILE string, defines full path and filename for logging

LOG_LEVEL integer, desired logging level. Use the same constants as for PHP
logging

WURFL_AUTOLOAD boolean, true if you want the XML to be loaded at every
startup. If not, the XML will be loaded when needed.


Caching

Considering how slow PHP can be when parsing a big XML file, caching was a must.
Currently there are two caching systems. The older is activated when setting WURFL_USE_CACHE to true and uses DATADIR to store its files. The concept is quite simple, dump the array generated by the parsers in a big file, by default called cache.php (set by the define CACHE_FILE). In this file we also store the array called $wurfl_agents and a timestamp, useful to check if a new XML was deployed.
This system is very simple in its concept and worked well for quite some time. Considering the big size that cache.php was reaching it became a need (and in fact I always strongly suggested it) to use a caching system at PHP-level, such as Zend Accelerator, Turck cache, APC 2.0. Using such tools lets you store the cache file (cache.php) into shared memory and provides really good performances from the third hit on (first hit the XML is parsed: slow, second hit the cache is stored in shared memory: slow, third hit and on the cache is read from the shared memory: fast!).

The new caching system was dubbed "multicache" because instead of generating a single big cache file it generates 1 cache file for every device in WURFL. For this reason you will need to create a directory for this (or at least this is suggested) because the library will generate about 6000 tiny files when this feature is activated.
To activate the multicache system you will need to set WURFL_USE_CACHE to true.
CACHE_FILE will still be used, but the file will only contain the array $wurfl_agents and the timestamp.
Also set WURFL_USE_MULTICACHE to true, set the appropriate path for MULTICACHE_DIR, an absolute path is suggested. Don't forget the ending slash (for example '/tmp/cache/multicache/').
MULTICACHE_SUFFIX should be left unchanged in most cases. This will define the extension of the tiny files. You might want to set some strange extension if you want to avoid that those files are cached in shared memory by any PHP-cache, for example. Change it only if you know what you're doing!
The multicache system provides a MUCH faster data retrival, files are smaller and so it will take a way less time to read them. This will also mean a higher I/O on your system, consider returning to the older cache system if you have problems. Generating many tiny files also involves possible race conditions if a new XML is deployed and the library is configured to update automatically the cache. Read on for more info.

If WURFL_CACHE_AUTOUPDATE is set to true the library (specifically wurfl_class.php) will check the timestamp in cache.php against the file mtime of wurfl.xml. If the XML is newer than the cache it is reloaded. This is not suggested for production environments, if you have many concurrent hits you might have more than one process trying to refresh the same cache wasting a lot of resources. If you would like to avoid this you can set the automatic update to false and use the 'ad hoc' script called update_cache.php. This script was created to be called from command line (or a hidden URL if you'd like, but the command line is suggested when available) and force a cache update. This way the cache update will be prepared and the file will be changed at the very last second saving a lot of resources and having a single process do it. On sites with MANY hits you might also consider preparing the cache files on a separated system and moving them to the production server at once. There isn't a sccript to do this automatically at this time, but the new update_cache.php is already a step in that direction.

When setting WURFL_USE_CACHE to true you also enable another simple caching system (active both when using standard cache and multicache). When a user agent hits your site, this will most likely hit it again a few times. It would be stupid to search for all its capabilities at every second hit. For this reason we store the user agent and its capabilities in a file named after the value set in WURFL_AGENT2ID_FILE, this will make every second hit A LOT faster.
Storing all the user agents hitting your site will end up having a second full cache file and the benefit would reach zero. For this reason you can define a limit of user agents stored using MAX_UA_CACHE. The "perfect" value will change depending on your server's performances on the variety of user agents visiting your site and so on. A good number is between 30 and 50. I suggest you to start with 30 and maybe check the logs and see how often the cache is cleaned of the elder user agents and how often a user agent that is still visiting your site is cleaned and researched.

It's a direct consequence of the cache system that you will not need to read the entire XML and parse it every time you start the wurfl object. You may still want to force this for debug reasons for example. You can do this setting to true the define named WURFL_AUTOLOAD. If you are using any of the caching systems, I suggest you to disable this. If you're not using any cache, the XML will be loaded anyway, so just set this to true, if you'd like.


Logging

While logging is out of the scope of the WURFL PHP Libraries and I suggest you to integrate the libraries with your logging system (if you have any), a basic logging feature is included. This should work fine on Linux, Solaris and Windows.
Logging is done on a file as configured with WURFL_LOG_FILE. The log level is defined following the PHP contants and using the define named LOG_LEVEL. It used to be buggy in previous releases, check out how it has changed and it is now supposed to work properly. When set with the highest level the library might generate a log of logs.
Logs should anyway give you all the info you might need. This is a sample log when set at the highest detail level:

[LuckyTitan.local 327][constructor] Class Initiated
[LuckyTitan.local 327][GetDeviceCapabilitiesFromAgent] searching for SonyEricssonZ600/R601
[LuckyTitan.local 327][_cacheIsValid] cache file is outdated
[LuckyTitan.local 327][GetDeviceCapabilitiesFromAgent] cache enabled, WURFL is not loaded, now loading
[LuckyTitan.local 327][GetDeviceCapabilitiesFromAgent] loading WURFL from XML
[LuckyTitan.local 327][parse] No XML patch file defined
[LuckyTitan.local 327][GetDeviceCapabilitiesFromAgent] Searching in the agent database
[LuckyTitan.local 327][_GetFullCapabilities] searching for sonyericsson_z600_ver1_subr601
[LuckyTitan.local 327][_GetDeviceCapabilitiesFromId] reading id:sonyericsson_z600_ver1_subr601
[LuckyTitan.local 327][_GetDeviceCapabilitiesFromId] I have it in wurfl_agents cache, done

No comments: