Due to heterogeneity of the data a number of steps were required to harmonise the data to the level where it can be matched with cluster definitions. The main portion of this process was bringing the whole dataset to NACE 4-digit level by splitting the data from higher levels of aggregation when not available.

Industry splitting algorithm
The data were processed on the same regional division level as they were reported, which corresponds to NUTS 2 regions for all countries except for Ireland, where the data were only available for Ireland as a whole (Ireland is composed of two NUTS 2 regions). Initial processing was also done only for the years for which the data were reported.

The algorithm of harmonization of the data works in the following way:

1. First, the script checked for availability of employment data for each combination of year, region and NACE 4-digit industry (for years/regions where at least some data was available).

2. If the data was present, it moved to the next combination.

3. If the data cell was blank, it searched for all the parent industries of the one that is missing until it found a value (e.g. if industry 51.83 was missing, it would check for 51.8; in case that is blank too, it would go to check 51 and section G, which is the parent for 51). If all the parent industries were missing, the cell was left blank.

4. When a value was found, the script went on to determine all the children (and sub-children) of the parent industry that are on cluster definition level (i.e. NACE 4-digit) and have no intermediate industries that are available. The script than calculates the sum of all the cells that are already available within the given parent, deducts this sum from the value of the parent cell and then splits the result equally between the children determined previously.

For example, industries 40, 40.11, 40.13 and 40.3 are available, but 40.1, 40.12, 40.2, 40.20 and 40.30 are not. The script skips 40.11 as it is available, determines that 40.12 is missing, goes on to check for 40.1. That is missing as well, so it checks for 40 and finds the value. Than it checks for the children: 40.12 is the first obvious one. Then comes 40.20, since 40.2 is not available either. 40.30 does not get there since the script sees that it can be derived from 40.3 with better precision. So, it takes the sum of the three available cells (40.11, 40.13 and 40.3), deducts it from the value of 40, divides the result in two (since there are two target industries) and assigns this last value to 40.12 and 40.20. Then it assigns the value of 40.3 to 40.30, since it is the only industry whose parent is 40.3.

This method helped us not only to transform the data that we had only on NACE 3-digit level, but also allowed us to use the data on multiple levels for the same country. This was particularly valuable with countries like United Kingdom, that provide data on all NACE levels, but the less aggregated the data is, the more values are withheld due to confidentiality.

There certainly are drawbacks of this method, and the main definitely lies in the fact that the splits are done equally among all the target industries. In most of the cases the split is close to reality, especially in the countries with NACE 4-digit coverage – they normally have only the very small values withheld (e.g. when less than ten people are employed in the industry). There are however some special cases when the results are wrong. One particularly erroneous case occurred in Vienna: the 2-digit code for transportation was split equally between railway transportation and pipeline transportation. This resulted in Vienna appearing as a 3-star oil and gas cluster, whereas definitely pipeline transportation employment constitutes only a small fraction of the railway employment. In this single case, we have manually assigned all of transportation employment to railways code.

Some clusters, especially small ones that are composed from NACE 4-digit industries from different sections, are more prone to errors than the others. The example of such cluster is Building Fixtures, the split data from which could have gone in furniture, construction and other clusters.
Fortunately, all of these potential errors are traced and indicated with red colour in the data tables presented in the mapping sections. The basic formula for a star to be considered certain is:



where E min is the employment in the cluster in case everything that could have been split to other clusters ended up in those clusters. E min is the employment of a cluster in which we are certain regardless the split errors. E cutoff is the employment required to receive a star in this indicator (size, specialization or focus). In case this inequality is not true for at least one of the stars, the cluster receives an “a” note and its stars are indicated in red.

Processing regions and years
After all the operations with industries were completed, the data on definition level was aggregated to the regions that we use in this project. This concerned only the countries that we decided to use on NUTS 1 regional level and some minor islands, like åland, that were merged with the mainland (see Regional Aggregation subsection for more details.
The data then has undergone two last processes. First, it was aggregated from NACE 4-digit industry level to the cluster categories that we use. And finally, the table was created showing not only the actual data year, but also the reference year to which it belongs. The reference years range from 1991 to 2006 and for each year the latest available data year was used. For the purpose of this project only the latest data was used (i.e. reference year 2006), you can find the actual data years that the data on the website represents in the following table.

Country Data Year Country Data Year
Austria 2004 Lithuania 2004
Belgium 2004 Luxembourg 2005
Bulgaria 2005 Malta 2005
Cyprus 2005 Netherlands 2005
Czech Republic 2005 Norway 2006
Denmark 2005 Poland 2001
Estonia 2004 Portugal 2004
Finland 2004 Romania 2005
France 2005 Slovakia 2005
Germany 2006 Slovenia 2006
Greece 2006 Spain 2005
Hungary 2005 Sweden 2005
Iceland 2004 Switzlerland 2005
Ireland 2004 Turkey 2002
Italy 2005 United Kingdom 2005
Latvia 2005


|