Skip to content

Mining the Web

June 8, 2013

I recently had a colleague over from Germany and each morning I would pick him up and we drive over to our client in Parramatta, a trip that takes about an hour. That twice daily commute gives us ample opportunity to chat and we cover a variety of topics. One day he remarked that it appeared to him that there were more duplicate street names in Sydney than was the case in Germany. That raised the question how to test such a hypothesis. First of all where to get relevant data. We quickly settled that Open Streetmap ( would be the likely best source. So next morning I checked and found that not only does it have the required data but also that there is a powerful query language to retrieve all roads and their road names.

In the end it took all of 20 minutes to download the road segment data for Berlin and Sydney and about 60 lines of python to construct and identify the unique roads (They are actually quite comparable: Sydney has about 12,500 roads and Berlin 11,700 roads).

Lo and behold, turns out that there is very little difference between the two cities. In Berlin about 88% of all road names are used only once, in Sydney it is 85% – hardly a difference.

This is what the internet revolution has actually brought us, the general availability of data. In contrast, 30 years ago when I was working as a programmer in an econometrics institute, all data had to be painstakingly researched from publication and typed in by legions of students.
Yes it still requires expertise to retrieve and process the data but at least it is available.

I have since compared other cities such as Munich, Hamburg, Paris, London, Rome and Denver. A great surprise that the results for all but two of them were within 1-2% points. Originally we speculated that this might be a matter of age. So the hypothesis we had was that this was caused by integration of nearby communities as cities grew older. By that hypothesis older cities would have more duplication than newer ones. As it turns out the opposite occurred. Rome has 97% unique street names whereas Denver has 71% unique street names.

Another surprise is that the plot of how many street names are used once, twice etc. follows a power distribution, i.e. the number of street names used twice are 10% of the number of unique names and the number of street names used 4 times is 1%.   Again the exceptions are Rome and Denver.

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: