• Data mining Split Ticketing Information

    by  • May 18, 2012 • .Net, Journal, Programming • 3 Comments

    In my last blog post I talked about what data I would need to go about building a Split Ticketing website, where you split up your train journey into multiple segments to reduce the overall cost.

    To do this I first off needed to find out information about train stations, particularly where they all are! This is because geographic distances play a part in the simple heuristic function employed by the A* search algorithm when plotting the cheapest route.

    Train Station Locations (Click to see Zoomed in)

    First off I needed a list of all the train stations, luckily this is provided by the ‘Office of Rail Regulation’ where there are currently 2352 stations in the UK, and they conveniently provide a spreadsheet with their names and postcodes – which using Google Docs can be saved as a CSV file.

    By using Ordance Surveys OpenData project I can for free convert these postcodes into grid references that I can use to easily plot locations and calculate distances. You have to request the PostCode info and 10-20 mins later you get huge CSV files full of Easting/Northing data for each postcode.

    I wrote a series of parsers to go through the data and pick out the information I need and match it up between the 2 data sets:

    • Train Station Code
    • Train Station Name
    • PostCode
    • Easting
    • Northing

    Once done I enumerated through, and using the relative Easting / Northing data I plotted the information on a bitmap along with the station code, the PNG of which can be seen at the top of the blog post – Click on it to get the full effect!

    For the nerds out there, you can download the CSV file of the data that was generated here which contains the categories described above

    Determining routes

    National Rail and ATCO have routing guides, these are used to generate train routes when finding tickets. Although good, we can use the A* algorithm to see if there are better routes.

    We need a mechanism for determining which stations are connected, to do this we need to mine National Public Transport Data Repository (NPTDR) data from the ATCO, luckily we can get all this (250mb+) from data.gov.uk.

    This provides a HUGE amount of data on every stop (bus, train, tram, ferry, airport etc.) in the country, each is given a special ATCO code. By stripping out the other forms of transport we don’t need, we can match up the Train Station to its ATCO code (RLY code) based on the Easting/Northing information, which can then be used in the provided stopping data to plot journeys.

    Below is the map which includes the ATCO routing information overlaid (showing direct routes between stations), it looks a bit rough and some of the journeys don’t seem right at the moment but I will look into this further.

    Routing Information (Click to see Zoomed in)

    So there we have part 1, I have everything (nearly) in place to start searching for possible routes, join me in part 2 where I will connect it to one of the many ticket information & timetable APIs to generate the A* search algorithm.


    Well I tinkered away and fixed most of the bizarre connections (due mainly to invalidate coordinate data), I added the A* search algorithm and using the heuristic of shortest distance & actual, we get the route of shortest distance across the rail network, I choose CPM and YRK as the endpoints.

    Shortest Route Map (Click for Zoomed Image)


    Software engineer. Tea drinker


    3 Responses to Data mining Split Ticketing Information

    1. Andrew Murray
      June 4, 2012 at 9:05 pm

      Very interesting stuff – it’s quite good all the free data available. Nice blog too btw. Now you just need to figure out how to make your millions!

    2. Pingback: Visualising data – what is it, what is important, how best to display it | Mr Pfisters Random Waffle

    3. Ali
      September 15, 2014 at 5:42 pm

      So what is the difference between A* and Dijkstra algorithm for pathfinding? Was there a reason for using A* or was it ‘just cause’? :)

      iirc, they use Dijkstra in devices like Tom Tom etc. I just can’t remember the reason.

    Leave a Reply

    Your email address will not be published. Required fields are marked *