• Week 24

    1 April 2013

    This week was all about Detling: finishing up the data-ingest and the “catalogue” display of all content. From this point, we can wrap functionality around it.

    I seem to have spent a lot of my professional life working on data-ingest or screenscraping. To be honest: I quite enjoy it. It’s a strange skillset, and scraping is always a kind of hack; I’d dearly love nicely structured data from the getgo.

    But by engaging with how human-entered data is structured and organised, you get a good feel for the shape of the information: how well the people working with it know the domain; how much it varies; what the faster-moving layers of it are, that you’ll need to be able to edit later.

    To that end, Detling has one of the more refined ingest processes I’ve worked on. A spreadsheet is sucked into a set of holding objects, each object representing one “object” in the spreadsheet – in this case, a concert, the information for which is distributed across two columns and many rows per concert.

    To ‘atomize’ the holding object into its constituent parts, an administrator needs to eyeball the scraped data, and correct it where appropriate. This won’t change the holding object – but it will effect the new objects created when we explode and normalize the the holding object. Where appropriate, these fields autocomplete based on values we already know.

    (Adding autocompletion meant sticking pg_search in, which has the added bonus of giving me search “for free”, as it were, a bit further down the line. Thanks, Postgres, for your full-text search).

    Once the fields are deemed as OK as makes no difference, a click of the ‘atomize’ button fragments the holding object, creating Composers, Venues, Performers and so forth where appropriate – or linking the Concert to existing ones.

    From that point, edits to individual objects can still be made, but the bulk of atomisation is done.

    It doesn’t sound wildly sophisticated, but it’s where the bulk of the progress this week has been made. My one day demo that I produced in the first week of the project resulted in a superficially very similar site. But it had very limited editing capabilities, and no potential to “massage” the data.

    Being able to do that up-front, swiftly and easily, saves a vast amount of time later. I learned this in many hard ways on Schooloscope, a product that relied heavily on other people’s information, and massaging it into shape as the structures of the source data changed year-on-year. So for this project, even though it’s much smaller, I wanted to make sure that an appropriate amount of time was spent on making data-ingest something anybody could do – not just me.

    That’s now out of the way – it’s inappropriate to spend much longer on it – and the next few weeks will be about building user-facing functionality; the meat of the product. But it feels good to be where we are now, and to have got there the way we did.

    Also, worth noting the domain vocabulary cropping up – Atomize, Ingest, Explode. At Berg, we referred to project languages as Dutch. It’s no problem that a project has its own language – but it’s important that language is shared with everybody. So I’ve kicked off a Glossary page on the Github wiki for the project, and every time I encounter something that I’ve named, I stick it on there. Minimum Viable Documentation goes a lon gway.