f-log

just another web log

10 Jul 2016:
fight the easy way to a xml bleb
More bleb-ing now we have the data.

XML is a wonderful idea about presenting structured data but depending on what language you use to read it the "wonderful idea" can be a big pain in the butt!

To get the programme data out of the channel xml files I needed to target only the programme data and ignore all the rest.

xmllint --xpath '//programme[ type = "Film" ]' raw/bleb-XML/0/$channelName.xml | tr -d '\n' | sed -re "s/<\/programme>/\n/gi"


gets only the programmes that are marked as Film using standard xpath notation then it makes all the results into one big line by removing the \n(carriage returns). Then re-splits the lines on the closing programme tag. This resolves the long descriptions that are spread over multiple lines in the file.
I found a xpath tester that proved invaluable in testing
http://chris.photobooks.com/xml/default.htm

Then it's just a matter extracting and massaging the data into the SQLite DB. I did not have to change my existing database schema, which is handy. What I did do though is create a new table for channel names. This allows the database channel ids to map to the channel names, which are also the file names.
ls -1 | cat -n | sed -re "s/\s+([0-9]+)\s+([^.]+).+/insert into channel values(\1,'\2','+');/gi" | sqlite3 ../../../database/tv.db

that gets a list of all the file names then adds line numbers and then stores the line number as the channel id and the channel name.

So for each of the 7 days(including today)
for i in $(seq 0 6)
do
getDaysProgramming $i
done


loop through each channel, getting the date as we go
function getDaysProgramming() {
dayNo=$1
for channel in $(sqlite3 database/tv.db "select id,name from channel;")
do
    channelId=( $(echo $channel | cut -d '|' -f1) )
    channelName=( $(echo $channel | cut -d '|' -f2) )
    echo $channelName $channelId
    programmeDate=$(xmllint --xpath '//channel/@date' raw/bleb-XML/$dayNo/$channelName.xml)
    programmeDate=$( getDate $programmeDate)
    ...

(getDate just re-formats the date string)

Then we process each programme in those channels
    IFS=$'\n';
    for programme in $(xmllint --xpath '//programme[ type = "Film" ]' raw/bleb-XML/$dayNo/$channelName.xml | tr -d '\n' | sed -re "s/<\/programme>/\n/gi")
    do
     echo "[-- $programme --]"
     title=$( getTagValue $programme "title" )
     startTime=$( getTagValue $programme "start" )
     endTime=$( getTagValue $programme "end" )
     year=$( getTagValue $programme "year" )
     description=$( getTagValue $programme "desc" )
     startTime=$(getTime $startTime)
     endTime=$(getTime $endTime)
     title=$(safeEntities $title)
     description=$(safeEntities $description)
     set +e
     if [[ -z "${year// }" ]]
     then
        year=$(getDeepYear $description)
        echo "=====================recovered deep year [$year]"
     fi
     set -e
     echo "insert into programme select null,$channelId,'$title','','','$year','','','','true','','','','','','','','','','$description','','$programmeDate','$startTime','$endTime',0;" | sqlite3 database/tv.db
    done
    unset IFS


The IFS bit is important and allows the output of my xmllint command to be interpreted directly as an array. To avoid confusion I unset it at the end.

function getTagValue() {
raw=$1
tag=$2
echo $raw | egrep -q ".*<$tag>.+<\/$tag>.*"
if [ $? -eq 0 ]
then
    echo $raw | sed -re "s/.*<$tag>(.+)<\/$tag>.*/\1/gi"
else
    echo ""
fi
}

Checks the programme string and if the requested tag exists then returns the value. The egrep -q just hides any errors.

I then have a number string manipulation functions to ensure the data is in the correct format for the database but they are not very exciting.

Finally I really want the year the film was released. Some channels complete the required data field but most do not. However, they often have the year somewhere in the description. It is always four digits but sometimes in round brackets other times in square brackets and other times where additional information on the end.

For some reason this final piece of the puzzle took the longest. It should have been very easy to check the state of $year and if empty get the last instance of four consecutive digits from the description. The set +e and set -e disable the logic that causes the script to end when it encounters an error and then re-enables it.
Every time I checked $year for being empty and it was not, the script ended. Very annoying.

function getDeepYear() {
raw=$1
echo $raw | grep -Po '\d{4}(?!.*\d{4})'
}


To do a look ahead in this regular expression required the -P option, which enables Perl regular expression handling. The o part is simple so it only returns the matching part.

All the data that I need is then inserted into the existing database and my old unchanged nodejs server works fine.

Why did I mention this is not 100%, it looks feature complete with the old code? I want a better mechanism to get outstanding days data AND to make them obvious in the GUI.

Now it should be pointed out that when trying to find a straight forward method for XML parsing in shell script there was an stackoverflow answer "shell is dead use a proper language" and all I can say to that is where is the fun in doing it the easy way? ;)

10 Jul 2016:
bleb api proves easy on the curl
I wanted to wait until the TvDig solution was 100% complete but that may never happen ;)

From the two options I did choose http://www.bleb.org/tv/

API
http://www.bleb.org/tv/data/listings
Files
http://www.bleb.org/tv/data/listings/
(and yes that is just a / difference)

Although at first I though I would be downloading individual files for each channel, as with the previous solution, it turns out they have a nice API. The API will package all the days and all the channels for you, avoiding the delay requirement when downloading multiple files.

http://www.bleb.org/tv/data/listings?days=...&format=...&channels=...&file=


So for my choice of channels its just
http://www.bleb.org/tv/data/listings?days=0..6&format=bleb&channels=bbc1,bbc2,ch4,five,bbc1_hd,bbc3,bbc4,bbc_hd,boomerang,bravo,cartoon_network,cbbc,cbeebies,ch4_hd,dave,e4,film_four,five_us,fiver,itv1_hd,more4,nick_junior,nickelodeon,tcm,tmf,uk_gold,watch,yesterday

(basically all the free ones that might have films shown

And the curl command that will fetch the next 6 days
curl --user-agent "TVDIG::MEANINGFUL-INDENTIFIER-EMAIL-URL-ETC" -o raw/tv.tbz2 "http://www.bleb.org/tv/data/listings?days=0..6&format=bleb&channels=bbc1,bbc2,ch4,five,bbc1_hd,bbc3,bbc4,bbc_hd,boomerang,bravo,cartoon_network,cbbc,cbeebies,ch4_hd,dave,e4,film_four,five_us,fiver,itv1_hd,more4,nick_junior,nickelodeon,tcm,tmf,uk_gold,watch,yesterday&file=tbz2"


Then I unpack them into the raw folder
pushd raw
tar xf tv.tbz2
popd


which gives a tree structure

raw/bleb-XML
raw/bleb-XML/0
raw/bleb-XML/1
raw/bleb-XML/2
raw/bleb-XML/3
raw/bleb-XML/4
raw/bleb-XML/5
raw/bleb-XML/6


and each of those folders has a xml file for each channel

bbc1.xml     boomerang.xml        ch4_hd.xml     five_us.xml     tcm.xml
bbc1_hd.xml cartoon_network.xml dave.xml     fiver.xml        tmf.xml
bbc2.xml     cbbc.xml             e4.xml         itv1_hd.xml     uk_gold.xml
bbc4.xml     cbeebies.xml         film_four.xml more4.xml        watch.xml
bbc_hd.xml ch4.xml             five.xml     nickelodeon.xml


That has got the data next time I will explain how I process it.

05 Jul 2016:
scottish trampolines spam the real me
Something strange happened about a month ago. I started getting invoices, just your usual spam except that they all had the jumpstation domain name on them. Considering that in the past I have received physical spam, those were just curious. Maybe an attempt to catch a web site owners attention?

Then I got an invoice that was a note about an invoice with an image of an invoice. I kid you not, the image of the invoice also has the jumpstation domain name in printed form.

"We have a minor issue with our Internet at the moment so I have had to
photograph the attached Invoice rather than scan it.
Hopefully its readable. If there are any issues please let me know."

As the item was for a "Tuk Tuk" and was quoted as thousands of pounds and had odd grammar, I gave very little thought. But when I then got a couple of emails regarding a site visit listing networking and other items these all noted an address in Scotland.

So I went back to my spam assassin folder and the photograph as it to had a Scottish address.

invoice image on wooden table heavily obfuscated
(and as you can see it was on a wooden table)

Googling around for ages and it appears and I found planning permission for trampoline park that had the same Postcode. But there was no web presence.

Weeks later I find that a new company with the name Jump Station(note the space) had registered jumpstations.co.uk but everyone was miss spelling it.

Now I was getting a few emails a day that appeared to be for the new Jump Station (Scotland?).

Lots of job applications all with dodgy Word document CV's attached and a couple of competition winners thanking the organiser and providing mailing addresses.
(this all feels like spam)

Apart from the consistency of location I would have disregarded all of these as the usual bombardment of spam. Having had this domain from before the millennium and having created literally 100s of separate email addresses I do get a lot of spam daily.

I tried responding to the latest CV with a nice "I think this was for this other domain" and just got the same CV back. Now I reply with


+++++++++++++++++++++++++++++++++++++++++
| !! MAIL BOUNCE !!                     |
+++++++++++++++++++++++++++++++++++++++++
| ERROR: 404                             |
| SYSTEM: X504.mailsubsystem             |
| MESSAGE ID: 0008                     |
| DETAILS:                             |
| Your message has not been delivered    |
| to the intended recipient.             |
+++++++++++++++++++++++++++++++++++++++++

Automated message please do not respond.

Please check your spelling and try again.

common miss-spellings include

jumpstations.co.uk
jumpstation.com
jumpnation.com
jumpfactory.co.uk
jumpstation.fm

Automated message please do not respond.


It was interesting to see all the other jumpstations[sic] that are registered.
loading results, please wait loading animateloading animateloading animate
[More tags]
rss feed

email

root

flog archives


Disclaimer: This page is by me for me, if you are not me then please be aware of the following
I am not responsible for anything that works or does not work including files and pages made available at www.jumpstation.co.uk I am also not responsible for any information(or what you or others do with it) available at www.jumpstation.co.uk In fact I'm not responsible for anything ever, so there!

[Pay4Foss banner long]