As part of the twitter content analysis solution I am working on (see my other post on sentiment extraction), I also have to track those tweets that link to certain content on the net, e.g. specific marketing content on youtube, websites etc.
In order to do so, I have to expand the shortened URLs and this is where I run into massive performance issues.
I have tried using an external URL expander service - I tried a bunch of them and they all are very slow (resolving 3500 URLs took over 20 minutes.) Now my development machine is very slow (believe me, it is seriously slow) but it was idling for most part. Our network connectivity is not too bad so that’s not the issue either.
So I tried having the UDT resolve the URL directly through opening the URL through urllib2.urlopen and using geturl() to get the final URL.
This also resolves the problem of daisy chained shortened URLs (from t.co to bit.ly to tinyurl to the actual URL) as urllib2.urlopen will ultimately resolve to the actual destination URL. (Whereas URL expanders may just provide you with yet another shortened URL).
However, using urllib2.urlopen to resolve the destination URL is even slower than using a single expander service.
Interestingly though, the twitter search API works much quicker. Data Services is happily scooping up the piles of tweets in rapid order - so external http connectivity through UDT and Python’s urllib2 is performing just fine.
Any suggestions on how to speed things up a bit? Anyone else dealing with the same issue of expanding these shortened URLs?
ErikR (BOB member since 2007-01-10)