Shortened URL expander (User Defined Transform)

system · March 26, 2013, 4:14am

As part of the twitter content analysis solution I am working on (see my other post on sentiment extraction), I also have to track those tweets that link to certain content on the net, e.g. specific marketing content on youtube, websites etc.

In order to do so, I have to expand the shortened URLs and this is where I run into massive performance issues.

I have tried using an external URL expander service - I tried a bunch of them and they all are very slow (resolving 3500 URLs took over 20 minutes.) Now my development machine is very slow (believe me, it is seriously slow) but it was idling for most part. Our network connectivity is not too bad so that’s not the issue either.

So I tried having the UDT resolve the URL directly through opening the URL through urllib2.urlopen and using geturl() to get the final URL.

This also resolves the problem of daisy chained shortened URLs (from t.co to bit.ly to tinyurl to the actual URL) as urllib2.urlopen will ultimately resolve to the actual destination URL. (Whereas URL expanders may just provide you with yet another shortened URL).

However, using urllib2.urlopen to resolve the destination URL is even slower than using a single expander service.

Interestingly though, the twitter search API works much quicker. Data Services is happily scooping up the piles of tweets in rapid order - so external http connectivity through UDT and Python’s urllib2 is performing just fine.

Any suggestions on how to speed things up a bit? Anyone else dealing with the same issue of expanding these shortened URLs?

ErikR (BOB member since 2007-01-10)

system · March 27, 2013, 3:21am

Question: In the context of the UDT, is the use of multi-threading Python code supported?

Python can easily span and manage multiple threads for async operations - and it actually works within the UDT (although I have seen some stability issues) … but is this actually supported within the context of the User Defined Transform?

ErikR (BOB member since 2007-01-10)

system · March 27, 2013, 7:29am

I think you are in completely unknown regions here…

Regarding multi-threading: Not sure about if that is supported or not. However, if you raise the DOP of a DF with a UDT in it, it will spawn multiple al_engines and from a BODS perspective it will run with multiple processes.

I think that is the easiest answer to your question.

Johannes Vink (BOB member since 2012-03-20)

system · April 2, 2013, 8:54pm

Multi threading does work inside the UDT. I can call the threading and queue modules, span different threads and handle the locks and thread collection just fine.

The only problem is that I may have floored our network by creating 10,000 HTTP connections at the same time

ErikR (BOB member since 2007-01-10)

system · April 3, 2013, 5:21am

In that case you are way further than I am with Python coding I can only change a few functions based on examples and a lot of googling…

Compared to single threaded processing: does the performance/throughput increase a lot with multi threading?

Johannes Vink (BOB member since 2012-03-20)

system · April 4, 2013, 2:27am

Yes, the output speed increases quite a bit. However, there is already quite a difference between record and collection based processing.

The Stack Overflow website/forum is a very good place to look for insight into Python. Coming from a mixed background of .NET and VB/COM+/ASP, Python does take getting used to but it is a very elegant, smooth language.

My question to you is - is it possible to return “nested” schema’s in the UDT or can it only handle a “flat” record output?

ErikR (BOB member since 2007-01-10)

system · April 4, 2013, 5:05am

I tested the input:
Not in BODS 3.x and 4.0. BODS forces you to select a field, but you cannot select a node. In fact: a field from a nesting cannot be selected as input.

As for output: I do not think so either. You need to map one field specific to an output field in the Python coding.

I have never tried collection based because we are processing relative big datasets (1m+). What is the impact on memory usage? Will it take the ful record set in to memory?

Johannes Vink (BOB member since 2012-03-20)

system · April 4, 2013, 5:59pm

I could see that one coming. The problem is that the Dataflow will make unlimited function calls. I saw the same thing when I wrote an ETL job to gather statistics on database tables. 50 tables needed statistics, 50 database connections were created. The DBA was not happy.

The solution was to single thread it using a custom function that did not allow parallel. This isn’t a good solution in your case because you DO want parallel threads.

I suspect that I could do this with Oracle scheduling. It has the ability to set the maximum number of events to run in parallel.

eganjp (BOB member since 2007-09-12)

system · April 4, 2013, 8:46pm

Hmm yes I already suspected that the UDT would not allow you to input nor output a nested schema. So I will have to do the ‘flattening’ in Python and create a list of ‘flat’ fields for output. Not exactly the solution I was after but you have to row with the padels that you find …or something along those lines. 8)

The exact workings of the UDT is still a bit of mystery to me at times but my understanding is that when using collection mode, the UDT receives an entire record set (or array) as input - opposed to a single record.

Which is why direct “GetField” calls against the incoming data set will not work. You will need to cycle though the input collection. In my code, I first cycle through the input fields, storing all the records in a Python list (array or collection) so that I kill the Data Services object and continue “internally” within the Python environment.

On the output side, you have to do the reverse - cycling through the internal Python lists to create and populate the Data Services Data Manager object.

So my code is divided into 3 sections:

retrieve data from Data Services
do something funky (like kiling our network )
output data back to Data Services

I tested record by record and collection modes and in my tests the collection mode achieved a much higher throughput.

The only thing I noticed is that the UDT acts like a Data Transfer object - in that it does not seem to ‘trickle’ data through the Data Flow. Instead, the UDT will absorb ALL source data and output everything as a huge single ‘blob’ of data.

I have no idea if this is a side effect of the collection mode or if the record mode does the same thing. From my limited tests with the record mode, I do recall seeing very similar behaviour in that mode as well.

In the UDT you do have more control over the number of threads you create. In fact, you have precise control over the number of connections you spawn as it does appear that the Python code is executed outside of the usual DOP * Table Loaders paradigm.

My mistake was that I had simply taken the entire input data set (of 10,000 records) and at once generate a seperate thread to resolve each URL in the data set. So at once the server spanned this massive amount of threads Although the network traffic generated per each http call was very little, the problem was that it was all happening at the same time.

If I control the number of threads generated and resolve everything in batches of say 100 URLs (and thus 100 threads) , the network should take somewhat less of a hit I think.

ErikR (BOB member since 2007-01-10)

system · April 5, 2013, 8:19am

I am going to read your reply at a later stage, but maybe you should talk to the netwerk/firewall admins. Could be that either the firewall or a proxy is limiting the amount of calls that you create.

Could be that those guys can create a highway for you in the network.

And I guess that they are not so happy with you as you are basically DDOS’sing some components of the network now

Johannes Vink (BOB member since 2012-03-20)

system · April 5, 2013, 11:39am

True - but is there a better way to “mass resolve” a large collection of short URLs?

ErikR (BOB member since 2007-01-10)

system · April 5, 2013, 11:54am

Without actually resolving them one by one? I guess not

Johannes Vink (BOB member since 2012-03-20)