Job failed with "Communication link failure" error

system · December 24, 2007, 8:24pm

Hi,

I have a job failed giving the following error :

(11.5) 12-24-07 10:54:37 (E) (3032:2152) DBS-070404: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
SQL submitted to ODBC data source resulted in error <[Microsoft][ODBC SQL Server Driver]Communication link failure>.
The SQL submitted is <UPDATE “AL_ATTR” SET “ATTR_VALUE” = ‘TABLE’ WHERE “ATTR_KEY” = 29385 >.
(11.5) 12-24-07 10:54:37 (E) (3032:2152) DBS-070404: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
SQL submitted to ODBC data source resulted in error <[Microsoft][ODBC SQL Server Driver]Communication link failure>.
The SQL submitted is <select OBJECT_KEY, MACHINE, SERVER, TYPE, SERVICE, CONTAINER, INST_MACHINE, INST_SERVER,
INST_SERVER_PORT, INST_REPO, INST_EXEC, INST_EXEC_KEY, RUN_SEQ, START_TIME, END_TIME, EXECUTION_TIME, STATUS, HAS_ERROR,
IGNORE_ERROR, NORM_MACHINE, NORM_SERVER, NORM_TYPE, NORM_SERVICE, NORM_CONTAINER, NORM_INST_MACHINE, NORM_INST_SERVER,
NORM_INST_REPO, NORM_INST_EXEC from AL_HISTORY where OBJECT_KEY = 840>.
(11.5) 12-24-07 10:54:37 (E) (3032:2152) REP-100117: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
Cannot retrieve <Job_runid> <select OBJECT_KEY, MACHINE, SERVER, TYPE, SERVICE, CONTAINER, INST_MACHINE, INST_SERVER,
INST_SERVER_PORT, INST_REPO, INST_EXEC, INST_EXEC_KEY, RUN_SEQ, START_TIME, END_TIME, EXECUTION_TIME, STATUS, HAS_ERROR,
IGNORE_ERROR, NORM_MACHINE, NORM_SERVER, NORM_TYPE, NORM_SERVICE, NORM_CONTAINER, NORM_INST_MACHINE, NORM_INST_SERVER,
NORM_INST_REPO, NORM_INST_EXEC from AL_HISTORY where OBJECT_KEY = 840> from the repository. Additional database information:
<SQL submitted to ODBC data source resulted in error <[Microsoft][ODBC SQL Server Driver]Communication link failure>.
The SQL submitted is <select OBJECT_KEY, MACHINE, SERVER, TYPE, SERVICE, CONTAINER, INST_MACHINE, INST_SERVER,
INST_SERVER_PORT, INST_REPO, INST_EXEC, INST_EXEC_KEY, RUN_SEQ, START_TIME, END_TIME, EXECUTION_TIME, STATUS, HAS_ERROR,
IGNORE_ERROR, NORM_MACHINE, NORM_SERVER, NORM_TYPE, NORM_SERVICE, NORM_CONTAINER, NORM_INST_MACHINE, NORM_INST_SERVER,
NORM_INST_REPO, NORM_INST_EXEC from AL_HISTORY where OBJECT_KEY = 840>.>.
(11.5) 12-24-07 10:54:37 (E) (3032:2152) REP-100109: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
Cannot save into the repository. Additional database information: <SQL submitted to ODBC data
source resulted in error <[Microsoft][ODBC SQL Server Driver]Communication link failure>. The SQL submitted is
<UPDATE “AL_ATTR” SET “ATTR_VALUE” = ‘TABLE’ WHERE “ATTR_KEY” = 29385 >.>.
(11.5) 12-24-07 10:54:37 (E) (3032:2152) DBS-070404: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
SQL submitted to ODBC data source resulted in error <>. The SQL submitted is .
(11.5) 12-24-07 10:54:37 (E) (3032:2152) REP-100119: |Dataflow DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP
Cannot perform operation from the repository. Additional database information: <SQL submitted to ODBC
data source resulted in error <>. The SQL submitted is .>.

The surprising thing is, sometimes it succeeds and sometimes it fails. The Target database is SQL SERVER 2005, Source is Oracle 9i and DI is 11.5.

Please help. Thanks.

bodi_cm (BOB member since 2007-08-09)

system · December 26, 2007, 9:59pm

Be afraid. Be very afraid.

We get these sporadically, and in the same sort of place (when DI is trying to update the repo with statistics as a job is running). This generic SQL error message is hard to debug.

Assuming it isn’t a legitimate network connectivity problem between the DI server and the repository database server… you might also check to see if your SQL server is completely overloaded.

My current suspicion is that we get this when a dataflow runs for a very long time (an hour or more). While the dataflow is running, the “main” connection from your Job has a connection open to the repo db. Eventually it seems to time out.

dnewton (BOB member since 2004-01-30)

system · December 27, 2007, 6:29am

Hi Newton,

Thanks a lot for your reply.

Regarding the time out issue, is there any setting in DI where we can specify the Connection time out value ? Usually I am getting this error when my job runs for a higher volume, I would say which runs more or less longer than 3 Hrs. If it runs for a lesser volume (within 2 Hrs), it succeeds mostly. But I will have instances soon where I have to run this job for history loading and each run will span for atleast 5-6 Hrs.
In the current setting, the Database Repository and the Job server are hosted on different servers/machines but in the sane Subnet. Do you think this might result in a network related issue? If there is any network related issue, what is the usual way to debug?

I am getting this error in my QA/PROD environment. In DEV environment, the same job runs for 5-6 Hrs and it succeeds.

Thanks

bodi_cm (BOB member since 2007-08-09)

system · December 27, 2007, 3:29pm

I think the last time I checked with tech support, there was no timeout setting on the DI side of things. But you might want to ask also and see what they say (And let us know!).

For network issues, it’s beyond me – you’d need some help from a networking person to monitor the network and re-tries/errors etc.

dnewton (BOB member since 2004-01-30)

system · December 30, 2007, 7:02pm

This time I got one more error.

(11.5) 12-30-07 10:22:25 (W) (2468:5312) DBS-070402: |Dataflow
DF_SRC_STG1_StoreOrders_F_Parallel_2|Loader Query_STOREORDERS_STG_PRIM_TEMP ODBC data source
<monbodb.traderjoes.com> warning message for operation <bcp_init>:
<[Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).>, SQLState <01000>.

Can you please give some idea about this error?

Thanks

bodi_cm (BOB member since 2007-08-09)

system · December 30, 2007, 9:29pm

As far as I know, it’s very similar to the other error – so the same issue.

dnewton (BOB member since 2004-01-30)

system · January 15, 2008, 3:23am

Hi,

Pardon me if this workaround was posted previously.

After much head-ache, we found this issue is due to Microsoft’s TCP Chimney Offload feature. The TCP Chimney Offload feature is enabled by the Windows Server 2003 Scalable Networking Pack which comes with sp2 (surprise!), details here http://support.microsoft.com/kb/942861.

We actually found this MS KB article via another application upgrade using a similar hardware config. as described and were experiencing all 3 symptoms: SQL Server on Windows Server 2003: “General Network error,” “Communication link failure,” or “A transport-level error”.

Microsoft’s recommended resolution did not work (Update vendor drivers), but, after applying the workaround (disable the TCP Chimney Offload feature), our ETL jobs have been completing 100% successful in less than 1/3 of the time compared to with TCP Chimney enabled.

I hope others experiencing the symptoms above can utilize this

real80 (BOB member since 2007-09-04)

system · January 15, 2008, 1:36pm

Thanks for the update, that’s good information.

To everyone else: We were having this same problem (connectivity errors on SQL statements updating the repository tables, as noted in the original post above). We worked around the problem by moving the repository database local to the BODI job server, rather than having it on a remote database server.

dnewton (BOB member since 2004-01-30)

system · August 25, 2009, 12:48pm

Hi,

Thought I’d add an update to this post as we had the same problem, but our symptoms were a ConnectionRead error rather than ConnectionWrite (so didn’t find this post on my original search!) - and we also spent a lot of time scratching our heads

We used the same fix as described in real80’s post - described in detail in http://support.microsoft.com/kb/942861 - again, hardware updates didn’t work, but disabling TCP Chimney Offload did.

It seems to be an incompatibility issue with some network adapters & Windows SP2 - since making the changes (required on both the database server & BODI application server) this is resolved.

dnewton - I wonder if you were experiencing the same problem, but by moving the repository to the local job server, you’re now avoiding having to use the network adapter.

jefale (BOB member since 2005-07-28)

system · August 25, 2009, 12:54pm

We did try the chimney fix (before moving the repository), and it didn’t help us, in our case.

dnewton (BOB member since 2004-01-30)