Product direction: Local Repo versus Central Repo only!!!!

system · October 16, 2012, 3:45pm

Sorry for the lengthy question but I am really desperate in looking for your opinions… Please respond with as many posts as possible, it is your future!

When talking about the long term future of DS Repo we immediately come to the following question: Are local repos required for maximum independence or can we use a central repo alone?

Imagine two developers work together on the same project, I work on DIM_CUSTOMER, you work on FACT_FINANCE. We both have checked in our objects in the central repo but we are working with the versions we have in our local repo. So you can execute the job loading the fact and the dimension using the most recent version of your object and an older version of my dataflow. You are neither impacted by what I am working on at the moment - maybe my dataflow is not finished and syntactically invalid even - nor impact by the version I had checked in the last time into the central repo. You have 100% control of the version you execute.

If we would have a central repo only, you would not have this flexibility. There is no local repo, you execute off the central repo. In theory we could ask you to confirm for each dependent object what version should be used and get to the same level of flexibility as with the many local repos but that is not very handy. I would think you have two operations in the central repo - as save and a checkin. When you checkout something and then save, we create a new version but keep that object marked as checked out. Only at checkin state others will see it. So when you execute the job you will use your current (saved) version of the fact dataflow and my last checked-in version of the dataflow loading the DIM_CUSTOMER.

The fear people have with the central-repo only approach is that you will never be able to execute the job successfully.

I might have checked in something that is not working - broken mapping formula or so. Okay, I shouldn’t have checked it in just saved but might happen.
I have re-imported a table, it has a different column layout, now your dataflow throws an error that column xyz does no longer exist as it got renamed.

I fully understand these fears and obviously it would be best if you could say, execute my fact dataflow with the table structure as it was in version 7 and Werner’s dataflow of 3 weeks ago but that is not manageable. Then you have so many objects that using local repos would be simpler.
At the same time, if you argue that things like the table structure might differ and therefore you need a separate repository, you could also argue that you need a separate data warehouse database for each developer. I mean, I loaded the customer dimension yesterday and today the new field is still empty?? How could that be? I don’t know that you executed a version of my dataflow that is a year old. Or the table structure did change? Then you get an error in the SQL statement anyhow.

There is no perfect solution, in the past we said we want to be 100% independent and that approach has its merits. And its downsides.

Which one would you prefer?

Werner Daehn (BOB member since 2004-12-17)

system · October 16, 2012, 6:43pm

If you take away my local repository I’m going to have to fly to Germany and have a rather frank discussion with you.

The local repository allows for independence in the development process. A table can indeed have a different look throughout the development process and it is very appropriate for me to have a newer version than another developer. If some other bozo doesn’t know what they are doing I sure don’t want their mistakes to impact my development.

As a consultant there are times when I have a copy of a production repository loaded into my local so I can diagnose issues or perform an audit. How would that work without a local repository?

Plus, if the Central Repository is all we’ve got then there will be hundreds if not thousands of versions in there that have absolutely nothing to do with versions that were actually deployed/used beyond the development environment.

I think the local/central repositories is a great way to make people really think about their development process. What is the migration path? How do we do version control? Who has access to each environment?

I’m not much of a Java developer but I don’t know any Enterprise level Java developers that don’t use some form of version control. They don’t all work out of the same directory.

The interface with the Central Repository could stand to be improved. Maybe that would solve some of the issues?

eganjp (BOB member since 2007-09-12)

system · October 18, 2012, 5:00am

That’s the way I’d argue

Each developer should have their own database/schema(s) to fully implement the concept of private workspaces and allow private builds, i.e. run any object, of any version, of any config with any data of their choosing without impacting other developers. Each environment is completely siloed.

I’m totally behind Jim on this, keep local repositories.

somebodi (BOB member since 2010-08-04)

system · October 18, 2012, 6:42am

Agreed, my vote goes to keep local repos.

Nemesis (BOB member since 2004-06-09)

system · October 18, 2012, 10:53pm

Keep local repositories. Don’t try to fix something that isn’t broken.

And improve the labelling, I really do not like the “free hand” text search for labels when constructing a release from Central Repo. If anyone made a typo mistake, the get-by-label approach could miss our vital parts of a job.

Instead, I would like to see the Central Repo to work more in line with the way Microsoft VSS deals with build releases, branches etc.

ErikR (BOB member since 2007-01-10)

system · October 19, 2012, 8:24am

I’d like to be more precise then.

Yes, we would keep the local repositories for QA and for Prod (and other means) but in development you do not have to use local repositories anymore. The central repo would have a jobserver by itself any you execute the latest checked in version of each object others created, your objects would be executed in the latest version you saved.

What would be your answer then? Useful change and you would stop using local repos for development or not useful at all given that you do not have control over the versions being executed and the dependencies?

Werner Daehn (BOB member since 2004-12-17)

system · October 19, 2012, 10:18am

This discussion reminds me of something I wrote-up a while back (when I was managing a DS practice) on applying Martin Fowler’s Continuous Integration principles to team development in BODS. Directly related to this, and very interested in what people think. In my experience, BODS developers tend to be lone wolf types…

See the file attachment for the full write-up. But here’s the bulk of it:

Introduction
In software development projects with multiple developers, things can go astray with disconcerting ease. This is as true for team development in SAP BusinessObjects Data Services (BODS) as for Java or C++.
If Tom makes two weeks worth of changes to his code units which all work fine for him and Kathy does likewise, when it comes time to integrate their pieces together, they will likely begin their descent into integration hell. To avoid hell requires religious adherence to a methodology that guards against the divergence of parallel streams of development. Our recommendation is for BODS development teams to follow the practice of Continuous Integration.

In a nutshell, Continuous Integration prevents divergence by requiring full daily builds of everybodys committed code and of required daily code commits. We didnt create Continuous Integration – its an accepted methodology with an established literature but have attempted to add value by translating it here into BODS-specific terms. Continuous Integration informs much of the Automated Continuous Integration Testing (ACIT) facility in the practice of Agile Data Warehousing (ADW) as described by Ralph Hughes in his book of the same name.

Your understanding will benefit by consulting the following background material:
Continuous integration - Wikipedia
Access Denied - Jazz Community Site
Continuous Integration
Integration Hell (this one is pretty funny)
Continuous Integration (original version) – a foundational article

Continuous Integration - General Principles
Continuous Integration enjoins the development team to adhere to the following principles:

Maintain a Single Source Repository.
Automate the Build.
Make Your Build Self-Testing.
Everyone Commits To the Mainline Every Day.
Every Commit Should Build the Mainline on an Integration Machine.
Keep the Build Fast.
Test in a Clone of the Production Environment.
Make it Easy for Anyone to Get the Latest Executable.
Everyone can see what’s happening.
Automate Deployment.

These principles work, but they need to be translated into terms and technologies specific to BODS. In the following, were ignoring testing, both in general and as handled in an ACIT facility, as well as how to automate certain operations; both are simply too large to be addressed here, where the focus is more on team dynamics and dos/donts.

In what follows, well use commit to mean check-in, and the terms mainline and code base will be synonymous with the latest version of the job or jobs in the BODS central repository.
Continuous Integration in BODS

Use a Single Central Repository and Strive to Keep Everything There
In a team working on a single project, we have to have a single DS central repository for code which contains everything necessary for a build. We recommend that teams create a central repository BODS_CENTRAL for the purpose.

There’s a lot of talk about “builds” in the literature. In DS terms, what runs is the job, so, for us, build = job. Some people say it should be a project, and that would be OK the build, then, would be the set of jobs. This doesn’t affect the general discussion.

We can have multiple jobs in a single DS central repository, and those jobs can use shared components (like custom functions and tables), and that’s all fine as long as we have a single central.

We do not recommend creating multiple central repositories for different code life cycle phases (typically, dev, QA, and prod). Central repositories intrinsically maintain versions, and we can use the code labeling feature to label our code with version numbers if desired.

Martin Fowler emphasizes that the repository must contain everything, and it should be possible for a developer to start w/ a virgin local repository, get the latest job, and run it successfully, with no external dependencies. A Data Services central repository is not a full-featured, Subversion-style repository, and we cant easily add Word documents, DDL scripts, etc. But by properly documenting the code that can get added, we can, at least, refer to such dependencies, and make our jobs self-documenting and self-contained to the maximum possible extent. The general rule should be: by performing a Get latest version of a job, everything required to run the job should be either directly present or referred-to within the job.

We encourage the practice of using BODS to create tables, vs. doing that with a modeling tool and DDL outside of BODS. Special setup environment or create tables jobs can be created, using template tables as the target, in lieu of external DDL scripts, and this helps keep everything self-contained in the central repository. BODS jobs can also be written to be self-checking and self-initializing, running scripts that check for the existence of objects (typically DBMS tables) and conditionally taking action to initialize those tables. Where you have need for advanced logical or physical data modeling, this won’t work, or will only work partially it would be a stretch to write lots of advanced DDL script and execute it from BODS (although, yes, you could) but for many purposes, regarding tables, all you need is the table and a primary key, which a BODS template table will handle just fine.

Centralize and Standardize System Configurations
System configurations are not, unfortunately, objects we can put in a central repository, but they need to be treated like code. Designate a person to be the keeper or manager of system configurations, data store objects, data store configurations, and substitution parameters, which all work in concert. Post the ATL files for system configurations and substitution parameters in a central network share. Each developer on the team should, every day, do an import of the latest official system configuration and substitution parameter ATL files from this share.
Everybody Starts Fresh on Everything Every Morning
Each developer must start the day, each day, every day, by performing a Get Latest Version of the job (or jobs) in question and all dependents, for any job or jobs the developer intends to work on that day. Each developer should also import the latest ATL files for the system configurations and substitution parameters.

The point of continuous integration is to continuously (at least daily) integrate-and-test to avoid serious divergence and speed overall efficiency and code quality. Will developers experience unpleasant surprises after doing complete Get Latest Version operations? Of course. But the surprises should always be from recent changes, and relatively easy to find and resolve. It is always easier to code in isolation in the short term, but the short-term productivity gains of ignoring coordination are paid for in spades later. Thus: developers are not allowed to pick-up where they left off on modifications to units which theyve had continuously checked-out and uncommitted for days. At the beginning of each work day, each developer must re-align to the mainline or code base, and start from there from an up-to-date base.

Developers Always and Only Check-Out Units They Intend to Modify Soon
Once a developer has a fresh code base, they check-out the specific unit or units they intend to modify soon within a few hours. They do not preemptively check-out large branches of code, containing a number of units far in excess of what they could reasonably modify within soon.

If they want to make large structural changes to flow units such as workflows and conditionals, and want to perform check-out-with-dependents of the root workflow of a large branch to get everything at once for some major reorganization effort, then (in this typical example) they should immediately undo the checkout for all the dataflows within and any small workflows known to be irrelevant to this high-level restructuring.

Developers should think of checking-out an object as an act of communicating to their team members: Hey, everybody Im actively working on this, right now. If you check something out, but arent working on it, youre misleading and confusing your team members.

Avoid Checkout Without Replacement
The checkout without replacement operation causes confusion, because its almost always used to get changes to a unit uploaded to the central repository in the absence of having properly checked-out the object beforehand.

Lets say that on Tuesday, unbeknownst to each other, Tom and Kathy both decide to work on dataflow DF_ABC, but only Tom remembers to check it out. If Kathy had remembered, she would find that Tom already had DF_ABC out, and would be warned that whatever she intended to do in parallel would need to be manually merged with Toms changes remember, a primary benefit of checking-out objects is to communicate. Indeed, Kathy should find something else to do it makes little sense for her to work on DF_ABC if Tom’s got it checked-out and, presumably, is making changes she can’t see yet. But Kathy forgets checking-out, and forgets to check to see if anybody else has DF_ABC checked-out, and starts making complicated adjustments to the dataflow. At around 3pm that day, Tom finishes with his changes and checks-in the code. At 6pm, Kathy is finished for the day, and, satisfied with her changes, needs to get her code committed to the central repository. Only then does she remember that she did the day’s work on an object she hadnt checked-out.

What should she do? She can certainly perform a checkout without replacement and upload her new version of DF_ABC. But she worked off a dirty code base her coding didnt reflect Toms changes, which paralleled hers and were committed in advance. Tomorrow morning, when Tom performs a Get Latest Version on the job, his changes to DF_ABC will all have disappeared in favor of Kathys, and after a round of recriminations and hurt feelings, theyll need to manually piece through their parallel efforts, that is, will need to spend some time in integration hell.

Avoid checkout without replacement.

If a developer does forget to checkout, however, all is not lost. The developer has two options:

If the object in question has not been versioned by anybody else since the beginning of the day, then the developer can safely perform a checkout without replacement and check the changes back in. No harm done.
If the object has been versioned, the offending developer should go ahead and perform a checkout without replacement and check-in, creating a new version, but then immediately use the comparison features in Data Services to see how the two versions differ and work with the other developer to integrate as necessary expressing apologies.
Always Test Against the Very Latest Code Before Checking-In
Another way of saying this is Never knowingly break the current job or Never commit from dirty code.

Before checking-in changed units, developers are responsible to make sure it passes both their own unit testing (of course) and mini-integration testing, i.e., running the relevant jobs successfully, against up-to-the-minute latest code. If you work on a given unit till, say, 3pm, you can and should assume that other developers will have committed updates to other units earlier that day. They havent been made in relation to your changes you havent committed yet, so theyve been working off the version as of that morning but theyve done their part to not break the job(s). Before you check-in your unit(s), you must perform a Get Latest Version of the entire job again, as you did that morning, and make sure your units still work with the changes that have been committed so far from everybody else. (Your code units are in a checked-out state and will not be overwritten by a Get Latest Version operation.)

Everyone Commits Everything Daily
Developers should check-in their changed code at least once per day, and preferably more often. Code should not be left in a checked-out state overnight. The divergence that leads to integration hell grows the wider the longer code is modified and not returned to the mainline, and under Continuous Integration, a days worth of changes is the limit of tolerance for this divergence.

What if a developer checks-out a unit in the morning, works all day to make changes, and still doesnt have it working by the end of the day? Then in general that developer is biting off more than they can chew, and needs to decompose the work into smaller pieces.

A developer may not check-in broken code to the mainline period. If he finds himself at the end of the day with a broken, in-process unit, and there are no smaller pieces of it which can be committed, then he will simply need to make a duplicate of the code and undo the mornings checkout operation on the still-broken unit. He is not allowed to retain it overnight in a checked-out state. In the morning, hell get a fresh copy of the unit in question (with no guarantee that someone else may not have modified it during the night), and will need to start afresh on the changes.
ETL Doctor Data Services Best Practice for Team Development and Continuous Integration.pdf (121.0 KB)

JeffPrenevost (BOB member since 2010-10-09)

system · October 24, 2012, 7:52am

Is there a specific reason/motivation for wanting to remove the development local repositories? I personally find them very useful and wouldn’t want to see them removed, but I remember some discussions with the Max Attention team about making the DEV/TEST/PROD promotion process easier and this resulted in the removal of local repositories.

Nemesis (BOB member since 2004-06-09)

system · October 25, 2012, 6:18pm

No, I still want local repositories. Sometimes I write quick-n-dirty jobs that have no business being in a central repository.

eganjp (BOB member since 2007-09-12)

system · October 26, 2012, 5:31pm

That’s a good one, Jim.

No particular reason. I don’t like having to create so many repositories, one per developer, copying objects back and forth if not needed is another overhead. But no particular reason for having to remove local repos. Just checking…

Werner Daehn (BOB member since 2004-12-17)

system · October 27, 2012, 1:01pm

At least, centralized control of datastores, system configurations, and substitution parameters might be nice. Local control over that set of objects frequently leads to a mess.

I think an architecture in which four people can easily have four uncoordinated, divergent copies of DF_DIM_CUST, four versions of the DIM_CUST table, etc. is questionable. You can avoid integration hell under a tight team protocol of check-out, check-in, etc., but the current architecture encourages divergence with its focus on local repositories. If, to make a change to DF_DIM_CUST, Bob has to open it from the central code base, and anybody else attempting to do so while he’s got it open has to do so in read-only mode (as though it were a Word document on a file share), with no such thing as a local repo, and no ability to “save local” in any way – that, in many team environments, would be a very good thing.

JeffPrenevost (BOB member since 2010-10-09)

system · October 29, 2012, 6:07pm

hi

I think said this already on previous posts but I really don’t like the use of databases as a place where code is written to and maintained and would much prefer a file based approach.

I’ve a background in java and version control, code integration, promotion etc is much easier especially when you already have tools such as ant, maven, subversion etc for doing all this for you. If SDS was file based then all these tools will be available to exploit - no need to re-invent wheel.

The other advantage of being able to use file based repos is that you can package all the other components of the project in same repo such as release notes, deployment guides and, hopefully in future, universes and reports if BusObj just use subversion without having to wrap it in LCM.

As well as this the other main advantages of file based code are,

offline working
you may not be able to exec a job if you’re disconnected from the job server or source/target databases but you can still do some basic work: code review, add comments, minor edits etc. Also I can easily load an ATL file without having to worry about it overwriting content in my local repo as i can just save it to another folder on my desktop.

offshore/remote development
When working with offshore teams they cannot use a local install of SDS as the latency between the offshore desktop and the onshore RDBMS server is too great - I’ve seen logon taking 2 mins, retrieving list of objects taking 5-10 mins etc. A workaround is to build a citrix server hosting an SDS Designer install or replicate all databases offshore - again costly. If file based then code is written to local file system and can then be submitted to central server for execution.

I know this isn’t answering your question but with file based code your “local” repository is just the local hard disk of the developer. You still need a centralised location for integrated code but this is then just any version control tool.

AL

agulland (BOB member since 2004-03-17)

system · October 31, 2012, 1:55pm

I understand the idea and goal. But how would the jobserver and engine execute something off your local files?

Werner Daehn (BOB member since 2004-12-17)

system · October 31, 2012, 2:18pm

And that right there is the Kiss of Death for using files for a repository.

A file would work well if it was on your local drive but it has to be accessible to the job server which may be on a different machine. Now you’re manipulating a potentially large file across the network. (Yes, I know that development can be done without connecting to a job server.)

No thank you, I’ll keep my repository in a database.

In my opinion, SAP should leverage the excellent desktop database from Sybase called SQL Anywhere for situations where a remote repository is to be used.

eganjp (BOB member since 2007-09-12)

system · October 31, 2012, 6:18pm

Ah that would be the technical challenge for SAP dev!

We know that designer can read log files generated on the job server so one option would be something similar in reverse.

Or when the developer executes a job Designer submits the ATL (for that job and any dependencies) to the job server.

Or a local jobserver runs on developers desktop

Yes it would be quite a big change but I believe it would be worth it long term - to have is a single repo for the full BI project containing everything: ETL, Reports, universes, data models, release notes and other doc etc.

al

agulland (BOB member since 2004-03-17)

system · October 31, 2012, 9:03pm

I was just thinking on this for few minutes today…

I believe, that the existing Setup is holds very good except for the CHECK-IN/CHECK-OUT process…

–> I have certain ugly situations where I have to agressively solve the problem…In which case, as Jim said, I can’t write that funky little jobs in CENTRAL, for that reason I need LOCAL

–> Next, Jeffprenevost said about centralized Datastores…Yes I like it, but again for a PROD issue solver, I need a right to create my own Datastore and try loading something and delete it off!!!

Maybe I can think of a collaborative approach. Local Repo development can be plucked off in certain cases and give local repo rights to so called FireFighters maybe? Having said that, it will create a mess around the working piece of code…

So the best way is to maintain you piece of code with yourself. I will review and take it up!
If codes are generated in CENTRAL, if that piece of job is say developed by fairly new developer to team, not knowing the exact standards, all those non-standard job will sit as one piece of crap. I had a code reviewer who will ask the existence of each query transform…What he will do…

So I vote for not plucking off the LOCAL repo concept!

ganeshxp (BOB member since 2008-07-17)

system · November 1, 2012, 1:22am

I agree with ErikR - Keep local repo and central.

But, please… please… improve the functioning of Central with labelling, move labelling, branching, etc… like other versioning product. I’m using BODS 3.x and labelling is a pain for us, especially in a large deployment, where after an initial label of all objects, then turn around and make a minor change on one of the object, I could not re-use the same label, ???
I should be able to re-use the same label on the latest object.

chiha (BOB member since 2011-06-10)

system · March 13, 2013, 4:40pm

Another welcomed feature would be the ability to pull the latest version from the central repository using a command line option for the purposes of building an automated Continuous Integration solution. For a local repo, we have the al_engine command, but strangely this doesn’t exist for a central repository. This makes it impossible to script a Continuous Integration solution which automatically runs tests against the latest version of the central repository every night.

It seems a bit odd that one has to move away from the central repository, i.e. to a local repository in order to perform an export.

Please correct me if I have overlooked some feature, but from what I have read, and having talked to my BODS colleagues, the only way to get the latest version from the central repository is via the Data Integrator tool using manual steps.

gje306 (BOB member since 2013-01-28)

system · March 13, 2013, 4:48pm

OK, you’re wrong. There is another way but it is not one that most people would care to explore. It is possible to create your own ATL file by directly reading the repository tables. This requires an advanced level of programming as well as a rather detailed knowledge of the repository tables.

eganjp (BOB member since 2007-09-12)

system · March 13, 2013, 5:13pm

That doesn’t seem challenging enough to me, why not go down a level and reverse engineer the underlying db binary representation

gje306 (BOB member since 2013-01-28)