0%

Compile Debug version TensorFlow whl package from source

Compile tensorflow 2.3 from source with Debug symbols

Docker environment

  • Docker image: tensorflow/tensorflow:2.3.0-gpu
  • ubuntu: 18.04
  • CUDA: incomplete CUDA 10.1
  • gcc: 7.5.0
  • python: 3.6.9

Compiling and errors:

  • Firstly, git clone tensorflow 2.3.1’s cource code

  • Use deb package install cudnn7.6.5 from Nvidia website

  • When executing ./configure in the tf source code directory, it prompts that cublas_api.h is missing, because the cuda component in the docker image used is incomplete. (NOTE: Maybe consider using tensorflow devel image.)

  • Install the deb package of cuda-repo from the Nvidia official website (cuda network installation version deb package).

  • After the installation is complete, execute apt update. At this time, you can find various cuda component packages from the apt source. But cublas lacks version 10.1.

  • apt install cuda-libraries-dev-10-1, cuda-libraries-10-1, then cublas 10.2 version will be installed

  • Copy the cublas related files in the cuda10.2 directory to the cuda10.1 directory.

  • Execute ./configure in the tf source code directory, choose to support cuda, and do not choose other things such as tensorrt.

  • Use bazelisk. Try to use bazel’s -c dbg compilation option. The compilation command is

    1
    bazel build --config=cuda --strip=never -c dbg --verbose_failures --keep_going //tensorflow/tools/pip_package:build_pip_package
  • Some files cannot be downloaded by bazel (llvm, aws-sdk), you can download them manually. Then replace the download link of bazel with file:///path/to/downloaded/file (bazel uses curl and supports file://)

  • aws-checksum compilation error in dbg mode, you could modify third_party/aws/aws-checksums.bazel, add DEBUG_BUILD in gdb mode.

    1
    2
    3
    4
    5
    6
    7
    29a30,35
    > defines = select({
    > "@org\_tensorflow//tensorflow:debug": [
    > "DEBUG\_BUILD"
    > ],
    > "//conditions:default": [],
    > }),

    https://github.com/tensorflow/tensorflow/issues/37498 ,
    https://github.com/tensorflow/tensorflow/pull/42743/files

  • An error occurred when installing the generated package in the venv virtual environment: invalid command bdist_wheel, pip3 install wheel is required.

  • After the wheel package is installed, an error occurs when importing tensorflow in python: Prompt that there are undefined symbols in the _pywrap_tensorflow_internal.so file

    1
    2
    \_ZN10tensorflow4data12experimental19SnapshotDatasetV2Op11kReaderFuncE
    which is: tensorflow::data::experimental::SnapshotDatasetV2Op::kReaderFunc

  • Switch to tf2.3 version source code. Try to compile again, and still get the same undefined symbol error when importing tensorflow.

  • Continue to use the source code of version 2.3.0, but give up the -c dgb compilation mode, use -c opt, and adjust the compilation command

    1
    bazel build --config=cuda -c opt --copt -g --strip=never --keep_going --verbose_failures //tensorflow/tools/pip_package:build_pip_package
  • Also delete DEBUG_BUILD of aws-checksums. The we could successfully Compile, install, and run.

Debug for tensorflow

  • tensorflow has LOG messages and VLOG messages.
    TF_CPP_MIN_LOG_LEVEL controls LOG messages.
    It is the usual log level (INFO=0, WARNING=1, ERROR=2, etc.).
    VLOG messages are controled by TF_CPP_MIN_VLOG_LEVEL,
    and are actually always logged at the INFO log level.
    It means that in any case,
    TF_CPP_MIN_LOG_LEVEL=0 is needed to see any VLOG message.
    TF_CPP_MIN_VLOG_LEVEL defaults to 0 and as it increases,
    more debugging messages are logged in.
    Consider adding the following lines before importing tensorflow:
    1
    2
    3
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0'
    os.environ['TF_CPP_MIN_VLOG_LEVEL'] = '3'
    os.environ['TF_DUMP_GRAPH_PREFIX'] = '/tmp/tf_dump_graph'