Compile tensorflow 2.3 from source with Debug symbols
Source code compiling handbook: https://www.tensorflow.org/install/source
bazel manual: https://docs.bazel.build/versions/master/user-manual.html
Docker environment
- Docker image: tensorflow/tensorflow:2.3.0-gpu
- ubuntu: 18.04
- CUDA: incomplete CUDA 10.1
- gcc: 7.5.0
- python: 3.6.9
Compiling and errors:
Firstly, git clone tensorflow 2.3.1’s cource code
Use deb package install cudnn7.6.5 from Nvidia website
When executing ./configure in the tf source code directory, it prompts that cublas_api.h is missing, because the cuda component in the docker image used is incomplete. (NOTE: Maybe consider using tensorflow devel image.)
Install the deb package of cuda-repo from the Nvidia official website (cuda network installation version deb package).
After the installation is complete, execute apt update. At this time, you can find various cuda component packages from the apt source. But cublas lacks version 10.1.
apt install cuda-libraries-dev-10-1, cuda-libraries-10-1
, then cublas 10.2 version will be installedCopy the cublas related files in the cuda10.2 directory to the cuda10.1 directory.
Execute ./configure in the tf source code directory, choose to support cuda, and do not choose other things such as tensorrt.
Use bazelisk. Try to use bazel’s
-c dbg
compilation option. The compilation command is1
bazel build --config=cuda --strip=never -c dbg --verbose_failures --keep_going //tensorflow/tools/pip_package:build_pip_package
Some files cannot be downloaded by bazel (llvm, aws-sdk), you can download them manually. Then replace the download link of bazel with file:///path/to/downloaded/file (bazel uses curl and supports file://)
aws-checksum compilation error in dbg mode, you could modify third_party/aws/aws-checksums.bazel, add DEBUG_BUILD in gdb mode.
1
2
3
4
5
6
729a30,35
> defines = select({
> "@org\_tensorflow//tensorflow:debug": [
> "DEBUG\_BUILD"
> ],
> "//conditions:default": [],
> }),https://github.com/tensorflow/tensorflow/issues/37498 ,
https://github.com/tensorflow/tensorflow/pull/42743/filesAn error occurred when installing the generated package in the venv virtual environment: invalid command bdist_wheel, pip3 install wheel is required.
After the wheel package is installed, an error occurs when importing tensorflow in python: Prompt that there are undefined symbols in the _pywrap_tensorflow_internal.so file
1
2\_ZN10tensorflow4data12experimental19SnapshotDatasetV2Op11kReaderFuncE
which is: tensorflow::data::experimental::SnapshotDatasetV2Op::kReaderFunc
Switch to tf2.3 version source code. Try to compile again, and still get the same undefined symbol error when importing tensorflow.
Continue to use the source code of version 2.3.0, but give up the -c dgb compilation mode, use -c opt, and adjust the compilation command
1
bazel build --config=cuda -c opt --copt -g --strip=never --keep_going --verbose_failures //tensorflow/tools/pip_package:build_pip_package
Also delete DEBUG_BUILD of aws-checksums. The we could successfully Compile, install, and run.
Debug for tensorflow
- tensorflow has LOG messages and VLOG messages.
TF_CPP_MIN_LOG_LEVEL controls LOG messages.
It is the usual log level (INFO=0, WARNING=1, ERROR=2, etc.).
VLOG messages are controled by TF_CPP_MIN_VLOG_LEVEL,
and are actually always logged at the INFO log level.
It means that in any case,
TF_CPP_MIN_LOG_LEVEL=0 is needed to see any VLOG message.
TF_CPP_MIN_VLOG_LEVEL defaults to 0 and as it increases,
more debugging messages are logged in.
Consider adding the following lines before importing tensorflow:1
2
3os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0'
os.environ['TF_CPP_MIN_VLOG_LEVEL'] = '3'
os.environ['TF_DUMP_GRAPH_PREFIX'] = '/tmp/tf_dump_graph'