解决-进一步分析与改造[一]
# 背景与现象还原
在 Rocky Linux 8.10 环境下,Ambari 3.0.0 升级到 JDK 17 后,服务启动常见卡死在数据源初始化阶段,表现为长时间无响应。通过截取的典型日志片段如下:
23:13:10.870 [main] INFO org.apache.ambari.server.orm.dao.HostRoleCommandDAO -- Host role command status summary cache enabled !
23:13:10.874 [main] INFO org.apache.ambari.annotations.TransactionalLock$LockArea -- LockArea HRC_STATUS_CACHE is enabled
23:13:10.945 [main] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- Registered org.apache.ambari.server.ldap.service.AmbariLdapConfigurationProvider in event publisher
log4j:WARN No such property [fileNamePattern] in org.apache.log4j.RollingFileAppender.
23:13:11.165 [main] INFO org.apache.ambari.server.logging.LockFactory -- Lock profiling is disabled
23:13:11.170 [main] INFO org.apache.ambari.server.controller.AmbariServer -- Getting the controller
23:13:11.170 [main] DEBUG org.apache.ambari.server.controller.AmbariServer -- Proxy authentication not specified
23:13:11.665 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:13:11.66--ServerSession(1136705850)-- You have specified multiple ids for the entity class [org.apache.ambari.server.orm.entities.KerberosKeytabServiceMappingEntity] without specifying an @IdClass. By doing this you may lose the ability to find by identity, distributed cache support etc. Note: You may however use EntityManager find operations by passing a list of primary key fields. Else, you will have to use JPQL queries to read your entities. For other id options see @PrimaryKey.
23:13:11.703 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:13:11.703--ServerSession(1136705850)-- The reference column name [resource_type_id] mapped on the element [field permissions] does not correspond to a valid id or basic field/column on the mapping reference. Will use referenced column name as provided.
23:13:11.938 [main] INFO org.apache.ambari.server.checks.DatabaseConsistencyCheckHelper -- Checking DB store version
23:13:11.938 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- JPA initialized event received: JpaInitializedEvent{eventType=JPA_INITIALIZED}
23:13:11.938 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- Loading ldap-configuration configuration data
23:13:11.946 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- =================Current get time is [0] ms
23:16:34.339 [main] INFO eclipselink -- [EL Info]: 2025-06-14 23:16:34.338--ServerSession(1136705850)-- EclipseLink, version: Eclipse Persistence Services - 2.7.14.v20231208-d05cebc9b0
23:16:34.363 [main] INFO com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource -- Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 5, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, contextClassLoaderSource -> caller, dataSourceName -> 2rwhi8bbusrj4l1eqmvvd|5d9b7a8a, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> org.postgresql.Driver, extensions -> {}, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, forceSynchronousCheckins -> false, forceUseNamedDriverClass -> false, identityToken -> 2rwhi8bbusrj4l1eqmvvd|5d9b7a8a, idleConnectionTestPeriod -> 7200, initialPoolSize -> 5, jdbcUrl -> jdbc:postgresql://192.168.3.1:5432/ambari, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 14400, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 32, maxStatements -> 0, maxStatementsPerConnection -> 0, minPoolSize -> 5, numHelperThreads -> 3, preferredTestQuery -> SELECT 1, privilegeSpawnedThreads -> false, properties -> {password=******, user=******}, propertyCycle -> 0, statementCacheNumDeferredCloseThreads -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, userOverrides -> {}, usesTraditionalReflectiveProxies -> false ]
23:16:34.374 [main] DEBUG com.mchange.v2.cfg.MConfig -- The configuration file for resource identifier '/mchange-commons.properties' could not be found. Skipping.
23:16:34.374 [main] DEBUG com.mchange.v2.cfg.MConfig -- The configuration file for resource identifier '/mchange-log.properties' could not be found. Skipping.
23:16:34.374 [main] DEBUG com.mchange.v2.cfg.MConfig -- The configuration file for resource identifier '/c3p0.properties' could not be found. Skipping.
23:16:34.374 [main] DEBUG com.mchange.v2.cfg.MConfig -- The configuration file for resource identifier 'hocon:/reference,/application,/c3p0,/' could not be found. Skipping.
23:16:34.375 [main] DEBUG com.mchange.v2.resourcepool.BasicResourcePool -- com.mchange.v2.resourcepool.BasicResourcePool@3f1d6a13 config: [start -> 5; min -> 5; max -> 32; inc -> 5; num_acq_attempts -> 30; acq_attempt_delay -> 1000; check_idle_resources_delay -> 7200000; max_resource_age -> 0; max_idle_time -> 14400000; excess_max_idle_time -> 0; destroy_unreturned_resc_time -> 0; expiration_enforcement_delay -> 900000; break_on_acquisition_failure -> false; debug_store_checkout_exceptions -> false; force_synchronous_checkins -> false]
23:16:34.375 [main] DEBUG com.mchange.v2.c3p0.impl.C3P0PooledConnectionPoolManager -- Created new pool for auth, username (masked): 'am******'.
23:16:34.375 [main] DEBUG com.mchange.v2.resourcepool.BasicResourcePool -- acquire test -- pool size: 0; target_pool_size: 5; desired target? 1
23:16:34.375 [main] DEBUG com.mchange.v2.resourcepool.BasicResourcePool -- awaitAvailable(): [unknown]
23:16:34.652 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:16:34.652--ServerSession(1136705850)-- Reverting the lazy setting on the OneToOne or ManyToOne attribute [viewInstanceEntity] for the entity class [class org.apache.ambari.server.orm.entities.ViewURLEntity] since weaving was not enabled or did not occur.
23:16:34.653 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:16:34.653--ServerSession(1136705850)-- Reverting the lazy setting on the OneToOne or ManyToOne attribute [user] for the entity class [class org.apache.ambari.server.orm.entities.UserAuthenticationEntity] since weaving was not enabled or did not occur.
23:16:34.656 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:16:34.656--ServerSession(1136705850)-- Reverting the lazy setting on the OneToOne or ManyToOne attribute [requestEntity] for the entity class [class org.apache.ambari.server.orm.entities.UpgradeEntity] since weaving was not enabled or did not occur.
23:16:34.658 [main] WARN eclipselink -- [EL Warning]: 2025-06-14 23:16:34.658--ServerSession(1136705850)-- Reverting the lazy setting on the OneToOne or ManyToOne attribute [executionCommand] for the entity class [class org.apache.ambari.server.orm.entities.HostRoleCommandEntity] since weaving was not enabled or did not occur.
23:16:34.882 [pool-2-thread-1] INFO org.apache.ambari.server.orm.dao.AmbariConfigurationDAO -- ======== [ldap-configuration] build query time is [5] ms
23:16:34.904 [pool-2-thread-1] INFO org.apache.ambari.server.orm.dao.AmbariConfigurationDAO -- ======== select time is [22] ms
23:16:34.916 [main] INFO org.apache.ambari.server.checks.DatabaseConsistencyCheckHelper -- DB store version is compatible
23:16:34.919 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- ======== toPropertis time is [0] ,map is [{
}]
23:16:34.919 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- =================Current Execute loading time is [202974] ms
23:16:34.919 [pool-2-thread-1] INFO org.apache.ambari.server.configuration.AmbariServerConfigurationProvider -- =================Current Execute decryptSensitiveData time is [0] ms
23:16:35.056 [main] INFO org.apache.ambari.server.stack.StackManager -- Initializing the stack manager...
23:16:35.240 [main] DEBUG org.apache.ambari.server.stack.ServiceDirectory -- Service package folder for service AMBARI_INFRA_SOLR-3.0.0 in common-services has been resolved to /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/package
23:16:35.240 [main] DEBUG org.apache.ambari.server.stack.ServiceDirectory -- Service folder /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/upgradesdoes not exist.
23:16:35.240 [main] DEBUG org.apache.ambari.server.stack.ServiceDirectory -- Service folder /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/checksdoes not exist.
23:16:35.240 [main] DEBUG org.apache.ambari.server.stack.ServiceDirectory -- Service folder /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/server_actionsdoes not exist.
23:16:35.298 [main] INFO org.apache.ambari.server.stack.ServiceDirectory -- Role command order info was loaded from file: /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/role_command_order.json
23:16:35.298 [main] DEBUG org.apache.ambari.server.stack.ServiceDirectory -- Role Command Order for /var/lib/ambari-server/resources/common-services/AMBARI_INFRA_SOLR/3.0.0/role_command_order.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
提示
从日志可见,Provider 获取与 JPA 查询本身耗时极短,并未发生数据库瓶颈,反而在 ORM 数据源初始化(EclipseLink + c3p0)时产生了异常耗时,基本排除 DB 端的性能瓶颈假设。
# 线程分工与堆栈特征分析
进一步追踪日志结构可以看到,主线程([main]
)负责 ORM
组件与数据源(EclipseLink/c3p0)的初始化,只有初始化完成后,后台线程如 [pool-2-thread-1]
才能继续执行数据查询、配置加载等操作。此时如果
JVM 主线程迟迟无法返回,就会造成服务整体假死。
注意
Ambari 的 ORM 初始化迟滞,往往意味着底层 JVM 某个 API 卡住,需进一步借助线程分析工具定位根因。
# jstack 实战定位
通过如下命令锁定实际 JVM 进程:
ps -ef | grep ambari-server
1
找到实际 PID 后,执行
jstack <pid>
1
即可获取完整线程堆栈:
节选主线程堆栈关键段:
"main" #1 prio=5 os_prio=0 cpu=4460.52ms elapsed=97.10s tid=0x00007ff6f002ce80 nid=0x232b59 runnable [0x00007ff6f9cbc000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base@17.0.15/Native Method)
at sun.security.provider.NativePRNG$RandomIO.readFully(java.base@17.0.15/NativePRNG.java:431)
...
at org.eclipse.persistence.internal.security.JCEEncryptor.encryptPassword(JCEEncryptor.java:102)
...
at org.apache.ambari.server.controller.AmbariServer.main(AmbariServer.java:1096)
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
主线程卡在 sun.security.provider.NativePRNG
,即 SecureRandom 初始化环节。重点
线程名 | 状态 | 典型调用栈 | 说明 |
---|---|---|---|
main | RUNNABLE | SecureRandom → NativePRNG → /dev/random | 随机数获取阻塞 |
pool-2-thread-1 | BLOCKED | 等待 EntityManagerFactory 初始化 | 等待主线程锁释放 |
其他 GC/监控线程 | RUNNABLE | 系统健康无异常 |
笔记
通过 jstack 可见,绝大多数业务线程都在等待主线程初始化数据源,这也间接说明了性能瓶颈高度集中。
- 01
- bigtop-select 打包缺 compat 报错修复 deb07-16
- 02
- bigtop-select 打包缺 control 文件报错修复 deb07-16
- 03
- 首次编译-环境初始化 必装07-16