A common symptom pattern includes the following: The OCI SDN connector is enabled and configured with metadata IAM. OCI endpoints resolve successfully. Instance metadata is available and shows valid instance identity information.
InvalidParameter.
Invalid Certs.
401 NotAuthenticated
This workflow is intended to isolate the exact failure domain and distinguish:
This workflow applies to environments where: The OCI SDN connector is operational on one HA member. The same connector appears down on another member. The affected member shows authentication failures during OCI token generation.
Solution: Check HA status first:
get system ha status
diagnose sys ha status
diagnose sys ha checksum show
Expected result: The primary node should be identified clearly. The secondary node may report passive/standby behavior for connector updates. HA checksum should be in sync if the configuration is consistent across members.
Example observations: A healthy member may operate as primary and update the connector normally. An affected member may report secondary mode and not actively update the connector, but local authentication tests should still be evaluated separately.
Relevant behavior observed in the case:
HA state: primary
HA state: secondary
ocid running in secondary mode, won't update
Important note: 'secondary' status alone does not explain 'Invalid Certs' or 'NotAuthenticated'. The HA role explains update behavior, but not a local token generation failure.
Check the full connector definition.
show full-configuration system sdn-connector
Expected result: Connector should show:
set type oci
set use-metadata-iam enable
set ha-status enable
Valid: tenant-id
Valid: compartment-list
Expected: oci-region-type
Expected: update-interval
Example connector definition:
config system sdn-connector
edit "OCI"
set status enable
set type oci
set use-metadata-iam enable
set ha-status enable
set tenant-id "ocid1.tenancy.oc1...."
config compartment-list
edit "ocid1.compartment.oc1...."
next
end
set oci-region-type commercial
set update-interval 60
next
end
If the healthy and affected nodes show the same connector configuration and HA checksum is in sync, the problem is likely not caused by connector configuration drift.
Run:
diagnose sys sdn status
Healthy output example:
SDN Connector Type Status
-------------------------------------------------------------
OCI-PRD-ASH oci Up
Failing output example:
SDN Connector Type Status
-------------------------------------------------------------
OCI-PRD-ASH oci Down
In the analyzed case, the healthy member reported 'UP' and the affected member reported 'Down'.
List the supported test options:
diagnose test application ocid -1
Example output:
1. list sdn connectors
2. filter list test
3. list available compartment
4. HA test
5. print nic metadata
6. instance metadata
7. force token refresh
8. list compartments in HA
99. restart
These commands help isolate the stage where the failure occurs. The same command set was available on the healthy member used as the baseline.
Run:
diagnose test application ocid 3
Healthy output example:
Available Compartments for OCI:
HUB_Network (ocid1.compartment.oc1....)
Failing output example:
OCI has no active compartment
In the analyzed case, the failing member repeatedly reported 'OCI has no active compartment' . This condition followed the authentication failure and should usually be interpreted as a downstream effect, not the first failure point.
Run:
diagnose test application ocid 5
This command is useful because it may show both:
Healthy pattern. A healthy member may show successful endpoint resolution and continue with compartment validation or inventory collection. Example:
core api endpoint iaas.sa-xxxxx-1.oraclecloud.com is resolved at 140.x.x.x
identity api endpoint iaas.sa-xxxxx-1.oraclecloud.com is resolved at 140.x.x.x
Failing pattern: A failing member may show the following sequence:
ocid api url: https://auth.sa-xxxxx-1.oraclecloud.com/v1/x509, ret: 400
http response err: 400
{
"code" : "InvalidParameter",
"message" : "Invalid Certs"
}
OCID failed to get metadata token
core api endpoint iaas.sa-xxxxx-1.oraclecloud.com is resolved at 140.x.x.x
identity api endpoint identity.sa-xxxxx-1.oraclecloud.com is resolved at 140.x.x.x
rsa key file open error: /etc/cert/local/root_.key
ocid api url: https://identity.sa-xxxxx-1.oraclecloud.com/20160918/compartments/ocid1.compartment.oc1...., ret: 401
http response err: 401
{
"code" : "NotAuthenticated",
"message" : "The required information to complete authentication was not provided or was incorrect."
}
rsa key file open error: /etc/cert/local/root_.key
ocid api url: https://identity.sa-xxxxx-1.oraclecloud.com/20160918/availabilityDomains?compartmentId=ocid1.compartment.oc1...., ret: 400
http response err: 400
<h1>Bad Message 400</h1><pre>reason: Illegal character CNTL=0x1</pre>
ocid failed to list availability domain
Interpretation: Endpoint resolution is successful. The failure happens first at /v1/x509. 401 NotAuthenticated follows the failed token request. Availability domain lookup fails after authentication has already failed.
This exact sequence was observed on the affected member.
Run:
diagnose test application ocid 6
Expected result: Instance name. Instance OCID. Compartment OCID. OCI region. Availability domain. Realm domain.
Example output:
Instance Name: xxxxxfortinet02
Instance Id: ocid1.instance.oc1.sa-xxxxx-1....
Compartment Id: ocid1.compartment.oc1....
OCI Region: sa-xxxxx-1
Availability Domain: gVeo:SA-xxxxx-1-AD-1
Realm Domain: oraclecloud.com
If this command succeeds while 'ocid 5' fails with 'Invalid Certs', the failure domain is narrowed to the token/authentication stage rather than metadata access. This exact condition was observed on the affected node.
Run:
diagnose test application ocid 7
Healthy output example:
Instance Principal Token has been refreshed
Failing output example:
metadata url: http://169.x.x.x/opc/v2/identity/cert.pem
metadata url: http://169.x.x.x/opc/v2/identity/key.pem
metadata url: http://169.x.x.x/opc/v2/identity/intermediate.pem
ocid api url: https://auth.sa-xxxxx-1.oraclecloud.com/v1/x509, ret: 400
http response err: 400
{
"code" : "InvalidParameter",
"message" : "Invalid Certs"
}
OCID failed to get metadata token
Failed to refresh token
In the analyzed case, the healthy member refreshed the token successfully, while the affected member failed with the same 'Invalid Cert' pattern during token refresh.
Use the following sequence:
diagnose debug reset
diagnose debug console timestamp enable
diagnose debug application ocid -1
diagnose debug enable
Then repeat relevant tests:
diagnose test application ocid 4
diagnose test application ocid 5
diagnose test application ocid 7
Disable debug when finish
diagnose debug disable
Typical failing pattern: Metadata URLs are accessed. /v1/x509 returns Invalid Certs. Token creation fails. Compartment validation fails. Not Authenticated is returned.
Example debug fragment:
2026-03-19 14:45:25 ocid stats: secondary
2026-03-19 14:45:25 metadata url: http://169.254.169.254/opc/v2/identity/cert.pem
2026-03-19 14:45:25 metadata url: http://169.254.169.254/opc/v2/identity/key.pem
2026-03-19 14:45:25 metadata url: http://169.254.169.254/opc/v2/identity/intermediate.pem
2026-03-19 14:45:25 ocid api url: https://auth.sa-santiago-1.oraclecloud.com/v1/x509, ret: 400
2026-03-19 14:45:25 http response err: 400
{
"code" : "InvalidParameter",
"message" : "Invalid Certs"
}
2026-03-19 14:45:25 OCID failed to get metadata token
This pattern indicates that the failure is in OCI token generation/authentication, not in initial metadata access.
Run:
execute time
diagnose sys ntp status
Healthy example:
synchronized: yes
Abnormal example:
synchronized: no
reachable(0xff)
no data
Time synchronization was healthy on the baseline member and unhealthy on the affected member in the analyzed case. This difference should be corrected, even though the strongest failure signature remained 'Invalid Certs' during X.509 token generation.
This message may appear during the failing flow:
rsa key file open error: /etc/cert/local/root_.key
This message should not be treated alone as the root cause.
Reason: It appears during the failing token flow. However, a missing or unreadable path by itself does not fully distinguish the failing member from the healthy authentication result. The more reliable discriminator is the full sequence: healthy member: active connector, active compartment, successful token refresh. affected member: /v1/x509 returns Invalid Certs, followed by NotAuthenticated, followed by no active compartment.
This interpretation is consistent with the comparative analysis performed on the two cluster members.
If all of the following are true:
diagnose test application ocid 6 -> succeeds.
diagnose test application ocid 5 -> fails at /v1/x509 with Invalid Certs.401 NotAuthenticated -> follows.
diagnose test application ocid 3 -> reports no active compartment.
diagnose test application ocid 7 -> fails to refresh the token.
The most likely failure domain is: OCI token generation / OCI metadata IAM authentication. Rather than:
That was the exact fault pattern observed on the affected member, while the healthy member kept the connector up and refreshed the token successfully.
diagnose test application ocid 99
diagnose test application ocid 5
diagnose test application ocid 7
Correct NTP if the affected node is not synchronized. Correlate the issue with OCI using the specific instance OCID of the affected node. If HA failover testing is required, perform it only in a controlled maintenance window.
|