Re: [PATCH v2 6/7] Watchdog: introduce ARM SBSA watchdog driver

From: Fu Wei
Date: Sat May 23 2015 - 12:51:10 EST


Hi Timur,


On 24 May 2015 at 00:28, Fu Wei <fu.wei@xxxxxxxxxx> wrote:
> Hi Timur,
>
>
>
> On 21 May 2015 at 23:42, Timur Tabi <timur@xxxxxxxxxxxxxx> wrote:
>> On 05/21/2015 03:32 AM, fu.wei@xxxxxxxxxx wrote:
>>
>>> +static void reload_timeout_to_wcv(struct watchdog_device *wdd)
>>> +{
>>> + struct sbsa_gwdt *gwdt = to_sbsa_gwdt(wdd);
>>> + u64 wcv;
>>> +
>>> + wcv = arch_counter_get_cntvct() +
>>> + (u64)(wdd->timeout - wdd->pretimeout) * gwdt->clk;
>>> +
>>> + sbsa_gwdt_set_wcv(wdd, wcv);
>>> +}
>>
>>
>> ...
>>
>>> +static int sbsa_gwdt_set_timeout(struct watchdog_device *wdd,
>>> + unsigned int timeout)
>>> +{
>>> + wdd->timeout = timeout;
>>> +
>>> + return 0;
>>> +}
>>
>>
>> ...
>>
>>> +static irqreturn_t sbsa_gwdt_interrupt(int irq, void *dev_id)
>>> +{
>>> + struct sbsa_gwdt *gwdt = (struct sbsa_gwdt *)dev_id;
>>> + struct watchdog_device *wdd = &gwdt->wdd;
>>> + u32 status;
>>> +
>>> + status = sbsa_gwdt_cf_read(SBSA_GWDT_WCS, wdd);
>>> +
>>> + if (status & SBSA_GWDT_WCS_WS0)
>>> + panic("SBSA Watchdog pre-timeout");
>>> +
>>> + return IRQ_HANDLED;
>>> +}
>>
>>
>> There's one thing I don't understand about your driver. The 'timeout' value
>> from the kernel is supposed to to be the number of seconds until the system
>> reboots. You are programming the WCV with that value, which means that the
>> WS0 interrupt will fire when the timeout expires the first time. However,
>> you don't reboot the system during this interrupt. The "panic" will cause
>> the system to halt, but not reboot. Instead, it will just sit there.
>
> the "panic" is not just halt the system, please check the code :
> (1)It can trigger Kdump (not just print the panic message), if you
> enable this in the config. that can help server administrator to
> figure out why the system goes wrong.
> (2)panic also can trigger a reboot, if you set up "panic timeout".
>
> Obviously, it won't just sit there, it can help user figure out the problem.
>
> At the beginning, I would like to make the first signal more useful,
> but for simplifying the first version of driver , I decide to use
> panic(). but if there is better "alerts" for a ARM server , I will go
> on maintaining this driver to make WS0 more useful.
>
>> You're waiting for the WS1 timeout for the system to reboot, but this is not
>> a clean reboot, and it occurs at 2*timeout seconds.
>>
>> That's why I like my driver better. It doesn't have any of this pretimeout
>> stuff, and when the timeout expires during the WS0 interrupt, it calls
>> emergency_restart() which reboots the system properly. The WS1 hard reset
>> is used as a "backup" reset in case emergency_restart() fails.
>
> OK, If you think so, I hope you can read the SBSA spec more carefully
> For the watchdog signal (WS0/WS1), SBSA say:
> "The initial signal is typically wired to an interrupt and alerts the
> system. The system can attempt to take corrective
> action that includes refreshing the watchdog within the second watch
> period. If the refresh is successful the
> system returns to the previous normal operation.

>From here, you can see, even a panic is not good enough. we even can
refreshing the watchdog.

But for simplifying the driver, I think, at least, panic() can help
user to backup system context, it is very helpful for a server
administrator.
Because server should be very stable and important , if its software
goes wrong, we must figure out the problem, we can not let it happen
again.

but in WS0 interrupt routine , just simply restart , it is not a
server watchdog should do.

> If it fails then the
> second watch period expires and a second
> signal is generated. The signal is fed to a higher agent as an
> interrupt or reset for it to take executive action."
>
> So WS0 is a warning, but not a reset. the WS1 maybe a reset, or a
> interrupt to higher agent.
>
> That is different from a normal watchdog use before. the two stage of
> WS are not just for reset , at least the first one is definitely not a
> reset. and the second one is not a backup.
>
> If you make SBSA watchdog work like a normal watchdog,:
> (1)why we need a new driver and new device? you can just use SP805 in
> the system.
> (2) why we need a two stages? ( if the second hardware reset signal
> can work more reliably , why use emergency_restart() which is a
> software reset, does it clean the system and do some useful backup or
> sync? )
> the only useful thing done by emergency_restart() is
> kmsg_dump(KMSG_DUMP_EMERG);)
> (3)why the first WS is connect to a interrupt, but not a reset
> signal(I believe the direct reset signal is far more reliable than a
> interrupt to trigger a software reset )
>
> And because of WS0 is a warning, so I decide to use a existing
> watchdog concept "pretimeout":
> -----------------
> Pretimeouts:
>
> Some watchdog timers can be set to have a trigger go off before the
> actual time they will reset the system. This can be done with an NMI,
> interrupt, or other mechanism. This allows Linux to record useful
> information (like panic information and kernel coredumps) before it
> resets.
> -----------------
>
>>
>> --
>> Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the
>> Code Aurora Forum, a Linux Foundation Collaborative Project.
>
>
>
> --
> Best regards,
>
> Fu Wei
> Software Engineer
> Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
> Ph: +86 21 61221326(direct)
> Ph: +86 186 2020 4684 (mobile)
> Room 1512, Regus One Corporate Avenue,Level 15,
> One Corporate Avenue,222 Hubin Road,Huangpu District,
> Shanghai,China 200021



--
Best regards,

Fu Wei
Software Engineer
Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
Ph: +86 21 61221326(direct)
Ph: +86 186 2020 4684 (mobile)
Room 1512, Regus One Corporate Avenue,Level 15,
One Corporate Avenue,222 Hubin Road,Huangpu District,
Shanghai,China 200021
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/